The MapReduce has been the standard for processing large data sets for years. Apache Hadoop is an open-source framework that implements the MapReduce model for distributed computing.
The model of MapReduce splits the whole data calculation into two phases: the map phase and the reduce phase. The map phase processes the input data and produces a set of key-value pairs, while the reduce phase processes the output of the map phase and produces the final result. For both phases, the data is calculated in parallel, then the results are combined and written to disk. This model is simple and easy to understand, but its limitations are obvious. The most significant limitation is that it is slow for iterative algorithms, such as machine learning algorithms because it writes intermediate results to disk after each iteration.
This is where Apache Spark comes in. The major change in Apache Spark is that it uses in-memory processing, which makes it faster than MapReduce. It introduces new concepts like Spark Job, Spark Context, and Spark Session, which provide a more user-friendly API for working with large data sets. Then Spark Stage and Spark Task are introduced to optimize the execution of the job. Spark also introduces new data structures like Resilient Distributed Datasets (RDDs) and DataFrames, which are more efficient than MapReduce’s model.
A Spark Job is a set of transformations and actions on data that are executed in parallel on a cluster of machines. It is the basic unit of work in Apache Spark. A Spark Job is composed of one or more stages, which are divided based on the shuffle boundaries. Each stage is composed of one or more tasks, which are executed on individual partitions of the data.
A Spark Context is the entry point to the Spark API. It is used to create RDDs, broadcast variables, and accumulators. The Spark Context is responsible for managing the resources of the Spark application, such as memory, CPU, and disk. It also coordinates the execution of the Spark Job on the cluster. A Spark Session is a unified entry point to the Spark API. It is introduced in Spark 2.0 to replace the Spark Context and SQL Context. The Spark Session provides a more user-friendly API for working with structured data, such as DataFrames and Datasets. It also provides a more optimized execution engine called Catalyst.
RDDs are the core data structure in Apache Spark. They are immutable distributed collections of objects. Each dataset in an RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. RDDs support two types of operations: transformations and actions. Transformations create a new dataset from an existing one, while actions return a value to the driver program after running a computation on the dataset. RDDs are fault-tolerant because they track the lineage of the dataset, so they can be reconstructed if a partition is lost.
DataFrames are a distributed collection of data organized into named columns. They are similar to tables in a relational database or data frames in Python. DataFrames can be created from a variety of data sources, such as structured data files, tables in Hive, external databases, or existing RDDs. They provide a more user-friendly API for working with structured data. DataFrames are more efficient than RDDs because they use the Catalyst optimizer to optimize the execution of the job.
Apache Spark has provided a more efficient and user-friendly data processing engine to replace the traditional MapReduce. It’s more efficient to write a Spark Job than a MapReduce Job.