Apache Spark Explained | Data Stack Academy Blogs

A quick deep-dive into Apache Spark, the most popular distributed data engineering tool.

What is Spark? Why is it so popular? When and how to use it?

Learn the difference between the sub components (RDDs, DataFrames, SQL, Streaming, ...), setup PySpark , and learn how to write Spark transformations using Python and Jupyter Notebook.

Overview

Apache Spark™ homepage says:

Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Spark is unanimously the #1 Distributed Engine in the World. It's able to run parallel data processing and analytical workloads on very large dataset. It is very versatile and provides APIs in a variety of languages such as Python, Scala, Java, and SQL. It can run on a standalone machine, a cluster of machines, Docker or Kubernetes containers, or on the Cloud.

The What/Why/How

Before we get into introducing Spark, let's take a look at the basics:

What is Spark? And what are its components?
Why is Spark the most prominent tool? And when should we use it?
How is Spark used? How does it achieve parallel processing?

The What?

As mentioned above, Spark is the "Unified analytics engine for large-scale data processing". Spark is mainly a data processing engine. It does not store data itself but it enables large scale processing of data stored on distributed systems such as Hadoop and Cloud. It provides a unified API in Java, Scala, Python, SQL, R, and other libraries which abstracts away the complexity of Distributed Data Processing and makes it accessible to all developers.

Spark is used by both Data Scientist and Data Engineers due to its expansive APIs which covers Python, SQL, and ML libs. Most developers find Spark very easy to use since it build on familiar concepts of SQL and DataFrames.

Spark is very feature-rich and consists of many sub-components. This could be confusing for people who are just beginning to learn Spark. Here, we aim to quickly explain each component and their use:

Spark RDDs
This is the legacy core component of Spark. It stands for Resilient Distributed Dataset (RDD). It breaks up datasets into smaller distributed resilient chunks and provides a series of transformation methods which makes data processing accessible and scalable. These transformation methods (API) enable developers to provide custom functions in Python (or Scala, or Java) to map, filter, join, merge, and store data.
info
RDDs are the lowest level of interaction with Spark. While they provide a great deal of functionality, they are typically overshadowed by newer and more feature-rich components such as Spark SQL and DataFrames.
SQL & DataFrames
Spark SQL provides mix use of SQL and familiar DataFrame APIs (from Pandas) to query and transform structured data. It stores and loads data from a wide range of data formats such as CSV, JSON, Parquet, ORC, Hive, and the Cloud, enabling developers to merge data across all these sources. It further allows developers to write complex custom transformation logic using DataFrame APIs along their SQL code.
Spark SQL and DataFrames are the most widely used component of Spark.
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables real-time data processing of high-throughput sources such as apps, twitter, social-media, websites, or stocks. Data can be ingested from streaming sources like Kafka, Google Cloud Pub/Sub, AWS Kinesis, or simple TCP sockets. Data is processed using transformation logic expressed in high-level functions like map, reduce, join and window. Finally, processed data is pushed out to file systems, databases, and live dashboards.
Spark ML or MLlib
MLlib is the Machine Learning library of Spark. It provides a wide range of ML algorithms such as regressions, classifications, and clustering methods on top of the familiar DataFrame APIs.
GraphX
GraphX is a new component in Spark for graphs and graph-parallel computation. It provides data processing and data analytics on data that is stored as graphs. Graphs are commonly used in social networking platforms to store people's relationships to one another. GraphX enables developers to traverse and explore this data at large scale.

info

Most prominent components used by Data Engineers are Spark SQL, DataFrame APIs, and Streaming.

The Why?

Spark has been the most dominant tool due to a few simple facts:

Spark is Fast:

Spark is much faster than some its predecessors such as MapReduce. Spark does most of its computation in memory while providing scalable, reliable, and fault-tolerant data processing across machines. It also has much less overhead to launch and coordinate jobs on a cluster, making it more suitable for applications that require low latency to run.

Easy to use:

Spark provides APIs in a wide range of familiar interfaces such as Python, Scala, Java, SQL, R, and DataFrames. This makes it very easy to adapt for developers from various backgrounds. It's also very easy to mix these workloads together.

Runs Everywhere:

Spark can run on a single standalone machine, a cluster of machines (Mesos), Docker or Kubernetes containers, or on the Cloud. The same code that is developed on a single machine can run on a cluster of machines, providing scalability. All Cloud vendors provide a Spark Service on their platform.

info

It's also fair to point out why Spark is NOT always used on the Cloud. Spark services are typically expensive to run on the Cloud. Other serverless services such as Cloud Functions and Cloud Containers are often preferred due to their cheaper cost. But if you need data processing at a large scale, there's arguably no replacement for Spark and people pay the price.

The How?

Spark loosely works based on Scatter/Gather distributed processing techniques. Large data sources are divided up into smaller, more manageable, chunks and are distributed across many machines (nodes). Data processing functions are distributed on each node and run in parallel, which is referred to as the scatter phase. When data needs to be aggregated and collected back, it is gathered back into individual nodes, which is referred to as the gather phase. Spark automatically handles these distribution and collection tasks which frees users from having to worry about complex fault-tolerance and scalability topics.

Spark applications are developed in a variety of programs such as Python, Java, Scala and SQL. In this course, we focus on the Python interface for Spark called PySpark. Spark runs applications through its internal interface called SparkContext. The Next sections will explain how to develop and run Spark applications using the SparkContext.

Getting starting with Pyspark

So how are we going to use this engine? While Spark itself is written in Scala, Pyspark provides us Python programmers a library to run Spark. In order to install the Pyspark module, run the following to create a python virtualenv and install pyspark:

# create and activate virtualenv if you haven't done so already
python3.7 -m venv venv
source venv/bin/activate

# install pyspark
pip install pyspark

SparkContext: how to interact with data

Now that we have Pyspark installed, let's figure out how to actually use it. Start a new notebook in your Python environment and follow along with these exercises. Our first goal is to be able to load and transform our data files. Because Spark is designed to work with data in HDFS (Hadoop Distributed File System) and other big data storage tools as well as local data, it has its own set of data interface objects. A SparkContext is the most basic of these for connecting to your data. In larger scale operations, a SparkContext could be configured to work across clusters or other cloud computing setups but for now, we're just using the little local cluster that is the machine that is running this notebook. To do this, we create a SparkContext object with the 'local[*]' options. The [*] allows this SparkContext to have access to all local cores; you can manually set this number lower if you would like to limit the context. Let's initialize a context:

import pyspark
sc = pyspark.SparkContext('local[*]')

Resilient Distributed Datasets (RDD)

Resilient Distributed Dataset or RDD. As the name "distributed" suggests, Spark spreads the data out and works on it in parallel. The resiliency comes from the redundant nature of the distributes similar to what was seen in HDFS. In our little local example, we won't be leveraging all of this power, but we can start by taking a simple Python data structure and parallelizing it into a RDD with the following code:

import pyspark
sc = pyspark.SparkContext('local[*]')

list_of_arrivals = [
    ("PDX", 1),
    ("LAX", 5),
    ("DEN", 3),
    ("PDX", 2),
    ("JFK", 9),
    ("DEN", 5),
    ("PDX", 7),
    ("JFK", 10),
]
arrivals_rdd = sc.parallelize(list_of_arrivals)
print(arrivals_rdd)

Great! We have an RDD object! But how do we see what is in our RDD? In order to avoid unnecessary computation or memory usage, Spark uses lazy evaluation.

Nothing is computed before it is needed. In order to view the data, we need to tell Spark to collect the distributed data and luckily the .collect() method does just that. It is worth noting that collect() does exactly what it sounds like and collects all the data distributed across the nodes by the RDD. This means it is not always the best way to view our data once we are using truly big data that cannot all be collected on one node. For those cases, we can use tools like count() to see how much data we have without collecting that data together:

print("This is our RDD", arrivals_rdd.collect())
print("It has {} elements".format(arrivals_rdd.count()))

Filter data

Now that we've created our RDD, let's go over some basic functions. First, let's look at filtering data based on a condition. If you're not familiar with Python lambda functions, here is a quick overview We will use this function to filter our data. The x in our lambda function is one of the elements of our RDD. For this RDD, it is the tuple of ('aport code', 'arrival count'). Let's filter the data to just Portland arrivals:

pdx_arrivals = arrivals_rdd.filter(lambda x: x[0] == "PDX")
print(pdx_arrivals.collect())

Group by key

In addition to filtering our data, we also may want to pull together the data based off a key. In our tuples, the first element, airport code, functions as a key:

grouped_arrivals = arrivals_rdd.groupByKey()
print(grouped_arrivals.collect())

As you can see above, this creates an iterable object for each airport code. This is the lazy evaluation noted earlier. The iterable allows all the arrival counts to be grouped together without having to load them into memory. If we would like to combine the arrival counts in each group we can use the mapValues() function. This argument of this function is the function to apply to the iterator. In our case, let's get the count per-airport code so we cause use the basic Python sum function as the argument:

grouped_arrivals_count = grouped_arrivals.mapValues(sum)
print(grouped_arrivals_count.collect())

Conclusion

danger

The links in the below paragraph need to be updated!

You just learned the basics of using Spark RDD transformations. To continue learning Spark SQL and DataFrames please refer to Introduction to Apache Spark in /docs/ch3/c3-intro episode of our Data Engineering Bootcamp. This episode will teach you how to use Spark to read, transform, and write data files. Have fun!

Overview​

The What/Why/How​

The What?​

The Why?​

The How?​

Getting starting with Pyspark​

SparkContext: how to interact with data​

Resilient Distributed Datasets (RDD)​

Filter data​

Group by key​

Conclusion​