Fundamentals of BIG DATA with PySpark

Aruna Singh
9 min readJun 4, 2021

--

This article introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData

Big Data concepts and Terminology

What exactly is Big Data? It is a term which refers to the study and applications of data sets that are too complex for traditional data-processing software.

There are three Vs of Big data that are used to describe its characteristics: Volume refers to the size of data, Variety refers to different sources and formats of data and Velocity is the speed at which data is generated and available for processing

Now, let’s take a look at some of the concepts and terminology of Big Data

Clustered computing is the pooling of resources of multiple machines to complete jobs

Parallel computing is a type of computation in which many calculations are carried out simultaneously

Distributed computing involves nodes or networked computers that run jobs in parallel

Batch processing refers to the breaking data into smaller pieces and running each piece on an individual machine

Real-time processing demands that information is processed and made ready immediately

Big Data Processing Systems

Hadoop/MapReduce : An open source and scalable framework for batch data.

Apache Spark: It is also open source and is suited for both batch and real-time data processing. It is a fast and general-purpose framework for Big data processing. Apache Spark provides high-level APIs in Scala, Java, Python, and R. It runs most computations in memory and thereby provides better performance for applications such as interactive data mining.

It is a powerful alternative to Hadoop MapReduce, with rich features like machine learning, real-time stream processing, and graph computations. At the center of the ecosystem is the Spark Core which contains the basic functionality of Spark. The rest of Spark’s libraries are built on top of it.

Spark runs on two modes. The first is the local mode where you can run Spark on a single machine such as your laptop. It is very convenient for testing, debugging and demonstration purposes. The second is the cluster mode where Spark is run on a cluster. The cluster mode is mainly used for production.

Spark’s version of Python: PySpark

Apache Spark is originally written in Scala programming language. To support Python with Spark, PySpark was developed with similar computation power as Scala. APIs in PySpark are similar to Pandas & Scikit-learn python packages.

Spark comes with an interactive python shell in which PySpark is already installed in it. It is particularly helpful for fast interactive prototyping before running the jobs on clusters. Unlike most other shells, Spark shell allows you to interact with data that is distributed on disk or in memory across many machines, and Spark takes care of automatically distributing this processing.

Spark provides the shell in three programming languages: spark-shell for Scala, PySpark for Python and sparkR for R. PySpark. Similar to Scala Shell, Pyspark shell has been augmented to support connecting to a cluster.

In PySpark Shell , A SparkContext represents the entry point to Spark functionality. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable sc. You can access the SparkContext in the PySpark shell as a variable named sc. It’s like a key to your car. Without the key you cannot enter the house, similarly, without an entry point, you cannot run any PySpark jobs.

Now, let’s take a look at some of the important attributes of SparkContext.

sc.version shows the version of spark that you are currently running

sc.pythonVer shows the version of Python that Spark is currently using.

sc.master shows the URL of the cluster or “local” string to run in local mode.

The easiest way to demonstrate the power of PySpark’s shell is to start using it. Let’s take an example of a simple list containing numbers ranging from 1 to 100 in the PySpark shell.

The most important thing to understand here is that we are not creating any SparkContext object because PySpark automatically creates the SparkContext object named sc, by default in the PySpark shell.

You can load your raw data into PySpark using SparkContext by two different methods which we will be discussed later:

  1. SparkContext’s parallelize method
  2. SparkContext’s textFile method

Use of Lambda Function in Python

Python supports the creation of anonymous functions. That is functions that are not bound to a name at runtime, using a construct called the lambda. It is used in conjunction with typical functional concepts like map and filter functions. Like def, the lambda creates a function to be called later in the program. Let’s look at some of its uses:

Use of Lambda function in python — map(): The map function is called with all the items in the list and a new list is returned which contains items returned by that function for each item.

Use of Lambda function in python — filter(): The function is called with all the items in the list and a new list is returned which contains items for which the function evaluates to True.

Introduction to PySpark RDD

It is simply a collection of data distributed across the cluster. RDD is the fundamental and backbone data type in PySpark.

Now, let’s take a look at the different features of RDD. The name RDD captures 3 important properties:

Resilient, which means the ability to withstand failures and recompute missing or damaged partitions.

Distributed, which means spanning the jobs across multiple nodes in the cluster for efficient computation.

Datasets, which is a collection of partitioned data e.g. Arrays, Tables, Tuples or other objects.

There are three different methods for creating RDDs, out of them You have already seen two methods which is being mentioned before.

SparkContext’s parallelize method:

From external datasets (SparkContext’s textFile method):

Files stored in HDFS or objects in Amazon S3 buckets or from lines in a text file stored locally and pass it to SparkContext’s textFile method.

From existing RDDs (mutating RDDs):

This transformation is the way to create an RDD from already existing RDD.

Partitioning in PySpark

Data partitioning is an important concept in Spark and understanding how Spark deals with partitions allow one to control parallelism. A partition in Spark is the division of the large dataset with each part being stored in multiple locations across the cluster. By default, Spark partitions the data at the time of creating RDD based on several factors such as available resources, external datasets etc, however, this behavior can be controlled by passing a second argument called minPartitions which defines the minimum number of partitions to be created for an RDD.

Introduction to RDDs in PySpark

There are the various operations that support RDDs in PySpark. RDDs in PySpark supports two different types of operations — Transformations and Actions.

Transformations are operations on RDDs that return a new RDD.

Actions are operations that perform some computation on the RDD.

Lazy evaluation : The most important feature which helps RDDs in fault tolerance and optimizing resource use. Spark creates a graph from all the operations you perform on an RDD and execution of the graph starts only when an action is performed on RDD.

Transformations on RDDs :

The map() takes in a function and applies it to each element in the RDD.

The filter() takes in a function and returns an RDD that only has elements that pass the condition.

The flatMap() is similar to map transformation except it returns multiple values for each element in the source RDD.

The union() returns the union of one RDD with another RDD.

Actions on RDDs:

The collect() action returns complete list of elements from the RDD.

The take() action print an ’N’ number of elements from the RDD.

The count() action returns the total number of rows/elements in the RDD

The first() action returns the first element in an RDD

Introduction to pair RDDs in PySpark

Real world datasets are generally key/value pairs. Each row is a key that maps to one or more values. In order to deal with this kind of dataset, PySpark provides a special data structure called pair RDDs. In pair RDDs, the key refers to the identifier, whereas value refers to the data.

There two most common ways of creating pair RDDs are as follows:

The first step in creating pair RDDs is to get the data into key/value form.

Next, we create a pair RDD using map function which returns tuple with key/value pairs with key being the name and age being the value.

Transformations on pair RDDs :

reduceByKey: It is the most popular pair RDD transformation which combines values with the same key using a function. reduceByKey runs several parallel operations, one for each key in the dataset, returns a new RDD consisting of each key and the reduced value for that key.

sortByKey: We can sort pair RDD as long as there is an ordering defined in the key and returns an RDD sorted by key in ascending or descending order.

groupByKey: it groups all the values with the same key in the pair RDD

join transformation: Applying join transformation merge two pair RDDs together by grouping elements based on the same key.

Advanced Actions on pair RDDs:

Reduce() operates on two elements of the same type of RDD and returns a new element of the same type. The function should be commutative and associative so that it can be computed correctly in parallel.

In many cases, it is not advisable to run collect action on RDDs because of the huge size of the data. In these cases, it’s common to write data out to a distributed storage systems.

saveAsTextFile() saves RDD with each partition as a separate file inside a directory by default. However, you can change it to return a new RDD that is reduced into a single partition using the coalesce method.

countByKey() is only available on RDDs of type (Key, Value). With the countByKey operation, we can count the number of elements for each key. One thing to note is that countByKey should only be used on a dataset whose size is small enough to fit in memory.

collectAsMap() returns the key-value pairs in the RDD to the as a dictionary. Like SparkContext’s parallelize method, collectAsMap produces the key-value pairs in the RDD as a dictionary which can be used for downstream analysis. Similar to countByKey, this action should only be used if the resulting data is expected to be small, as all the data is loaded into the memory.

Hence, we have covered the overview of Big Data Fundamentals with PySpark, and in the process learned some useful syntax for RDD Transformations and actions. There is another article which includes SQL, Dataframes and machine learning with MLib. To understand it, you can click on this link.

Stay Connected and Enjoy Reading !

--

--

Aruna Singh

As a BIE at Amazon, I explore why we call data, the new oil by interpreting and generating meaningful insights.