Spark RDD with an example

Brinda Potluri
5 min readDec 1, 2023

--

Interested in learning Big Data concepts from basics to advance just through small articles? Hit the follow button!

In the previous article, we focused on Spark and introduction to Databricks, now, let’s look at Spark RDD.

Blocks and Partitions in RDD:

Here, worker nodes are data nodes (when you have a Spark cluster it’s more relevant to call it worker nodes). Memory means temporary storage (RAM). Whenever the blocks are loaded into the memory, we call them partitions (block is on disk and partition is on memory). All the partitions together are called RDD. Therefore, RDD is a distributed collection in memory. We can see that it is distributed on a cluster of machines and in memory.

If I have a file in HDFS and I need to load it to the memory, I need to write a Spark code, and let’s say we named it RDD1. Now, we’ll do the necessary transformations like ‘map’ and ‘filter’ and then get the results through ‘collect()’.

RDD1 = load the file from HDFS to memory

RDD2 = RDD1.map

RDD3 = RDD2.filter

RDD3.collect()

When you execute the 1st 3 lines of codes nothing will happen, but internally these lines were added to a plan that down the line this will happen/ execute. Once we call ‘collect’ all the lines will be executed.

There are 2 kinds of operations in Apache Spark:

  1. Transformation (the 1st 3 lines of code).
  2. Action (last line of code (collect)).

Transformations are lazy but actions are not!

Consider the 4 steps in the form of a flow chart (execution plan), it is called a DAG (Directed Acyclic Graph).

RDD: Resilient Distributed Dataset where R in RDD is resilient to failure. Example: What will happen if we lose RDD3 somehow, then you don’t have to start from the beginning (the execution plan might be very big); we can take the parent of RDD3 which is RDD2, and apply ‘filter’ on it. We don’t have to do it, the system will do it. We need to maintain the lineage (flow path) to achieve this fault tolerance.

RDDs are immutable, i.e., we cannot make changes to the existing RDDs we can only create new ones for that particular transformation.

Why are transformations lazy?

Let’s say we have 1 billion records and we’re loading it in the 1st transformation but the 2nd transformation is just to print the 1st 5 records then it’ll be a waste of resources to load everything. That’s why Spark will go through the next transformations and once we do any actions, it is smart enough to understand what is required and what is not required and will do the job accordingly.

RDD1 = To load the file from HDFS (1 billion)

RDD2 = rdd1.map (1 billion)

RDD3 = rdd2.filter (5 records)

RDD3.collect()

Your file size is 1 billion, you load everything and the map function will go through all the 1 billion records and then later filter will get those 5 records only. But because Spark is lazy, it will first do the filter and then do the mapping so each will process only 5 records (mapping will not go through 1 billion records).

Apache Spark with word count example:

Below is the code to begin a Spark Session, it is like an entry point to the Spark cluster; else we’ll just be in our Gateway node.

To load the file from HDFS to Spark we use the below code where we’ll 1st go into the Spark Session and then Spark Context.

Below, the flatmap will flatten the entire structure, work row by row, and give the list of all the words. Here x is just the variable we are using to refer to line/ row and we’ll use the lambda function to perform an operation — ‘split’. (Let’s say our I/P is “big data is interesting”).

Now, each word is in a new row, if we want to take the I/P of ‘x’ rows and give the O/P of ‘x’ rows we need to map each row with ‘1’ like below:

Below, reduceByKey will sort the words 1st (all the same words will be placed together i.e., 1 below the other) and then add those and give the count of those words. (Let’s say our I/P was “data big data big data”).

collect() will put the data into the Gateway Node, our data might be 1 TB and it will be loaded to our local machine. Therefore, we should write this output to HDFS by the below command:

Hit the clap, comment your views if you got any value from this article (You can clap up to 50 times!), your appreciation means a lot to me :)

Feel free to connect and message me on my LinkedIn.

References:

  1. Sumit Sir’s Big Data

--

--

No responses yet