Introduction to Spark and Databricks
Interested in learning Big Data concepts from basics to advance just through small articles? Hit the follow button!
In previous articles, we focused on Hadoop architecture, Data warehousing techniques, and MapReduce in depth, now let’s look at Spark and introduction to Databricks.
What is Spark?
It is a general-purpose, in-memory, compute engine which supports real-time processing unlike MapReduce.
Apache Spark is a computing engine that replaces MapReduce only and since it does not have storage or resource management, we need to plugin it with a storage (like HDFS, S3, or ADLS Gen2) and a resource manager (like YARN, Mesos, or Kubernetes).
Why general-purpose?
Spark is a multi-language engine to execute data engineering, data science, and machine learning on a single node or cluster. That’s why we say Apache Spark is for general purpose (because it helps DE, DS, and ML). The languages it supports are Scala, Python, Java, and R.
What is in-memory compute engine?
Let’s say we have 5 MapReduce(MR) jobs. It will take the data from HDFS, process it and the O/P of it will be put back into the disk (HDFS) then MR2 will take it from the disk again, process it, and put it back in the disk, and so on. When we have a series of MR jobs that need O/P from 1 MR as I/P to the next then a lot of disk I/Os (input-outputs) will be involved. Reading and writing from disk so many times is time-consuming.
Whereas in Spark, once we load the data in 1st transformation, it will process, and send the O/P to the next transformation. Means everything happens in memory, we get the final answer from the last transformation. Sometimes it may have a little more disk I/Os as the scenario we just saw is ideal.
In MapReduce, we were mainly focused on how to do it but in Spark, we mainly focus on what to do. Spark abstracts away the notion that we are writing code to run across the cluster.
Apache Spark VS Databricks:
Apache Spark is open source (source code is free, you can modify and use it as per your needs).
Databricks is a company whose product is Databricks. Databricks is again Spark internally but with extra features:
- It is like Spark on the cloud (AWS, Azure, or GCP).
- It is optimized, and fine-tuned to be 5x faster than Spark.
- Cluster management is possible just by a few clicks.
- The code can easily be made available for colleagues to collaborate.
- Delta lake.
- Good security features.
Databricks was founded by people who developed Spark. Databricks has 2 phases, Spark core APIs (we can write code in Scala, python, java, and R using Spark core APIs) and higher-level APIs (they are developed to make our life easier), we’ll work most of the time with higher-level APIs because they’re easier to work with and give better performance. When you work at Spark core you work at the RDD (Resilient Distributed Dataset) level. We can write code based on RDDs but it will be the hardest way to work with Apache Spark. Spark RDD is the fundamental unit of Spark that holds data.
Higher level APIs available:
- Spark SQL / Data frames.
- Structured streaming.
- Mllib (for machine learning).
- GraphX (for graph-based processing).
When we say Spark SQL and Data frames, this is the way we work on Spark. Spark SQL is the same as working on a database and a table. Whatever SQL query you write will execute across the cluster and give you parallelism and it won’t feel like you are writing for a cluster (it’s Spark’s specialty). This is the easiest way to work on Spark.
Spark Data frame is between Spark SQL and RDD means it’s medium-level of complexity. Even when you write code at a higher level, it will be internally converted to RDD.
- RDD — Toughest — Most flexible (every problem’s solution can be derived through this)
- Data frame — Medium — Medium flexibility
- Spark SQL — Easiest — Less flexible (because not every solution can be coded in SQL/ sometimes we cannot derive a solution in SQL. Data frame is a little more flexible than this).
Therefore, go to Spark SQL 1st, if you can’t solve it then go to Dataframe, and let RDD be your last option.
When we work on Spark we aim for 3 things:
- Loading the data from the source. Be it HDFS/ S3/ any other datalake (these are just the common sources, there are other locations too).
- Performing different transformations like filtering, joining, grouping, etc.
- Delivering results to the target location like HDFS/ S3/ any other datalake (these are just the common target locations, there are other locations too).
Hit the clap, comment your views if you got any value from this article (You can clap up to 50 times!), your appreciation means a lot to me :)
Feel free to connect and message me on my LinkedIn.
References:
- Sumit Sir’s Big Data