Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
Spark limits it's scope to a computing engine ie spark handles loading data from storage system and performs computation on it and not permanent storage as the end itself . Spark focuses on performing computing over data , no matter where it resides whereas Hadoop is Map reduce + HDFS.
Spark's final component is it's libraries which build on it's design as a unified engine to provide a unified API for common data analysis task .
Apache Spark began at UC-Berkeley in 20019 as spark research project . Research was on how to optimise machine learning in which data is required to be read 10 or 20 times . MR(map reduce) - each step is is a job which were launched separately on cluster and load data from scratch . For this Spark API perform efficient in-memory data sharing across computation step.
Spark manages and coordinates the execution of tasks on data across a cluster of computer. The cluster of machines that spark will use to execute tasks is managed by a cluster manager like spark's standalone cluster manager , YARN or Mesos .
Spark application consists of a driver process and a set of executor process . Driver process runs main() function , sits on a node in a cluster and is responsible for : - Maintaining info. about spark application .
- Responding to user's program or i/p .
- Analysing , distributing and scheduling work across executors.
Executors are responsible for actually carrying out work that driver assigns them . Also reporting state of completion on executor back to driver node.
In local mode driver and executors run (as threads) on individual computer instead of a cluster.
- Spark employs a cluster manager that keeps track of resource available.
- Driver process is responsible for executing the driver programs commands across the executors to complete a given task.
- Executors will always be running spark code.
Spark API:
1 . Low - Level API - unstructured API's (eg RDD etc)
2 . High-Level API - structured API's (eg Data Frames etc).
Spark Session manages spark application : we control spark application through a driver process called spark Session . There is 1:1 correspondence b/w spark session and spark application .
eg : Simple spark program
DATA FRAMES : Most common structured API , Simply represents a table of data with rows and columns . The list that defines the columns and the types with those columns is called the schema. It is a spreadsheet with name given as spark data frame ,it span over thousands of computers.Python data frame resides on one system/computer only ,we can although convert those to spark df which is distributed.
PARTITION: To allow executors to perform work in parallel , Spark breaks up data into chunks called partitions . A partition is a collection of rows that sit on one physical machine in the cluster.If we have many partitions but only one executor spark still have a parallelism but only one because there is one computational resource.
In Spark core data structures are immutable ie they cannot be changed. To change or get new value we use transformations (from one to another immutable).
eg :
Transformations are the core of how we express our business logic using spark. There are 2 types of transformations -
1 : Narrow transformation (1 to 1 ) : Each partition to another partition eg - filter transformation .
2 : Wide Transformation (1 to N shuffles) : Each partition to multiple shuffles eg - aggregation , sort (rows compared with each other).
With narrow transformations , spark will automatically perform an operation called pipelining ie if we specify multiple filters on df (data frames) they all will be performed in memory , whereas when we perform a shuffle spark writes result to disk .
Lazy Evaluation : It means spark will wait until the very last moment to execute the graph of computation instruction . Spark compiles plan from raw df transformation to a streamlined physical plan that will run as efficiently as possible across the clusters. eg - predicate pushdown on df ie if we build up a large spark job and specify filter at end to fetch only one row then most efficient way to execute is to access the single record we need , Spark will automatically optimise this by pushing down filters automatically.
Actions : To trigger the computation , we run an action . An action instructs spark to compute a result from a series of transformation. eg - divisibleBy2.count() There are 3 kinds of actions :
1: Action to view data in console.
2: Action to collect data to native objects in respective languages.
3: Action to write output to data sources.
Spark UI : To monitor progress of a job through webUI we use port 4040.
To read data we use a DataFrameReader associated with sparkSession .
- schema inference :- Spark takes best guess at what schema of our df should be it does this by reading first few rows and tries to interpret data types while parsing. In production we define schema.
eg :
Each df have a set of columns with unspecified number of rows . The reason being reading data is transformation and it is a lazy operation.
eg : sort transformation :- it does not modify a df , rather it returns a new df .
Explain Plan : Read plan from top to bottom , top being end result and bottom being source of data .
By default when we perform a shuffle , spark outputs 200 shuffle partitions . We can set value by : spark.conf.set("spark.sql.shuffle.partitions","5").
The logical plan of transformations that we build up defines a lineage for the df so that at any given point in time , spark knows how to recompute any partition by performing all the operations it had performed before on the same input data.
SPARK SQL : With spark SQL , we can register any df as a table or view (temporary table) and query it using pure SQL . There is no performance difference between writing SQL queries or writing df code , they both compile to same underlying plan that we specify in df code . We prefer df code over SQL queries as they are scalable ie in case we need to change any table from oracle to hive changing in df code is easy compared to sql queries.
eg :
Max transformation : It scans each value in relevant column in df and checks whether it is greater than previous value that has been seen or not. This is a transformation because we are effectively filtering down to one row. eg:
Comments