What is Apache Spark and History of Spark

By | February 8, 2016

What is Apache Spark and History of Spark





Spark project started at UC Berkeley AMPLab its now part of the Apache software foundation

Spark uses Mapreduce model for processing data in a parallel way across hundreds to thousands of machines

Spark jobs are memory hungry, spark breaks job into smaller tasks and well intelligent in case of detecting task failure, it satisfies mapreduce’s fault tolerance and scalability properties, if you are familliar with Directed acyclic graph (DAG) spark pass the intermediate data (map data) directly to the next stage (Another RDD)

Spark supports in memory processing. Its RDD property provides flexibility to developers to process data into in memory this means in normal mapreduce map output stored in local mapper side in spark it will be stored into memory where we can reuse this data in future and because of its in memory property it is suitable for high algorithms that requires multiple passes of input data, spark APIs provide support for data transformation, stats functionality, machine learning, graph processing etc.

Spark cache features help a lot in machine learning algorithm processing where it helps for passing training set of data multiple times using in memory operation, data can be available in memory for the long time.
Spark supports Mapreduce API so using this API we can read/write data into any supported format.

Example – saveAsTextFile function of spark

Spark Processing Model

Spark processing model is divided into two main phases

Transformation

In simple words transformation is nothing but storing or passing data into another RDD (RDD – array of partition) for next action or transformation.

Example –

val readLines= sc.textFile(“toodey.txt”)
val readLines2 = readLines.map(s => s.length)
We are passing result of readLines to readLines2 by applying map transformation.

Action
In simple words action of RDD is storing final/result data that is output data.

Example – counts.saveAsTextFile(file)

Here counts is a RDD where we have counted some word or something the we have used in build spark saveAsTextFile function and file is nothing but the location of that file.

Note – This article is in my own words please apologize if you don’t find it informative

Leave a Reply

Your email address will not be published. Required fields are marked *