mapreduce interview questions and answers – Map Reduce in depth – Part 1
We all are already aware about that the mapreudce is programming model used for processing large dataset parallely on top of hadoop, Mapreduce is devided into two phases that is map phase and reduce phase.
In this article, we will discuss about only map phase in more sophisticated(depth) way, map and reduce is famous in many functional languages like Lisp and other
What is Map(mapper) in mapreduce ?
In simple word map(mapper) is a user written function where the logic is written for reading the data set from a particular file or block (in hadoop), remember mapper only understands key value pairs of data so there is a way in mapreduce that before passing data to mapper directly from raw files, data should be first converted into key value pairs, how this is gonna happen we will see now.
There is data locality concept in mapreduce(Hadoop) meaning is each datanode runs map function on their local data(block) and then output of map will be stored as temporary storage, futher this data is sent to reducer but one thing we need to make sure that all data corresponds to one specific key nad this must be present on the same datanode(mapreduce does this by default) if you have your own custom mapper or reducer fuctions then you must take care about this point.
Note – Before passing data to mapper the data is logically devided between inputSplits and for each inputSplit and each mapper there is a recordReader.
Now let’s understand the process, how key value pair data is generated for mapper(steps) and before that let’s understand what is inputSplit and what is recordReader in this case.
What is inputSplit ?
InputSplit is a logical representation of data, inputSplit doesn’t contain any data, it is just a reference.
What is recordReader ?
Record reader comunicate with the inputSplit and it conver the data into key value pairs by default it uses TextInputFormat class for converting data into key value pair there are some more formats that we will discuss in separate article, in TextInputFormat it assigns some byteoffset to each record(line).
Note – Until the file reading gets completeled recordReader communicates with the inputSplit and it assigns some byteoffset to each record (line), this will be clear using diagram
Let’s consider i have 100 MB of text file in hadoop with some data and i want to count the occurances of words from this file, consider the default hadoop block size is 64 MB so let’s understand diagramatically how many number of mappers, inputSplits and recordReader will be created for this file
inputSplits covert physical representation of blocks into logical for mapper, 100 MB file will be read by two inputSplits and as per definite guide the mapreduce framework says for each block one inputSplit is created and for each inputSplit one reacordReader and one mapper will be created.
Note – It is not always the case where inputSplits is depend on number of blocks, you can customize number of splits for a particular file by setting mapred.max.split.size property during your job execution.
RecordReader keep reading/converting data int key value until end of file, recordReader’s responsibility is to assigned byteoffset(unique number) to each line present in the file further this key value data is send to mapper.
As you can see in the diagram map send this data to reducer but before sending it to reducer directly there is some logic behind that that we will cover in reducer article.
In short and simple words map reads the data set and it creates a key and value pair data this data or the output of mapper program is called as intermediate data(key/value data which is understandable to reduce), mapper is then responsible for grouping all this intermediate key value data with set of one key corresponds to all its respective values before passing it to reducer.
Next we will see reducer in depth so stay tuned.
Share This Knowledge ! For more updates Join us on Facebook ! Toodey Inc.