hadoop interview questions and answers – Map Reduce in depth – Part 2
In last article i have explained about the mapper process in depth, please ref link to understand mapper process and before reading reducer process i would suggest you first go through mapper process once again
Here is link – Map Reduce in depth – Part 1
What is Reduce Phase(reducer) in mapreduce ?
As discussed in mapper process, reducer is also a user written function which accepts the data from mapper, in short mapper output is nothing but input to a reducer, reducer accepts an intermediate data that is key and a set of values belonging to that key.
There can be many mappers feeding data to one single reducer let’s see the diagram
In above diagram you can see two mappers communicating with a single reducer but this is not so simple as looking in diagram there is some secrets, there is some steps between mapper and reducer, so let’s see what are those steps and why it’s required.
While understanding process between mapper and reducer i will introduce you about
Shuffling & Sorting
Mapper generates duplicate intermediate key value data and mapper doesn’t know how to shuffle or sort key value data so before sending data to reducer suffling phase is responsible for avoiding duplicate keys, combining and making list of all values belonging to that particular key
Shuffling and sorting will be handle by mapreduce framework internally so don’t worry about this phase, sorting phase or order is automatically handled by hadoop framework,sorting means comparing keys with each other and framwork campare keys with each other using comparable interface available in java which is by default comes under hadoop framwork, see below diagram
If you see the diagram two mappers are feeding data to single reducer remeber one thing for every key value pair reducer get executed if we have 100 key value pair after shuffling and sorting then reducer get executed 100 times so the problem here will be if we have 1 million or 1 billion key/value then reducer will be executed that much amout of time.
In this case the reducer performace may down on high number of key value pairs data so to overcome this issue one more important concept is used that is combiner, combiner is mini-reducer which helps reducer to pre combine final values associated with its key
Note – Above diagram explains about single reducer and for multiple reducer we have to use partitioner concept to make sure that all key/value of a particlar mapper going to same reducer we will see information about partitioner and combiner in separate article.
RecordWriter writes each record into the output file. There is no local data writes recordWriter output is a final output for reducer which is directly stored on hdfs.
There are two types of reducer
Identity reducer is a default reducer provided by hadoop, you can use it in your code by using following syntax and in identity reducer only sorting will happen.
Note – like identity reducer we have identity mapper we can use it if we need sorted input data for mapper.
Custom or own reducer class
This is user written reducer class and by using this reducer both shuffling and sorting is possible
Next we will see combiner and partitioner in more sophisticated way with the help of diagrams
Share this knowledge ! Join us on Facebook ! Now Whatsapp sharing is supportable ! TooDey Inc.