All readers know about this website is all about teaching complex topics in a simple way so in this post we will discuss about core mapreduce model and how it works.
What is Hadoop ?
Hadoop comes under Apache Open Source community, basically hadoop is designed to solve current relational database problems and it provides features like
- Distributed data storage
- Fault tolerance
- Parallel processing
Those features are not cheap or easily available as compare to any relational databases
Hadoop is basically used for both storing structure as well as unstructured data
What is Map Reduce model and how it works in easy and different way
Before starting let me give one simple example why we need map reduce framework in hadoop, if you are a beginner then this question is for you.
In any SQL databases like MySQL, Oracle or SQL Server we have one common thing that is common language to read or query data from this databases and this language is nothing but SQL language, we need SQL to read data from above database so in this case SQL is a framework for communication with above databases (data resides in columns and row format into relation databases)
Like this map reduce is a data processing frameworks runs on top of hadoop to query data from HDFS (Hadoop Distributed File System)
Map reduce framework divided into two phases
Map is a phase where you need to write your initial transformation logic that is reading the input records in parallel way, map reduce framework supports this kind of parallel API, each record will be read line by line, after reading input data map generates its own data on local itself we called this intermediate data.
See this Post to get in depth knowledge about Map Phase
This is an aggregation or collection phase where particular mapper output data is received, after receiving mapper intermediate data reduce may sum up the keys or distinct the set of keys or sort it and further reducer generates final output.
See this Post to get in depth knowledge about Reduce Phase