In Part 1,
We discussed about basic master slave workflow in this tutorial we will cover Hortonworks HDP 2.2 basic architecture or overview of HDP architecture.
Let’s get started
Before going further let’s understand What is hadoop ?
- Apache open source product and written in java.
- Hadoop can store large amount of structured, semi-structured and unstructured data.
- Hadoop allows us to process large amount of data in parallel across the cluster(on multiple DataNodes) (Example – master/slave – Part 1 )
- Hadoop stores the data block by block like 64MB block size you can customize it as per your need, Example –
- Next time for next 200 MB file it will again start from new block that is 64,64,64 and 8 MB because only 8 MB was remaining. its a internal architecture of Hadoop (namenode) that we will discuss later.
- If you have 100 MB data and if you want to store this data into Hadoop then this data will be split into 2 blocks i.e 64 MB for 1st block and 36 MB for second block, and one more important think people always confuse that if block size is 64 MB then if 36 MB space is allocated will it waste rest of the space, “it’s not” it will not waste any space here, the last block is created as per the file size that is bytes remaining to store that file block.
- Hadoop provides data locality, data availability and fault tolerance this points we will discuss in another post.
Above point just explains the overview about Hadoop, if you see the Hortonworks documentation and it’s architectural diagram some how its hard to understand for beginners so i have prepared one simplified diagram below and lets discuss more on this in simplified way.
Let’s go step by step
Here i will not go in depth understanding about each component, rather i will cover each of this points in more sophisticated way in separate posts.
Step 1 – HDFS ( Hadoop Distributed File System )
- (HDFS) Hadoop Distributed File System is a Java based file system that provides scalable and reliable data storage where the data is stored in blocks and mainly in flat files (text files).
- HDFS is optimized for streams data reads/writes of very large files.
Step 2 – YARN ( Resource Management )
- YARN is second part of Hadoop normally we called it as a MR2 architecture of Hadoop.
- YARN is compatible with HDFS such as this is treat as a single cluster, providing the ability to move the computation resource to the data that is data processing will be happen at datanode side this is also called a data locality With YARN, the storage system of yarn need not be physically separate from the hadoop system.
- YARN system runs on top of hadoop.
Step 3 – Miscellaneous components on top of Hadoop.
- Hive Provides data warehouse infrastructure solution
- Hive enabling ad- hoc query and analysis of large data sets.
- The query language of hive is HiveQL (HQL), it is similar like SQL.
- Hive is good for non programer if you are familiar with SQL.
- Pig or pig latin is a similar to scripting language
- Pig is designed for processing a long series of data and expressing data analysis programs with map reduce framework
- Pig Latin can be self-optimizing using own user defined functions.
Spark – According to documentation
- Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley.
- Spark is a in-memory data processing framework where data divided into smaller RDD
- Spark performance is up to 100 times faster than hadoop mapreduce for some applications.
- Spark is also for Data scientists for machine learning and analytics
- Storm use as a real time data processing framework.
- Storm also famous in complex event processing world.
- MapReduce is a framework for writing applications on top of hadoop where it can process large amounts of semi-structured, structured and unstructured data in parallel across multiple nodes, in a reliable and fault-tolerant way.
- Map is mainly responsible for data lookup from HDFS.
- Reducer get the relevant partition data from the nodes where the maps executed(data locality), then writes its output back into HDFS
Step 4 – Batch and Real Time Data Processing.
What is Batch processing ?
- In Batch mode, normally you are not concerned about time taken for query execution.
- you are just interested in query result then it may take 1 hour or 1 day.
Real Time Data Processing ?
- Where response time of a particular request must be in milliseconds or seconds.
- near real time is separate thing where response time may vary upto 5 mint.
- In short you fire a query and got result within few seconds
We will discuss further more on Hortonwork components
In Part 3