Hotonworks Hadoop (HDP 2.2)
Hortonworks is open source and you can get commercial support as well.In Hortonworks VM or sandbox you will find list of Apache components which is running on top of Hadoop. Now a days it’s not sufficient to gain only hadoop knowledge, you must be somehow aware about other component which are running on top of Hadoop. such as Hive, Pig, Flume . .etc
In this tutorial i will not cover much theoretical concept, rather i am focusing on practical concept of Hortonworks sandbox, from last couple of years i am working on Hortonworks Hadoop Cluster so i am going to share my experience and opinion here, When i started learning hadoop there is only one hadoop distribution that is apache hadoop, then Cloudera, hortonworks, MapR etc came with their own hadoop distribution.
Normally at the end architecture of all this hadoop distro is somehow same, Hortonworks provides many open source components on top of hadoop and lets discuss some of them.
Hortonworks Hadoop –
- This framework enables the distributed processing on large, unstructured, structured and semi-structure data sets across cluster of commodity hardware using Mapreduce programming model.
- Hadoop itself is designed to scale up from single server to thousands of machines. what this really mean is and if you are not aware about the master slave architecture take a look at below diagram
In normal scenarios like oracle and mysql server there was only one server(master) responsible for query execution.
If we consider the above case like master-slave where master is only responsible for storing metadata information about datanodes or slave nodes like datanode storage capacity, health, datanode like or dead but what is datanode here, datanode is just a physical identity in simple word A machine with unique ip address, physical address with OS and storage capacity (RAM/HDrive).
During old mysql or oracle server if query result is big it might take one hour to display the final result. here comes the master slave where client send a query request to the master, master looks for a request and assigns a task to multiple slave nodes to get the query result
Master is the only entity which is responsible for storing information about datanode/slave node its location, file location, size and all. master node never saves the data on it, Here slaves are s1,s2, s10 or s11 contains data, in some cases client can directly connect to one of the slave or datanode, it can communicate via ssh.
At the end some main points to be understand in master slave is :
- Master – responsible for storing metadata information about all slave nodes.
- Datanode / SlaveNode – responsible for data processing and storing.
We will discuss further more on Hortonwork components
In Part 2