Hortonworks Hadoop 2.2 – Introduction Part 3

By | July 15, 2015

[spacer height=”20px”]

In Part 2,

We have covered basic architecture of Hadoop (HDP), In this tutorial we will discuss on GOVERNANCE topic available in hortonworks documentation, Hortonworks GOVERNANCE terminology describes about some extra component which works along with Hadoop or on top of Hadoop. Let’s discuss some of those components which comes under GOVERNANCE category in HDP 2.2, here is diagram for HDP 2.2 GOVERNANCE category hdp 3 Governance, the name itself describe the meaning but this is about data, mainly called data governance. Data Governance and Tools(components) enable the enterprise to load data from external sources to hadoop, using this tools you can manage data work flow like data policy, security which improves the data quality and manage it according to policy, controlling the workflow of data and the data lifecycle. Lets discuss overview of some components here and we will discuss each of this component later post in more depth for now lets understand the overview :



  • Sqoop is a Tool that transfers bulk data (complete data set) between Hadoop and structured datastores such as relational databases.
  • Sqoop support data load from MySQL, oracle and other relational databases to Hadoop.
  • From latest version of Sqoop, it also provides an efficient way to load data back from Hadoop to Relational databases.




  • Flume is a distributed, reliable service for efficiently collecting, aggregating and moving large amounts of data (mainly log data).
  • It has a flexible architecture based on continuous / streaming data flows.
  • It is robust, fault tolerant.
  • It provides failover and recovery mechanisms so data will not be loss.




  • Apache kafka is fast, scalable, durable and distributed in nature.
  • Apache Kafka is publish-subscribe messaging system
  • A single Kafka broker can handle hundreds of reads and writes request per second




  • Its a Protocol, it provides HTTP REST access to HDFS.
  • It enables the external users to connect to hortonworks cluster from outside.
  • The HTTP REST API support like :
    • HTTP GET
    • HTTP PUT




  • Falcon plays important role in data governance of Hortonworks cluster.
  • Apache Falcon provides a framework for automating data governance by defining data pipelines
  • Falcon provides Interface and it can allow dynamic changes also in pipeline
  • Falcon mainly offers :
    • Data Replication
    • Data Lifecycle Management
    • Dataset Traceability

Thats it this is all about governance tab available in Hortonworks hadoop architecture.

We will discuss further more on Hortonworks architecture

In Part 4

[spacer height=”20px”]