Today, let’s discuss about Cloudera CDH latest version 5.3, we will go in more deeper and deeper in upcoming tutorials so stay tuned.
I hope you guys are enjoying all my tutorials! if not, have a look at here Hortonworks –
Let’s get started !
Are you a business guy (sales), manager, consultant or architect and are you new to hadoop ? then right now you are at right place and you are reading right things and one more important thing is this tutorials are more important for developers or who want to switch their career from other technologies to Hadoop or Big Data ! so i welcome you all in Big Data world and on most popular site
What is CDH (Cloudera Hadoop Distribution) ?
- In Cloudera distribution you will find apache hadoop with different different additional component support which runs on top of apache hadoop
- Cloudera build their own custom Hadoop from Apache Hadoop source so they provides commercial support to Hadoop
- Commercial support includes management support as well as governance for this
they introduces enterprise version which purely satisfies the user and business need, they are a proven commercial vendor from last couple of year
This is a raw architectural diagram of CDH (Cloudera), as you will find same components are available in other distributions(vendors) like Hortonworks and MapR but they have their own features, we will discuss how cloudera plays a role
- Data integration introduces some open source components like flume, kafka or sqoop which helps for data migration from one source machine to another, normally this scenario called source and sink and component like flume introduces channel to transfer data from one source to another
- Other hand kafka proven as a very fabulous pub/sub messaging system that is publish subscribe messaging system (Log Aggregation)
- Sqoop plays important role where it manage efficient data flow between relational databases and Hadoop. Sqoop can retrieve data from many sources like mysql or oracle and bring this data to Hadoop for further analytics. you can again load this data again back to relational databases
Data Storage Layer
- You can store any type of data into CDH Cloudera cluster, main thing is cloudera introduced apache sentry components to access or handle this data in more secure way
- We can say the purpose of apache sentry is to provide support for complete HDFS storage layer and other components also
- In batch processing there is no time constrains for you, your query may take 1 day or 1 hour it depends on data that no matter of time
- In cloudera this can be achieved using MapReduce, Hive or Pig
Stream processing & Machine learning
- Cloudera introduced apache spark for stream processing, you can apply real time logic or some machine learning algorithm on real time data
- Spark also supports Mlib which is a machine learning written in scala, java and python
Some key features of Cloudera CDH 5.3
- Flexible – In cloudera you can play with any component which comes under hadoop ecosystem, any type of data can be store and retrieve in secure way.flexible to do batch and stream processing, Efficient way to access to Impala SQL, free text search browser with solr support, machine learning spark Mlib and statistical computation
- Security – Access control over data and users, multi tenancy support
- Balanced – Complete cloudera package is available for quick access you can play with cloudera using cloudera sandbox
In Part 2 we will discuss further more on Cloudera.