Latest Cloudera Hadoop CDH 5.4.x – Part 4 Cloudera Search

By | July 15, 2015

Latest Cloudera Hadoop CDH 5.4.x – Part 4 Cloudera Search

[spacer height=”20px”]

In Part 3,

We have discussed about Impala overview in this tutorial let’s explore about the Cloudera search mechanism

What is Cloudera Search ?


  • Cloudera search provides near real time access and indexing over data stored in HDFS, this is a full text based search
  • This search provides access to enterprise data so no need to perform any complex ETL task task to bring data from one machine to another or to export data into real time BI reporting system
  • Cloudera search refers apache solr, solr is based on lucene, this functionality is available from CDH 4


Some of the Cloudera Search key features


Search data criteria


  • Cloudera search is depends on Apache Tika for generating file formats during search criteria which will then further useful for indexing
  • The file format support is Sequence, Text, Avro, JSON, XML and etc


User friendly Hue UI and Data Visualization


  • Cloudera search can be refer as hue addon or plugin you can say, which enable users to play with the data, query data from hadoop
  • Hue browser uses cloudera search api which is based on apache lucene and solr


Data Indexing


  • Let’s consider the example of flume, flume source is reading data from another source and flume agent is writing data into cloudera hadoop, during write time flume data can be directly writable into cloudera search indexing repository
  • Cloudera search support indexing on large data, this search functionality come up with some build in map reduce jobs to index large data available in Hadoop or HDFS, this is normally called batch indexing using mapreduce


.cloudra part 4

Let’s understand the basic workflow of Cloudera Search


Streaming data is indexed through the cloudera search, flume events are indexed using solr indexing schema, indexes are directly written into HDFS with solr index support which is easily accessible via cloudera search interface

  • Streaming data is coming from many different source (Ex. log data)
  • Flume channel is playing pipelining role between flume source(start point) and flume sink(end point)
  • Data is indexed over a flume channel, the indexing mechanism is handled by flume agent (sink) which is lucene based.With the help of flume agent data is directly written into HDFS or HBase
  • Later on indexes are loaded from hadoop to solr for searching


We will discuss further more on Cloudera.

In Part 5

[spacer height=”10px”]


Leave a Reply

Your email address will not be published. Required fields are marked *