Latest Cloudera Hadoop CDH 5.4.x – Introduction Part 3 Impala Overview

By | July 15, 2015

[spacer height=”20px”]

In Part 2,

We have discussed three most important parts available in cloudera that is Cloudera Manager, CDH (Cloudera Hadoop Distribution) and Cloudera Navigator. In this tutorial we are gonna  discuss about Impala workflow and impala overview. let’s go further and see how impala works.   cloudera part 3

Some important entities in Impala Workflow

 

HDFS

 

  • As we all know Hadoop distributed file system, this entity used for Impala data storage or you can query your data directly from HBase.

 

Impala

 

  • This is a daemon process runs on each datanode in parallelize manner, they coordinate with each other sometime for resource sharing, each node acting as a worker node.
  • Each daemon impala  process receives query execution plan from impala user.

 

JDBC/ODBC Client

 

  • JDBC – Impala support standard JDBC driver, it provide access to hive metastore using custom softwares written in java and other languages. you can write you own customer java driver to communicate with impala databases for various queries
  • ODBC – Third party products support like custom data warehouse, connectors are available on official site for ODBC access
  • Example – You can access Hive/Impala tables in Excel using ODBC connector.

 

Hue GUI

 

  • Nice GUI to access various tools available in cloudera, it allows you to run, create, deploy and configure mapreduce jobs, hive or pig scripts etc. using simple GUI interface
  • Hue provides beeswax server which fires/sends query from hue browser to HiveQL engine.

 

Impala Shell

 

  • This is a simple impala shell where you can create databases and fire queries, even you can submit any complex query statements here
  • By using some command options you can simply pass a impala script file to automate the sql statements

 

Hive Metadata

 

  • Hive metastore stores information about all tables and data required for impala
  • Impala refers hive metadata to gain knowledge about databases available and format of database.
  • Impala catalog service is responsible for broadcasting metadata change information globally to all nodes available in cluster.
  • Example – Information about table deletion, updation etc.

 

Query execution process

cloudera part 3_2

  • Client send SQL statement to impala using(Hue Browser) JDBC or ODBC, client hue browser connect to any impalad service in cloudera cluster.
  • At the second step impalad processes query and it will prepare query execution plan, query will execute across the cluster in parallel mode
  • Impalad access data from its local HDFS location.
  • There will be one coordinator impalad daemon, after local query execution of all node result send back to coordinator impalad node further coordinator impalad sends this result to client (Hue)

 

Impala features

 

  • Normal SQL syntax which is easy to understand for database users (Data Scientist, Database developer)
  • It can query data from hadoop ecosystem using its parallel execution feature.
  • Simple ETL model for analytics where you can automate your SQL scripts for your data extraction, transformation, upload process.
  • Support share data mechanism which enables to write with impala statement and access using hive interface.
  • This is a cost effective and reliable solution which runs on cheap hardware.
  • Kerberos authentication support is available for impala
  • Separate Cloudera Impala Query UI is available.
  • Support for different file formats like Avro, Text, Sequence, Parquet file etc. and compression techniques like Snappy, GZIP etc.

 

We will discuss further more on Cloudera.

In Part 4

[spacer height=”10px”]

Leave a Reply

Your email address will not be published. Required fields are marked *