What is hcatalog in hadoop hive
- HCatalog service is used for table and storage management on top of hadoop
- HCatalog supports different components available in hadoop like mapreduce, hive and pig
- HCatalog provides very fancy UI based access to hive metastore where you can use this UI for creating and managing tables.
For storage it supports some file formats like normal text file, sequence file, RCFile, CSV and JSON, using HCatalog API you can even write custom storage format.
Let’s discuss this diagrammatically
- In mapreduce we have Hadoop InputFormat and OutputFormat for reading and writing data like this in HCatalog we have HCatInputFormat and HCatOutputFormat for Hadoop interaction
- HCatInputFormat responsible for read data from table which is internally partition into multiple directory
- HCatOutputFormat responsible for writing data to table, it can be for creating new partition, you can use existing partition for write using two important parameter, partition key and value
- HCatLoader and HCatStorer is two scenarios which internally calls Pig Load and storage interface
- HCatLoader is responsible for read while calling HCatLoader, we can specify particular partition for reading
- HcatStorer is responsible for write while calling HCatStorer, we can create new partition on data or we can use existing one by passing partition key and value parameters
- HCatalog directly connected with Hive metastore, there is no specific loader or storer available for Hive Data Storage
- HCatalog supports easy relation with data which can be visualize in UI in tabular format
- You can create database and tables over HCatalog this detail is by default stored in hive metastore
- In HCatalog records are separated in columns which consists of two major things column name and its data type
- At the end HCatalog is a good tool over Hive metastore to play with partition on data.
In Next tutorial we will cover practical use case and
How to use HCatalog with Hive, mapreduce and Pig