Monday, February 20, 2017

How To Fix Hive – Partition Table Query Failed When Stored As Parquet

This article is about the bug in Hive filtering option, when the partition table query stored as parquet. Big data developers will help you to fix this bug via this post. Read the post and find how they do it.
Big data is a term for data sets that are rapidly growing or so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization and querying and information privacy.


Quantifying Big DATA

  • National Medical Records, Weather Images
  • Machine logs, RFID reader, sensor networks, Rich media
  • Retail and enterprise transactions
  • Publicly available data from different sources
  • Airlines & Share market

Technology :

There are different ecosystems which are used to process large datasets in Big Data world but the most commonly used ecosystem in Hadoop is Apache Hive.
Hive is developed by Facebook to analyze and extract useful information from their huge data but now it is very popular in other organizations too such as Netflix and FINRA.
Even though Big Data is gaining popularity day by day, there are some issues seem so obvious they’re hardly worth considering, likewise we have found one issue in Apache Hive which is worth sharing.

Use-case:

Now a days most of us are using different ways to optimize query or we can say to improve the performance of the Hive query. Out of which 2 most common techniques are:
  1. Partitioning
  2. Storing data in parquet format.
Partitioning is very known concept to the folks who are processing/analyzing/aggregating their data thru Apache Hive and the Parquet file format incorporates several features that make it highly suited to data warehouse-style.
But most of us are unaware of the fact that Apache hive does not support the query, when storing a partitioned table in parquet format and executing a query on partitioned column.

Let’s have a detail look into it.

Below is the pipe delimiter sample data present in HDFS which we will load into managed non-partitioned Hive table
screenshot 1
Below steps will create a managed hive table named “hive_emp1”.
2
Loading data from HDFS into hive table (hive_emp1) which we have created in above steps.
3 4
Take a look into data present in Hive table created above.
We have few Males and 2 Females which are represented by ‘M’ and ‘F’ respectively in last column (sex).
5
Now, we will create another table in hive name “hive_emp_dynpart”, which will be partitioned on 2 columns (dept and gender) and also data of this table will be stored in parquet format.
6
Set the hive.exec.dynamic.partition to true and hive.exec.dynamic.partition.mode to nonstrict to load the data dynamically in hive table.
7
We will insert the data from hive_emp1 table into hive_emp_dynpart table along with partitions too.
8

Issue:

While querying the hive_emp_dynpart table with one of the partition column, you will get the following error, for all other regular column it is working fine.
9

Error Description:

It is a known bug in Apache Hive (HIVE-11401) filtering option, when the partitioned was stored as Parquet.

Resolution:

A known workaround is to disable predicate pushdown by setting property hive.optimize.index.filter to false.
10
Now query the table using same command.
11
12

Conclusion:

You need to set the property to false every time you execute the query.

Hive update , delete and insert ERROR in cdh 5.4.2


Getting following errorhi in cdh 5.4.2 

Error :
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

Solution :

hive.auto.convert.join.noconditionaltask.size = 10000000;
hive.support.concurrency = true;
hive.enforce.bucketing = true;
hive.exec.dynamic.partition.mode = nonstrict;
hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
hive.compactor.initiator.on = true;
hive.compactor.worker.threads = 1 ;

Import Sql Query results into HDFS


We can use SQOOP to Import the results of a sql query from a relational database into HDFS
Let us take the following RDBMS Mysql Table called student.
Here let us only import std_id = 103 Query Results into HDFS using SQOOP IMPORT command.
Let us import row where std_id = 103 using SQOOP IMPORT.

Let us run the above SQOOP Import and see the Execution flow of the SQOOP IMPORT using –query.
Please note that we no need to use –table parameter if we use –query.
if you are using where condition, make sure you end with AND $CONDITIONS in a single qoates ‘ or ” ”
Let us check the output in HDFS.

How To Fix Hive – Partition Table Query Failed When Stored As Parquet

This article is about the bug in Hive filtering option, when the partition table query stored as parquet. Big data developers will help y...