This article is about the bug in Hive
filtering option, when the partition table query stored as parquet. Big
data developers will help you to fix this bug via this post. Read the
post and find how they do it.
Big data is a term for data sets that
are rapidly growing or so large or complex that traditional data
processing applications are inadequate. Challenges include analysis,
capture, data curation, search, sharing, storage, transfer,
visualization and querying and information privacy.
Quantifying Big DATA
- National Medical Records, Weather Images
- Machine logs, RFID reader, sensor networks, Rich media
- Retail and enterprise transactions
- Publicly available data from different sources
- Airlines & Share market
Technology :
There are different ecosystems which are
used to process large datasets in Big Data world but the most commonly
used ecosystem in Hadoop is Apache Hive.
Hive is developed by Facebook to analyze
and extract useful information from their huge data but now it is very
popular in other organizations too such as Netflix and FINRA.
Even though Big Data is gaining popularity day by day,
there are some issues seem so obvious they’re hardly worth considering,
likewise we have found one issue in Apache Hive which is worth sharing.
Use-case:
Now a days most of us are using
different ways to optimize query or we can say to improve the
performance of the Hive query. Out of which 2 most common techniques
are:
- Partitioning
- Storing data in parquet format.
Partitioning is very known concept to
the folks who are processing/analyzing/aggregating their data thru
Apache Hive and the Parquet file format incorporates several features
that make it highly suited to data warehouse-style.
But most of us are unaware of the fact
that Apache hive does not support the query, when storing a partitioned
table in parquet format and executing a query on partitioned column.
Let’s have a detail look into it.
Below is the pipe delimiter sample data present in HDFS which we will load into managed non-partitioned Hive table
Below steps will create a managed hive table named “hive_emp1”.
Loading data from HDFS into hive table (hive_emp1) which we have created in above steps.
Take a look into data present in Hive table created above.
We have few Males and 2 Females which are represented by ‘M’ and ‘F’ respectively in last column (sex).
Now, we will create another table in
hive name “hive_emp_dynpart”, which will be partitioned on 2 columns
(dept and gender) and also data of this table will be stored in parquet
format.
Set the hive.exec.dynamic.partition to true and hive.exec.dynamic.partition.mode to nonstrict to load the data dynamically in hive table.
We will insert the data from hive_emp1 table into hive_emp_dynpart table along with partitions too.
Issue:
While querying the hive_emp_dynpart
table with one of the partition column, you will get the following
error, for all other regular column it is working fine.
Error Description:
It is a known bug in Apache Hive (HIVE-11401) filtering option, when the partitioned was stored as Parquet.
Resolution:
A known workaround is to disable predicate pushdown by setting property hive.optimize.index.filter to false.
Now query the table using same command.
Conclusion:
You need to set the property to false every time you execute the query.