Temime Diaries

How To Fix Hive – Partition Table Query Failed When Stored As Parquet

This article is about the bug in Hive filtering option, when the partition table query stored as parquet. Big data developers will help you to fix this bug via this post. Read the post and find how they do it.

Big data is a term for data sets that are rapidly growing or so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization and querying and information privacy.

Quantifying Big DATA

National Medical Records, Weather Images
Machine logs, RFID reader, sensor networks, Rich media
Retail and enterprise transactions
Publicly available data from different sources
Airlines & Share market

Technology :

There are different ecosystems which are used to process large datasets in Big Data world but the most commonly used ecosystem in Hadoop is Apache Hive.

Hive is developed by Facebook to analyze and extract useful information from their huge data but now it is very popular in other organizations too such as Netflix and FINRA.

Even though Big Data is gaining popularity day by day, there are some issues seem so obvious they’re hardly worth considering, likewise we have found one issue in Apache Hive which is worth sharing.

Use-case:

Now a days most of us are using different ways to optimize query or we can say to improve the performance of the Hive query. Out of which 2 most common techniques are:

Partitioning
Storing data in parquet format.

Partitioning is very known concept to the folks who are processing/analyzing/aggregating their data thru Apache Hive and the Parquet file format incorporates several features that make it highly suited to data warehouse-style.

But most of us are unaware of the fact that Apache hive does not support the query, when storing a partitioned table in parquet format and executing a query on partitioned column.

Let’s have a detail look into it.

Below is the pipe delimiter sample data present in HDFS which we will load into managed non-partitioned Hive table

Below steps will create a managed hive table named “hive_emp1”.

Loading data from HDFS into hive table (hive_emp1) which we have created in above steps.

Take a look into data present in Hive table created above.

We have few Males and 2 Females which are represented by ‘M’ and ‘F’ respectively in last column (sex).

Now, we will create another table in hive name “hive_emp_dynpart”, which will be partitioned on 2 columns (dept and gender) and also data of this table will be stored in parquet format.

Set the hive.exec.dynamic.partition to true and hive.exec.dynamic.partition.mode to nonstrict to load the data dynamically in hive table.

We will insert the data from hive_emp1 table into hive_emp_dynpart table along with partitions too.

Issue:

While querying the hive_emp_dynpart table with one of the partition column, you will get the following error, for all other regular column it is working fine.

Error Description:

It is a known bug in Apache Hive (HIVE-11401) filtering option, when the partitioned was stored as Parquet.

Resolution:

A known workaround is to disable predicate pushdown by setting property hive.optimize.index.filter to false.

Now query the table using same command.

Conclusion:

You need to set the property to false every time you execute the query.

Hive update , delete and insert ERROR in cdh 5.4.2

Getting following errorhi in cdh 5.4.2

Error :

FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

Solution :

hive.auto.convert.join.noconditionaltask.size = 10000000;
hive.support.concurrency = true;
hive.enforce.bucketing = true;
hive.exec.dynamic.partition.mode = nonstrict;
hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
hive.compactor.initiator.on = true;
hive.compactor.worker.threads = 1 ;

Import Sql Query results into HDFS

We can use SQOOP to Import the results of a sql query from a relational database into HDFS
Let us take the following RDBMS Mysql Table called student.
Here let us only import std_id = 103 Query Results into HDFS using SQOOP IMPORT command.

1

2

3

4

5

6

7

8

9

mysql> select * from student;

+--------+---------------+

| std_id | std_name      |

+--------+---------------+

|    101 | javachainAnto |

|    102 | kumar         |

|    103 | Sofia         |

+--------+---------------+

3 rows in set (0.00 sec)

Let us import row where std_id = 103 using SQOOP IMPORT.

1

2

3

4

5

6

7

sqoop import  \

--connect 'jdbc:mysql://localhost/StudentInfo' \

--username root \

--password cloudera \

--query 'select * from student where std_id=103 AND $CONDITIONS' \

--split-by std_id \

--target-dir '\user\cloudera\student\';

Let us run the above SQOOP Import and see the Execution flow of the SQOOP IMPORT using –query.
Please note that we no need to use –table parameter if we use –query.

if you are using where condition, make sure you end with AND $CONDITIONS in a single qoates ‘ or ” ”

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

[cloudera@quickstart ~]$ sqoop import  \

> --connect 'jdbc:mysql://localhost/StudentInfo' \

> --username root \

> --password cloudera \

> --query 'select * from student where std_id=103 AND $CONDITIONS' \

> --split-by std_id \

> --target-dir 'student';

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

15/07/20 19:07:12 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.0

15/07/20 19:07:12 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

15/07/20 19:07:13 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

15/07/20 19:07:13 INFO tool.CodeGenTool: Beginning code generation

15/07/20 19:07:13 INFO manager.SqlManager: Executing SQL statement: select * from student where std_id=103 AND  (1 = 0) 

15/07/20 19:07:13 INFO manager.SqlManager: Executing SQL statement: select * from student where std_id=103 AND  (1 = 0) 

15/07/20 19:07:13 INFO manager.SqlManager: Executing SQL statement: select * from student where std_id=103 AND  (1 = 0) 

15/07/20 19:07:13 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce

Note: /tmp/sqoop-cloudera/compile/e11bc6ff1f5bec392ea27817d7c4342d/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

15/07/20 19:07:16 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/e11bc6ff1f5bec392ea27817d7c4342d/QueryResult.jar

15/07/20 19:07:16 INFO mapreduce.ImportJobBase: Beginning query import.

15/07/20 19:07:16 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

15/07/20 19:07:17 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

15/07/20 19:07:18 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

15/07/20 19:07:18 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

15/07/20 19:07:21 INFO db.DBInputFormat: Using read commited transaction isolation

15/07/20 19:07:21 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(std_id), MAX(std_id) FROM (select * from student where std_id=103 AND  (1 = 1) ) AS t1

15/07/20 19:07:21 INFO mapreduce.JobSubmitter: number of splits:1

15/07/20 19:07:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1437428820805_0004

15/07/20 19:07:22 INFO impl.YarnClientImpl: Submitted application application_1437428820805_0004

15/07/20 19:07:22 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1437428820805_0004/

15/07/20 19:07:22 INFO mapreduce.Job: Running job: job_1437428820805_0004

15/07/20 19:07:35 INFO mapreduce.Job: Job job_1437428820805_0004 running in uber mode : false

15/07/20 19:07:35 INFO mapreduce.Job:  map 0% reduce 0%

15/07/20 19:07:46 INFO mapreduce.Job:  map 100% reduce 0%

15/07/20 19:07:46 INFO mapreduce.Job: Job job_1437428820805_0004 completed successfully

15/07/20 19:07:46 INFO mapreduce.Job: Counters: 30

 File System Counters

 FILE: Number of bytes read=0

 FILE: Number of bytes written=135322

 FILE: Number of read operations=0

 FILE: Number of large read operations=0

 FILE: Number of write operations=0

 HDFS: Number of bytes read=107

 HDFS: Number of bytes written=10

 HDFS: Number of read operations=4

 HDFS: Number of large read operations=0

 HDFS: Number of write operations=2

 Job Counters 

 Launched map tasks=1

 Other local map tasks=1

 Total time spent by all maps in occupied slots (ms)=8558

 Total time spent by all reduces in occupied slots (ms)=0

 Total time spent by all map tasks (ms)=8558

 Total vcore-seconds taken by all map tasks=8558

 Total megabyte-seconds taken by all map tasks=8763392

 Map-Reduce Framework

 Map input records=1

 Map output records=1

 Input split bytes=107

 Spilled Records=0

 Failed Shuffles=0

 Merged Map outputs=0

 GC time elapsed (ms)=109

 CPU time spent (ms)=900

 Physical memory (bytes) snapshot=120209408

 Virtual memory (bytes) snapshot=1505222656

 Total committed heap usage (bytes)=60751872

 File Input Format Counters 

 Bytes Read=0

 File Output Format Counters 

 Bytes Written=10

15/07/20 19:07:46 INFO mapreduce.ImportJobBase: Transferred 10 bytes in 28.1674 seconds (0.355 bytes/sec)

15/07/20 19:07:46 INFO mapreduce.ImportJobBase: Retrieved 1 records.

Let us check the output in HDFS.

1

2

3

4

5

6

7

8

[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/student/

Found 2 items

-rw-r--r--   1 cloudera supergroup          0 2015-07-20 19:07 /user/cloudera/student/_SUCCESS

-rw-r--r--   1 cloudera supergroup         10 2015-07-20 19:07 /user/cloudera/student/part-m-00000

[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/student/part-m-00000

103,Sofia

Monday, February 20, 2017

How To Fix Hive – Partition Table Query Failed When Stored As Parquet

Quantifying Big DATA

Technology :

Use-case:

Let’s have a detail look into it.

Issue:

Error Description:

Resolution:

Conclusion:

Hive update , delete and insert ERROR in cdh 5.4.2

Import Sql Query results into HDFS

How To Fix Hive – Partition Table Query Failed When Stored As Parquet