How can I properly load data into Hive using a spark?

advertisements

I want to input data which looks as-

"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""

My create table statement is as-

sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")

When I run the statement-

sqlContext.sql("from dummy11 select age").show()

OR

sqlContext.sql("from dummy11 select y").show()

It returns NULL value instead of correct values, though other values are visible

So how do I correct this??


As you are using Hive QL syntax, you need to validate the input data before processing.

In your data, few records have lesser columns - than the actual columns defined in DDL.

So, for those records, the rest columns (from last) are set as NULL; as that row does not have enough values.

That's why, the last column y has values NULL.

Also, in DDL, first field's data type is INT; but in record, first field values are:

"58
"44
"33

Because of ", the values are not type-casted to INT; setting field value as NULL.

As per the DDL and data - you provided, values are getting set as:

age             "58
job             ""management""
marital         ""married""
education       ""tertiary""
default         ""no""
housing         2143
loan            ""yes""
contact         ""no""
month           ""unknown""
day_of_week     5
duration        ""may""
campaign        261
pday            1
previous        -1
poutcome        0
emp_var_rate    ""unknown""
cons_price_idx  ""no""
cons_price_idx  NULL
cons_conf_idx   NULL
euribor3m int   NULL
nr_employed     NULL
y               NULL

Check the NULL values for last 5 columns.

So, if that is not expected, you need to validate the data first before proceeding.

And for the column age, if you need it in INT type, cleanse the data to remove unwanted " character.


WORKAROUND

As workaround, you can define age as STRING at beginning, as use spark transformations to parse the first field and convert it to INT

import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
  Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show

Here df is your dataframe created while executing the hive DDL with column age as sTRING.

Nnow, you can perform operation on new column ageInINT rather than column age with INTEGER values.