incremental data selection from multiple tables in Hive

I have five tables(A,B,C,D,E) in Hive database and I have to union the data from these tables based on logic over column "id". The condition is : Select * from A UNION select * from B (except ids not in A) UNION select * from C (except ids not i

How to convert RDD [List [Int]] to DataFrame?

I hava a RDD[List[Int]] ,I don not know the count of list[Int],I want to convert i Rdd[List[Int]] to DataFrame,How should I do? this is my input: val l1=Array(1,2,3,4) val l2=Array(1,2,3,4) val Lz=Seq(l1,l2) val rdd1=sc.parallelize(Lz,2) this is my e

How to add an empty card type column to DataFrame?

I want to add a new map type column to a dataframe, like this: |-- cMap: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) I tried the code: df.withColumn("cMap", lit(null).cast(MapType)).printSchema The erro

How to calculate the moving median in DataFrame?

Is there a way to calculate moving median for an attribute in spark DataFrame? I was hoping that it is possible to calculate moving median using a window function (by defining a window using rowsBetween(0,10)), but there no functionality to calculate

attach files in a spark

I have a file like this. code_count.csv code,count,year AE,2,2008 AE,3,2008 BX,1,2005 CD,4,2004 HU,1,2003 BX,8,2004 Another file like this. details.csv code,exp_code AE,Aerogon international BX,Bloomberg Xtern CD,Classic Divide HU,Honololu I want the

spark dataframe parse csv with non US format strange error

I have a dataframe in spark which contains a column of df.select("y_wgs84").show +----------------+ | y_wgs84| +----------------+ |47,9882373902965| |47,9848921211406| |47,9781530280939| |47,9731284286555| |47,9889813907224| |47,9881440349524| |

Loading Spark XML File

How can I load XML files in Spark 2.0? val rd = spark.read.format("com.databricks.spark.xml").load("C:/Users/kumar/Desktop/d.xml") I'm getting error com.databricks.spark.xml not available. java.lang.ClassNotFoundException: Failed to fi

Spark Streaming - TIMESTAMP field-based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation. The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTA

How to access nested fields in the .proto, ScalaPB database

The following my dataframe schema root |-- name: string (nullable = true) |-- addresses: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- street: string (nullable = true) | | |-- city: string (nullable = true) I want to out

How to use java.time.LocalDate in a Cassandra Spark request?

We have a table in Cassandra with column start_time of type date. When we execute following code: val resultRDD = inputRDD.joinWithCassandraTable(KEY_SPACE,TABLE) .where("start_time = ?", java.time.LocalDate.now) We get following error: com.data

Apache Spark - Back-end servers

I've developed a reporting application in PHP. The application is built with HTML, CSS, javascript libraries, charting library(Highcharts) & MySQL to store data. The user chooses some options in the front end & clicks a "Submit button".

Search for terms in DataFrames

I'm very new to Spark Apache and this is mostly an exercise for myself. I have two json files. File1 companies.json) [ {"symbol":...,"name":...,"description":...} . . ] File 2) emails.json: [ {"from":...,"to&qu

Showing weird results showing with sql: Spark

I am trying to do some analysis with spark. I tried the same query with foreach which shows the results correctly but if I use show or in sql it is weird, it is not showing anything. sqlContext.sql("select distinct device from TestTable1 where id = 2

Convert the pyspark string to the date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() and I get a string of nulls. Can anyone he

Understand the physical plane

I'm trying to understand physical plans on spark but I'm not understanding some parts because they seem different from traditional rdbms. For example, in this plan below, it's a plan about a query over a hive table. The query is this: select l_return

How to use countDistinct in Scala with Spark?

I've tried to use countDistinct function which should be available in Spark 1.5 according to DataBrick's blog. However, I got the following exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function countDis

how to create a permanent table in spark sql

in my project, I'm transferring data from MongoDB to SparkSQL table for SQL-based queries. But Spark SQL let me to create temporary files. When I want to query something, execution time is very high, because data transferring and mapping operation ta