Get `Any` in place of Seq [DataFrame]

I want to slightly improve the following code: val filePathsList = Seq("path_1","path_2) var seqdf = filePathsList.map(path => { try { sqlContext.read.format("json").load(path) } catch { case e: Exception => e.printStackTrac

attach files in a spark

I have a file like this. code_count.csv code,count,year AE,2,2008 AE,3,2008 BX,1,2005 CD,4,2004 HU,1,2003 BX,8,2004 Another file like this. details.csv code,exp_code AE,Aerogon international BX,Bloomberg Xtern CD,Classic Divide HU,Honololu I want the

spark dataframe parse csv with non US format strange error

I have a dataframe in spark which contains a column of df.select("y_wgs84").show +----------------+ | y_wgs84| +----------------+ |47,9882373902965| |47,9848921211406| |47,9781530280939| |47,9731284286555| |47,9889813907224| |47,9881440349524| |

Addition of vectors present in two different RDD scala spark

I have two RDDs with this structure org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] Here each row of RDD contains an index Long and a vector org.apache.spark.mllib.linalg.Vector. I want to add each component of the Vector into

Spark Dataframe / Dataset: generic conditional cumulative sum

I have a dataframe which has a few attributes (C1 to C2), an offset (in days) and a few values (V1, V2). val inputDF= spark.sparkContext.parallelize(Seq((1,2,30, 100, -1),(1,2,30, 100, 0), (1,2,30, 100, 1),(11,21,30, 100, -1),(11,21,30, 100, 0), (11,

Converting yyyymmdd to MM-dd-yyyy format in pyspark

I have a large data frame df containing a column for date in the format yyyymmdd, how can I convert it into MM-dd-yyyy in pySpark. from datetime import datetime from pyspark.sql.functions import col,udf from pyspark.sql.types import DateType rdd = sc

Spark DataFrame

When i am querying dataframes on spark-shell(1.6 version) ,the column names are case insensitive . On Spark-Shell val a = sqlContext.read.parquet("<my-location>") a.filter($"name" <=> "andrew").count() a.filter($&

Multiple aggregations in Spark Structured Streaming

I would like to do multiple aggregations in Spark Structured Streaming. Something like this: Read a stream of input files (from a folder) Perform aggregation 1 (with some transformations) Perform aggregation 2 (and more transformations) When I run th

How to sort a column with Date and Time values ​​in Spark?

Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe. Input: 04-NOV-16 03.36.13.000000000 PM 06-NOV-15 03.42.21.000000000 PM 05-NOV-15 03.32.05.000000000 PM 06-NOV-15 03.32.14.000000000 AM Expec

Search for terms in DataFrames

I'm very new to Spark Apache and this is mostly an exercise for myself. I have two json files. File1 companies.json) [ {"symbol":...,"name":...,"description":...} . . ] File 2) emails.json: [ {"from":...,"to&qu

write a spark data structure in CSV format with partitions

I'm trying to write a dataframe in spark to an hdfs location and I expect that if I'm adding the 'partitionBy' notation spark will create partition (similar to writing in parquet format) folder in form of "partition_column_name=partition_value"

Randomly random column in Spark RDD or Dataframe

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task.While one can not not just shuffle a single column directly - it

How can I pass additional parameters to UDFs in SparkSql?

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date may change (i.e. 2011/01/10 => 2011 /01 if the resolution is set to "Month"). I wrote the following code: def convertDataFrame(dataframe: D

Spark SQL Queries vs Data Functions

To perform good performance with Spark. I'm a wondering if this is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select(). Any idea? :)There is no performance difference whatsoever. Both met

Spark aggregation and custom aggregation

i have data as below, n1 d1 un1 mt1 1 n1 d1 un1 mt2 2 n1 d1 un1 mt3 3 n1 d1 un1 mt4 4 n1 d2 un1 mt1 3 n1 d2 un1 mt3 3 n1 d2 un1 mt4 4 n1 d2 un1 mt5 6 n1 d2 un1 mt2 3 i want to get the output as below n1 d1 un1 0.75 n1 d2 un1 1.5 i,e do a groupby on 1

SPARK Is the Sampling Method About Uniform Dataframes Sampling?

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not