How to load hiveContext in Zeppelin?

I am new to zeppelin notebook. But i noticed one thing that unlike spark-shell hiveContext is not automatically created in zeppelin when i start the notebook. And when i tried to manually load the hiveContext in zeppelin like: import org.apache.spark

spark dataframe parse csv with non US format strange error

I have a dataframe in spark which contains a column of df.select("y_wgs84").show +----------------+ | y_wgs84| +----------------+ |47,9882373902965| |47,9848921211406| |47,9781530280939| |47,9731284286555| |47,9889813907224| |47,9881440349524| |

Mapping the kafka partition to a specific spark executor

So I need to specify how an executor should consume data from a kafka topic. Let's say I have 2 topics : t0 and t1 with two partitions each, and two executors e0 and e1 (both can be on the same node so assign strategy does not work since in the case

pyspark saveAsTextFile works for python 2.7 but not 3.4

I'm running pyspark on an Amazon EMR cluster. I have a very simple test script to see if I can write data to s3 using spark-submit ... from pyspark import SparkContext sc = SparkContext() numbers = sc.parallelize(range(100)) numbers.saveAsTextFile("s

scala spark version mismatch

i am getting below exception while running word count program in scala. Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps; at org.apache.spark.util.Utils$.get

PYSPARK: molding chain to float when reading a csv file

I'm reading a csv file to dataframe datafram = spark.read.csv(fileName, header=True) but the data type in datafram is String, I want to change data type to float. Is there any way to do this efficiently?If you want to do the casting when reading the

Scala Data Analysis on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are: Customer life cycle Value (CLV) Centrality measures (degree, Eigenvector, edge-betweenness

State change of Apache Spark Count

I am new to Apache Spark (Pyspark) and would be glad to get some help resolving this problem. I am currently using Pyspark 1.6 (Had to ditch 2.0 since there is not MQTT support). So, I have a data frame that has the following information, +----------

SLF4J recorder missing on spark workers

I am trying to run a job via spark-submit. The error that results from this job is: Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDe

Data creation at Scala

wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word']) This is a way of creating dataframe from a list of tuples in python. How can I do this in scala ? I'm new to Scala and I'm facing problem in figu

How to build DStream from continuous RDDs?

I`m reading data from ElasticSearch to Spark every 5min. So there will be a RDD every 5 minutes. I hope to construct a DStream based on these RDDs, so that I can get report for data within last 1 day, last 1 hour , last 5 minutes and so on. To constr

Apache Spark RDD [Vector] Immunity Problem

I know the RDDs are immutable and therefore their value cannot be changed but I see the following behaviour: I wrote an implementation for FuzzyCMeans (https://github.com/salexln/FinalProject_FCM) algorithm and now I'm testing it, so I run the follow

Caching in streaming sparks increases

So i'm preforming multiple operations on the same rdd in a kafka stream. Is caching that RDD going to improve performance?When running multiple operations on the same dstream, cache will substantially improve performance. This can be observed on the

Specifying an External Configuration File for Apache Spark

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime. ~~~~~~~~~~Edit~~~~~~~~~~~ It turns out I was pretty confused about how to go about doing this. Ignore the rest of this question.

Convert CSV to RDD

I tried the accepted solution in How do I convert csv file to rdd, I want to print out all the users except "om": val csv = sc.textFile("file.csv") // original file val data = csv.map(line => line.split(",").map(elem =>

Spark - What language should I use?

Currently Spark supports several languages to use its functionality, e.g., Scala, Java, Python, but which one should I choose to work with Spark? Can someone explain the pros and cons of using each language on Spark?I have a little experience with Sp

Spark UnsupportedOperationException: empty collection

Does anyone knows possible causes of this error while trying to execute spark mllib ALS using hands on lab provided by Databricks? 14/11/20 23:33:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/11/20 23:33:39 WARN SizeEsti

PySpark Drop Rangs

how do you drop rows from an RDD in PySpark? Particularly the first row, since that tends to contain column names in my datasets. From perusing the API, I can't seem to find an easy way to do this. Of course I could do this via Bash / HDFS, but I jus

Can not add a spark job on an EC2 cluster

I am new to Spark. I am able to launch, manage and shut down Spark clusters on Amazon EC2 from http://spark.incubator.apache.org/docs/0.7.3/ec2-scripts.html. But I am not able to add below job on cluster. package spark.examples import spark.SparkCont