Hive: Can not retrieve a column that is not present in GROUP BY

I have a table in hive called purchase_data that has a list all the purchases made. I need to query this table and find the cust_id, product_id and price of the most expensive product purchased by a customer. The data in purchase_data table looks lik

DB benchmarks: Cassandra vs BigTable vs Hadoop (s)

I am looking to evaluate the possibility of using Cassandra, BigTable, or a Hadoop-solution. Are there any places that have an up-to-date comparison on how these three compare and perform on a set of benchmark tests? I found a few from perhaps five y

Datanode does not start: incompatible clusterID Hadoop

Trying to start hadoop 2.7.3 services datanode is not starting: Incompatible clusterIDs in /opt/hadoop/tmp/dfs/data: namenode clusterID = CID-4808098e-de31-469d-9761-9a4558fdaf70; datanode clusterID = CID-492135f3-fc08-46f1-a574-

Installing HAWQ on Redhat

I am installing HAWQ on RedHat servers provisioned on Amazon EC2. I already have HDP 2.3 setup on the cluster. I have cloned HAWQ from Github. First I run ./configure --prefix=/opt/hawq. In the second step, I run make. The dependencies are compiling

Hive UDAF to find the most frequently displayed column value

I am trying to create a Hive UDAF to find most frequently appearing column (string) value (not single character or substring, exact column value is used) . Assume that the following is my table called my_table (dashes are used for separating columns

Pork error: unexpected character '\'

I am new to hadoop and pig. I am trying to run a sample pig script in a CentOS6 enviroment on VMWARE: records = LOAD '2013_subset.csv' USING PigStorage(',') AS (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,\ CRSArrTime,UniqueCarrier,Fli

how to write a query and a group by Hive query

This is my hive table: course dept subject status btech cse java pass btech cse hadoop fail btech cse cg detained btech cse cc pass btech it daa pass btech it wt pass btech it cnn pass mba hr hrlaw pass mba hr hrguid absent mtech cs java pass mtech c

DELETE FROM tablename Cloudera Impala

I'm new on Impala, and I'm trying to understand how to delete records from a table... I've tried looking for delete commands, but didn't quite find understandable instructions... This is my table structure: create table Installs (BrandID INT, Publish

Kafka | Can not Publish Data to Broker - ClosedChannelException

I am trying to run simple kafka producer consumer example on HDP but facing below exception. [2016-03-03 18:26:38,683] WARN Fetching topic metadata with correlation id 0 for topics [Set(page_visits)] from broker [BrokerEndPoint(0,sandbox.hortonworks.

Lambda Architecture - Why the Batch Process Layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems. I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is i

Hadoop - LeaseExpiredException

I have multiple compressed files and each compressed file contains 8 xml files of size 5-10kb. I took this data for testing purpose otherwise live data has 1000s of xml files. I wrote map only program to uncompress the compressed file for(FileStatus

Hive find the top n pages by games

We have a table in which we want to store data for the top 100 pages. So if the destination table top100Pages has uid string, mid string, pageurl string, plays string, device string To fill this table, I can run: SELECT uid, mid,pageurl,sum(plays),de

Connect to Hive using Beeline

I am trying to connect to hive installed in my machine through Beeline client. when I give the 'beeline' command & connect to Hive, the client is asking for user name & password !connect jdbc:hive2://localhost:10000/default I have no idea what is

Is FairScheduler applicable to a single-node cluster?

I am just using my 6GB RAM laptop. I just want to know if the 'Fair Scheduler' will or can work to single-node cluster.Yes, as there can be multiple mappers and reducers on one node, so the Fair Scheduler will still prioritise.

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering? Here is the context I have read in Hive : In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas

Call sqoop in a pig script

I wanted to call sqoop export command in my pig script, Basically once I stored my result immediatlyy i wanted to call sqoop export cmd. Is it possible ? If Yes please let me know how to call . Thanks SelvamNo, you can't do Sqoop export from Apache P

Hadoop / Hive upgrade performance

We are testing Hive and Hadoop for digging in our data and I installed a while back Hadoop 1.2.1 and Hive 0.11 (was the stable version) Test server is 4 cores and 16GB of ram. Now I wanted to know if switching to Hive 0.12 and Hadoop 2.2 is worth the

Hadoop sshfence (Permission denied)

I am trying Hadoop 2 High Availability for HDFS. I set up passwordless ssh connection among NameNodes under user hafence. That I verified - and it works. However I am getting following (Permission Denied) when using this sshfence setup. 2014-01-20 12

Architecture to process data from very large database tables

I have a problem to solve and was wondering if I am right to use something like Hadoop for this problem to distribute it across multiple nodes or use something else.. The Problem: I have a very large database table with potentially a huge amount of r

Does libhdfs c / c ++ api support compressed read / write file

I have found somebody talks libhdfs does not support read/write gzip file at about 2010. I download the newest hadoop-2.0.4 and read hdfs.h. There is also no compressing arguments. Now I am wondering if it supports reading compressed file now? If it

Sorting a huge text file using hadoop

Is it possible to sort a huge text file lexicographically using a mapreduce job which has only map tasks and zero reduce tasks? The records of the text file is separated by new line character and the size of the file is around 1 Terra Byte. It will b

HBase & ldquo; between & quot; filters

I'm trying to retrieve rows with in range, using Filter List but I'm not successful. Below is my code snippet. I want to retrieve data between 1000 and 2000. HTable table = new HTable(conf, "TRAN_DATA"); List<Filter> filters = new ArrayLis