We have a website which generates MB/TB of data which needs to be mined. What technologies should we use to process terra bytes of data in real time ? Hadoop , Cassandra are good for batch processing; but not for real time. Real-time; means process the data as it is happening and show reports on that. Any ideas or suggestions ?
Have you looked into the Storm project? It's used by Twitter. It's like real-time Hadoop.
We use it for one of our stream processing project. It's awesome. Documentation, development, deployment, scalability awesome. We recently ran a 20K message/sec with processing (storing in Cassandra, modifying and broadcasting, calculating mean), it worked reliably and like magic. Definitely worth giving a shot. The mailing list is very friendly, I rarely had to use it to ask a question.