I recently created a python script that performed some natural language processing tasks and worked quite well in solving my problem. But it took 9 hours. I first investigated using hadoop to break the problem down into steps and hopefully take advantage of the scalable parallel processing I'd get by using Amazon Web Services.
But a friend of mine pointed out the fact that Hadoop is really for large amounts of data store on disk, for which you want to perform many simple operations. In my situation I have a comparatively small initial data set (low 100s of Mbs) on which I perform many complex operations, taking up a lot of memory during the process, and taking many hours.
What framework can I use in my script to take advantage of scalable clusters on AWS (or similar services)?
Parallel Python is one option for distributing things over multiple machines in a cluster.