Hadoop – Does it solve all of your data problems?



Hadoop adoption is growing very fast

ThoughtWorks recently published a fascinating post, “To Hadoop or not to Hadoop?“. Their very first point matches with something that we’ve observed at many (but not all) customer sites:

Mention “Big Data” or “Analytics” and pat comes the reply: Hadoop!

So let’s examine a bit deeper:

Hadoop, however, was purpose-built for a clear set of problems; for some it is, at best, a poor fit and others, even worse, a mistake.

Hold on – so Hadoop isn’t the solution to all so-called “big” data problems? The Q&A of the post continues, analyzing several key questions that organizations should be asking themselves if and when considering Hadoop.

  • Do I have several terrabytes of data or more?
  • Do I have a steady, huge influx of data?
  • How much of my data am I going to operate on?

We’ve learned a lot from our customers in the past 8 months. Some of our key learnings in those areas:

  • Very few customers have terabytes (or terabytes of relevant data to look at). Just a fact.
  • Except for those capturing clickstreams or social media data, most don’t have incoming data streams (or real-time requirements – another store; see comments in the ThoughtWorks post about Hadoop for real-time) that exceed the capabilities of existing queue systems.
  • This is the most crucial question. So many of our customers actually choose FlockData because they’re not sure what data is relevant. FlockData can provide information out of data pretty fast. And the truth is that most companies start their data projects with 1-4 smaller data sets, often sub-sets.

And then the key questions – how do you want to examine the data?

ThoughtWorks lays out a few key questions and requirements for Hadoop examination:

Analysts sorely miss SQL. Hadoop doesn’t function well for random access to its datasets (even with Hive, which basically makes MapReduce jobs of your query).

  • Is the underlying structure of my data as vital as the data itself?

Some tasks/jobs/algorithms simply do not yield to the programming model of MapReduce.

Herein lies the rub – can you live with these constraints? As the common saying goes in tech, what’s your use case? Do your needs map to what Hadoop is good at?

For most companies that we’ve talked to, the answer is at best, some of them align well with Hadoop. Most don’t.