Wait, there's more! EMC didn't want to be without its own answer to this, so it has Greenplum/HAWQ from its new Pivotal division. Those are most decidedly not open source. Not to be outdone, Hortonworks and others are backing Tez, which claims it will offer a thousand-fold improvement.
You should probably know at least what Hive is and some basics of how to use it. That's somewhat transferable to knowledge of Impala if you end up working for someone that uses Cloudera's distribution. I'd have my eye on Tez and not really bother learning the others unless you work somewhere that decides the vendor lock-in is really worth ditching the existing expensive proprietary data warehouse infrastructure for a new expensive proprietary data warehouse infrastructure in Pivotal.
Ecosystem (this you should know)
You're not going to be able to claim you "know Hadoop" if you're ignorant of the ecosystem around it. So familiarize yourself with the following:
- HBase/Cassandra: Both are column family databases built on various parts of Hadoop. They are in many ways very similar, but their differences are substantial. While they compete in some areas, there are key areas of differentiation. If running MapReduce jobs against your column family store is a big deal to you, then you should probably go with HBase. If you're doing time-series data -- more of an operational store -- and you need nice, pretty management tools, dashboards, and so on, then you may find Cassandra is your best buddy even if she's a bit cursed.
Datastax and Rackspace seem to be the big backers of Cassandra. HBase seems to be supported by the major Hadoop vendors and, oddly, Microsoft. I'd learn either as the key knobs are the same, then transfer the skills. Cassandra will probably seem like a less steep learning curve if you're starting from scratch, but HBase will be more familiar if you're already heavily into Hadoop. To complicate matters, if you think you'll need more database support for contextual security you may also want to look at Accumulo.
- Spark/Shark and Storm: HDFS is high latency. What if you want to do grand MapReduce distributed computing, but latency doesn't work for you? Well, shell out the cash for more memory and go with Spark, which integrates on top of Hadoop and HDFS but runs jobs in memory. Spark can also "stream," where you basically have long-running jobs that continuously return results as new data comes in. Shark is Hive on Spark. Storm is very similar as far as capabilities. Databricks and Cloudera seem to be backing Spark, and Hortonworks seems to backing Storm but has Spark in a preview. I'd settle for knowing what these thing are for now unless you need to stream or work in one of the industries where low latency is a must and not a "would be nice."
- Oozie: This is basically workflow-based job control for Hadoop. Mainframers are nodding their heads and can read on. Basically any given system has a lot of repetitive tasks, and probably most of your MapReduce jobs are not really "ad hoc" even if you conceive them of that way. That is, businesses cycle, computing cycles, and thus your jobs are repetitive and cyclic. However, jobs often depend on other jobs and events, meaning you're not going to run end-of-month reporting until the end of the month, and based on the results of that job you may need to run another job and so on. Once your organization has rolled out Hadoop and starts running anything regularly, I'd invest in learning Oozie.
- Ambari: This is more of a tool for the Administrators, but as a developer you should know a bit about it. While rolling out Hadoop nodes using the command line is incredibly relaxing and rewarding, Ambari can automate these tasks. Moreover, while logging into each node and looking at its stats is fun for the whole family, Ambari ties it up in a nice dashboard with pretty graphs. Ambari doesn't really support Windows and I'd say Hadoop overall is still alpha-ish on Windows (which has been great for the consulting business BTW). Hortonworks includes Ambari in its distribution while Cloudera rolled its own with Cloudera Manager.
Sign up for Computerworld eNewsletters.