HBase Hackathon Wrap-up

February 4, 2009

HBase contributors came together last weekend for the first ever HBase Hackathon here at Streamy HQ in Manhattan Beach, California.  In attendance were most of the HBase committers, guys from Sun and StumbleUpon… nearly 20 developers in total.  There are photos posted on the Meetup Page (for members only).  If anyone else who attended has pictures please post links in the comments!

We spent a great deal of time discussing the new features set for the 0.20 release of HBase.  You can follow all the issues slated for 0.20 here.  It’s been a few days and so much has already come out of the weekend so I thought I’d post a quick follow-up to share some of the cool stuff being worked on now.

Cascading Support for HBase
Chris Wensel, of Concurrent Inc., has successfully implemented the first version of HBase adapters for Cascading.  Streamy devs are really looking forward to refactoring some of their MapReduce jobs for Cascading!  We will report back soon.

HBase New File Format
We have hit a performance wall in HBase with the Hadoop MapFile format.  It was never intended for a random-access read pattern which is really the primary purpose of HBase.  Based on the hard work by guys over at Yahoo on the TFile binary file format, work is well under way on a new HBase-specific file format, currently being called HFile.  Michael Stack of Powerset and Ryan Rawson of StumbleUpon are leading the effort.

The emphasis is on speed and efficiency.  By switching to a block-based segmenting/indexing of the file, we can have predictable memory usage and an ideal abstraction for caching.  Once in memory, we can use something like Java’s NIO ByteBuffers to allow high numbers of concurrent scanners with minimal memory copying.  Remember, even random-access reads require scanning.  The new format also supports meta blocks for additional indexes, bloom filters, meta data and anything else we want to add in the future.

Cell Caching
Streamy’s own Erik Holstad is wrapping up testing, benchmarking and optimizing the new Cell Cache.  We’re seeing a 5-10X improvement in random-access speed when serving out of the cache.  A big part of this feature is implementing a memory-aware LRU in Java.  Since Java will not tell you the size of an Object in memory, we have had to hack our way around through profiling and some tools we’ve built to determine sizes.  More on this in a later post.

Zookeeper Integration
Jean-Daniel Cryans and Nitay Joffe have made leaps and bounds with Zookeper integration into HBase.  Initially designed to remove the single point of failure, discussions at the Hackathon opened the door to future improvements such as configuration management and even to eventually distribute master functionality and eliminate the HMaster all together.  They have already committed 5 issues to 0.20 trunk.  You can follow their progress here.

Datanode Network I/O Improvements
Andrew Purtell at Trend Micro is working on a Hadoop issue that creates a big headache for the users of HBase.  We keep a large numbers of files open at a time and since it is currently implemented using a thread-per-connection model, we end up with thousands of idle threads and having to keep increasing the total number of receivers.  I’m not sure where this currently stands.

My Random Contributions
In addition to voicing my opinion and contributing to design and decision making, I’m currently working on a number of small issues:  a binary key range splitting algorithm, the ability to run more than one mapper per region or to specify start and stop rows for MR jobs sourcing from HBase, and a number of benchmarking tools to evaluate all the new stuff.

All in all, it was a terrific weekend.  The weather was absolutely perfect and it was as friendly and smart a group as I could have imagined.  Thanks to everyone who came, especially those who made the trip down from Norcal and JD who came from Canada to defrost for a bit in the Cali sun.  Summer is coming soon, stay tuned for the next beach city hackathon 🙂

Advertisements

To help kick off development on some of the major improvements coming in the 0.20 release of HBase, Streamy will be hosting a hackathon at our office in Manhattan Beach on January 30th.

More information and to RSVP

Also of interest this month is the San Francisco based HBase User Group meetup on January 14th and the Los Angeles Hadoop Meetup on January 13th.

There are references around the web regarding changing the replication factor on a running Hadoop system. For example, if you don’t have even distribution of blocks across your Datanodes, you can increase replication temporarily and then bring it back down.

To set replication of an individual file to 4:

./bin/hadoop dfs -setrep -w 4 /path/to/file

You can also do this recursively. To change replication of entire HDFS to 1:

./bin/hadoop dfs -setrep -R -w 1 /

Interesting conversation between Chris Wensel, founder of Concurrent Inc and author of Cascading, and Sohrab Modi, VP Chief Technology Office at Sun Microsystems.

Sun has definitely been paying attention to Hadoop (and increasingly HBase), so it will be interesting to see if they can make a case for using (typically high-end) Sun hardware to run this new distributed, commodity hardware driven software model. Sohrab mentions increasing the Disk-to-Core ratio on Hadoop nodes above the 1-to-1 ratio typical in many clusters today.

This thinking seems at odds with most of the Hadoop community who are often CPU bound, or who feel more nodes with fewer disks is better than fewer nodes with more disks.

They don’t speak about HBase, but from that perspective it might make sense to squeeze more disks per node and fewer total nodes, especially with the new findings from George Porter about tapping the local FS when possible. However it still seems to me that on the surface Sun hardware does not necessarily fit the new distributed, commodity hardware model.

Part one of that conversation (originally posted here):

Part two:

Another Hadoop-related video of interest by Stefan Groschupf of Scale Unlimited visualizing the evolution of the Hadoop codebase:

Hadoop and HBase Presentation

December 12, 2008

Today I gave a presentation on Hadoop, MapReduce, and HBase to the Los Angeles CTO Forum.  In addition to introducing the technologies and basic information about their implementations, there was a focus on how they compare to a traditional RDBMS.

You can find the presentation here