Tuesday, September 21, 2010

JavaOne 2010: Choosing the Right NoSQL Database

Tobias Ivarsson presented "Choosing the Right NoSQL Database" at JavaOne 2010.  He works at Neo Technology, who provides the Neo4j graph database. He stated that his approach to this presentation would be to look at various hypothetical problems with storage requirements and determine which approach works best for that particular set of storage requirements.  He stated that his examples would focus on implementations of Graph Databases (Neo4j specifically), Document Databases (MongoDB specifically), and Column Family Databases (Apache Cassandra specifically)

Neo4j is a JVM-based graph database, storing nodes and the relationships between nodes. There are numerous graph databases out there. Other graph databases listed in the presentation: Sones GraphDB, InfiniteGraph, AllegroGraph, Hypergraph, InfoGrid, DEX, VertexDB, and FlockDB.

Document Databases store their data as structured documents and collections of documents.  They tend to store JSON-based documents in their databases.  Examples of document databases include MongoDB, Riak, Apache CouchDB, and SimpleDB (Amazon internal).  The speaker commented that Lucene is structured like a document database.

The ColumnFamily databases are inspired by Google's internally-used BigTable. Other examples include Cassandra, HBase (Hadoop's database), and Hypertable.  Cassandra is the implementation used in this presentation for demonstration purposes.

The first hypothetical example used was that of a blog system. The blog system would need to support arbitrary number of posts and comments as well as ability to query and filter blog posts by data and possibly by tag.  The speaker led us through design decisions to store posts as documents and to store comments as nested documents within the post documents. A Document Database seemed obvious at this point, so he showed some code for creating a blog post using MongoDB. This code example demonstrated using MongoDB classes Mongo, DB, DBCollection, DBObject, and BasicDBObject. Ivarsson also showed code necessary to retrieve blog posts from the MongoDB. The MongoDB APIs seem straightforward to apply, but the obvious drawback is its lack of standardization - it is very MongoDB specific.

The second hypothetical example was a Twitter clone. In this case, each post is very small, but needs to be visible to all followers. There is a high load, especially for the high write load. The application should also retrieve all posts by a specific user ordered by date as well as all followers by date.  Cassandra is designed for handling large load of writes and is a good fit here. It makes it even easier for this presentation's demonstration to use the Cassandra-provided Twissandra.

Ivarsson called it "amazing" that Cassandra scales linearly for writes, but pointed out that as with any tactics for performance gain, there are trade-offs. In this case, developers take on more responsibility for maintaining data consistency.

The first example of a blog post was large documents with less traffic and the Twitter clone example covered smaller documents with high traffic.  The third example in Ivarsson's request for "world domination" is to build a social network like Facebook.  For this example application, a graph database will be used.

Individual people are represented by Nodes in the graph database for the social networking application.  Groups are also represented by Nodes and friendship is represented by Relationships. A slide showed "a small social graph example" based on The Matrix movie characters (I insist that there was only one Matrix movie!) and their relationships with one another and with their ship.

The social networking example was implemented with Neo4j and its API looked similarly easy to use, but again is proprietary/non-standard.


Ivarsson summarized the lessons gleaned from his presentation of the different example applications implemented with different types of NoSQL databases.  He stated that Document Databases are often best when dealing with collections of similar entities (but the entities do not need to be perfectly alike). He stated that ColumnFamily Databases are best when scalability (particularly write scalability) is the main issue. The cost is that developers must write more complicated code to do somethings explicitly. Graph Databases are often best when deep traversals are important or for complex domains or in cases where "how entities are related" is very important.


One of the things that I was impressed with in Ivarsson's presentation was his willingness to cover multiple types of NoSQL databases and even talk briefly about Graph Databases other than Neo4j.  I was further impressed when he had a slide talking about when NoSQL may not be the most appropriate.


Ivarsson stated that RDBMS is better at some things, particularly reporting. There is a large ecosystem of reporting tools built around RDBMS. Working system with RDBMS should also be left alone.  However, he added this important bullet: "But please don't use a Relational database for persisting objects." He also asserted verbally: "Object-relational mappers are the worst abomination I have seen in years."

I liked Ivarsson's use of the term "Polyglot Persistence." He recommended what should be the obvious: use right tool for each job. He then asked, "Why limit self to one database? He suggested, as examples, possible combinations of  RDBMS for structured data with Graph database for storing relationships between entries or using Graph database for domain model with Document Database for large data chunks. I do think it's best to be able to use the correct persistence approach for the job or even use more than one together when the costs of multiple approaches are justified by the benefits.

I definitely got what I wanted out of this presentation: an overview of the NoSQL landscape with some ideas on what's available and how to select the best tool for the job.

No comments: