About a week ago, following nosql east in Atlanta, Jonathan Ellis from the Cassandra project published a fantastic overview of the current NOSQL ecosystem. He analyzes 10 popular NOSQL databases along three axes: horizontal scalability, data model and internal persistence design. It's a great read.
The third axis (internal persistence design) may not be terribly relevant for users of NOSQL systems [1] but the position on the first two axes reveal some important underlying assumptions. In particular, it reveals a focus: is this NOSQL project oriented around scaling to size or scaling to complexity? [2]
The four main NOSQL data models
Now, there are four main categories of NOSQL databases today. Before we get into how they differ in focus, let me just quickly run through them and outline a few key characteristics:
Key-Value Stores
- Lineage: Amazon's Dynamo paper and Distributed HashTables.
- Data model: A global collection of key-value pairs.
- Example: Voldemort, Dynomite, Tokyo Cabinet
BigTable Clones (aka "ColumnFamily")
- Lineage: Google's BigTable paper.
- Data model: Column family, i.e. a tabular model where each row at least in theory can have an individual configuration of columns.
- Example: HBase, Hypertable, Cassandra [3]
Document Databases
- Lineage: Inspired by Lotus Notes.
- Data model: Collections of documents, which contain key-value collections (called "documents").
- Example: CouchDB, MongoDB, Riak
Graph Databases
- Lineage: Draws from Euler and graph theory.
- Data model: Nodes & relationships, both which can hold key-value pairs
- Example: AllegroGraph, InfoGrid, Neo4j
Scalability focus
How then do these data models scale to size and complexity? Check out this slide from my presentation at nosql east:
The exact positions in the picture above are obviously debatable but I think it serves to illustrate my point: the key value stores and BigTable clones of the world handle size really well. This is because they have data models that can easily be partitioned horizontally. Which is great for scale out of, for example, simple two-column data like a whole bunch of username/password pairs.
The drawback however is that by constraining themselves to simpler data models, they've pushed complexity up the stack. So if you have data with a non-trivial structure, then you have to compensate for a simple data model by adding more complex functionality in the upper layers. [4]
Document databases and graph databases, on the other hand, have opted for richer data models. This means that they have more powerful abstractions that make it easy to model both simple and complex domains. But these richer data models introduce more coupling of data and therefore it's more challenging to get them to scale to size.
Size matters (but you're not Google so complexity matters more)
Now, size gets a lot of attention because scaling out to hundreds of machines is very sexy. But here's the kicker: the majority of the use cases out there don't need to store hundreds of billions of objects and scale out to truckloads of machines.
At the end of the day, there are only so many projects of Amazon and Google scale out there. A lot of projects fit within a couple of BILLIONS of objects. For most people, it's a lot more important to have a rich data model that lends itself to easily represent their domain.
Ben Scofield of Viget Labs expresses it eloquently in NoSQL Misconceptions:
"... there's a lot more to NoSQL than just performance and scaling. Most importantly (for me, at least) is that NoSQL DBs often provide better substrates for modeling business domains. I've spent more than two years struggling to map just part of the comic book business onto MySQL, for instance, where something like a graph database would be a vastly better fit."
Choose your hammer wisely
It's important to note that these data models are all isomorphic. Which is a fancy way of saying that you can express all datasets in either one of them. For example, you can decompose any data into a collection of key-value pairs.
But that's a bit like claiming you can write any program in any Turing complete programming language: sure, it's true in theory but just because you can doesn't mean that you should. In practice there's a bunch of programming languages that are a poor fit for many use cases. And the same is true of data models.
I think it's clear that we're rapidly moving beyond the era of the One Size Fits All database. Whereas in the past you could always trust that any decent-sized app had a relational database as backend, it's now increasingly about matching your dataset to whatever data model fits best. NOSQL is not No To SQL. NOSQL means Not Only SQL, as in: in the future, our backends will consist of Not Only SQL databases but also key-value stores, graph databases and more.
NOSQL is about choice and picking the right tool for the job. When you look at adding a NOSQL database to your current project, consider your requirements both for scaling to size and for scaling to complexity.
1] Few developers care whether their RDBMS implementation uses hash joins or nested loop joins.
2] Scaling to size and scaling to complexity was introduced (at least to me) in O'Reilly's Beautiful Data by Toby Segaran and Jeff Hammerbacher. The graph of the various NOSQL data models was first visualized by my friend and colleague Peter Neubauer.
3] Cassandra is actually the first of the "second-generation" NOSQL databases and it combines the decentralized scale out architecture of the Dynamo clones with the data model of BigTable.
4] As an analogy, imagine writing any piece of software and the only construct you had for storing state was a single global hashtable. No linked lists, no arrays, no structs, no objects. Imagine how much code you'd have to add just to work around that hashtable! Now, a key-value store is basically a distributed hashtable. This is why they have problems with scaling to complexity.
InfoGrid actually, not InfoGraph ;-) The link to http://infogrid.org/ is correct, though.
Posted by: Johannes Ernst | Monday, November 16, 2009 at 06:41
Very nicely put! Using the right tool for the job and the practice of right-engineering (http://faassen.n--tree.net/blog/view/weblog/2007/11/01/0) is extremely valuable when developing software that is to be put to real use.
/Herbjörn
Posted by: twitter.com/herbjorn_w | Monday, November 16, 2009 at 12:21
@Johannes: Thanks! Updated the name.
@Herbjörn: I concur. But the link seems dead...
-EE
Posted by: twitter.com/emileifrem | Tuesday, November 17, 2009 at 22:31
Sorry, the end parethesis was not supposed to be in the link: http://faassen.n--tree.net/blog/view/weblog/2007/11/01/0
/Herbjörn
Posted by: twitter.com/herbjorn_w | Wednesday, November 18, 2009 at 08:02
@Herbjörn: Great link.
-EE
Posted by: twitter.com/emileifrem | Wednesday, November 18, 2009 at 11:45
... and a possible continuation of your excellent post: http://nosql.mypopescu.com/post/287581423/the-new-dimension-of-nosql-scalability-complexity
Unfortunately I couldn't find a way to link to the follow up distributed conversation we've had on Twitter :-).
Posted by: Alex Popescu | Wednesday, January 06, 2010 at 15:29
Thanks for the article. Regarding "using the right tool for the job", maybe we miss some usecases, informations and pro and cons of document db "vs" graph db.
Graph db deserve more populalarity and document db have especially huge success (Mongodb is a good example of it) like if it were a nearly default NoSQL choice as it has been for SQL decades ago.
Documentation / Tutorials / Small talks / Books are part of the key for improving information level and pouplarity. I just mention it as i don't see much of these (when they even exist) related to the graph db we could be interested in, and eprsonaly, i would have pleasure to see more of them on Neo4J. This would also help us to contribute at our turn.
Posted by: Alexandre Emeriau | Wednesday, August 04, 2010 at 15:15
No offense, but i suggest admin adding a google+ button for easy share!
Posted by: elliptical reviews | Monday, December 12, 2011 at 13:42
House maid organizations are now flourishing in every essential area on the planet. If you are an company looking for that maid organization to have confidence in and help you on your maid specifications, there's nothing keeping you from getting one.
Posted by: Recruitment Jobs Hong Kong | Tuesday, January 03, 2012 at 12:42
Today, your article may be feeling pretty good, hope you can make friends. And you become friends, I think I can learn a lot of things.
Posted by: Online Art School | Monday, January 16, 2012 at 04:57
NoSQL databases are categorized according to the criterion they store the intelligence and spill under territory such as key-value stores BigTable.etc
Posted by: Mother day Wishes | Wednesday, February 29, 2012 at 10:56
whether a blog is nice or not? I always judge this by feeling if what he says is meaningful or senseful or not. You provide such an informative blog and what you provide is useful. Thank you for sharing your knowledge. Keep it up!
Posted by: Transformers DVD box set | Tuesday, April 24, 2012 at 12:25
These stripes of place are always inspiring and I prefer to read caliber component so I happy to find lots good kernel here in the post, penmanship is simply great.
Posted by: web hosting company | Monday, May 21, 2012 at 14:52
your object may be intuition pretty good, hope you can type friends. And you become friends, I think I can learn a courtyard of things.
Posted by: square one condos | Tuesday, May 22, 2012 at 06:56