At Loomia we recently decided that for improved scalability, we need to move some of our data out of MySQL and into another form of persistent storage better suited for what is mostly key-value data.
To that end, we started researching various ‘no sql’ options. What follows is a discussion of how we went about our research and how we reached our decision. I found the articles listed in the references below to be valuable in our research; my hope is that this write-up may be of use to others.
Our approach was to draft a longish list of candidates for a first round of analysis and winnow that down to a short list of candidates for actual deployment and testing.
Although many of the solutions we looked at are more than key-value stores, for simplicity sake I refer to them all that way (or sometimes as ‘kv stores’). And apologies in advance to anyone working on these various projects if I've misrepresented anything about them -- please let me know and I will make the appropriate corrections.
RDBMS alternatives are an active topic of discussion in the technology world at the moment and listed below are a number of helpful articles written on the subject. I am very grateful to the authors for their insights:
- Leonard Lin presents a nice general comparison of key-value stores (the comments, in particular, are surprisingly intelligent and useful): http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores
- Another nice general comparison by Richard Jones: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
- A video of Bob Ippolito's talk 'Drop Acid and Think About Data' - not to be missed: http://blip.tv/file/1949416/
- Discussion of the Tokyo suite vs. Mongo at Blue Collar Informatics: http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/
- Jonathan Ellis (who works on Cassandra) does a nice round-up here: http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/#
We defined the following requirements for our candidates:
- The kv store needs to be able to handle arbitrarily large amounts of data, which means that we need to be able to scale writes across multiple machines.
- The read performance needs to be excellent (since most of the queries will be cache misses, so a caching layer won't help much)
- The write performance needs to be good
- The kv store needs to be readable and writable by many clients at once
- High Availablity
- The kv store needs to be highly available
- Python friendliness
- Python bindings for the kv store should exist
- Back-up and recovery
- There should be some facility for back-up and recovery of the data
- Network accessibility
- The kv store needs to be accessible over the network (e.g. Berkeley DB is not network accessible)
- Ease of deployment
- Ideally, the kv store would be fairly easy to deploy
- The kv store should have a substantial user community and should have been deployed in existing production applications
Round 1 Candidates
We immediately ruled out two solutions:
For Redis, the entire data set needs to fit in memory and scalaris is not persisted to disk (at least as of now).
We then settled on the following list of candidates:
Assessment of the Candidates
- Built on top of Tokyo Tyrant
- Used by SNS plurk.com: http://opensource.plurk.com/LightCloud/
- LightCloud is a distributed db -- it adds a universal hashing layer to Tokyo Tyrant
- Discussion of LC here: http://news.ycombinator.com/item?id=498581
- Supports scripting in Lua (which is supposedly extremely fast) - this is interesting because it might provide us a way to do auto increments
- Conclusion: LightCloud is definitely worth looking at given that Tokyo Tyrant gets good reviews
- Open-sourced by Facebook
- From the web page:
- "Cassandra was open sourced by Facebook in 2008, where it was designed by one of the authors of Dynamo. In a lot of ways you can think of Cassandra as Dynamo 2.0. Cassandra is in production use at Rackspace, Digg, and a number of other companies, but is still under heavy development."
- No compression
- Easy to administer, but not polished (according to Bob Ippolito)
- Used by Digg with a LOT of data (3TB): http://about.digg.com/blog/looking-future-cassandra
- Digg has open-sourced a Python client: http://github.com/digg/lazyboy/blob/master/README.mkd
- Nice write-up here: http://blog.tonybain.com/tony_bain/2009/12/is-cassandra-winning-the-nosql-race.html
- Conclusion: Cassandra meets our requirements well and it has the most institutional support of any of the solutions we're looking at.
- Good online documentation
- Increasing number of deployments, albeit no deployments of the scale of Tokyo.
- Supposedly perf is very good: claims of 20k inserts
- Supports sharding, but current sharding support is described as alpha: http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB.
- Supports replication
- Written in C++
- Easy to deploy with well-documented Python bindings (PyMongo)
- Jonathan Ellis claims that MongoDB is not a distributed database (probably because sharding is alpha): http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/#
- My guess is that by the time we need sharding in MongoDB, it will production level (~1 year from now)
- Conclusion: Mongo is a strong contender for our kv store.
- Used by Mixi, which is the equivalent of Facebook in Japan
- Good performance (1.5M writes/second)
- Nice presentation here: http://www.scribd.com/doc/12016121/Tokyo-Cabinet-and-Tokyo-Tyrant-Presentation
- Replication strategy similar to MySQL
- Supports compression
- It's been suggested that it chokes @ about 70GB of data: http://news.ycombinator.com/item?id=841942
- In order for Tokyo Tyrant to be a distributed database, we would need to put a consistent hashing layer in the application code that uses it (which is essentially what LightCloud does)
- Scripting with Lua
- Conclusion: given that TT is not distributed, we're passing on it in favor of its distributed offspring, LightCloud.
- Written in Erlang
- CouchDB is often compared to Mongo, but is said to have inferior performance
- Example: http://www.snailinaturtleneck.com/blog/?p=74 (but nb -- this was written by a 10gen (people who do Mongo) staffer)
- Conclusion: given that Couch is said to have performance is that is inferior to Mongo and given that Couch doesn't boast an install base that is larger than Mongo (it's not as though Couch is used by Facebook, for example), we decided to eliminate Couch.
- Persistent key-value store that uses the memcache protocol
- I believe it use's Berkeley DB so this might be a deal-breaker
- MemcacheDB isn't a distributed database: you could treat it like a distributed database by implementing hashing at the app level, but as I understand it that's the only way to scale writes across multiple machines.
- Performance is excellent: http://memcachedb.org/benchmark.html
- Used by Digg: http://highscalability.com/blog/2009/2/14/scaling-digg-and-other-web-applications.html
- Conclusion: the fact that Digg uses MemcacheDB is certainly a point in its favor, but at the same time they seem to be the only ones and I don't have the feeling that MemcacheDB has enough momentum behind it; this is in contrast to Mongo or Tokyo or Cassandra that are definitely in the ascendance.
- Produced by Linkedin and open-sourced by them
- No re-balancing (you set up a cluster with 10 nodes and then you need 11, you need to migrate data to a new cluster)
- Apparently there is garbage collection in spite of what Bob Ippolito said in his talk (http://groups.google.com/group/project-voldemort/browse_thread/thread/2746cd8bd3700162)
- From their website:
- Data is automatically replicated over multiple servers.
- Data is automatically partitioned so each server contains only a subset of the total data
- Server failure is handled transparently
- Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, and Java Serialization
- Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system
- Each node is independent of other nodes with no central point of failure or coordination
- Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, the disk system, and the data replication factor
- Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart.
- Limited Python support -- this appears to be all there is: http://github.com/etrepum/voldemort_client.
- Conclusion: the limited Python support makes me wary of using PV when other good options exist.
Of our first round candidates the following made in on to round 2:
- Light Cloud
In the next article on the topic, we discuss how we assessed the remaining candidates.