I’m a huge fan of all the cloud technologies. I’ve been working on a M2M project on top of cassandra and I can really say I love this distributed database. I’d like to give my feedback on this great database.
Easy management
Cassandra doesn’t require any kind of manual management for complex operations like sharding data accross node restore a crashed server or put a new or a previous disconnected node back into the cluster. You just have to tell the nodes to join the cluster and watch him do all the work.
It’s obviously a little bit more difficult to start with cassandra than it is to start with MySQL but it’s conceptually easier to understand. Management tools are clearly lacking though.
Data paradigm change
You have an extraordinar flexbility with cassandra, you can add columns to column families (“Table” equivalent) at any time. But you can’t use indexes the same way as you do in relationnal databases. For large indexed data, time series, you need to build your own indexes.
Because everything is retrieved on a per-row basis and that each row can be a different server. You need to retrieve as much data as possible per row. Which means you sometimes need to forget about creating a like to the data and putting the data itself. In my cases, data coming from equipment are stored twice. Once depending on the equipmentId then the time and once depending on the equipmentId and the dataType, then the time.
They are some very interesting articles about this:
If found that in many cases, saving objects directly in a json form made my life a lot easier. And as all data is compressed internally, it doesn’t takes too much additional space.
Particularities
As said earlier, it’s best to store rows with a lot of columns in cassandra. Columns are often used in a completely different way than they are used in relationnal databased, then can be time values. But then you also have to take care of not making too much columns. I use 100 000 columns without any problem. If you have 1M and more columns, your data retrieval could take a lot of time (it could be a matter of seconds). I discovered this while doing some profiling, it came as a surprise for me because cassandra is “advertised” as being able to handle billions of columns. So, sure it can handle billions of columns, but you shouldn’t do it.
Cassandra supports TTL (Time To Leave), it’s very useful for temporary data like sessions or cached values. Data is garbage collected automatically.
Because it’s a distributed database, cassandra distribute deletion as if they were values. A deleted column is in fact a column where the value has a deleted state. The data is actually deleted 1 week after it was marked as being deleted. This mecanism allows failing node to be plugged back into the cluster at most one week after they disconnected.
Deleted columns count as classical columns internally, you might end-up with serious performance issues if you delete and create a huge number of columns at the same time.
It eats all your memory
Cassandra with its default settings eats a lot of memory. With 2GB, it will have some OutOfMemoryErrors, with 4GB, it will flush data very frequently. It runs ok with 8GB. And in production, I like to give it 12GB of memory. It’s a not really a problem, you just have to buy bigger server. But if you sell your software so that it can be installed on a client architecture, this can be a little bit more problematic.