One of the biggest issue with distributed database is to find the right model to store your data. On a recent project, I decided to use a registry model.

The registry idea

The idea behind writing a registry is to have an easy way to both store and view data.

For a given device that has a {UUID} id:

  • I will access “/device/{UUID}/”.
  • Any properties will be stored in “/device/{UUID}/properties/”.
  • Deletion of the device will delete all the contents this device contains.

Classical column-families to index data

The problem comes with the data we need to index. We can store everything in a registry manner like having a path “/device/by-owner/{UUID}”:["{UUID1}","{UUID2}"]. But it’s just easier to use cassandra secondary indexes have each property of each entity written to the indexed columns of the column family.

Sample use case: file storage

So you get the basic “Registry” model. Storing file on top of that is quite easy. Then what I did is I just said files are chunks of data. So if I want to store a picture for a user, I could store like this:

  • “/user/{UUID}/picture/” becomes the path of the picture.
  • “/user/{UUID}/picture/type” describes the type of this file (“file” or “directory”)
  • “/user/{UUID}/picture/filetype” describes the content of this tile (“text/plain” per example)
  • “/user/{UUID}/picture/size” describes the size of the file
  • “/user/{UUID}/picture/chunk-size” describes the size of each chunk that we will save
  • Then we will save each chunk from “/user/{UUID}/picture/0” to /user/{UUID}/picture/X.

Hector object mapper

I have to say I didn’t know this project existed not that long ago.

I think HOM is a much better option in pretty much all the cases. Still having a simple tree view of your data can be a very interesting feature to analyze what you are working on.