One of the biggest issue with distributed database is to find the right model to store your data. On a recent project, I decided to use a registry model.
The registry idea
The idea behind writing a registry is to have an easy way to both store and view data.
For a given device that has a {UUID} id:
- I will access “/device/{UUID}/”.
- Any properties will be stored in “/device/{UUID}/properties/”.
- Deletion of the device will delete all the contents this device contains.
Classical column-families to index data
The problem comes with the data we need to index. We can store everything in a registry manner like having a path “/device/by-owner/{UUID}":["{UUID1}”,"{UUID2}"]. But it’s just easier to use cassandra secondary indexes have each property of each entity written to the indexed columns of the column family.
Sample use case: file storage
So you get the basic “Registry” model. Storing file on top of that is quite easy. Then what I did is I just said files are chunks of data. So if I want to store a picture for a user, I could store like this:
- “/user/{UUID}/picture/” becomes the path of the picture.
- “/user/{UUID}/picture/type” describes the type of this file (“file” or “directory”)
- “/user/{UUID}/picture/filetype” describes the content of this tile (“text/plain” per example)
- “/user/{UUID}/picture/size” describes the size of the file
- “/user/{UUID}/picture/chunk-size” describes the size of each chunk that we will save
- Then we will save each chunk from “/user/{UUID}/picture/0” to
/user/{UUID}/picture/X
.
Hector object mapper
I have to say I didn’t know this project existed not that long ago.
I think HOM is a much better option in pretty much all the cases. Still having a simple tree view of your data can be a very interesting feature to analyze what you are working on.