Tag Archives: cassandra

Cassandra on droid.io

droid.io is an other travis clone.

I gave it a try to execute some automated tests on top of cassandra. Unfortunately it doesn’t support cassandra out of the box. But adding support for it is in fact quite easy:

Here is a script to load cassandra and wait for its startup:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
 
curl -LO http://archive.apache.org/dist/cassandra/2.0.5/apache-cassandra-2.0.5-bin.tar.gz
tar xf apache-cassandra-2.0.5-bin.tar.gz
ln -s apache-cassandra-2.0.5 cassandra
cd cassandra
 
sudo mkdir -p /var/lib/cassandra /var/log/cassandra
sudo chown `whoami` /var/lib/cassandra /var/log/cassandra
 
bin/cassandra
 
for i in {0..30}; do echo "Waiting server ($i)..." ; nc localhost 9042 </dev/null && exit 0 ; sleep 1; done;
 
exit 1

Cassandra CQL3 internal data structure

I’m a huge fan of cassandra, I’ve been playing with it since 0.7 and I’ve never stopped using it. It would say it’s most amazing features are: Always working and simple replication + predictable performances.

I was very happy when it went from a key-value store to a well structured database with CQL. With CQL you can focus on your data, and less on how you should organize your own structure to handle it properly. Still, behind the wheels, it works the same (it’s still a KV store). That’s why it’s very important to understand how the internal structure is done:

it’s not a perfect replacement yet. For example, you can’t get the collection elements timestamp (called writetime in CQL). “SELECT map[‘value’] FROM table;” doesn’t exist (not CQL compatible), so “SELECT writetime( map[‘value’] ) FROM table;” doesn’t either unfortunately.

This problem is known by Cassandra’s dev team but there’s indeed a syntax issue to solve first.

Cassandra

I’m a huge fan of all the cloud technologies. I’ve been working on a M2M project on top of cassandra and I can really say I love this distributed database. I’d like to give my feedback on this great database.

Easy management

Cassandra doesn’t require any kind of manual management for complex operations like sharding data accross node restore a crashed server or put a new or a previous disconnected node back into the cluster. You just have to tell the nodes to join the cluster and watch him do all the work.

It’s obviously a little bit more difficult to start with cassandra than it is to start with MySQL but it’s conceptually easier to understand. Management tools are clearly lacking though.

Data paradigm change

You have an extraordinar flexbility with cassandra, you can add columns to column families (“Table” equivalent) at any time. But you can’t use indexes the same way as you do in relationnal databases. For large indexed data, time series, you need to build your own indexes.

Because everything is retrieved on a per-row basis and that each row can be a different server. You need to retrieve as much data as possible per row. Which means you sometimes need to forget about creating a like to the data and putting the data itself. In my cases, data coming from equipment are stored twice. Once depending on the equipmentId then the time and once depending on the equipmentId and the dataType, then the time.
They are some very interesting articles about this:

If found that in many cases, saving objects directly in a json form made my life a lot easier. And as all data is compressed internally, it doesn’t takes too much additional space.

Particularities

As said earlier, it’s best to store rows with a lot of columns in cassandra. Columns are often used in a completely different way than they are used in relationnal databased, then can be time values. But then you also have to take care of not making too much columns. I use 100 000 columns without any problem. If you have 1M and more columns, your data retrieval could take a lot of time (it could be a matter of seconds). I discovered this while doing some profiling, it came as a surprise for me because cassandra is “advertised” as being able to handle billions of columns. So, sure it can handle billions of columns, but you shouldn’t do it.

Cassandra supports TTL (Time To Leave), it’s very useful for temporary data like sessions or cached values. Data is garbage collected automatically.

Because it’s a distributed database, cassandra distribute deletion as if they were values. A deleted column is in fact a column where the value has a deleted state. The data is actually deleted 1 week after it was marked as being deleted. This mecanism allows failing node to be plugged back into the cluster at most one week after they disconnected.
Deleted columns count as classical columns internally, you might end-up with serious performance issues if you delete and create a huge number of columns at the same time.

It eats all your memory

Cassandra with its default settings eats a lot of memory. With 2GB, it will have some OutOfMemoryErrors, with 4GB, it will flush data very frequently. It runs ok with 8GB. And in production, I like to give it 12GB of memory. It’s a not really a problem, you just have to buy bigger server. But if you sell your software so that it can be installed on a client architecture, this can be a little bit more problematic.

Dom2Dom

UPDATE (25 July 2013) :
I closed the project because it was consuming a lot of resources for not special result. It was merely a test of cassandra. BTW, Cassandra didn’t have any problem handling the load generated by this project. It went very smoothly.

UPDATE (07 Nov 2011) :
I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.

UPDATE (07 Nov 2011) :
I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.

UPDATE (02 Oct 2010) :
Service is available here. Here is an example with Apple.

This is an experimental little project of mine. The goal is to be able to tell what domain are hosted on the same hosts as an other domain. Some services are already offering it but they do a very crappy job. This service will do only that.

The program just goes from links to links to find new domain names. It stays as less possible on each domain, it doesn’t store any information other than the domain name.

The only address I did put in the program is the address of this blog. All the other were discovered.

Please tell me if you would also be interested by this service.

06/07/10 :
In less than 1 week, the program already collected a little bit more than 400 000 hosts (and I think a little bit more than 350 000 domains), and there really are a lot of porn sites.
I think I will start the DNS requesting part in two weeks and the little webinterface two weeks later. I could do it sooner but results would be very crappy anyway.

10/07/10 :
We’ve now reached 1 million indexed hosts.

12/07/10 :
I’ve added the DNS requesting code. It’s working fine (it’s much easier to do and maintain). It has indexed 60 000 hosts. I’ve made a little web interface but I’m not giving it right away because the SQL requests aren’t optimized yet (but no worries, the database is).

15/07/10 :
1.650 M hosts indexed
420 k have been ip linked
136 k different IP addresses have been found

21/07/10 :
2.5 M hosts indexed
1.0 M hosts IP linked
340 k IP addresses

02/08/10 :
I fixed the program. It is now identified as “Dom2Dom/0.1.3865.36392_2010-08-01_20:13:04”. It can’t make too much requests on the same server now (I limited it explicitely) and it should crawl the web more efficiently.
Statistics are :
3.7 M hosts indexed
(almost) all hosts IP linked
900 k IP addresses

04/08/10 :
4.1 M hosts indexed
all hosts IP linked (and it’s starting to relink old hosts to their potentially changed IP adresses)
957 k IP adresses

20/08/10 :
5.8 M hosts indexed
all hosts IP linked
1.2 M IP adresses

28/08/10 :
I still can’t find time to do the webinterface but the program and the great comments continue.
I took ten minutes to make a little stats page if you’re interested :
http://dom2dom.webingenia.com/stats