Opensourcing the content of this blog

Hi everyone,

During the last years, I launched the javacint google group which now has grown out to be a good community of professionnals working around the Cinterion (java enabled) chips. I also created a TC65 development document. And all the questions and feedbacks you gave me on the development around these chips helped me a lot to improve (what was) my document and (what was) my FAQ.

You helped me so much indeed that I believe this content should know be open to everyone to modify. That’s why I created the javacint wiki.

So from now on, for all your TC65i related questions and feedbacks, please go to the javacint discussion group or the javacint wiki. And please share your knownledge on the javacint wiki.

I still provide development services around the Cinterion chips through my company but I try to focus more on creating products with few partners.

xrdp and the ulimits / nofile issue

You might have noticed for xrdp on Debian (but quite possibly with a lot of other Linux tools and other Linux distributions) the user limits (described in /etc/security/limits.conf) are not enforced. Which meant in my case that any session open with xrdp was opened with a max number of open files (nofile) set to 1024.

To fix this, edit the file /etc/pam.d/common-session and add the following line:

session    required   pam_limits.so

Limiting number of connections per IP with ufw

This is a personal reminder post.

The easiest attack one can perform on a web server is opening all the connections and do nothing with it. iptables fortunately has a “connlimit” module to avoid this. If you’re using ufw like me you will want to keep your good integration with it.

In the /etc/ufw/before.rules file, after these lines:

1
2
3
4
5
6
7
# Don't delete these required lines, otherwise there will be errors
*filter
:ufw-before-input - [0:0]
:ufw-before-output - [0:0]
:ufw-before-forward - [0:0]
:ufw-not-local - [0:0]
# End required lines

You can add this to limit the number of concurrent connections:

1
2
# Limit to 10 concurrent connections on port 80 per IP
-A ufw-before-input -p tcp --syn --dport 80 -m connlimit --connlimit-above 10 -j DROP

And this to limit the number of connections:

1
2
3
# Limit to 20 connections on port 80 per 2 seconds per IP
-A ufw-before-input -p tcp --dport 80 -i eth0 -m state --state NEW -m recent --set
-A ufw-before-input -p tcp --dport 80 -i eth0 -m state --state NEW -m recent --update --seconds 2 --hitcount 20 -j DROP

This second rules might create some issues with http clients that don’t support keep-alive (is there any?).
If you want to do some benchmarks (with ApacheBench for example), you need to enable the keep-alive and set the max number of keep-alive requests per connection very high (or unlimited).
In apache config it is set with:

1
MaxKeepAliveRequests 0

Cassandra as registry

One of the biggest issue with distributed database is to find the right model to store your data. On a recent project, I decided to use a registry model.

The registry idea

The idea behind writing a registry is to have an easy way to both store and view data.

For a given device that has a {UUID} id:

  • I will access “/device/{UUID}/”.
  • Any properties will be stored in “/device/{UUID}/properties/“.
  • Deletion of the device will delete all the contents this device contains.

Classical column-families to index data

The problem comes with the data we need to index. We can store everything in a registry manner like having a path “/device/by-owner/{UUID}”:[“{UUID1}”,”{UUID2}”]. But it’s just easier to use cassandra secondary indexes have each property of each entity written to the indexed columns of the column family.

Sample use case: file storage

So you get the basic “Registry” model. Storing file on top of that is quite easy. Then what I did is I just said files are chunks of data. So if I want to store a picture for a user, I could store like this:

  • “/user/{UUID}/picture/” becomes the path of the picture.
  • “/user/{UUID}/picture/type” describes the type of this file (“file” or “directory”)
  • “/user/{UUID}/picture/filetype” describes the content of this tile (“text/plain” per example)
  • “/user/{UUID}/picture/size” describes the size of the file
  • “/user/{UUID}/picture/chunk-size” describes the size of each chunk that we will save
  • Then we will save each chunk from “/user/{UUID}/picture/0” to /user/{UUID}/picture/X.

Hector object mapper

I have to say I didn’t know this project existed not that long ago.

I think HOM is a much better option in pretty much all the cases. Still having a simple tree view of your data can be a very interesting feature to analyze what you are working on.

TINC – Simple P2P VPN

The world is full of good surprises.

If you joined the NoSQL gang like me, chose Cassandra to store your data and you distributed your system among different datacenters. Wouldn’t it be great to interconnect all your nodes on a virtual private network with no single point of failure? Well, TINC does just that. In fact, it does a little bit more because it’s able to establish a meshed network if hosts can’t directly contact each other (in case of a routing issue, a NAT firewall, etc).

One of the amazing things about this software is that it’s really simple to setup. I followed some setup instructions and it just worked. I didn’t have to increase the verbosity or check any log, it just worked everywhere.

Sources:

The Mystery of the Duqu Framework

Update 2012-03-25:
It turns out, it’s just some object oriented C:
Kaspersky Lab experts now say with a high degree of certainty that the Duqu framework was written using a custom object-oriented extension to C, generally called “OO C” and compiled with Microsoft Visual Studio Compiler 2008 (MSVC 2008) with special options for optimizing code size and inline expansion.
Source


If you missed it in the news, you should definitely read this: The Mystery of the Duqu Framework.

I’ve a little culture around languages and frameworks, mostly because I’ve worked with C, C++, Objective-C, C# .Net, java, javascript and PHP, but I’ve also read some things or even done few tests on languages like python, scala, erlang, caml, F#, VB or d language. It has always been a great pleasure to discover these new languages because it shows how some human beings decided to create a new way of organizing intelligence. What usually happens around these new languages (and frameworks), is that we speak about them and they get adopted by developers or nothing is done with them. But we usually speak, at least in the beginning, a lot more about the language than the projects done with it.

Here, security companies first discovered a virus (yet an other one) and then discovered it was embedding it’s own framework (and we can pretty much guess there’s a dedicated language for that one). This story uncovers a real mistery with its set of questions: why did some people decided to create a new framework? Why was it only used (or seen) in a virus? How could it be especially applicable to a virus? Why did they decided to use everything internally and not use standard C/C++ compilers?

They are few things that are very interesting in this framework: It’s a low-level framework (no standard library), but it’s totally eventful. This is quite innovative. It terms in modern sales-speech it means: very light and very scalable. “Function table is placed directly into the class instance and can be modified after construction”: You can change the behavior of any method of your object at any time. It’s quite a good idea (can easily be done in javascript thought but that’s because javascript is super-permissive).

The conclusions of this article are:

  • The Duqu Framework appears to have been written in an unknown programming language.
  • Unlike the rest of the Duqu body, it’s not C++ and it’s not compiled with Microsoft’s Visual C++ 2008.
  • The highly event driven architecture points to code which was designed to be used in pretty much any kind of conditions, including asynchronous commutations.
  • Given the size of the Duqu project, it is possible that another team was responsible for the framework than the team which created the drivers and wrote the system infection and exploits.
  • The mysterious programming language is definitively NOT C++, Objective C, Java, Python, Ada, Lua and many other languages we have checked.
  • Compared to Stuxnet (entirely written in MSVC++), this is one of the defining particularities of the Duqu framework.

Who could have done this framework? Well…
– Stuxnet, the last virus that mostly attacked Iran, was at least backed (maybe created) by the USA
– This one required the workforce of a pretty big organization (a lot of smart people put together to do evil things)
I hope we’ll discover who is behind this this someday.

Source: The Mystery of the Duqu Framework

Cassandra

I’m a huge fan of all the cloud technologies. I’ve been working on a M2M project on top of cassandra and I can really say I love this distributed database. I’d like to give my feedback on this great database.

Easy management

Cassandra doesn’t require any kind of manual management for complex operations like sharding data accross node restore a crashed server or put a new or a previous disconnected node back into the cluster. You just have to tell the nodes to join the cluster and watch him do all the work.

It’s obviously a little bit more difficult to start with cassandra than it is to start with MySQL but it’s conceptually easier to understand. Management tools are clearly lacking though.

Data paradigm change

You have an extraordinar flexbility with cassandra, you can add columns to column families (“Table” equivalent) at any time. But you can’t use indexes the same way as you do in relationnal databases. For large indexed data, time series, you need to build your own indexes.

Because everything is retrieved on a per-row basis and that each row can be a different server. You need to retrieve as much data as possible per row. Which means you sometimes need to forget about creating a like to the data and putting the data itself. In my cases, data coming from equipment are stored twice. Once depending on the equipmentId then the time and once depending on the equipmentId and the dataType, then the time.
They are some very interesting articles about this:

If found that in many cases, saving objects directly in a json form made my life a lot easier. And as all data is compressed internally, it doesn’t takes too much additional space.

Particularities

As said earlier, it’s best to store rows with a lot of columns in cassandra. Columns are often used in a completely different way than they are used in relationnal databased, then can be time values. But then you also have to take care of not making too much columns. I use 100 000 columns without any problem. If you have 1M and more columns, your data retrieval could take a lot of time (it could be a matter of seconds). I discovered this while doing some profiling, it came as a surprise for me because cassandra is “advertised” as being able to handle billions of columns. So, sure it can handle billions of columns, but you shouldn’t do it.

Cassandra supports TTL (Time To Leave), it’s very useful for temporary data like sessions or cached values. Data is garbage collected automatically.

Because it’s a distributed database, cassandra distribute deletion as if they were values. A deleted column is in fact a column where the value has a deleted state. The data is actually deleted 1 week after it was marked as being deleted. This mecanism allows failing node to be plugged back into the cluster at most one week after they disconnected.
Deleted columns count as classical columns internally, you might end-up with serious performance issues if you delete and create a huge number of columns at the same time.

It eats all your memory

Cassandra with its default settings eats a lot of memory. With 2GB, it will have some OutOfMemoryErrors, with 4GB, it will flush data very frequently. It runs ok with 8GB. And in production, I like to give it 12GB of memory. It’s a not really a problem, you just have to buy bigger server. But if you sell your software so that it can be installed on a client architecture, this can be a little bit more problematic.