UPDATE (25 July 2013)

I closed the project because it was consuming a lot of resources for not special result. It was merely a test of cassandra. BTW, Cassandra didn’t have any problem handling the load generated by this project. It went very smoothly.

UPDATE (07 Nov 2011)

I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.

UPDATE (07 Nov 2011)

I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.

UPDATE (02 Oct 2010)

Service is available here. Here is an example with Apple.

new TLDs This is an experimental little project of mine. The goal is to be able to tell what domain are hosted on the same hosts as an other domain. Some services are already offering it but they do a very crappy job. This service will do only that.

The program just goes from links to links to find new domain names. It stays as less possible on each domain, it doesn’t store any information other than the domain name.

The only address I did put in the program is the address of this blog. All the other were discovered.

Please tell me if you would also be interested by this service.

06/07/10

In less than 1 week, the program already collected a little bit more than 400 000 hosts (and I think a little bit more than 350 000 domains), and there really are a lot of porn sites.

I think I will start the DNS requesting part in two weeks and the little webinterface two weeks later. I could do it sooner but results would be very crappy anyway.

10/07/10

We’ve now reached 1 million indexed hosts.

12/07/10

I’ve added the DNS requesting code. It’s working fine (it’s much easier to do and maintain). It has indexed 60 000 hosts. I’ve made a little web interface but I’m not giving it right away because the SQL requests aren’t optimized yet (but no worries, the database is).

15/07/10

  • 1.650 M hosts indexed
  • 420 k have been ip linked
  • 136 k different IP addresses have been found

21/07/10

  • 2.5 M hosts indexed
  • 1.0 M hosts IP linked
  • 340 k IP addresses

02/08/10

I fixed the program. It is now identified as “Dom2Dom/0.1.3865.36392_2010-08-01_20:13:04”. It can’t make too much requests on the same server now (I limited it explicitely) and it should crawl the web more efficiently.

Statistics are :

  • 3.7 M hosts indexed
  • (almost) all hosts IP linked
  • 900 k IP addresses

04/08/10

  • 4.1 M hosts indexed
  • all hosts IP linked (and it’s starting to relink old hosts to their potentially changed IP adresses)
  • 957 k IP adresses

20/08/10

  • 5.8 M hosts indexed
  • all hosts IP linked
  • 1.2 M IP adresses

28/08/10

  • I still can’t find time to do the webinterface but the program and the great comments continue.
  • I took ten minutes to make a little stats page if you’re interested :

http://dom2dom.webingenia.com/stats