Dom2DomMay 1, 2010 — Florent Clairambault
UPDATE (07 Nov 2011) :
I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.
This is an experimental little project of mine. The goal is to be able to tell what domain are hosted on the same hosts as an other domain. Some services are already offering it but they do a very crappy job. This service will do only that.
The program just goes from links to links to find new domain names. It stays as less possible on each domain, it doesn’t store any information other than the domain name.
The only address I did put in the program is the address of this blog. All the other were discovered.
Please tell me if you would also be interested by this service.
In less than 1 week, the program already collected a little bit more than 400 000 hosts (and I think a little bit more than 350 000 domains), and there really are a lot of porn sites.
I think I will start the DNS requesting part in two weeks and the little webinterface two weeks later. I could do it sooner but results would be very crappy anyway.
We’ve now reached 1 million indexed hosts.
I’ve added the DNS requesting code. It’s working fine (it’s much easier to do and maintain). It has indexed 60 000 hosts. I’ve made a little web interface but I’m not giving it right away because the SQL requests aren’t optimized yet (but no worries, the database is).
1.650 M hosts indexed
420 k have been ip linked
136 k different IP addresses have been found
2.5 M hosts indexed
1.0 M hosts IP linked
340 k IP addresses
I fixed the program. It is now identified as “Dom2Dom/0.1.3865.36392_2010-08-01_20:13:04″. It can’t make too much requests on the same server now (I limited it explicitely) and it should crawl the web more efficiently.
Statistics are :
3.7 M hosts indexed
(almost) all hosts IP linked
900 k IP addresses
4.1 M hosts indexed
all hosts IP linked (and it’s starting to relink old hosts to their potentially changed IP adresses)
957 k IP adresses
5.8 M hosts indexed
all hosts IP linked
1.2 M IP adresses
I still can’t find time to do the webinterface but the program and the great comments continue.
I took ten minutes to make a little stats page if you’re interested :