Dom2Dom

UPDATE (07 Nov 2011) :
I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.

UPDATE (02 Oct 2010) :
Service is available here. Here is an example with Apple.

This is an experimental little project of mine. The goal is to be able to tell what domain are hosted on the same hosts as an other domain. Some services are already offering it but they do a very crappy job. This service will do only that.

The program just goes from links to links to find new domain names. It stays as less possible on each domain, it doesn’t store any information other than the domain name.

The only address I did put in the program is the address of this blog. All the other were discovered.

Please tell me if you would also be interested by this service.

06/07/10 :
In less than 1 week, the program already collected a little bit more than 400 000 hosts (and I think a little bit more than 350 000 domains), and there really are a lot of porn sites.
I think I will start the DNS requesting part in two weeks and the little webinterface two weeks later. I could do it sooner but results would be very crappy anyway.

10/07/10 :
We’ve now reached 1 million indexed hosts.

12/07/10 :
I’ve added the DNS requesting code. It’s working fine (it’s much easier to do and maintain). It has indexed 60 000 hosts. I’ve made a little web interface but I’m not giving it right away because the SQL requests aren’t optimized yet (but no worries, the database is).

15/07/10 :
1.650 M hosts indexed
420 k have been ip linked
136 k different IP addresses have been found

21/07/10 :
2.5 M hosts indexed
1.0 M hosts IP linked
340 k IP addresses

02/08/10 :
I fixed the program. It is now identified as “Dom2Dom/0.1.3865.36392_2010-08-01_20:13:04″. It can’t make too much requests on the same server now (I limited it explicitely) and it should crawl the web more efficiently.
Statistics are :
3.7 M hosts indexed
(almost) all hosts IP linked
900 k IP addresses

04/08/10 :
4.1 M hosts indexed
all hosts IP linked (and it’s starting to relink old hosts to their potentially changed IP adresses)
957 k IP adresses

20/08/10 :
5.8 M hosts indexed
all hosts IP linked
1.2 M IP adresses

28/08/10 :
I still can’t find time to do the webinterface but the program and the great comments continue.
I took ten minutes to make a little stats page if you’re interested :
http://dom2dom.webingenia.com/stats

Posted in English. Tags: . 76 Comments »

76 Responses to “Dom2Dom”

  1. Anton Sherwood Says:

    Okay … Why?

  2. Florent Clairambault Says:

    Hi Anton,

    mostly because I wanted to use this simple service for years and a lot of other people might be interested. It’s just an other service to give a little more data to analyze some (competitors for instance) website.

  3. Anthony Milas Says:

    Just wanted to chime in and say I think its a great idea.

  4. cunt Says:

    wanker!!

  5. Florent Clairambault Says:

    Hi “Cunt”,

    I discovered a new word with your last comment.

    Best regards,
    Florent

  6. stephen Says:

    awsome idear

  7. Targenor Says:

    Intresting idea, but the sole purpose is only to find out which site that are on the same host?

  8. Florent Clairambault Says:

    Hi Targenor,

    Yes, but this is the first step. Mostly because this kind of database takes a lot of time to create.

    By the way, it’s not necessarily the same host. It could be a multi-host hosting architecture.

  9. Evaluator Says:

    It is interesting to know our server neighbors.
    The idea is very good and we hope to be kept free in the future.

  10. Florent Clairambault Says:

    Hi Evaluator,

    Thank your for your kind comment.

    No worries, the service will remain free.

  11. Evaluator Says:

    Perfect.
    Your service helped us to take a decision about our future hosting strategy.

  12. Florent Clairambault Says:

    Hum. Well, the services isn’t opened yet, so I’m a little bit surprised. What hosting strategy did you chose ?

  13. Graphic Designer Says:

    That’s a rather good idea, and you’re right, there’s sites out there that already offer something similar, but it’s very limited in functionality. In actual fact, many only show about 10 results.

  14. classifieds Says:

    It is a great idea, but when the service will be available ? and under what domain. Also, will be great if you could determine the PR, AR of the domain name .

  15. classifieds Says:

    Also a subpage with trafic stats, keywords linked to that domain.. will be great if you could do this for free :) .

  16. Host Says:

    Thanks for links! Author bravo!

  17. Aiko Says:

    Found your referrer in my log files and I just was curious about what’s behind it :-) Don’t think it’s very usefull to me at this moment but I don’t mind you visiting my site.

    Good luck with your project!

  18. Melvin Harr Says:

    I think it is interesting myself. As you see I am a web hosting company. How can I help you. As a web provider if there is something I might do to help you. I mean why not.

  19. Consultanta SEO Says:

    So, did you find out if there are hosted more then one website on the same host ?

  20. Florent Clairambault Says:

    Yes, I’m already getting some very results but the program is still indexing. Your domain “consultanta-seo.ro for instance ” isn’t indexed yet.

  21. Saundra Says:

    I don’t see why you have to hit my site over and over to figure this out- I’m an author, and each of the domains are clearly marked as belonging to me, considering they’re for my books. This is throwing off my stats, which I report to my publisher for advertising reasons.

  22. Florent Clairambault Says:

    Hi Saundra,

    Well, the “over and over” part is a mistake that I should fix.

    The good thing is that it shouldn’t come back to your website anytime soon.

    Best regards,

  23. Jason Cole Says:

    Hi Florent,
    I noticed my site got hit as well about 155 times. Whats the overall goal and purpose here if you don’t mind me asking?

  24. Florent Clairambault Says:

    Well well well. It seems like I have a pretty big bug. The idea is to avoid staying on a server as much as possible. So it’s clearly not intentional.
    The idea is that the server could potentially come back to a server that it has already visited (because it could help find more links). But that’s clearly not the goal (155 requests are 154 requests it could have spent on someone else’s server).

    I’ll try to fix this pretty soon.

    Sorry & Best regards,

  25. Ken Paulman Says:

    I may be having the same issue as Jason. I run a news aggregation site (St. Paul, MN, USA) and use bit.ly to track traffic on my outgoing links. I’ve recorded 102 clickthroughs today from your site, 66 of which occurred simultaneously at about 11:54 a.m. CDT. Looks like it hit every link on my page twice at that time.

    Sounds like you’re already on it, just hoping the additional information will be helpful.

  26. M Says:

    You’ve hit my site 160 times… mind fixing this?

  27. Florent Clairambault Says:

    Hi everyone,

    Well, I have changed the program but I still need to check that it actually doesn’t still request a lot of pages on the same server.
    Believe me, this is in my own interest to fix this, this has become a lose/lose situation because my server spends all his bandwidth on one server instead of crawling the web. 150 requests on server means 149 request that could have been done on some other servers.

    Best regards,

  28. webhost-choice.com Says:

    my site also received 6 hits today from this url : )

    just wanted to let you know, and uhm keep up the good work i love this blog

  29. Dennis Short Says:

    Sounds like a cool tool… do you have a beta online somewhere?

    As above I found you in my stats…

  30. Florent Clairambault Says:

    The tool is already working but I’d like to make a pretty interface. And as we (I’m working with an other guy on a development services company) are currently working on making a website. It should be integrated a “lab service” of the website and comes just after the website is ready.

  31. cncfraese.com Says:

    with me it were several projects of mine that got hits from this url. amazing!

  32. London Product Photographer Says:

    One earlier commenter re hosting strategy might have had an good point. Scanning ISPs for the numbers of domains hosted on their servers (i.e. shared hosting) could help people decide which shared hosts are more likely to given them a good service. If you used your algorithm to collect an average response time per ISP? Then publish the stats for each host with their ave response times, maybe make a league table?

    I expect there are services doing these sorts of calculations anyway, but just a thought.

  33. Directory Says:

    Great tool! I use shared hosting (yes yes I know) and I always worry about had neighbourhoods.

    Had problems with spammers/ domains with penalty being on the same IP as one of my sites before and took me ages to realise that my SEO problems had such a simple cause… yet it took me ages to realise what it was.

  34. webhost-choice.com Says:

    I keep getting more and more hits. Went from 11 to 34 in 3days! did u find out the problem yet? maybe you should first pull the url and check if its already crawled or not?

  35. V2J Says:

    Good idea, why not.

  36. Fatih Güner Says:

    Seems like a great thing, maybe you can send me some details about the program. Thanks

  37. Ciupanezul Says:

    Is a very good idea.

  38. Anton Sherwood Says:

    Dom2Dom accessed my site once in June and 268 times in July. It will be interesting to see what happens in August.

  39. Weekly Alibi Says:

    Well, if nothing else, you have found a way to attract the attention of various webmasters all over the world. :-) We’re all, “Who is this guy?” Glad to see you’re a nerd rather than a referrer spammer. Cheers!

  40. Stephen Says:

    Love to find out more info –what you have set up is a good
    i,ll post again soon

  41. Dan Milu Says:

    i was “toutched: by you today. Interesting, but google still hates me :)

  42. Danica Says:

    FYI I see you in the August logs 88 times.

    But as was pointed out earlier, glad to see it’s not just referrer spam.

  43. mercado Says:

    Is a very good idea.

  44. Dominica Hotels Says:

    I was visited by your robot. Interesting to see how this all pans out.

  45. China Expats Says:

    Just to add that my website is getting hit as well, but only 14 so far in August. This is no problem for us or our stats, as we get a lot of traffic each month.

    As with Weekly Alibi above I’m thinking Who’s this? Just to know you’re are a real guy trying to do something that seems pretty useful is no problem – thanks!

  46. Eddy Says:

    Thank you for nice idea.

    My site got many clicks from your URL-address.
    What’s mean ?

    Is it refer of your software ?

    Thanks for responce.
    Eddy

  47. Andreas Says:

    Nice idea! Your bot visited my site too ;-) Greetings to all of the curious webmasters ;-)
    Looking forward to see the results- thx!

  48. Tony Lukasavage Says:

    You found me!

  49. Broken Arrow Says:

    Added Dom2Dom bot to our detection list :) (this doesn’t mean we prevent you from crawling)

  50. how to beijing Says:

    So that’s who was knocking…

  51. pedro Says:

    Thanks very much for your hundreds of visits. Since then we have got much more spam than before. Really nice.

  52. php developer Says:

    Interesting idea, in some situations there will be some parked domains, that do not consume server resources but just ‘count’ on same server.

    by they way, how and where this information will be used, like if we have http://www.pkshops.com [no 1 web development and designing service], if we know how many other domains are on hosted on our server. where/how we will use this information.

    to know about resource sharing on server ????

    Thanks for your efforts.

  53. pooopbear Says:

    wouldn’t it be easier to get all that information from dns, insteat of _really_ crawl all the sites?!?!
    *facepalm*

  54. Florent Clairambault Says:

    Hi,
    there are two main purposes :
    - SEO : Search engine don’t like host serving porn or phishing websites.
    - Intelligence : In a lot of cases, you can see what other websites a company owns.

    For “pkshops.com”, the dom2dom database currently shows :
    http://www.tajgames.com
    http://www.accountingformanagement.com
    http://www.manglavision.com
    http://www.mobileztotal.com
    http://www.123mobiles.info
    http://www.apnimodels.com
    http://www.stage.pk
    http://www.123filipina.com
    http://www.koolgirls.net
    http://www.123pakistani.com
    blog.anasimtiaz.com
    http://www.puttingblogsfirst.com
    http://www.dl4fun.com
    http://www.humanityworld.com
    http://www.hazara.edu.pk
    http://www.freeakhbar.com
    http://www.zeeshanusmani.com

    Still no interface because I don’t have a lot of time right now but the project continues.

    Best regards,

  55. JohnnyTurbo Says:

    Thank You All, I’ve thouroughly and utterly enjoyed your conservations, voted a couple of times and laughed (someone with the nickname c*** calling someone else a w***** [without any comment], terrific! — do you do that often, c***?)
    Still, what is the purpose of the crawl? Hosting companies that host 200.000 domains on one server will not survive anyway. But, what about 200 virtual machines on one [very large] server, each with their own IP address, each hosting tens or hundreds domains. Can you tell the difference? How can you draw any conclusions? What are the underlying assumptions? I love the idea (I used to program, pre-internet days) but what do the results of the program mean? Enjoy and keep up the good work!
    [In the next few days I'll find out if this is the most elaborate trap ever to get 'live' email addresses of unsuspecting suckers (like me) and spam them to hell!!]

  56. Florent Clairambault Says:

    Hi Johny,

    First of all, your emails won’t be disclosed to anyone. So this isn’t a trap. BUT some bots seem to index the pages of my blog because the number of spam comments has significantly increased on my blog and someone reported that it has increased on his website. But in your case, Johny, you didn’t specified any website so that won’t be a problem.

    So about the actual question : This is a tool. It doesn’t always work but in a lot of cases it works just fine. For some company like google (I searched google.com and it gave me 321 hosts), this allows to discover some pretty interesting project that they were launched like : http://rechargeit.org/ , http://www.466453.com, http://www.fiberforcommunities.com or some weird google alias : http://www.western.com to google, http://measuremap.com to analytics. It can also show that http://directory.opensocial.org is hosted by the “standard” google servers but http://www.opensocial.org is not.

    Short version: this is just a tool. It might be integrated as part of a bigger tool or even a so called “solution” but the current step is about this simple tool.

  57. Some Geeky Backend Web Stats for this Blog — Chris Abraham Says:

    [...] http://florent.clairambault.fr/dom2dom [...]

  58. johntrathome Says:

    Don’t understand a word of this, but I like people hitting on my Flickr site. I hope the bot enjoyed it – what does it like looking at?

  59. Seo tips Says:

    As a seasoned search engine pro, I can easily accede on what you’ve said. Still, a vital detail for anyone to always remember is that search engines will rank your blog, forum, or whatever high if you find juicy DoFollow backlinks to your site with proper anchored text. Do that, and nothing else sort of carries weight.

  60. Roger Says:

    Found your referrer in my log files and I just was curious about what’s behind it :-) Don’t think it’s very usefull to me.

    Don’t see either why a French hoster wants to work in the Netherlands

  61. BigV Says:

    Dear Webmaster, with a name like Florent and the fact that you like crawling around peoples backends, I assume you are some sort of pervert. Would you mind crawling back under your rock and stop snooping around our servers.

  62. Steve Says:

    Hi,

    Interesting idea, but you appear not to have found the bug yet – you have visited my site 17 times this month.

    S.

  63. notjakebuttheotherguy Says:

    looks to me like a tool hackers could use to find (and subsequently expolit) other sites hosted on the same server as the target site they are looking to hack… :(

  64. M-Akif Says:

    Intresting idea…

  65. Your Mother Says:

    Looks like somebody’s BEGGING for a DDoS attack…

  66. Anton Sherwood Says:

    One hit in June, 268 in July, 99 in August.

  67. Darryl Says:

    Man… some people get all grumpy about crawlers.. It makes me wonder how Google ever made it big-time.

    Does your crawler acknowledge robots.txt? It should.. just my advice if it doesn’t.

  68. PK Directory Says:

    What a nice project, Really enjoyed the conversation. Great job.

  69. Dom Says:

    Hi,

    sounds great..if your project popup one day and if it still remains free as you wrote it.

    As mentionned before, you got the attention of “many webmasters”… could be useful for other projects ;-)

    Be carefull to what you are doing and how you will display the information. Depending on the country, you may get have some law issues…(not all webhosting companies may like your project).

    Good luck.

  70. john barrett Says:

    I saw 37 hits by your bot in my stats today.
    Well…welcome on my site if that help you with your project.
    I find your project useful for webmasters

  71. Florent Clairambault Says:

    UPDATE (02 Oct 2010) :
    Service is available here. Here is an example with Apple.

  72. Tonuri De Apel Says:

    Why with my domain don’t work? ” This domain is unknown ! “

  73. Florent Clairambault Says:

    I abandoned the project because the MySQL really didn’t scale well, updates and insert were becoming really too slow. I think this really is the kind of application where we need NoSQL databases.

  74. Tonuri de Apel Says:

    Interesting idea, it is a great tool.
    Too bad that you have abandoned the project. :(

  75. Forum Auto Says:

    Sounds like you’re already on it, just hoping the additional information will be helpful.
    Thanks.

  76. Tonuri de apel Says:

    You did a good job.
    I’m glad it worked and you could apply this kind of NoSQL DB to other projects.
    An update would be welcome.

Leave a Reply