Dom2Dom
May 1, 2010 — Florent ClairambaultUPDATE (07 Nov 2011) :
I reworked the same project with a totally different architecture. I used Java/Servlet/Glassfish + Cassandra. It’s just a test project to see how I could apply this kind of NoSQL DB to other projects, so it’s really simple.
UPDATE (02 Oct 2010) :
Service is available here. Here is an example with Apple.
This is an experimental little project of mine. The goal is to be able to tell what domain are hosted on the same hosts as an other domain. Some services are already offering it but they do a very crappy job. This service will do only that.
The program just goes from links to links to find new domain names. It stays as less possible on each domain, it doesn’t store any information other than the domain name.
The only address I did put in the program is the address of this blog. All the other were discovered.
Please tell me if you would also be interested by this service.
06/07/10 :
In less than 1 week, the program already collected a little bit more than 400 000 hosts (and I think a little bit more than 350 000 domains), and there really are a lot of porn sites.
I think I will start the DNS requesting part in two weeks and the little webinterface two weeks later. I could do it sooner but results would be very crappy anyway.
10/07/10 :
We’ve now reached 1 million indexed hosts.
12/07/10 :
I’ve added the DNS requesting code. It’s working fine (it’s much easier to do and maintain). It has indexed 60 000 hosts. I’ve made a little web interface but I’m not giving it right away because the SQL requests aren’t optimized yet (but no worries, the database is).
15/07/10 :
1.650 M hosts indexed
420 k have been ip linked
136 k different IP addresses have been found
21/07/10 :
2.5 M hosts indexed
1.0 M hosts IP linked
340 k IP addresses
02/08/10 :
I fixed the program. It is now identified as “Dom2Dom/0.1.3865.36392_2010-08-01_20:13:04″. It can’t make too much requests on the same server now (I limited it explicitely) and it should crawl the web more efficiently.
Statistics are :
3.7 M hosts indexed
(almost) all hosts IP linked
900 k IP addresses
04/08/10 :
4.1 M hosts indexed
all hosts IP linked (and it’s starting to relink old hosts to their potentially changed IP adresses)
957 k IP adresses
20/08/10 :
5.8 M hosts indexed
all hosts IP linked
1.2 M IP adresses
28/08/10 :
I still can’t find time to do the webinterface but the program and the great comments continue.
I took ten minutes to make a little stats page if you’re interested :
http://dom2dom.webingenia.com/stats
July 2, 2010 at 3:43 am
Okay … Why?
July 2, 2010 at 8:37 am
Hi Anton,
mostly because I wanted to use this simple service for years and a lot of other people might be interested. It’s just an other service to give a little more data to analyze some (competitors for instance) website.
July 5, 2010 at 10:53 am
Just wanted to chime in and say I think its a great idea.
July 7, 2010 at 3:12 pm
wanker!!
July 7, 2010 at 7:49 pm
Hi “Cunt”,
I discovered a new word with your last comment.
Best regards,
Florent
July 11, 2010 at 3:22 am
awsome idear
July 12, 2010 at 3:14 am
Intresting idea, but the sole purpose is only to find out which site that are on the same host?
July 12, 2010 at 9:27 am
Hi Targenor,
Yes, but this is the first step. Mostly because this kind of database takes a lot of time to create.
By the way, it’s not necessarily the same host. It could be a multi-host hosting architecture.
July 13, 2010 at 2:00 pm
It is interesting to know our server neighbors.
The idea is very good and we hope to be kept free in the future.
July 13, 2010 at 9:26 pm
Hi Evaluator,
Thank your for your kind comment.
No worries, the service will remain free.
July 13, 2010 at 9:29 pm
Perfect.
Your service helped us to take a decision about our future hosting strategy.
July 13, 2010 at 9:35 pm
Hum. Well, the services isn’t opened yet, so I’m a little bit surprised. What hosting strategy did you chose ?
July 16, 2010 at 12:18 am
That’s a rather good idea, and you’re right, there’s sites out there that already offer something similar, but it’s very limited in functionality. In actual fact, many only show about 10 results.
July 16, 2010 at 3:51 pm
It is a great idea, but when the service will be available ? and under what domain. Also, will be great if you could determine the PR, AR of the domain name .
July 16, 2010 at 3:52 pm
Also a subpage with trafic stats, keywords linked to that domain.. will be great if you could do this for free
.
July 17, 2010 at 8:28 am
Thanks for links! Author bravo!
July 18, 2010 at 3:11 pm
Found your referrer in my log files and I just was curious about what’s behind it
Don’t think it’s very usefull to me at this moment but I don’t mind you visiting my site.
Good luck with your project!
July 21, 2010 at 1:37 am
I think it is interesting myself. As you see I am a web hosting company. How can I help you. As a web provider if there is something I might do to help you. I mean why not.
July 21, 2010 at 8:23 am
So, did you find out if there are hosted more then one website on the same host ?
July 21, 2010 at 9:05 am
Yes, I’m already getting some very results but the program is still indexing. Your domain “consultanta-seo.ro for instance ” isn’t indexed yet.
July 21, 2010 at 3:48 pm
I don’t see why you have to hit my site over and over to figure this out- I’m an author, and each of the domains are clearly marked as belonging to me, considering they’re for my books. This is throwing off my stats, which I report to my publisher for advertising reasons.
July 21, 2010 at 10:10 pm
Hi Saundra,
Well, the “over and over” part is a mistake that I should fix.
The good thing is that it shouldn’t come back to your website anytime soon.
Best regards,
July 22, 2010 at 12:48 am
Hi Florent,
I noticed my site got hit as well about 155 times. Whats the overall goal and purpose here if you don’t mind me asking?
July 22, 2010 at 1:04 am
Well well well. It seems like I have a pretty big bug. The idea is to avoid staying on a server as much as possible. So it’s clearly not intentional.
The idea is that the server could potentially come back to a server that it has already visited (because it could help find more links). But that’s clearly not the goal (155 requests are 154 requests it could have spent on someone else’s server).
I’ll try to fix this pretty soon.
Sorry & Best regards,
July 23, 2010 at 4:08 am
I may be having the same issue as Jason. I run a news aggregation site (St. Paul, MN, USA) and use bit.ly to track traffic on my outgoing links. I’ve recorded 102 clickthroughs today from your site, 66 of which occurred simultaneously at about 11:54 a.m. CDT. Looks like it hit every link on my page twice at that time.
Sounds like you’re already on it, just hoping the additional information will be helpful.
July 23, 2010 at 6:49 am
You’ve hit my site 160 times… mind fixing this?
July 24, 2010 at 1:28 am
Hi everyone,
Well, I have changed the program but I still need to check that it actually doesn’t still request a lot of pages on the same server.
Believe me, this is in my own interest to fix this, this has become a lose/lose situation because my server spends all his bandwidth on one server instead of crawling the web. 150 requests on server means 149 request that could have been done on some other servers.
Best regards,
July 25, 2010 at 4:25 am
my site also received 6 hits today from this url : )
just wanted to let you know, and uhm keep up the good work i love this blog
July 25, 2010 at 5:01 am
Sounds like a cool tool… do you have a beta online somewhere?
As above I found you in my stats…
July 25, 2010 at 2:19 pm
The tool is already working but I’d like to make a pretty interface. And as we (I’m working with an other guy on a development services company) are currently working on making a website. It should be integrated a “lab service” of the website and comes just after the website is ready.
July 26, 2010 at 9:00 am
with me it were several projects of mine that got hits from this url. amazing!
July 28, 2010 at 9:30 am
One earlier commenter re hosting strategy might have had an good point. Scanning ISPs for the numbers of domains hosted on their servers (i.e. shared hosting) could help people decide which shared hosts are more likely to given them a good service. If you used your algorithm to collect an average response time per ISP? Then publish the stats for each host with their ave response times, maybe make a league table?
I expect there are services doing these sorts of calculations anyway, but just a thought.
July 28, 2010 at 11:52 am
Great tool! I use shared hosting (yes yes I know) and I always worry about had neighbourhoods.
Had problems with spammers/ domains with penalty being on the same IP as one of my sites before and took me ages to realise that my SEO problems had such a simple cause… yet it took me ages to realise what it was.
July 29, 2010 at 5:16 pm
I keep getting more and more hits. Went from 11 to 34 in 3days! did u find out the problem yet? maybe you should first pull the url and check if its already crawled or not?
July 29, 2010 at 8:30 pm
Good idea, why not.
July 31, 2010 at 2:15 pm
Seems like a great thing, maybe you can send me some details about the program. Thanks
August 2, 2010 at 1:28 am
Is a very good idea.
August 2, 2010 at 4:40 am
Dom2Dom accessed my site once in June and 268 times in July. It will be interesting to see what happens in August.
August 4, 2010 at 5:48 pm
Well, if nothing else, you have found a way to attract the attention of various webmasters all over the world.
We’re all, “Who is this guy?” Glad to see you’re a nerd rather than a referrer spammer. Cheers!
August 5, 2010 at 1:51 am
Love to find out more info –what you have set up is a good
i,ll post again soon
August 5, 2010 at 10:38 pm
i was “toutched: by you today. Interesting, but google still hates me
August 7, 2010 at 3:08 pm
FYI I see you in the August logs 88 times.
But as was pointed out earlier, glad to see it’s not just referrer spam.
August 7, 2010 at 7:02 pm
Is a very good idea.
August 7, 2010 at 10:04 pm
I was visited by your robot. Interesting to see how this all pans out.
August 8, 2010 at 1:33 am
Just to add that my website is getting hit as well, but only 14 so far in August. This is no problem for us or our stats, as we get a lot of traffic each month.
As with Weekly Alibi above I’m thinking Who’s this? Just to know you’re are a real guy trying to do something that seems pretty useful is no problem – thanks!
August 9, 2010 at 2:39 am
Thank you for nice idea.
My site got many clicks from your URL-address.
What’s mean ?
Is it refer of your software ?
Thanks for responce.
Eddy
August 9, 2010 at 11:57 am
Nice idea! Your bot visited my site too
Greetings to all of the curious webmasters 
Looking forward to see the results- thx!
August 12, 2010 at 2:49 pm
You found me!
August 13, 2010 at 9:26 pm
Added Dom2Dom bot to our detection list
(this doesn’t mean we prevent you from crawling)
August 14, 2010 at 1:28 pm
So that’s who was knocking…
August 15, 2010 at 3:34 pm
Thanks very much for your hundreds of visits. Since then we have got much more spam than before. Really nice.
August 16, 2010 at 11:48 am
Interesting idea, in some situations there will be some parked domains, that do not consume server resources but just ‘count’ on same server.
by they way, how and where this information will be used, like if we have http://www.pkshops.com [no 1 web development and designing service], if we know how many other domains are on hosted on our server. where/how we will use this information.
to know about resource sharing on server ????
Thanks for your efforts.
August 19, 2010 at 1:50 pm
wouldn’t it be easier to get all that information from dns, insteat of _really_ crawl all the sites?!?!
*facepalm*
August 19, 2010 at 8:14 pm
Hi,
there are two main purposes :
- SEO : Search engine don’t like host serving porn or phishing websites.
- Intelligence : In a lot of cases, you can see what other websites a company owns.
For “pkshops.com”, the dom2dom database currently shows :
http://www.tajgames.com
http://www.accountingformanagement.com
http://www.manglavision.com
http://www.mobileztotal.com
http://www.123mobiles.info
http://www.apnimodels.com
http://www.stage.pk
http://www.123filipina.com
http://www.koolgirls.net
http://www.123pakistani.com
blog.anasimtiaz.com
http://www.puttingblogsfirst.com
http://www.dl4fun.com
http://www.humanityworld.com
http://www.hazara.edu.pk
http://www.freeakhbar.com
http://www.zeeshanusmani.com
Still no interface because I don’t have a lot of time right now but the project continues.
Best regards,
August 20, 2010 at 10:28 pm
Thank You All, I’ve thouroughly and utterly enjoyed your conservations, voted a couple of times and laughed (someone with the nickname c*** calling someone else a w***** [without any comment], terrific! — do you do that often, c***?)
Still, what is the purpose of the crawl? Hosting companies that host 200.000 domains on one server will not survive anyway. But, what about 200 virtual machines on one [very large] server, each with their own IP address, each hosting tens or hundreds domains. Can you tell the difference? How can you draw any conclusions? What are the underlying assumptions? I love the idea (I used to program, pre-internet days) but what do the results of the program mean? Enjoy and keep up the good work!
[In the next few days I'll find out if this is the most elaborate trap ever to get 'live' email addresses of unsuspecting suckers (like me) and spam them to hell!!]
August 20, 2010 at 11:21 pm
Hi Johny,
First of all, your emails won’t be disclosed to anyone. So this isn’t a trap. BUT some bots seem to index the pages of my blog because the number of spam comments has significantly increased on my blog and someone reported that it has increased on his website. But in your case, Johny, you didn’t specified any website so that won’t be a problem.
So about the actual question : This is a tool. It doesn’t always work but in a lot of cases it works just fine. For some company like google (I searched google.com and it gave me 321 hosts), this allows to discover some pretty interesting project that they were launched like : http://rechargeit.org/ , http://www.466453.com, http://www.fiberforcommunities.com or some weird google alias : http://www.western.com to google, http://measuremap.com to analytics. It can also show that http://directory.opensocial.org is hosted by the “standard” google servers but http://www.opensocial.org is not.
Short version: this is just a tool. It might be integrated as part of a bigger tool or even a so called “solution” but the current step is about this simple tool.
August 21, 2010 at 6:17 pm
[...] http://florent.clairambault.fr/dom2dom [...]
August 23, 2010 at 4:05 pm
Don’t understand a word of this, but I like people hitting on my Flickr site. I hope the bot enjoyed it – what does it like looking at?
August 25, 2010 at 4:55 am
As a seasoned search engine pro, I can easily accede on what you’ve said. Still, a vital detail for anyone to always remember is that search engines will rank your blog, forum, or whatever high if you find juicy DoFollow backlinks to your site with proper anchored text. Do that, and nothing else sort of carries weight.
August 27, 2010 at 5:49 pm
Found your referrer in my log files and I just was curious about what’s behind it
Don’t think it’s very usefull to me.
Don’t see either why a French hoster wants to work in the Netherlands
August 28, 2010 at 12:38 pm
Dear Webmaster, with a name like Florent and the fact that you like crawling around peoples backends, I assume you are some sort of pervert. Would you mind crawling back under your rock and stop snooping around our servers.
August 29, 2010 at 9:20 pm
Hi,
Interesting idea, but you appear not to have found the bug yet – you have visited my site 17 times this month.
S.
August 31, 2010 at 11:55 am
looks to me like a tool hackers could use to find (and subsequently expolit) other sites hosted on the same server as the target site they are looking to hack…
August 31, 2010 at 5:30 pm
Intresting idea…
September 1, 2010 at 10:26 pm
Looks like somebody’s BEGGING for a DDoS attack…
September 3, 2010 at 12:57 am
One hit in June, 268 in July, 99 in August.
September 3, 2010 at 6:44 am
Man… some people get all grumpy about crawlers.. It makes me wonder how Google ever made it big-time.
Does your crawler acknowledge robots.txt? It should.. just my advice if it doesn’t.
September 4, 2010 at 7:20 pm
What a nice project, Really enjoyed the conversation. Great job.
September 6, 2010 at 3:43 pm
Hi,
sounds great..if your project popup one day and if it still remains free as you wrote it.
As mentionned before, you got the attention of “many webmasters”… could be useful for other projects
Be carefull to what you are doing and how you will display the information. Depending on the country, you may get have some law issues…(not all webhosting companies may like your project).
Good luck.
September 6, 2010 at 10:11 pm
I saw 37 hits by your bot in my stats today.
Well…welcome on my site if that help you with your project.
I find your project useful for webmasters
October 2, 2010 at 1:47 am
UPDATE (02 Oct 2010) :
Service is available here. Here is an example with Apple.
July 22, 2011 at 1:04 pm
Why with my domain don’t work? ” This domain is unknown ! “
July 27, 2011 at 12:03 am
I abandoned the project because the MySQL really didn’t scale well, updates and insert were becoming really too slow. I think this really is the kind of application where we need NoSQL databases.
March 6, 2012 at 5:08 pm
Interesting idea, it is a great tool.
Too bad that you have abandoned the project.
March 8, 2012 at 1:42 pm
Sounds like you’re already on it, just hoping the additional information will be helpful.
Thanks.
March 19, 2012 at 11:13 pm
You did a good job.
I’m glad it worked and you could apply this kind of NoSQL DB to other projects.
An update would be welcome.