A new robot crawled my site on 2nd Jan. I know because it crawled my download script, which is prohibited by my robots.txt. The download log that was emailed to me showed the following info.

IP Address: 216.179.125.69
Host: tubgirl.biz
User-agent: WebVulnCrawl.blogspot.com/1.0 libwww-perl/5.803
Date: 2006-01-02 19:24:23

My interest piqued, I followed the trail. It's a project run by Dennis Brown; his first post explains it best.

At this time, not many other folks seem to have commented on the WebVulnCrawl robot; technorati gives nothing, a google search currently turns up a single blog entry which seems quite negative in its reaction. I'm personally not that bothered. Dennis' research quite intrigues me, and I for one will be subscribing to his feed and waiting for his results, due in March.

For those who can't quite see what Dennis is trying to do, for me it recalled this great article on The Register - Crackers use search engines to exploit weak sites:

"HotBot advanced search allows you to specify your search with file extensions, looking for sites or directories that include .dat files and the words 'index of' and 'admin' or 'customer'", Utreg says.

He showed us a file named data.txt on ISP Lanline.com's servers which contained the personal information of several hundred people, including their names, addresses, social security numbers and credit card account details - and all of it in plain text.

The article includes other tips on how one might locate such data:

Nothing listed in a 'robots.txt' file will turn up in a search query; but once a person has seen the directory and file names it contains, they can type them directly into their browser to access the various subdirectories and pages which the site administrators would rather keep hidden. These are of course the very subdirectories and files most likely to be of interest to crackers.

The article finishes by noting that of course:

For Web site operators afraid of falling prey to such backdoor inquiries, the solution is painfully obvious and quite simple. Stop putting sensitive data in public places. A file which you would not print out and post on a billboard simply has no business being posted on a Web site.

Dennis' bot is an attempt at researching how many web site operators are doing precisely this. Oops. If Dennis is professional with the way he handles the data he collects, notifying any vulnerable sites of their accidentally exposed data, then all power to WebVulnCrawl say I.

Epilogue:

Amusingly, tracking back to tubgirl.biz (the hostname resolved by a reverse lookup of the WebVulnCrawl IP) doesn't seem to ever load the index page, but a google search for the domain gives the following amusing snippet. Nice touch Dennis

tubgirl.biz listing

Keeping it light-hearted, should WebVulnCrawl come visiting here again, it will find a new page to analyse. First person to post the "phrase of the day" in a comment will win instant fame and fortune ;)