So the claim that he’s only going through googled links or hacks to find sites is bogus.
I've not seen anyone claim that's what he's doing.
What he is doing is simply making his way one-by-one through a big list of dotcoms. Your dotcom is obviously one of them, as is mine. For each dotcom, his bot looks at the robots.txt and deliberately does the opposite of what a regular bot does - it downloads all the disallowed content for later analysis.
Why? Because some people are silly enough to put sensitive restricted material on their websites and then try to "hide" it by disallowing bots from indexing it. They think this means that because it isn't listed in a search engine, no-one will be able to find it, so it will be safe. But of course anyone can download it once they know the URL so it is far from safe. In fact, by listing it in robots.txt, a foolish webmaster is providing the means to work out the URL, therefore actually advertising the fact that the "hidden" data exists!
Dennis' bot is automatically gathering the material for his research which will reveal how widespread such foolishness is. He will have to manually analyse all the data collected to work out if any of it is sensitive material. Hopefully he can help a few webmasters secure their stuff properly.
So WebVulnCrawl downloaded all the disallowed stuff on my site, which piqued my curiosity but didn't cause me any trouble or reveal any embarassing secrets. I'm surpised at what you say it did on your site - why would it look for mtfiles if that is not in your robots.txt? It didn't do that on my site. Perhaps Dennis' implementation is a bit buggy?
Response to julie:
I think you have misunderstood. You say
I've not seen anyone claim that's what he's doing.
What he is doing is simply making his way one-by-one through a big list of dotcoms. Your dotcom is obviously one of them, as is mine. For each dotcom, his bot looks at the robots.txt and deliberately does the opposite of what a regular bot does - it downloads all the disallowed content for later analysis.
Why? Because some people are silly enough to put sensitive restricted material on their websites and then try to "hide" it by disallowing bots from indexing it. They think this means that because it isn't listed in a search engine, no-one will be able to find it, so it will be safe. But of course anyone can download it once they know the URL so it is far from safe. In fact, by listing it in robots.txt, a foolish webmaster is providing the means to work out the URL, therefore actually advertising the fact that the "hidden" data exists!
Dennis' bot is automatically gathering the material for his research which will reveal how widespread such foolishness is. He will have to manually analyse all the data collected to work out if any of it is sensitive material. Hopefully he can help a few webmasters secure their stuff properly.
So WebVulnCrawl downloaded all the disallowed stuff on my site, which piqued my curiosity but didn't cause me any trouble or reveal any embarassing secrets. I'm surpised at what you say it did on your site - why would it look for mtfiles if that is not in your robots.txt? It didn't do that on my site. Perhaps Dennis' implementation is a bit buggy?