Day's blog - Yahoo! Pipes, Caching and robots.txt

Every time I create or use a pipe, I'm indirectly causing hits on some third party's website. So I was curious to learn how the Yahoo! Pipes backend behaves. What caching does it do, is it a good and well behaved web citizen in general?

There isn't much official documentation to go on. The Pipes Troubleshooting guide has some notes on how to stop Pipes from downloading a feed too frequently and how to stop Pipes from using feeds at all.

So, I put myself in the shoes of the third party website that the pipes hit upon to find out more.

I created a simple pipe to pull a page from my own website, set up a cron job to request the output of that pipe from Yahoo! every 60 seconds, and sat back a while (okay... for nearly two months!) to see what the logs reveal.

The general impression is good. Even though I was hitting the Pipe hosted by Yahoo! every 60 seconds, Yahoo! in turn only hits the original source on my website once every 10 minutes or so. The period between requests does vary between 3 to 17 minutes, but a 10 minute interval seems most common.

The bot runs with a user agent of "Yahoo Pipes 1.0", and requests originate mainly from two IPs

69.147.112.168 (db1.pipes.re3.yahoo.com)
69.147.112.169 (db2.pipes.re3.yahoo.com)

although there are a handful of requests that originate from

68.180.194.242 (rsscache1.pipes.sp1.yahoo.com)
68.180.194.243 (rsscache2.pipes.sp1.yahoo.com)

The Pipes bot is periodically checking my robots.txt, despite what the official documentation would have you believe:

Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol, and won't check you robots.txt file.

So, I tried the obvious by using the following (in turn) in my robots.txt

User-agent: Yahoo Pipes
Disallow: /

User-agent: Pipes
Disallow: /

Unfortunately neither attempt stopped the bot from continuing to request the source from my site. I think Yahoo! needs to be a bit clearer on their policy here. Either they don't support robots.txt, in which case they should stop downloading it, or they do support it in which case they need to explicitly state what user-agent string we should be using to block it.

I have suggested as much on the Pipes Suggestion Board, please vote for the suggestion if you agree.

The Pipes bot does cache the robots.txt as my logs show the request being answered with a 304 (Not Modified) response, suggesting that the bot is making the request using the If-Modified-Since or If-None-Match header. Oddly though, every other request is served up as a 200 by my server, so either the Pipes bot is flushing its cache very frequently for some reason, or my webserver is acting up... again :)

There's a lot more analysis that can still be done, but I thought I'd share some initial findings.

2 comments

JayFresh said:

26th February 2008 at 12:16 p.m. (1 week, 6 days after the post)

Awesome analysis Day, makes me a lot happier using Yahoo! Pipes in my own service. I'm concerned with Yahoo! caching at the front-end so that they deliver the results more rapidly. It's reassuring to know that they're at least caching at the back-end!

Pigeon said:

19th June 2008 at 5:06 a.m. (4 months, 1 week after the post)

This was just what I was looking for, two articles in a row. The tutorial on scraping with Pipes was excellent, but when I attempted to fetch a certain webpage, I got an error saying it wasn't allowed by robot.txt . And now I know why, thanks :)

Day Barr / blog

Yahoo! Pipes, Caching and robots.txt

2 comments

Leave a comment

About

Day Barr / blog

Yahoo! Pipes, Caching and robots.txt

2 comments

Leave a comment

About

Subscribe