Yahoo! recently released a new Fetch Page module which dramatically increases the number of useful things that Pipes can do. With this new "pipe input" module we're no longer restricted to working with well-organised data sets in supported formats such as CSV, RSS, Atom, XML, JSON, iCal or KML. Now we can grab any HTML page we like and use the power of the Regex module to slice and dice the raw text into shape.

In a nutshell, the Fetch Page module turns Yahoo! Pipes into a fully fledged web scraping IDE!

Yahoo! Pipes is a web scraping IDE in a nutshell

As it happens, I already have a web scraping project which has been broken for some time now. I don't have the energy to check out the hacky old PHP scrapers and debug the problem. But with Yahoo! Pipes and the Fetch Page module to hand, I can throw away my PHP scripts and their associated libraries, delete the cron jobs and free my overloaded webserver from the onerous responsibility. Time to get cracking.

The Task at Hand

My web hosting provider (LunarPages - affiliate link alert!) runs a useful web forum at lunarforums.com. There are many different boards, the most useful being the announcement style boards such as Server Information. Unfortunately, the forum software only provides a single feed that contains the last 10 posts across all the boards on the entire site. But I want a feed of the last few posts to the boards I'm interested in, so I can keep up to date on the latest patches and hardware upgrades that might affect my sites without having to wade through posts from the "spaghetti crocheting" board.

So, what will this entail?

Looking at the first page of the Server Information board, I can get most of the information I need from here. Each row of the table shows a topic posted to the board, and will be a single item in the RSS feed. Each row provides the URL for the topic to use as the item link, the item title, the name of the author and the published date of the post.

board

For the content of each item in the feed, I'll have to follow the link to the topic and extract the content of the first post. I'm only interested in the announcement style boards, so I don't really need entire threads of posts to a topic.

topic

Starting the Pipe

It's time to head on over to Yahoo! Pipes, log in, and create a new pipe. If you don't want to follow along at home, you can view the source for one I made earlier at http://pipes.yahoo.com/daybarr/lunarforums. Note how Pipes lets you edit your pipe URLs now. Much better than http://pipes.yahoo.com/pipes/pipe.info?_id=sm_Vqdqk3BGGw_O5yp1_DQ

To get started, drag a Fetch Page module from the Sources section of the Library pane onto the canvas, enter the URL for the Server Information board into the URL field, refresh the pipe debugger and behold - a single item with a content field containing the source of the Server Info board.

step01

Fantastic!

We don't have to restrict the pipe to just this one board, so we can make the URL configurable by adding a URL Input module and wiring it up to the URL field of the Fetch Page module.

step02

Use a Name of "Board" for the URL Input and give it a suitable Prompt. The Prompt appears next to the form field on the pipe's run page and the Name is used in the query string when running the pipe.

step03

So we have the Fetch Page module happily returning the HTML for any board. Now we just need to take the huge string of tag soup and trim it into shape. We can use the "cut" feature of the Fetch Page module to cut out just the table that we're interested in. The "split" feature can then be used to split the table into rows, one row per post. The split will create an item in the pipe output for each row in the table that we cut out. To do their job, both "cut" and "split" require us to provide some appropriate text to use as delimiters.

To find the candidate strings for cutting and splitting, use the trusty Firebug inspector to point at the main table on the Server Information board and find the relevant part of the DOM.

step04

It looks like

<table width="100%" cellspacing="1" cellpadding="4" border="0" class="bordercolor">

would be a good place to cut from, and we can use </table> to stop cutting. But this doesn't seem to work

step05

Viewing the source of the page in Firefox and searching for "bordercolor", we can find the same part of the document as it is actually written in the HTML. Aha. The attributes are shown in a different order by Firebug because it's using the abstracted DOM rather the literal page source. We can confirm the correct string to use by finding it in the Pipe Debugger too.

step06

Using this string for the cut and </tr> for the split, we get 20 items in the output shown in the pipe debugger. One item for each row of the table as required.

step07

But we've also picked up the header row of the table as item 0. We don't want that, so let's remove it. Pipes has a couple of Operator modules that look promising: Truncate and Tail. Truncate will output the first 'n' items. Tail will output the last 'n' items. Given their equivalentutilities in the UNIX world, Head might be a better name for the Truncate module, but never mind.

What we need to do is output the last 'n-1' items where 'n' is the total number of items in the feed. Rather than hard code this number into the pipe (it's 19 right?), we can use the Count operator and the Simple Math module to work it out on the fly for improved robustness. A bit of cleverness with the Split module is needed, but it makes for some pleasingly bendy connectors in our pipe.

step08

Note that the input label for the Tail module is incorrect and misleading. It should say something like "Emit last [num] items" and not "Emit after position [num]". This bug has been reported to the Pipes team so should be fixed soon.

Extracting item attributes from each table row

Before we start extracting bits from the item.content, we can make things simpler by stripping out carriage returns and linefeeds from the HTML and removing any <img> tags. This can be done with a couple of simple regexes using the (surprise) Regex module. Simply replace [\r\n] with nothing and <img src=[^>]+> with nothing, not forgetting to set the appropriate regex flags by ticking the boxes at the end of the rules.

step09

Although this step isn't strictly required, I find that doing this makes subsequent regexes and debugging simpler, which is what Yahoo! Pipes is all about.

Now we have nice clean HTML for each item representing a post to the board. We need to extract

  • The title of the post
  • The URL of the post
  • The author
  • The publishing date

These values should be in the right format and stored as appropriately named attributes of the item so that when the pipe is run and rendered as RSS, we get the correct, valid entities in the XML. The appropriate names for the attributes are (respectively)

  • item.title
  • item.link
  • item.dc:creator (Not item.author as this must be an email address according to the RSS spec)
  • item.y:published (This is a pipes-ism. The item.y:* are special attributes)

To add all these new attributes, we can use the Rename module to create them as copies of the existing item.content attribute.

step10

Note how the debugger now shows the item.title value instead of the item index. It's a bit unwieldy at the moment, but we're about to sort that out because the next task is to apply appropriate regexes to the HTML that currently forms each attribute, leaving only the appropriate text.

step11

Now we have something that's starting to resemble a news feed! I won't bore you with the details of all the regexes used here (you can view the source of the published pipe to see them if you must). The interesting stuff happens with the item.y:published attribute. The extra steps are necessary because the web forum software refers to posts published today in a different format from all the other published dates.

step12

We need to substitute the literal "today" string with today's date in the same format as the other items. How do we get today's date? By using Date Builder to get the date "now", running that through the Date Formatter and hooking the resulting string up to the replacement part of the regex. Easy.

step13

The rest of the item.y:published manipulations are to prepare the string value for the next step, turning item.y:published into the datetime value that Pipes wants instead of a string. This is easily achieved with the Date Builder module again, this time inside a Loop module which applies the nested module to every item in the input feed.

step14

Getting the content of each post using a subPipe

So, all that's missing from our items now is the most important bit, the actual content of the post. To retrieve this we'll have to use the Fetch Page module again to get the HTML for each item.link URL. Then we'll need to use some by-now-familiar techniques to extract the content of the first post from the topic. This is done using a subPipe.

At the moment, adding a new pipe from the pipe editor itself doesn't quite work in the way we'd like, so for now, save the work so far with an appropriate name and go back to "My Pipes" to create a new one for the subPipe.

The subPipe uses the same techniques as earlier, with the URL Input module hooked up to the URL field of the Fetch Page module. The "cut" parameters for the Fetch Page module are very simple this time, and as we're only interested in the first post to the topic, there is no need for a "split" operation.

step15

Save this new pipe with an appropriate name, and now go back to where we left off with the original pipe.

Finishing off

To run every item in the pipe through our new subPipe, simply add another Loop module and drop the subPipe inside it. The new subPipe should be under "My pipes" in the Library pane of the pipe editor. Note how the URL input for the subPipe is prompted for using the text we supplied. Hook it up to item.link to pass in the URL we extracted earlier, and then assign the first result to a new attribute, item.description

step16

Keen observers may note that the subPipe we created for use in this Loop is so simple that we didn't actually have to create it as a separate pipe. We could simply have placed the Fetch Page module into the Loop and hooked up its URL field to the item.link directly. This is true, but it doesn't give us as much flexibility. If we ever wanted to do more work when extracting the first post from a topic, such as adding a "Created with Yahoo! Pipes!!" footer to the item content, then we can do this simply by editing the subPipe. If we didn't use a subPipe we'd have to add another Loop module to the main pipe for each additional step because each instance of the Loop module can only contain a single module.

So, we've got everything we need in our feed, but we're not quite done yet. The content of the post is stored in item.description.content but we need it in item.description. Also, it would be good to add a GUID for each post item, and for this we can just use the value of item.link (which uniquely identifies each topic) for the item.y:id.value attribute (another Pipes-ism which is used to produce GUIDs in the RSS when the pipe is run). We can use the Rename module again to accomplish both of these tasks.

All that then remains is to hook up the end of the pipeline to the Pipe Output and we're done.

step17

One drag and drop web scraper completed.

In use

You can view and run the published pipe at http://pipes.yahoo.com/daybarr/lunarforums. Here's the output, served up as RSS and as viewed by Google Reader.

result-xml

result-reader

If you're a LunarPages customer, or just want to keep up to date with some of their forums, then I've pre-configured a FeedBurner feed of the pipe output for each of the major boards. You can get a list or an OPML file from this forum post.

More Pipes stuff to come

This post is the first (and probably the longest!) of a mini-series of posts on Yahoo! Pipes that I'm planning. I'm not sure when I'll have a chance to actually post these, so subscribe now if you're interested in catching them.

Credits: The original nutshell image is published on flickr and licensed under a Creative Commons license by twinsparc. "Pipes in a nutshell" modifications by me.