I'm currently using Lunarpages to host daybarr.com and am very pleased with their service and support. To help keep on top of the latest news, I thought I'd try and track down their RSS feed. Alas, all I could find was a post in one of their forums from someone else (hi Malcolm) looking for the same thing.
The response from Lunarpages was that they don't intend to introduce feeds for their news. Strange, but fair enough. That was all the excuse I needed to start fiddling about with some PHP to try and "scratch the itch". The basic idea being to write a page that when requested causes the following to happen on the server.
- Download the first page of the forum that lists all the top level posts.
- For each post on that page, download the linked page that includes the content of the post and any follow ups.
- Extract the title and content of each post, the name of the author and the date of posting.
- Use all that information to create the appropriate XML for the feed and serve it up to my news reader
I'm obviously not going to manage all that from scratch in one evening, so a quick google later and I had step 1 covered by class_http.php from Troy Wolf. This handy screen scraping utility makes it simple to download and cache any web page. The caching is important both for speed and for "being a good neighbour". I don't want every hit on my feed to result in a hit on lunarforums. More importantly, Lunarpages don't want that, and might get a bit cross with me. I set the cache TTL to 10 minutes. So even if I request the feed every 30 seconds from 6 different PCs, there will still only be one request to the lunarforums server from my server every 10 minutes. I'm sure it can cope with that.
Step 2 requires some regex work to extract the link for each post from the nested tables in the lunarforums HTML. Then I use Troy's class_http again to get the page for each posting. This time I set the cache TTL to be 365 days. This means that the page for the posting is requested only once. Ever. It's not going to change so why download it again.
Once I've got the page, some more regex work gets the title, content, author and date from the top level post. Part of this work involved removing any embedded images from the HTML for the post content. I replace them with the text from the alt attribute instead. Again this means less hits on the lunarforums server when RSS readers try to render the images. Just being neighbourly. That's step 3 done.
For step 4, google brought me to FeedCreator by Kai Blankenhorn. Kai's script makes it simple to take all the bits I've extracted from the various HTML pages and turn them into an RSS (or Atom or OPML) feed. Thanks Kai.
And that was it working. Once I'd finished, I made a quick addition to allow my script to serve a feed for each forum on lunarforums. Notice how the first page for each forum has a link of the form
where the board parameter changes for each forum (23.0 is for the "Hosting plans and site information" forum). I simply use the same parameter in my script to get the feed for the required forum like this
You can see the end result here. Sorry but just a screenshot of the feeds is available at the moment - Lunarpages haven't agreed to let me publish the URL for my feed generator. They haven't disagreed either - it's been quite quiet. Regardless, I've scratched my itch and I'm happy now I have the feeds for myself. Hopefully if you have that itch too then this article will at least point you in the right direction to scratch it yourself