Cook Computing

Archiving Weblog Posts

November 26, 2002 Written by Charles Cook

John Robb points to some work-in-progress at UserLand regarding the archiving of weblog posts in RSS files. The original requirement seems to be backing up weblog posts to a server. However the solution seems to confuse the description of a post with its content. A lot of people seem to put the whole content of a post into the RSS description element but its not necesssary or always desirable. For example the weblog youre reading puts only a brief introductory extract in the description element and if the reader is interested they can click through in their browser to the full post. One of the advantages of this is that every time I update the RSS file and aggregators detect it has changed, they download only a relatively small RSS file, not one containing the full content of recent entries. This stops the full content of multiple posts being downloaded repeatedly whenever the RSS file is updated.

Archiving of weblogs posts is important. I often read a post which I would like to archive locally and retrieve later either by looking at the archived posts of a particular weblog or applying a search to the whole archive. Conversely when working on something I often think of a post that Id like to refer to but cant remember where I read it. If I had access to the content of posts I could write an application to do this but it is not possible. Unless the RSS description element contains the full content of a post, and we cant rely on that, then the only access to the content is via the web page pointed to by the RSS entry. Which puts you in a screen-scraping scenario which I dont want to bother with.

Google does provide some of this functionality but you get a lot of noise in a Google search and it is not as snappy as searching a local archive which is guaranteed to contains items of interest. So lets start making content available so that clients can manipulate it in any way they want. This way we can start to develop much richer aggregators.

Speaking of new clients, Dave Winer also makes a comment that with new software to view weblogs We will have routed around the Microsoft browser monopoly. Two points here. First, I wasnt aware that Microsoft has a browser monopoly. I frequently use Mozilla, even on Microsoft sites, and when I look around the office at work I see people using a variety of browsers. Second, I may be mistaken, but the screenshot of Brent Simmons BlogBrowser appears to use an embedded browser, which is an unsurprising requirement given the amount of HTML markup that appears in some RSS files. This means a Windows version of the new generation of BlogBrowsers would have to either implement their own HTML renderer or embed a browser. The first case is unlikely and in the second case it would either be IE or Mozilla. If IE then we havent routed around Microsoft very well, if Mozilla then there cant really be a browser monopoly.