Searching Weblog Posts
Google is a cool search engine, but has a pretty glaring problem when indexing blog posts. Acts of Volition documents this flaw using a picturesque metaphor of Joe, Sam and hang gliding. Essentially the problem is: the page is not necessarily the finest unit of web content. In a blog’s case, the post is. Garrity went on to suggest BlogML, which would right these wrongs via XML. Others suggested proximity operators for Google.
While proximity operators are nice, they don’t necessary solve the problem. BlogML does, but it’s overkill. A better solution would use what bloggers already use — HTML and RSS. Here’s my take. It’s a voluntary system, but is a win-win for searchers, blog authors and a search engine’s credibility.
Here’s how it’d work: any page that identifies itself as a weblog would include a meta tag like this in all its pages: and . Further, the block within the weblog where the posts begin and end would be bounded by the HTML comments and . Given markup like this, it would be trivial for a search engine to consult the ’semantically enriched’ version, but only for the content region within the html that needs to parsed in a semantically richer manner. And RSS (or any other public DTD) can be used — in this case, RSS is perfectly adequate since it demarcates posts very unambiguously.

