22 September 2005

The long tail of Google News

when googlenews debuted in 2002 [wayback]
my two favorite things about it were
1 that its algorithm for picking top stories
took the whole world into account
and not just the usa or the wealthy, and
2 that for each top story
the top article might come from anywhere
in the world or in smalltown america

but people complained, i guess, that unwelcome perspectives
were getting too much prominence
and google tweaked the algorithm
so now the usual suspects: wapo, nyt, etc
dominate again, and i'm bored by their frontpage

but ggn also allows search of the full news database
which i'd use when i knew or suspected a story was out there
and wanted to find more-local coverage, say,
or the same story, but with a picture

but i recently found myself wishing i could access ggn's raw feed
so that every single story would scroll past as it was spidered

and i slapped my forehead to realise that
that capability has always been in ggn-search
if you search for a common word like "the"
and sort the results by date
('the' needs a plus in front: +the
because it's normally a stopword)

so i put an icon for this search
in my toolbar, and hit it when things got boring
and scanned thru tons of me-too dross
occasionally finding a new story or a new source

which drove home how many thousands of sources and stories
are hidden in ggn's 'long tail'

soon after i was playing with ggn's advanced search
and i realised that if i found the right source-keyword
i could create custom feeds for any source or set of sources

(but what i really wanted was a feed
that excluded every AP story
and everything from corporate media
but i haven't got that figured out yet)

now a few weeks ago ggn added rss-feeds
which i'd reluctantly tried last spring
and quickly fallen in love with
because they strip away the webpages' egos
(along with the rest of their personalities)
and because they check semi-automatically for new content

so i started looking for source-keywords
for news sources i liked to check regularly
but that didn't have their own rss-feeds
and i quickly realised that even for sites that offered their own rss-feeds
the ggn feeds were often much better

(newspapers usually don't get
that the rss-feed is supposed to include everything
in reverse chronological order
and they try instead
to make it echo their front page
with only the top stories
and with the 'top' story on top)

i suspect the only reason ggn doesn't advertise this capability
is that they still want some wiggle-room
because they seem to miss a lot of articles from some sources
and if they claimed they were offering feeds
people would have higher expectations
and complain about the missed bits

the feeds they offer are so 'clean'
with only the rarest bizarre visible glitches
that i think their parsing mechanism must be extrememly conservative
and throws out anything it isn't sure it understands

sidebar on screenscraping:

remember that ggn works entirely by 'screen scraping'
which means their software loads, from the news sites
the same html pages that anyone else sees
and uses pattern-matching, probably in python
to suppress all the junk before, beside, and after
the news stories themselves
and to spot new content and identify the headline and the date

which for my money is the right startingpoint for the semantic web

don't make people embalm their pages in xml

start with a science of screen-scraping
that's flexible enough to adjust to the quirky ways
that anyone might code their html

and i hope google releases their screen scraper eventually
because every serious news-hacker ought to run a customised copy

end of sidebar

so now i'm adding more and more ggn versions
of news feeds from different sources
as i discover their magic keywords:
Znet, In These Times, the New Yorker, Weekly World News

but i still haven't recovered the serendipity factor
until i get the idea of adding feeds for abstract concepts
like: prank, mysterious, utopian, sacred
that scan all news sources
for every smalltown prank
and every thirdworld view of what's utopian or sacred

and this really works

(i'm also trying a google blogsearch feed for prank
but what this shows me is that the intellectual level
of the average blog
really is down the tubes, with the splogs)