04 March 2006

GoogleParse metadata

i predict
that sooner rather than later
google will have to create
(and will choose to publish)
a metadocument
for every webpage it indexes

that expands on the traditional semantics
of META headers
identifying author, date, topic, etc

by condensing
the most useful results
of a thorough ai-parsing
of the page

breaking out its subsections:
subtopics
author info
host info
site-navigation
site news
recommended followup readings
etc etc etc

providing guideposts
that will allow the page
to be re-parsed
as efficiently as possible

(eg if a small edit
changes the position
of previously parsed data)

perhaps gradually lengthening
as new categories of semantics
are added

and encouraging web publishers
to post their own version
of these metadocuments
that google can spider
to doublecheck their own parser



with site metadata
as well
that summarizes
sitewide
markup-style:

which tags-n-attributes are most used
how section-divisions are marked

maybe even a vocabulary of regexps
optimised for that site's style




the xml/sgml people used to argue that
this metadata should be
embedded
in the document

under the illusion that
each semantic category
could indulge in its own stylesheet style

but i don't even think the
page-creation tools
support this yet




where it's probably already begun
is with Google News
and Google Blog Search

which need to know
how different news sites
and blog hosts
mark up their dates
and titles
and links
etc