that sooner rather than later
google will have to create
(and will choose to publish)
a metadocument
for every webpage it indexes
that expands on the traditional semantics
of META headers
identifying author, date, topic, etc
by condensing
the most useful results
of a thorough ai-parsing
of the page
breaking out its subsections:
subtopics
author info
host info
site-navigation
site news
recommended followup readings
etc etc etc
providing guideposts
that will allow the page
to be re-parsed
as efficiently as possible
(eg if a small edit
changes the position
of previously parsed data)
perhaps gradually lengthening
as new categories of semantics
are added
and encouraging web publishers
to post their own version
of these metadocuments
that google can spider
to doublecheck their own parser
with site metadata
as well
that summarizes
sitewide
markup-style:
which tags-n-attributes are most used
how section-divisions are marked
maybe even a vocabulary of regexps
optimised for that site's style
the xml/sgml people used to argue that
this metadata should be
embedded
in the document
under the illusion that
each semantic category
could indulge in its own stylesheet style
but i don't even think the
page-creation tools
support this yet
where it's probably already begun
is with Google News
and Google Blog Search
which need to know
how different news sites
and blog hosts
mark up their dates
and titles
and links
etc