05 September 2006

Recap of Yahoo zodiac (plus novelists' styles)

we count the frequency
of the million most-frequent
words on the Web

based on these frequencies
we calculate the expected average distance
between every pair of words

and compare it to the
observed average distance

closer actual pairs
get connected by 'short' elastics

farther pairs by anti-elastics

the whole takes a semi-unique 3D shape

we find the freqency of each word
within each top-level Yahoo
topic category (14 total)

and color it one of 14 colors
to reflect which topic
uses that word most

we rotate the 3D shape
until the 'stars'
of each of the 14 colors
are optimally clustered

and we flatten the shape
along that axis

the radial order of the words
we call 'yahoobetical order'

we temporarily maintain
yahoobetical order
but reposition each word
with nearer orbits for common words
distant orbits for uncommon ones

we now consider word-pairs
on the web
and give them stars
with orbits based on frequency
and positions halfway
between their components

we add triplets, etc
up to phrases, sentences
paragraphs, pages, chapters, books

using hypothetical elastic links
to their components
determining their yahoobetical

we loosen the original
yahoobetical ordering
of the individual words
and let the elastics
re-sort things
(expecting not-too-dramatic
a change)

this is recalibrated
yahoobetical order

we consider the full oeuvre
of any author
and count the word-frequencies

repositioning words' orbits to match

and we compare those
individual-author's orbits
to the whole-web orbits

and re-map based on differences

so the words leCarre (say) uses
more than average
get distinctive orbits
and ditto, those he uses less

(we may choose to filter out
words and phrases used more mainly
in only one or two titles
eg characters' names)

these words' orbits are his
stylistic fingerprint