It's been a while since my last update, many apologies. Here's an update on my work with Eostric.

As it turns out, Wordnet was a bit of a boon. I've put together a pile of routines that can take arbitrary text pasted in from the web (or print) and break the text up into chunks that a machine can digest.

Take for instance the text:

Edgar Allen Poe was born in Boston in 1809 but grew up in Richmond, Virginia and attended the University of Virginia.

The code will take that and break it up as follows:

Edgar Allen Poepos n proper_noun 1
wasstem be pos v wordid 12254 morph 1
bornstem bear pos v wordid 12366 morph 1
inpos preposition
Bostonpos n proper_noun 1
1809pos n proper_noun 1
butpos preposition
grewstem grow pos v wordid 60126 morph 1
uppos preposition
Richmond , Virginiapos n proper_noun 1
andpos conjunction
attendedpos {s a} all_pos {s a} wordid 9501
the University of Virginiapos n proper_noun 1
.pos {}

Essentially the code tries to lump nouns into a noun phrase, and identify the parts of speech for each word or phrase. It's not perfect yet, obviously, but it is performing a lot of cleanup that I wasn't even aware it was going to need to do before I started the implementation.

Of all things, apostrophes and quote marks are turned out to be a challenge. Because of some curious design decisions in both ASCII and keyboard design, how quotes are actually typed and stored varies widely. Also, one can never be sure if the ' character means the end of a single quote phrase, or an apostrophe, without digging into the context.

You will see that each word or phrase is marked with a part of speech (pos), and if the word could be located in wordnet, the Wordnet id for that word and the stem word that Wordnet is tracking. Wordnet only tracks nouns, verbs, adverbs, and adjectives, so you will see that there are also "propositions" and "conjunctions" listed. Those are sorted out by Eostric. Eostric also has rules for detecting and lumping proper nouns.

I didn't think I'd get this far this fast, and frankly I'm scratching my head on how to take this software to the next level. But that's what I've been up to the last few weeks. Stay tuned for updates.

I have posted the code so far on chiselapp if you would like to experiment with it. I don't have the wordnet database included in the code just yet. It's about 500mb of data, and I'm not entirely convinced I need all of it. Or even most if it. In a lot of my same exercises (including the one above) wordnet is only really useful for a small chunk of the data processing I actually need. And usually about generic words with special rules. Wordnet contains a lot of encyclopedic entries that are sort of dubious in usefulness. Things like "Battle of Jerico" and "Albert Einstein." Nice that it recognizes them. Not so nice that it does anything useful with them.

The link of the code is: