This post is continuing my thoughts from Seeking a Language Template for Eostric.

One problem my little system of taking English and turning those ideas into some sort of neutral storage system is the fact that English loves to modify words to get them to satisfy different parts of a sentence. We can see that Chinese does that too, as does Cuneiform. The difference is that Chinese and Cuneiform are a little more obvious about how they do it.

Let's take the sentence "I am going to the store." Now let us also place that sentence in the past, in the future, and instead of "I", change the actor to "you."

presentII am going to the store我要去商店
pastII went to the store我去了商店
futureII will go to the store我要去商店
presentyouYou are going to the store你要去商店
pastyouYou went to the store你去了商店
futureyouYou will go to the store你要去商店


In the world if indexing human thought, many engines use a process called Stemming. That process takes a word, and separates that stem of that word from the extra bits that define the word's role in the sentence. Search engines like Google take text and stem it, so that if I'm searching for a verb, a hit for mouse wearing a cape will be picked up by a search for mouse that wore a cape. A stemmer knows the rules for the particular language it is parsing. I use what is known as the Porter Stemmer. It takes a phrase in and returns all of the stem words, in order, and strips out joiner words like the. If I run our little mouse through the porter stemmer:

mouse wearing a cape - MOUSE WEAR A CAPE
mouse that wore a cape - MOUSE THAT WORE A CAPE
the anticipation is killing me - THE ANTICIP I KILL ME
the anticipation killed me - THE ANTICIP KILL ME
nobody anticipated my being killed - NOBODY ANTICIP MY BE KILL

As we can see, this process of Stemming isn't perfect [1]. Stripping words to their stems also changes the meaning of a sentence. Ideally a stemmer would catch that wear and wore are the same root word. Also it should know that is is a word, not the plural if I. In fact, English is challenging enough that most search engine's don't even try to parse a statement for grammer. They use a statistical association to find patterns [2]. That won't be good enough for the purposes of the application I am describing.

The computer has to know what the complete thought actually is to be able to formulate that thought into proper English. And possibly, any other spoken or written language. I'm not looking for perfect prose. If the text sounds like it's coming from a cave man, or an individual with a very thick accent, that could actually work for our purposes. Even better would be to discover the algorithm could be tuned to compose everything from Country Bumpkin to William Shakespeare.

Data Structure For Representing Ideas

First we need to answer "what is a thought." If we remember from grade school, every sentence should be a complete thought. Now this may not be the correct answer of a neurological level or psychological level, but we aren't trying to model a complete brain. We are just trying to represent how such a brain would need to arrange data to formulate coherent human communications.

Virtually every english sentence needs a subject and a predicate. In Chinese, each sentence has a subject a verb and an object. And then things get a bit murky.

Both have different rules that allow us to implement what news folks like to call the 5 Ws: Who, What, When, Where, Why... and How. (But they still call it the 5 Ws). Not every sentence has all of these elements, but a complete story will have a sentence delving into each. English and Chinese just have different ways of going about that process.

Long story short, I'm a software engineer. I'm not a linguist. So I'm going to throw away the technical terms that linguists use to describe the process of communication, and make up terms that better describe what this project is trying to do. Namely store thoughts in a form that can be expressed at a later time. And store those thoughts in a manner so simple an consistent that a computer can convert those ideas back into sentences.

What I've been finding is that many of the existing linguistic and grammatical terms that describe language bake in a lot of concepts that don't cross from one language to another, because each language organizes ideas in a different way. A way that makes sense for their particular way of doing things.

In a way this project is actually a bit like chemistry. You have compounds that break into Atoms. And then you find that atoms themselves are made up of sub-atomic particles. And then that those sub-atomic particles are themselves made of quarks. The Sentence structure you are used to dealing with from grammer school is on the atom level, with a confusing soup of sub-atomic particles. The computer is going to need to be working on the quark level, with a different set of rules to make English atoms vs. Chinese atoms.

Ok, maybe chemistry isn't the best analogy for people. I promise there won't be equations.

What I will be working on for my next article will be taking different sentence patterns and figuring out what sort of arrangement of blocks would be needed to accommodate all of those patterns without resorting to reading into the context. Computers can't. For our game purposes, the computer agent will have databases full of these ideas, and then use a grammer synthesizer to form those ideas into sentences. The system will also have a way to take a player's commands and turn those command into one of these data structures. It will be clearer when I get some more examples together.

Basically I'm trying to get a cluster of ideas that will snap together like lego, with different bricks connecting in designated places to form a structure. As I said... more on this as I work out specifics.

[1] - As any casual search engine query will also demonstrate.

[2] - As any casual search engine query will also also demonstrate.