When the English tongue we speak.
Why is break not rhymed with freak?
Will you tell me why it’s true
We say sew but likewise few?
— Our Strange Lingo by Lord Crome
The English language can be complex and the way our brain interprets it makes things even more interesting. The same words and phrases can have entirely different meanings depending on the context in which they are used.
Consider a few examples:
- Did you remove the dust or were you covered in dust?
- Were you driving fast or were you stuck fast to the ground?
- Were you bound to the ground with manacles or were you bound for somewhere?
Add to this variables like what was said, when, how and to whom and we have a fairly interesting rigmarole to deal with.
These are the kind of questions I wrestle with every day as I design algorithms that discern buying intent from documents. While natural language processing (NLP) and associated analytics can help filter out some of the unknown variables from conversations and give us data to make decisions, the human mind and its machinations continue to be a challenge.
The blind side
To solve any problem efficiently in NLP, we need to have the metadata for every word used in the English language. The basic metadata information only includes the parts of speech, but that is definitely not enough to draw any conclusions. To come up with a coherent information pattern you need more details like:
- Is it a verb? If yes, what type of verb?
- What is the emotional value of the word?
- If we are talking about a noun, is it a person, location or physical entity?
The chemistry of language
To explain this better let me draw an analogy from chemistry. In chemistry we have 118 chemical elements in the periodic table. These elements are arranged in such a way that the elements in the same group (column) tend to have shared chemical attributes and exhibit a clear trend in properties with an increasing atomic number. So just like an atom in chemistry has properties like similar number of electrons, protons and isotopes, if information about a word can be classified, it may not only explain its behaviour in a given text but also help in drawing a pattern.
Having said that, it is extremely difficult to have a table for all the words ever used in the English language. In fact even building tables in NLP is almost impossible given today’s technology. Instead we can choose to have multiple tables for multiple levels and use cases. For example intention analysis, question answer system, text summarization etc.
Arriving at a pattern
When you mix different chemicals, the final result depends on the order in which you mix them. The elements in the mixture and the surrounding conditions like the room temperature also matter. Sometimes while arriving at this chemical mixture, you end up with fixed stable compounds which do not change even if you add new chemicals as they are in a state of equilibrium.
We can apply the same principal to words, their order, and position in the text. The time, location and meaning of the text will also change based on the source of the text. In a sentence, if you have sub-sentences (separated by a comma), adding a few more sub-sentences does not change the meaning as they are in a state of equilibrium (much like this very sentence!).
What’s the good word?
Interpreting language isn’t an easy problem to solve even for the systematic and logical brain of the machine. For instance, the meaning of a word can change based on its context.
- The group has achieved fair and equal representation for all its members.
- She is very fair with blue eyes.
While it is very easy for the human eye to know what the intent is, the computer will struggle to do so. In cases like this, we can borrow some concepts from the isotope model in chemistry. So for words like “fair” we can have an extra property for isotopes and surrounding conditions like where it happens. The conditions in this case are “followed by an adjective or a verb.” Using this property, the next time the machine can predict a better meaning.
The brain and its workings will always be a mystery which even the most sophisticated machine cannot unravel. As John Horgan says in the Undiscovered Mind, “As researchers learn more about the brain, it becomes increasingly difficult to imagine how all the disparate data can be organized into a cohesive, coherent whole.”
At Compile.com, we deal with a variety of datasets both big and small. Often, there is a need to run analysis on top of 3rd party datasets that we haven’t ingested to see if it’s worth the effort. This particular …