How to Create a Natural Language Answering System (2012)

Conversation is hard.  Just ask anyone paired with a stranger in close proximity, say residents in an elevator.  Far easier to stare fixedly at the elevator control panel and listen to the mechanical beeps as the floors pass by.  Pack ten people in the elevator and all ten of us stare at the floor number as if it were the most riveting experience in the world.

 

Why is it hard?  And what does it mean that we can now converse with IBM's Watson or Apple's Siri?  Is it time, as Jeopardy contestant Ken Jennings jested, "[to] welcome our new computer overlords," as he rapidly lost to Watson?  Or is it time to dig a little behind the scenes and find the trick behind the magic illusion?

 

Mechanically, conversation is complex.  The speaker needs to coordinate the diaphragm to expel the air over the vocal chords, tightened just so for the right pitch as the airflow passes over the tongue, whistles through the teeth, vibrates the skull, erupts past the lips pursed just so, propagates through the ambient air pressure, diffuses with sufficient energy to vibrate the listener's ear drum, stimulates the hair cells in the basal membrane, and activates the appropriate auditory nerves that trigger phoneme recognition.  There is a lot of hard, brilliant work focused on these very concepts.  But mechanically producing and recognizing speech is not what makes people nervous about conversations or Watson or Siri. 

 

Decision-wise, conversation is hard.  The reason we are uncomfortable about conversation is that we may be judged by it.  Hearing may be impaired with age or lips may be numbed by novocaine, but our choice of words conveys magnitudes about our intelligence and our intent.  "Hello, sir" is much different from "Howdy, how y'all doing?" is much different from "Yo, sup?"  Similarly, the reply of "Hello, ma'am" is much different from stony silence or "Hush - the government is watching."  All options bespeak of our intelligence, our attention, and our intention.  Speak well and cogently to display a high ranking intelligence.  Speak erratically and ramblingly and display gross incompetence.

 

What is intelligence as it pertains to conversation?  Removing the mechanics of phoneme production highlights the decision making of phoneme selection.  How does a person respond upon recognizing the conversation initiation phoneme, "Hello?"  The person needs to somehow use 26 letters, plus spaces and punctuation, compiled into a return string of indeterminate length.  Major points lost if the response is grammatically incorrect.  More points lost if the response is socially incorrect or irrelevant.  Some points lost if the response does not conform to local jargon.  This is a hard challenge.   

 

This is a normal conversation:

A: "Hello."

B: "Hi!  How are you doing?"

A: "I am fine, Thanks, and how are you?"

B: "I am doing well, Thank you!"

 

This is not a normal conversation:

A: "Hello?"

B: "I aime strawberries?"

A: "Wo bu am a computer:"

B: "I livre votre be hungry chair patio;"

 

To make it more readable, I added the constraint that all lines above contain real words and no misspellings, though not necessarily in English grammar or any grammar at all.  How can a Watson or a Siri or any AI conversation or question and answer system act like the former and not the latter?  We can experiment a little, right now, on the cheap. 

 

According to the available Watson documentation on the IBM website, leaving aside the complex hardware, power consumption, and bodily components - which composes a large portion of their work - leaves the basic brain algorithm.  Watson's brain is built as a combination search and entity recognition program.  Entity recognition flags the names, dates, places, and pronouns in a document.  Google does search.  Putting them together allows Watson to respond thus:

 

Question: "This poor Shakespeare jester is eulogized by the Prince of Denmark."

Watson: "Who is Yorick?"

 

Breaking it down, we can paste the question into Google.  Automatic stop wording scripts ignore the common words like, "this, the, is, by" leaving only the distinguishing words.  The order does not matter.  In fact, try copying the phrase, "poor, Shakespeare, jester, eulogized, Prince, Denmark" without quotes into a Google search.  Feel free to mix up

the order; it makes no difference.  At this initial step, the search has no grammatical construct.  We can do the same with a series of Microsoft Word searches.

 

Entity recognition works in parallel by scanning the parts of speech (Object, Predicate, Subject) and determines we are looking for a reference to a person.  Specifically, a "poor" person in a work of "Shakespeare" based in "Denmark."  Running a word search on the content of "Hamlet, the Prince of Denmark" as picked out above places the word "poor" in close proximity to a character named "Yorick."  Since we already know we are looking for a person, we place "Yorick" in the blank template, "Who is _____?"  Add some more formatting, a creative logo and avatar, and voila!  The poor person's Watson is a stockpile of Word documents and a MadLibs booklet. 

 

Of course, the Watson and Siri on display are extremely challenging to engineer and require years of brilliant hard work.  I abstracted many of the low level details.  Let us do the same for how a person might hold a conversation.  How does a person respond to, "Hello?"  Let us assume it must be a short, four word response and also assume it must be English.  With about 300,000 words or more in English, that comes to 300,000^4 possible responses.  If the brain does not overheat and flame out trying to run the gauntlet of these conversation options, it may require volumes of food calories to run the brain doing so.  Alternatively, if we take Chomsky to the extreme, the template, "<Greeting word> <Polite conversation continuing three word phrase>" must be stored somewhere on genetic chromosome number 15.  

 

Another alternative is watching how small children learn conversation.  We say, "Hello" and reward the child with a smile or hug when they respond to our satisfaction, as in, "Hi!  How are you?"  The burst of dopamine makes the child feel good and seals the conversation pattern in memory, linking the two phrases together in time via activity in the hippocampus, prefrontal cortex, and others.  "Hello" is followed by, "Hi!  How are you?" just as surely as "Four score and seven years ago" is followed by, "Our fathers brought forth on this continent...."  The first phrase links to the next.  The phrases we link together are the phrases we experience everyday.  They become linked and stored so seamlessly over time that the process becomes, in the parlance of Piagettian theory, automatized.

 

The phrases we hear everyday are correlated to our background and environment.  The example I chose came from Lincoln's Gettysburg Address since I often read recent political history.  I could not choose lyrics from Lady Gaga, "We are the crowd" that is followed by, "We are co-coming out," because I do not have a radio and have yet to hear them.

 

Conversation is not hard because of the required speed and precision to select the correct words from the correct facts as Watson and Siri might do.  Rather, conversation may be hard because it reveals volumes of what we have been doing everyday, where, and with whom.  In this regard, Watson and Siri are wonderfully useful and complex tools - exactly as their designers would attest.  They can search.  They can filter.  They can do so very ergonomically.  But they do not hold conversations.  Every conversation is a work in progress in displaying our history.  The issue is not cleverness.  The issue is privacy. 

 

Otherwise, someone could make a killing by producing and directing a movie starring those elevator blinking floor indicators, complete with such pleasant beep soundtracks.