How to Understand Natural Language (2016)

 

Once upon a time, the world's communications in the raw, unstructured, and unformatted yet meaningful form was simply known as language.  Any person could understand their local language.  Nobody could hope to scale it up to understand it all. 

 

Any calculating computer can assist with scaling up known, step-by-step tasks.  The catch is that the designer must understand every single step.  No computer could hope to understand anything besides its delineated steps.

 

The first computer program communications took place on punch cards, plastic cards with precisely placed, structured hole patterns in a form of physical Morse Code.  Every computer communication since derives from this necessary legacy.  This makes for a distinction: computer languages are formatted and structured instruction set protocols.  Raw, unstructured, and unformatted yet meaningful forms are now called natural languages.  Bridging this gap is called natural language processing.

 

Natural language processing is an industrialized scale of understanding natural language.  It basically entails segmentation, parsing/ transformation (word sense disambiguation), and generation.

 

The problem, as it were, is that disintegrating the language into independent steps breaks the natural and shoe horns it into the computational industrialized process.  And like an industrialized process, this makes for very large scales, but only for suitable raw materials.  Everything else gets discarded.  For example, according to the engineers, sales and marketing staff, and the chief scientist for a major commercialized natural language processing system, multiple independent system instances are required to handle different subject corpora.  Feeding a medical advice system legal questions generates nonsensical medical advice.  Furthermore, if the system is adaptive as is currently in vogue, we may have also corrupted it going forward --  in exactly the same manner as placing diesel fuel into a non-diesel car engine or feeding scrap metal into a cranberry sorting machine.  Garbage in, garbage out.

 

The question almost nobody is asking is where are the neuropsychologists in this discussion of natural language?  How do we understand natural language?  How do we understand language? 

 

Chomsky (1972) attempted to view children's understanding holistically and essentially threw his hands up in despair.  It was inconceivable that people can formulate novel utterances based on imitation and reinforcement.  There must therefore be a universal grammar inbuilt on some unspecified genetic language organ.

 

While Chomskyan's proposed organ - synonymous with hard coded linguistic grammar rules - is no longer strongly favored, human natural language does appear uniquely special.  Tomasello, et al. (1993) revealed that only humans exhibit speech not important for survival.  Even chimps who can do sign language quite well do not show any communications for the sake of conversation.  That is, no chimp has ever shown small talk.

 

MacWhinney (2002) proposed that language evolved with the speaker.  Languages do not spring fully formed from the ether, but gradually with the assistance of cultural factors.  Snow (1986) showed that children learn full speech in cultures where adults encourage them with children's topics, where adults are neutral and speak of non-children's topics, and even in cultures where adults actively discourage children from speaking at all.  Yet, other non-speech complex cognitive skills still require more amenable adult assistance to develop.

 

Other researchers start with early learners in their first year to observe emergent understanding, namely with segmentation.  Jusczyk, Cutler, and Redanz (1993) found that 7-month old babies find stress patterns (i.e. accents) in words to find word boundaries.  English usually stresses the first syllable in a word (doc'-tor, can'-dle).  Babies in such environments exhibit difficulties on segmenting more unusually stressed words (gui-tar', sur-prise').

 

Saffran et al, (1996) explored the use of transitional probabilities.  For example, the phrase "pretty baby."  The sound "pre" alone implies an unfinished word and that "tty" would soon follow.  The sound "ba" likewise implies "by."  "Pre" and "Ba" rarely appear alone.  "Pretty," however often can be followed by many different sounds or none, implying a complete finished word.  Pretty and Baby can thus be split as different words.  8-month olds tested on artificial sounds following this rule can identify such word boundaries.

 

Similarly, Mattys and Jusczyk (2001) and Mattys, et al. (1999) explore   phonotactic information where there are sound constraints in English tor example.   For example, "ant" comes from from "a-" and "-nt," both natural in-word sounds.  In contrast, "-mt" is not an English in-word sound.  Therefore, "come to" cannot be "comt oo" but "come" and "to." 

 

The conclusion from this as regards segmentation is that it is natural yet not intuitive at first glance.  The intuitive approach embedded in computational methods is to shoe horn it with articulated, structured pauses between words.  Unfortunately, natural language exhibits co-articulation.  Instead of a structured, "the cow jumped over the moon,"  natural language is more, "thecowjumpedoverthemoon."  Or even worse, it may be, "thec owju mpedov erthemoo n."  This has been traditionally a severe problem for natural language processing.  0-year-old infants apparently show no such complication. 

 

In a similar vein, children's language learning imply parsing and transformation may be linked with output generation, at odds with computational divide-and-conquer approaches.  Neville (1995) and Neville, Mills, and Lawson (1992) presented intriguing research that showed grammar words (e.g. the) activates the frontal temporal regions of the brain while object words (e.g. dog) activates the rear.  It can be suggestive that rear areas of the brain are linked with object identification.  Other research shows frontal regions are more invoked with executive and social functions.  Perhaps proper grammar words are not meant to convey meaning but socio-cultural style and standing. 

 

Panagos and Prelock (1982)  showed that until age 5-6, children often cannot properly pronounce full speech consistently.  The sounds, sh, th, s, and r need precise control over chords, lips, teeth, and tongue.  Therefore, (Leonard, 1995; Menn and Stoel-Gammon, 1995) young children choose words carefully, especially favoring words they can pronounce.  Kuczaj (1983), showed that 3-year olds can identify the difference between actual and desired language utterances.  An especially poignant example of a 2.5 year-old follows (Smith, 1973).

 

Father: say jump.

Child: Dup.

Father: no jump.

Child: Dup.

Father: no Jummmp.

Child: Only daddy can say Dup.

 

This phenomenon may not fit a static division between parsing/transformation and output generation, but rather a dynamic division between the Child's language and the Adult's language.

 

 

 

 

Caselli, et al. (1995) and Bonvillian, et al. ( 1983) compiled the most common first 50 words across a variety of languages include words like more, mommy, daddy, car, cookie, juice, milk, dog, cat, and ball.  Not included were words like less, stove, and animal.  Echols (1993) and Johnson, et al. (1995) studied toddlers from 12-18 months of age.  They theorized that the cognitive load for single words or syllables strains young toddlers' capabilities.  So toddlers choose wisely to convey their intent.  They rely heavily on the concept of a holophrase.  The single word, "Ball," may  mean, "give me the ball," or  "that is a ball," or perhaps, "the dog took the ball."  The word, "Banana," may indicate, "I want a banana."  But upon being given a banana the toddler did not want, "No," means "I do not want this banana."  Saying "Banana" at that point would be ambiguous. 

 

Anyone around young children can clearly understand this child-speak.  But they and the children are in essence mastering context sensitive intent key words, something no commercial natural language processor has yet mastered.  Imagine a computational natural language processor making sense of "Ball."  Imagine Siri or Echo or Google Search or any bot understanding that highly unstructured holophrase.  Indeed, as of this writing, Google Map's own system defined category for parks, playgrounds, and museums includes an entry for "John Park, MD."  Clearly, the name is erroneously triggering the static "park" key word.  We truly have a long way to go.

 

The key to understanding natural language is to solve the riddle of induction.  How does one find and generate the intent of words and sentences?  How can we make sense of holophrases?

 

Research supports 4 pillars of understanding natural language intents:

 

  • Attentional constraints.
  • Working memory tracking.
  • Social learning.
  • General cognitive process.

 

Human attentional limits are not a bug but a feature.   When an adult points out a novel, arbitrary object and says, "Oh look!  A wug!"  The toddler focuses her senses on the object and nothing else.  First, there is a whole object bias - "wug" must refer to the entire object not a portion thereof or the object plus some portion of environment.  Second, there is novelty detection - "wug" being a novel word, it must refer to something novel and not the familiar ball next to it.  Third is taxonomic constraint - this "wug" and all else like it invariant in size and color.

 

Working memory is a form of active memory.  To review: long term memory is statically stored for durations exceeding minutes to hours, short term memory is statically stored for durations less than hours or minutes, and working memory is dynamically recombining multiple long or short term, spatial or temporal memories.  The image of a dog is in long term memory.  The name of this dog is in short term memory.  But a series of images of an approaching dog gets worked on to inform you it is charging fast.  Within natural language, the series of sounds in sequence get worked on to say that "a wug" is an object, "some wug" is a quantity, and "wugging" is an activity. 

 

Social learning enables us to extract knowledge from others' actions.  Baldwin (1991; 1993) explored 18-month old toddlers with adults. Both had novel objects. The adult then says to nobody in particular, "Oh wow a modi!"  When the toddlers were later asked to get the "modi," they were more likely to get the adult attended object.  The toddlers were able to infer the knowledge without explicitly being trained.

 

General cognitive processes use the same "code" not just for language vocabulary but also for general learning.  Markson and Bloom (1997) studied 3-4 yr olds with a series of novel objects, one of which was called a "koba" and one of which was presented as a gift.  After weekly and monthly delays, the toddlers could recall both the "koba" item and the gift item equally well. The learning was not specialized only on words.

 

This general cognitive process pillar is especially profound.  It implies that the path to understanding natural language segmentation, parsing/translation, and generation lies not in breaking down language into component parts or even constraining the exploration to language alone.  Rather, it lies through complementary senses and behaviors.  It requires seeing the big picture as well as the disintegrated parts.  We would need to understand general executive learning in order to understand natural language.  We would need to understand natural language to understand general executive intelligence.