Understanding Why Classic Machine Learning is Machine Storage (2014)

 

Is Machine Learning Really a Machine Learning?

 

Machine Learning is a hot topic for study and applications.  It is a subfield of Computer Science focused on generalizing with known data for predictions with unknown data.  It is an enabling technology for Big Data.  A modern spam filter, for example, more often than not uses machine learning algorithms to predict whether incoming email messages are spam or not spam.  A modern hospital diagnosis tool to assist health care professionals makes use of machine learning and big data mining.

 

It follows to ask whether machine learning is on track to help “solve” the Artificial Intelligence problem – can computers ever become “smart” like a person?  It follows to ask, “Today, spam;  Tomorrow, the world?”  Is machine learning about a machine that learns?  And if learning efficiently is what makes us human, then a machine that learns sufficiently efficiently should be like a human for all intents and purposes.  Let us explore this syllogism.

 

Phase 1.  The Expert System.

 

In the 1970’s, expert systems proliferated in computer science.  An expert system is a set of rules or facts combined with a logic module that essentially queries the rules and fills in gaps to make a conclusion.  In a health care diagnosis system, it might work like this:

 

Input: Fever, stomachache, headache.

Output: Flu.

 

The expert system contains a list of input symptoms tied to outputs such as, “IF fever -> THEN flu…”  The system receives inputs, “Fever; Stomachache; Headache,” runs them on this list of input symptoms in the database, tallies up the outputs, and responds, “Flu.”  Maintaining the list of symptoms requires, say, a database administrator performing overnight batch updates.  Maintaining the logic and query system requires a programmer to change the code in the development area, try it out with a select group of end-users in the test area, and finally release the code patch to production in an overnight batch update.  This setup might be marketed to the world with slogans like, “80% of doctors and nurses agree with the AutoDoc Health Care Expert System…”

 

Except that might not be what makes a healthcare professional a healthcare professional.  This makes an internet self-diagnosis board of limited utility that makes everyone think they have the flu.

 

Phase 2.  The Big Data Expert System.

 

Today, there is a profusion of online data available.  New articles are being posted and published online at ever increasing rates.  A healthcare professional has about 15 minutes allotted per patient.  This includes preparation time, research time, and interview time.  There is a literal ton of background features and data available that must be taken into account before making a diagnosis.  Missing a critical piece of information when making a diagnosis can lead to fairly severe malpractice charges.  A machine learning system might work like this:

 

Input: Fever, Stomachache, Headache.

CrossCheck: Age, Weight, Height, Past Travels, Patient Recent History, Patient Long Term History, Patient Immunization Records, Patient Billing Records, Patient Dietary Records, Patient Skin Tone, Patient Gender, Patient Psychological Profile,…

Output: Flu, 80% probability.  Spotted Fever, 25% probability.

 

The procedure is similar to the basic expert system, with emphasis on scanning and extracting more information from among tons of available online data.  This online data mining allows for double-checking against more rules.  The primary enabling technology is a more efficient logic module on bigger and faster processors.  For example, a more efficient means of quickly searching a patient’s text history during a 15-minute visit might make use of text indexing.  Instead of linearly searching tons of patient history text notes during the each visit for the word, “flu shot,” the big data expert system might include something like an Aho-Corasick or derivative.

 

 

In this simple example, a list of terms in the text (top) gets transformed and indexed using a one-pass word filter (bottom).  Searching the original text, “fluflourideflushot…” for each word individually ad hoc takes time.  Searching the already processed text for the presence of the word, “flushot,” is much faster and easier.  Each patient document can be indexed and preprocessed in an overnight batch to be ready for a very rapid term search during a patient visit.  This makes it possible to quickly perform a massive, big data crosscheck on any desired patient attribute in a massive, feature-rich database.  This setup might be marketed to the world with slogans like, “Experts agree that the thoroughness of the Systematic AutoDeepDoctor (SADD) improves healthcare practice by 90%…”

 

Except that might not be what makes a healthcare professional a healthcare professional either.  This makes a highly linked internet self-diagnosis board of limited utility that makes everyone think they might have cancer and spotted fever instead of the flu.

 

Phase 3.  The Professional Guide

 

Computers do what they are programmed to do.  Massive amounts of data are available for computers to do something with them.  Together, they produce big data solutions.  But a healthcare professional is first and foremost a guide.  They do not only heal the sick; they help us to live.  A doctor is a teacher.  A professional guide might work like this:

 

Input: Fever, Stomachache, Headache.

InitialResponse: There is a flu going around.  How long have you had these?  How is everyone else in your household?  Take plenty of fluids… Get rest… wait X days and let me know if the symptoms remain… I am here for you…

FollowUpResponse: Still have persistent fever and stomachache... Let us together check for X, confirm or deny Y, try out solution Z...  

 

The professional guide empathizes with the patient, gets in the patient’s metaphorical shoes, sketches out a course of action, and reassures the patient to build trust, confidence, and rapport.  Along the way, they qualitatively crosscheck both within patient history and across other patients.

 

This presents a stark contrast with the machine learning approaches.  Schematically, a machine learning system takes in historical data, stores it in such a way as to maximize generalization, and then extrapolates from this stored data to make predictions.

On the left is a segregated machine learning data structure.  A segregated data structure works like a database of rules.  It forms the basis for databases, K-nn, Radial Basis Functions, Bayesian belief networks, decision trees, linear discrimination analyses, and types of support vector machines using appropriate kernels.  An Aho-Corasick tree works using such a data structure.  The theory and practice behind these forms is clearly understood and mapped.

 

On the right is a distributed machine learning structure.  A distributed data structure is modeled (very) loosely on the biological brain.  It forms the basis for types of support vector machines using specialized kernels and for artificial neural networks, machine learning-style.  All rules are shared and combined and re-combined in the network to exhibit damage resistance (any lost node does not completely remove any crucial rule), fixed storage size (there is never any need to expand the physical storage), and rule re-combinations.  The rule re-combination attracts the center of attention because it allows new rules to theoretically be formed in unknown, unknowable, and unexpected manners.  It forms a high-risk, high-reward, “Hail Mary” approach to strive towards artificial intelligence.

 

In all forms, the goal of the machine learning is to copy and store the known data into the selected data structure.  In a segregated structure, linking a fever to a flu involves simply writing, “IF fever -> THEN flu.”  In a distributed structure, it gets more complex by weaving network weights such that “Input Node: Fever -> increases energy output for Output Node: Flu AND decreases energy output for all other Output Nodes.”

 

The key aspect here is that the machine learning algorithms’ goal is essentially to STORE data to a specific data structure.  A medical textbook stores rules.  It can store it ordered by an index or it can store it ordered by a stream-of-consciousness narrative, but the job is to store.  A healthcare professional makes decisions and treats.  A medical textbook is no more a physician replacement than is any current instance of a machine-learning algorithm.  A half-written textbook is not in training or learning.  A half-written rule base is not in training or learning.  A half-stored rule base regardless of data structure is not in training or learning.  Perhaps “Machine Learning” should be more aptly renamed “Machine Storage.”

 

Getting a computer to Phase 3 then might not be related to the underlying data structure.  An artificial neural network, machine learning-style does not provide significantly more ability than an Aho-Corasick type look-up approach.  Clearly, whatever the reason our biological brains use a connectionist data structure, it is not simply for the structure that makes a person smart like a person.  The structure type may be necessary, but alone is not sufficient.

 

As Phase 3 may point out, a human healthcare professional guide is not about storing knowledge.  While it is important, that is not the defining characteristic.  If learning is part of being human, then it is not simply learning facts and figures.  It might be learning about someone else.  A professional guide learns about the patient and gains their trust and rapport.  A professional guide grows towards the patient and allows the patient to grow towards the guide.

 

Perhaps on this note, a biologically modeled, distributed, connectionist-like data structure is not sufficient, but necessary.  It may be an enabling basis in ways that a segregated structure is not.  By simple definition, the connections in a distributed, connectionist structure can grow.  By simple definition, this is drastically different than the artificial neural network, machine-learning style.  The connections must be allowed to grow within layers, rather than just between layers as in the above figure on the right.  And the connections should allow two-way antidromic responses. In this way, learning is not just learning; learning is growing.  When there is a “Machine Growing” algorithm, then we might at last be on the path.