Identifying Computational Bias Effects in Machine Learning (2014)

 

“Computers do not make mistakes.” 

A computer program traditionally follows a “fetch-execute” cycle.  That is, from the hardware up to the high level program language instruction set in Matlab or C or Java or Python or R, these are geared towards “fetching” and reading the next instruction line and “executing” or carrying out the instruction.  Whether the program directs the radar tracking beam towards a certain point in space, monitors heartbeats from an EKG, or renders alien plasma cannon blasts in accordance to the gaming console controller, all computer programs are carrying out arbitrary instructions in a repeating “fetch-execute” cycle.  Ever wonder what those GigaHertz numbers mean on a given smart phone, or laptop, or ye olde desktop of yesteryear?  They relate to how quickly the hardware can run through a cycle.  Adding more cyclic speed and more workspace (i.e. Hertz and Bytes) allows the computer to carry out the arbitrary instructions with arbitrary amounts of speed and precision.  Computers do not make mistakes.  Computers compute and nothing more.  The question on what to compute is deferred. 

 

In 1989, Hornik, Stinchcombe, and White mathematically proved that an artificial neural network with sufficient time and appropriate network structural complexity could approximate (i.e. compute) any arbitrary mathematical mapping function with arbitrary fidelity.  In essence, with enough “Hertz” and “Bytes” equivalents, an artificial neural network has the full capability of any computer.  This was a hugely if indirectly influential finding in support of artificial neural networks.  But it nevertheless may still miss the point.  But first, some back story.

 

What is an artificial neural network?  The answer differs depending on the perspective.  To popular neuroscience, these are the Terminator or Matrix neural nets that allow evil artificial intelligence to take over the world.  To computer science and mathematics, these are interesting but highly specific science fiction fringe models somewhat inspired by neuro-physiological findings.  To econometrics, these are like extended multi-layer regression models.  To a neuroscientist, this is anything and everything that concretely or abstractly explains either the form or the function of the nervous system.

 

It started in the modern era with Hodgkin and Huxley (1952).  They isolated and prodded a real-world, biological, macro-scale neuron in a squid.  Rosenblatt (1958) followed and captured this phenomenon turned abstraction with an algorithm.  Competing for finite intelligence research funding, Minsky and Papert (1969) argued that these so-called upstart artificial neural network algorithms could never possibly fit simple arbitrary mathematical XOR functions, let alone more complex arbitrary functions.  Hornik et al. (1989) mathematically showed that these artificial neural networks could in fact fit even arbitrarily complex functions provided sufficient structural depth and complexity.  Rumelhart, Hinton, and Williams (1986) demonstrated this with a multi-layer extension of Rosenblatt’s work.  Hornik et al. and others have thus demonstrated – or rather, re-demonstrated – the viability of artificial neural networks.

 

But they demonstrated this viability within the bounds as they understood them.  They showed that an artificial neural network could compute like a computer.  The best computers make no mistakes.  To compete, an artificial neural network would need to have a sufficiently deep and complex structure to handle anticipated computations with no discernible mistakes.  This necessarily implies two factors.  First, the artificial neural network needs to know in advance what the computations are going to be.  Second, the network needs to be at least that complex to handle them.  Taken together, this indicates an artificial neural network is both harder to use because it cannot handle arbitrary computations ad hoc and harder to comprehend because it is potentially far more complex.  This is in comparison to purpose built computational tools such as decision trees, support vector machines, Bayesian networks, linear discriminant analyses, and others. 

 

But Hodgkin and Huxley started in the modern era with a squid nerve.  They were not looking for a new form of computer.  They were simply exploring a sample nerve function.  They worked on a squid because it was accessible – because there are few ethics considerations when dissecting uncooked calamari and because the central nerve is easy to see and manipulate with the naked eye.  Adding on mathematical proofs of its ability to approximate arbitrary mapping functions like a computer is interesting, but patently not the point.  A neuroscientist’s point of view is that neurons correlate with core behavioral intelligence.  Any model incorporating neurons in any concrete or abstract manner is a neural network.  This causes literal semantic confusion: a neuroscience neural network a la Hodgkin and Huxley is literally not the same as a computational neural network a la Minsky and Papert, Hornik et al, or Rumelhart et al.  This also causes confusion in goals.  A computational model intends to compute with high fidelity any arbitrary future unknown function.  A neuroscience neural model intends to understand how only a collection of physical neurons can be responsible for known everyday behaviors in vivo.  Arbitrary future data has little to do with it.  A natural neural network evolved with natural physical environments in mind and natural physics is anything but arbitrary.   

 

An XOR function is a simple abstract arbitrary function.  It follows the rule, “IF only A or only B, then True; IF both or neither, then False.”  It is a simple, abstract, non-linear, and non-additive relationship.  It serves as a “Hello World” test for computational discrimination models.  It serves as a harbinger of the potential future functions that a discrimination model must handle.  On a computer scale, the more functions such a model can handle and with fewer mistakes, the better.  On a neuroscience scale, XOR is irrelevant to survival.  The test is not indicative or useful in this context.

 

Perhaps the reason consensus has it that computers have not and might never achieve that which makes a human a human has nothing to do with the number of mistakes in mapping an arbitrary function.  It might have nothing to do with more Hertz and Bytes.  Adult humans have about 1400 grams of neurons in the skull.  Young humans have about 500 grams.  Both are humans.  Whales have about 9000 grams.   A dog has 70.  Neither are humans.  None are computers.   

 

More neurons are not smarter.  Fewer neurons are not smarter.  Perhaps the mistake in understanding neural networks lies in attempting to have them make no mistakes.  Computers attempt to make no mistakes.  Perhaps neural networks – the original, neuroscience kind that explores in vivo behavior – attempt to determine what is and is not a mistake.

 

In practical terms, this means researchers may not wish to focus overly on getting artificial neural networks to fit arbitrary mathematical goals and data sets quickly and accurately.  Instead, the focus should be on figuring out what goals and data sets represent in vivo.  The focus should be on what the goal and the data sets are and what to do with them.  To fit or not to fit, that is not the question.  The mistake was in trying to fit it at all.