How to Write a Truly Effective Business Data Report (2015)

 

Question: There is a list or data table in the following format:

 

[Name, Items Bought]

 

with the following sample snippet:

 

[Jim,                    15]

[Peter,        25]

[Samantha,          23]

[Jim,                    43]

[Peter,        12]

[Jessica,     12]

 

Efficiently put together a report totaling the items by unique name. 

 

Solution 1: Quickly use a pre-built SQL function since SQL is specialized for exactly this question.  The SQL code would be this:

 

SELECT DISTINCTROW Name, Sum(Items) AS Total Items

FROM Table

GROUP BY Name

 

This should take all of 30 seconds to set up.

 

Solution 2: Quickly use another pre-built SQL-like function in any supporting programming language of choice, say a dictionary in Python or an associative array in Java.  The code would be like this:

 

report=collections.defaultdict(int)

For name, count in table:

          report[name]+=count

 

This should take all of 45 seconds to set up.

 

Simple.  Effective.  Fast.  Anyone at all involved in this process – from report requesting to report generation – should know the basic idea within their first year working.  It is a transformation and filtering of the data into… more data.

 

Analogy: humans visually process the scene in front of them.  Light reflected from the environment enters the eye, through the lens, through the vitreous humor, and onto the retina. 

Then the light radiation stimulates specific retinal cells (e.g. rods, cones) and their associated optic neurons (e.g. horizontal cells, amacrine cells) that discharge onto the lateral geniculate nucleus (e.g. parvocellular – small, colored, steady detectors; and magnocellular – large, motion, fast detectors) before projection to the visual cortex at the back of our heads.  At the end of this process, the image (i.e. as projected to the striate cortex in V1) is a roughly upside down version of the outside environment twisted and magnified as if on a convoluted mirror. 

 

It is a transformation and filtering of the data into… more data.

 

There are thus two issues with this question and the answers. 

 

1. Is the set up investment worth the expected use? 

The trained, yet unseasoned analyst will know about the built-in shortcuts and functions in SQL, Java, Python, visual cortex, and the like.  Using these shortcuts is a textbook exercise taking about 30-45 seconds to setup.  

 

The really trained analyst will know how to recreate these functions from scratch – especially since not all languages will have these pre-built functions.  Creating the function requires extensive indexing (e.g. Aho-Corasick dictionary matching complete with index links) which may take a one-time setup investment of days before each individual 30-45 second setup.  The alternative is to use a nested pass to meticulously compare, group, count, and sort the names and items bought.  There is no setup investment, but each individual run may take hours.

 

The seasoned professional needs to ask and project how many times this function may be used BEFORE being anchored and biased towards using it just because it exists.  This is a variation on the sunk cost / moral hazard concepts – just because it exists and cost so much investment does not mean it is necessarily appropriate to use.  The pre-built function exists to do what it was built to do.  Using it as a choice forces the analyst to frame the question in terms of multiple choices on the terms of the pre-built function.  Just because we spent so much to get a car does not mean we should drive 50 meters to the grocery store.  This is a serious bias.  

 

2. The ultimate and ultimately more important question is:

What is the point of the data reporting?  In other words, what was the original question before it got to the one listed at the beginning?  Report writing has hidden (latent) variables.  Real life has hidden questions.

 

Working backwards, Professor Kahneman in Thinking Fast or Slow would say the common question listed at the beginning is the easier-to-answer heuristic question in response to a hidden, more difficult target question.  In more comical terms, Scott Adams in Dilbert often has the joke along the lines that people’s questions are those they are best at answering – such that the marketing executive wants more marketing research, the engineer wants to build an engineering model, and the hot-tempered impatient folks want to “kick some hiney.”  They all ignore the original question and reframe it in terms that they each individually know.

 

The original question – in one form or another a key one for all businesses – is what do the customers want and how do we give it to them?  The original question – in one form or another a key one for all living animals – is how do I fit and adapt to this environment?  Focusing on the reporting scale might serve as a distraction on the key scale.  

 

Eyeballs, surprising enough to us humans biased towards making everything visual, are technically optional for seeing.

 

One final note to belabor the connection between data structures and their intended use and nomenclature:  Java calls the SQL-friendly dictionaries as “associative arrays” as in it forms associations much like associative learning.  This is a misnomer of an egregious order.  Here is why:

 

An array data structure is a numeric-indexed set of data entries.

data[1] = “a”

data[2] = “b”

data[3] = ”c”

To access the 3rd data entry, we do not need to check the first, then second, then third data entry.  We simply access data record [3].  Think of a physical library.  We use the Dewey Decimal System to locate a text by its numeric index code. 

 

An associative array data structure is a non-numeric-indexed set of data entries. 

data[“gray”] = “a”

data[“blue”] = ”b”

data[“pink”] = ”c”

To access what letter “associates” with “blue”, we simply access data record [“blue”].  This is like saying the data array “associates” the letter “c” with the number 3.  Technically this is true, but is seriously misleading.

 

Human associative memory refers to more of a self-organizing, top-down selective attention-mediated bottom-up feature clustering coincident detector.  Clearly, this is for neither here nor now.  But suffice to say, it is the opinion of this scientist that the terms “associative array” and “machine learning” be understood as differing from neuro-biological association and learning.  Perhaps at least internally, they would be better served as “non-numeric-indexed data” and “machine storage.”  This may help alleviate the inherent bias to misuse this pre-built functionality for which they were intended.