Oct 31, 2016 | By

Why Big Data Can’t Be Trusted (Part One) – and What to Do About It!

Imagine 100 monkeys typing (presumably randomly) on 100 typewriters for a limitless period of time: Eventually, hidden somewhere in the seemingly endless streams of nonsense, they would produce all of the works of Shakespeare. This popular thought experiment has been around for more than a century (longer than typewriters!) and demonstrates interesting features of both randomness and infinity. It is a useful starting point for discussing unique problems now being encountered with large data sets.


If one were to come across an anonymous digitized copy of the manuscript from the 100 monkeys experiment, it would be an extremely large data set. What if one began to interrogate the data, looking for patterns that suggested a correlation — some meaningful conclusion to be drawn from the data.   Among the random variations, one would discover a few rare subsets of the data that appeared to create meaningful phrases in the English language. Probing those subsets, one would, eventually, discover the hidden gem of Shakespeare’s Romeo and Juliet. Would you conclude that this is a really weird but totally random coincidence? Or would you think that there was a hidden intelligence that caused the production of the remarkable work of literature?

Now consider if the data set contains all of the information about stock trades that had taken place across the world over the past year, along with detailed information about the phases of the moon at the time those trades occurred. If you interrogate that data set, looking for patterns, you would be able to find some subsets of data that suggested a correlation. Probing those subsets, one might be able to draw out a sample that showed a particular limited category of trades was strongly correlated (with an apparently high statistical confidence) by the phases of the moon. Would you conclude that this is a really weird but totally random coincidence? Or would you think that there is some cause, some hidden intelligence, directing those trades? Would you bet money on it?

Unfortunately, people often get caught in the fallacy of spurious correlations. We are a species of “pattern-finders”, after all, and we tend to be attracted to unusual or surprising patterns. Just look at the prevalence of conspiracy theories, some built on the flimsiest of (or no!) evidentiary claims. We also have only a limited appreciation for probabilities and statistics. For example, as cited a number of times by Russ Roberts on EconTalk, if an analyst is using linear regressions to look for some factor that influences stock price, he/she may try and then throw out 19 models that show no correlation, but find that the 20th gives a positive correlation with a statistical confidence of 95%. That confidence means that there is a 1 in twenty chance the positive result is spurious – a false positive. Looking at the entire process, there can be NO confidence that the result is actually positive.


Big data sets are inherently subject to this risk: If you interrogate the data in the hope of finding something, some correlation hidden in the apparent noise (an approach referred to as cherry-picking), you will almost always be able to find a subset of the data that supports it. Sometimes, that one-off conclusion will directly contradict the overall trends overwhelmingly demonstrated by the data. The Climate Change debate is a good example. When I posted an article recently about the proposed Anthropocene Epoch (see the comments on this post) to an online forum, a few individuals soon overwhelmed the comment section, arguing that there were very specific subsets of data that contradicted the broad consensus of the scientific community.

So, in the debates about climate change, economic policy, medical breakthroughs, or any claim based on purported evidence, whom do you trust? Drawing from my essays on rationality, I recommend that we should:

1 – Recognize the possibility that any given claim may be wrong. The error may be inadvertent, a sincere mistake – there is a one in twenty chance that a statistical finding with 95% confidence will be wrong. Or it may be the result of a misguided analysis that cherry-picks the data or misinterprets the statistics. Or it may be that the claim is pre-conceived and the analysis is just post-hoc rationalization.

2 – Be aware of the potential biases of those making a given claim, as well as your own. Bias comes in many forms, some intentional and many subliminal or subconscious. (see e.g.: BIAS)   Big data offers big opportunities for bias to leak into the conclusions, particularly since any given pre-conceived position will be able to find support, somewhere, in the data set.

3 – Be open to the possibilities. Specifically, do not assume that evidence is false simply because the validation is weak, and do not assume that the lack of falsification is proof that something is true. Both may be in error. Rather, conclusions should be conditional (they could be wrong) and reflect a balanced view of the evidence and its strength. Broad consensus and many confirmatory findings are trustworthy, while one-off claims and the hyperbole of most media headlines and Internet postings are not.

4 – Try to understand the motivations of those making a claim and to be clear about your own. It does not help in the pursuit of truth to judge other’s claims in light of one’s own predispositions. It is also helpful to consider what drives a writer or researcher – are they motivated by the search for truth, or money, or fame, or influence, or all of the above. Everyone (including me) is influenced, to some degree, by each of these motivations.


The monkey analogy is a cautionary tale for some of the problems of bag data. However, there is another side to the analogy. Clearly, we can never interrogate an infinite dataset, and the chances of actually finding Shakespeare in a random finite data set is vanishingly small (but theoretically not zero). But let’s suppose we encounter a very large data set, one that appears to be random, and within that data there are what appear to be anomalies, a quirkiness in the randomness. An analysis of these quirks may show them to be well below the threshold of statistical significance when applied to the entire data set. But can we be sure that the data set is truly random? Not according to Carl Sagan, who said, “The absence of evidence is not evidence of absence.”

Let’s consider, then, the very large data set of human experience, and the claim that there are no miracles. Clearly, there have been some events in individual human lives, that the individual might perceive as miraculous – surviving a plane crash; falling in love; hearing the voice of a deceased loved one; spontaneous remission of disease; experiencing a transcendent presence. There are also many texts that report such occurrences.

It is true that some, and potentially all, such experiences are simply the result of random coincidence misinterpreted by the human mind as being purposeful acts of an intelligent agency. But it is also true that some of those experiences, and potentially many others that were never observed or reported, were the result of such a divine agency. Statistical ex-post analyses of this data set are not going to resolve the question. (See also: Miracles, More Data…)


Join the Discussion

Why ask?