Analysis: How Medium Data Can Address Big Data's Woes

May 1, 2013


© iStockPhoto/enot-poloskun

We live in a world that is being transformed by Big Data. The ability to gather and process vast amounts of information, and the development of analytical tools to find meaning in it all, promises to change business, science and social practices.

It's already changing the way companies like Target and WalMart do business and how the CDC plans for 'flu season. It's a hot topic for good reason. But it also has its limitations and problems, which are becoming more apparent as Big Data becomes more widely used. Ultimately, the solution to these problems will lead us to a new way of looking at analytical work -- a more focused and rigorous approach. Medium Data.

To illustrate what I mean by Medium Data, I'm going to look at an example of a Big Data model that has sometimes gone awry, and what its errors tells us about what we need to do to improve Big Data models.

Google Flu Trends is a service provided by google.org that estimates current 'flu intensity by analyzing what search terms people in different parts of the world are using on Google's search engines. To create the model, scientists at Google and the CDC analyzed the 50 million most commonly-searched-for terms on Google between 2003 and 2008, and looked at what terms correlated most closely with the number of people reported to have 'flu by the CDC. After some numerical analysis, they determined that the prevalence of 45 specific search terms correlated with the CDC 'flu tracking numbers closely enough that they could be used for tracking purposes going forwards.

Google Flu Trends is one of the poster children for Big Data. It is a great example of how a massive amount of data can be analyzed in creative ways to produce unexpected new results.

But Google Flu Trends also shows the limits of the Big Data model. At times, it has gone off the rails. In 2009 it underestimated 'flu cases at the start of the H1N1 pandemic, and last Winter it badly overestimated 'flu levels in the United States, at one point estimating that 11% of the US population was down with 'flu when the actual level was 6%. These misses, both of which occurred when the model was most needed, illustrate one pitfall of Big Data: it can be very helpful, but it can also go wrong for reasons that are hard to analyze.

This highlights a problem that requires a better approach to Big Data analysis, and the central question is why does a model like Google Flu Trends sometimes fail?

There are at least three main problems with Big Data analysis as it is currently practiced. Two of them have have received attention recently, but the third has largely been overlooked, or even considered a "feature" of Big Data, rather than a problem.

The first issue, raised by Nassim Nicholas Taleb in his Wired article Beware the Big Errors of 'Big Data', is the fact that big data causes big random signals. As Taleb puts it, "big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal)." The issue is that when you have a big enough dataset, and you examine it along enough different dimensions, you will inevitably find "signals" -- i.e., things that stand out statistically -- that are illusions.

The second criticism, well-articulated by Kate Crawford in her recent piece in the Harvard Business Review, Hidden Biases in Big Data, is that Big Data does not equal unbiased data: "Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves."

She uses the example of analysis of Twitter activity during Hurricane Sandy. Twitter provided interesting information about people who were still online or able to use their cellphones as the storm passed over, but no information at all about people who had lost power and cellphone signal -- i.e., the people who were really in the thick of things. Conclusions drawn from the data might therefore provide untrustworthy to agencies planning for future similar events.

To these two criticisms, I would add a third, which is that Big Data analysis lacks a systematic process for testing accuracy, robustness and the costs of errors, and traditional methods of testing these things are no longer sufficient. In other words, we currently have no way of knowing whether we can trust any given Big Data model.

There are two main causes of this problem.

First, one of the central ideas behind Big Data is that sufficient data allows us to relax the relationship between correlation and causation.

In a traditional analytical project, one starts with a hypothesis and uses a set of statistical tools to determine whether the hypothesis is true or false. For example, a hypothesis might be that people with the 'flu are more likely to use the search term "flu symptoms." Testing the hypothesis might consist of finding 1,000 people with the flu, 1,000 people who do not have the flu, and seeing which group has searched for "flu symptoms" on Google over the past 24 hours. If significantly more people with the 'flu have searched for "flu symptoms", then we have proved our hypothesis (or at least shown a correlation). As I discussed earlier, Google Flu Trends works at the problem from the other end by examining all search terms and determining which ones correlate most closely with 'flu outbreaks. The idea is that if we study a big enough dataset, we can create a predictive model that, while we might not understand all the underlying details, nevertheless produces useful predictions. The problem is that finding correlation cannot relax the search for causation as much as we might hope.

Second, Big Data models can be inherently fragile in ways we have not seen before.

Taking a look at Google Flu Trends again, the original researchers did everything by the book. They took a huge dataset of 50 million search terms and filtered it down to only the top 100 that showed the greatest correlation with flu prevalence. From that list of 100, they then found that using the top 45 produced the best model. Furthermore, they looked closely at all of the search terms in the top 100, and confirmed that the top 45 were all 'flu-related in some way. The fact that their automated algorithm discarded terms like "Oscar nominations", which are correlated with 'flu outbreaks because the Oscar nominations come out during 'flu season, gave further credence to the model. Finally, they tested the model on historical data excluded from the original study and verified that it produced accurate predictions.

To most observers, the strategy followed by the Flu Trends researchers would seem to be free from the common errors identified by Taleb and Crawford. The correlations don't seem spurious, and there's little sign that the data set or analysis might be biased. So what went wrong? How could we tell that the model was vulnerable to error?

Google hasn't provided any official word about what went wrong with the model, but there seem to have been different causes for each failure. When the H1N1 outbreak occurred, a new search term that the model didn't consider that significant, "H1N1", suddenly became common. In Winter 2012-13, there seems to have been a feedback loop: media coverage of 'flu season, partly caused by Flu Trends itself, caused a spike in searches for the terms Flu Trends monitors.

In the first case, the problem was temporal: a set of search terms that worked before were usurped when something new came along -- people stopped searching for "flu symptoms" and started searching for "h1n1 symptoms". In the second, the problem was algorithmic: Flu Trends itself caused a change in search behavior that was unrelated to people actually getting the 'flu.

Identifying these kinds of problems ahead of time is the challenge that needs to be met. I've been thinking of this for a while as "Medium Data", because it requires distilling the problem down to a size and scope that a human being can understand. The goal of Medium Data is to identify the vulnerabilities in big data models, and to make the models more robust by addressing those vulnerabilities.

How might a Medium Data approach help with a system like Flu Trends?

First, it would identify the vulnerabilities. Flu Trends tracks flu outbreaks over time, so it is subject to temporal risks: people changing their search terms, changing their search habits (e.g., searching from a phone, not a PC) and so on. Flu Trends is also based on search traffic, which is influenced by, among other things, reports in the news, and Flu Trends is a system that has become news-worthy. So we have a feedback loop vulnerability too.

Having identified these vulnerabilities, we then need to see how they can be addressed.

First, the temporal issues: when people search for "flu symptoms", they are really searching for the symptoms for the thing that we are currently calling "flu". If a new flu appears, say H1N1, then a robust tracking system will add that new flu terminology to the search terms. So "h1n1 symptoms" would become one of the tracked terms.

Second, the feedback loop problem could be addressed by measuring the correlations between news articles about flu and searches for flu-related terms. This might act as a "damping mechanism" that adjusts estimates down because people are more likely to do searches related to flu when they see news articles about the topic.

The goal of making these adjustments is to make the model more robust. In particular, we want it to work equally well under all conditions, and especially when its results are potentially most valuable (such as during a major 'flu outbreak). Note that this doesn't necessarily mean that the model is more accurate mathematically. In fact, one might trade off accuracy under all conditions for accuracy under the extreme conditions when the information it provides might save lives.

Working out these trade-offs, and building the tools to help us improve models, is the central task for Medium Data.

In my next post, I will look in more detail at what makes a Big Data model vulnerable, and how we can measure the severity of the vulnerability, by looking at two well-known services: Google's search term spell checker and the airline ticket price predictor Farecast. In subsequent posts, I will examine ways we can address those vulnerabilities in the models themselves.

Bruce Nash bruce.nash@the-numbers.com

Filed under: Analysis