Predicting the future is hard, but predicting the present is easier.
The second main strategy used by researchers with observational data is forecasting. Predicting the future is notoriously difficult, but it can be incredibly important for decision makers, whether they work in companies or governments.
Kleinberg et al. (2015) offers two stories that clarify the importance of forecasting for certain policy problems. Imagine one policy maker, I’ll call her Anna, who is facing a drought and must decide whether to hire a shaman to do a rain dance to increase the chance of rain. Another policy maker, I’ll call him Bob, must decide whether to take an umbrella to work to avoid getting wet on the way home. Both Anna and Bob can make a better decision if they understand weather, but they need to know different things. Anna needs to understand whether the rain dance causes rain. Bob, on the other hand, does not need to understanding anything about causality; he just needs an accurate forecast. Social researchers often focus on what Kleinberg et al. (2015) call “rain dance–like” policy problems—those that focus on causality—and ignore “umbrella-like” policy problems that are focused on forecasting.
I’d like to focus, however, on a special kind of forecasting called nowcasting—a term derived from combining “now” and “forecasting.” Rather than predicting the future, nowcasting attempts to predict the present (Choi and Varian 2012). In other words, nowcasting uses forecasting methods for problems of measurement. As such, it should be especially useful to governments who require timely and accurate measures about their countries. Nowcasting can be illustrated most clearly with the example of Google Flu Trends.
Imagine that you are feeling a bit under the weather so you type “flu remedies” into a search engine, receive a page of links in response, and then follow one of them to a helpful webpage. Now imagine this activity being played out from the perspective of the search engine. Every moment, millions of queries are arriving from around the world, and this stream of queries—what Battelle (2006) has called the “database of intentions”— provides a constantly updated window into the collective global consciousness. However, turning this stream of information into a measurement of the prevalence of the flu is difficult. Simply counting up the number of queries for “flu remedies” might not work well. Not everyone who has the flu searches for flu remedies and not everyone who searchers for flu remedies has the flu.
The important and clever trick behind Google Flu Trends was to turn a measurement problem into a forecasting problem. The U.S. Centers for Disease Control and Prevention (CDC) has an influenza monitoring system that collects information from doctors around the country. However, one problem with this CDC system is there is a two week reporting lag; the time it takes for the data arriving from doctors to be cleaned, processed, and published. But, when handling an emerging epidemic, public health offices don’t want to know how much influenza there was two weeks ago; they want to know how much influenza there is right now. In fact, in many other traditional sources of social data, there are gaps between waves of data collection and reporting lags. Most big data sources, on the other hand, are always-on (Section 2.3.1.2).
Therefore, Jeremy Ginsberg and colleagues (2009) tried to predict the CDC flu data from the Google search data. This is an example of “predicting the present” because the researchers were trying to measure how much flu there is now by predicting future data from the CDC, future data that is measuring the present. Using machine learning, they searched through 50 million different search terms to see which are most predictive of the CDC flu data. Ultimately, they found a set of 45 different queries that seemed to be most predictive, and the results were quite good: they could use the search data to predict the CDC data. Based in part on this paper, which was published in Nature, Google Flu Trends became an often repeated success story about the power of big data.
There are two important caveats to this apparent success, however, and understanding these caveats will help you evaluate and do forecasting and nowcasting. First, the performance of Google Flu Trends was actually not much better than a simple model that estimates the amount of flu based on a linear extrapolation from the two most recent measurements of flu prevalence (Goel et al. 2010). And, over some time periods Google Flu Trends was actually worse than this simple approach (Lazer et al. 2014). In other words, Google Flu Trends with all its data, machine learning, and powerful computing did not dramatically outperform a simple and easier to understand heuristic. This suggests that when evaluating any forecast or nowcast it is important to compare against a baseline.
The second important caveat about Google Flu Trends is that its ability to predict the CDC flu data was prone to short-term failure and long-term decay because of drift and algorithmic confounding. For example, during the 2009 Swine Flu outbreak Google Flu Trends dramatically over-estimated the amount of influenza, probably because people tend to change their search behavior in response to widespread fear of a global pandemic (Cook et al. 2011; Olson et al. 2013). In addition to these short-term problems, the performance gradually decayed over time. Diagnosing the reasons for this long term decay are difficult because the Google search algorithms are proprietary, but it appears that in 2011 Google made changes that would suggest related search terms when people search for symptoms like “fever” and “cough” (it also seem that this feature is no longer active). Adding this feature is a totally reasonable thing to do if you are running a search engine business, and it had the effect of generating more health related searches. This was probably a success for the business, but it caused Google Flu Trends to over-estimate flu prevalence (Lazer et al. 2014).
Fortunately, these problems with Google Flu Trends are fixable. In fact, using more careful methods, Lazer et al. (2014) and Yang, Santillana, and Kou (2015) were able to get better results. Going forward, I expect that nowcasting studies that combine big data with researcher collected data—that combine Duchamp-style Readymades with Michaelangelo-style Custommades—will enable policy makers to produce faster and more accurate measurements of the present and predictions of the future.