Large datasets are a means to an end; they are not an end in themselves.
The first of the three good characteristics of big data is the most discussed: these are big data. These data sources can be big in three different ways: many people, lots of information per person, or many observations over time. Having a big dataset enables some specific types of research—measuring heterogeneity, studying rare events, detecting small differences, and making causal estimates from observational data. It also seems to lead to a specific type of sloppiness.
The first thing for which size is particularly useful is moving beyond averages to make estimates for specific subgroups. For example, Gary King, Jennifer Pan, and Molly Roberts (2013) measured the probability that social media posts in China would be censored by the government. By itself this average probability of deletion is not very helpful for understanding why the government censors some posts but not others. But, because their dataset included 11 million posts, King and colleagues also produced estimates for the probability of censorship for posts on 85 separate categories (e.g., pornography, Tibet, and Traffic in Beijing). By comparing the probability of censorship for posts in different categories, they were able to understand more about how and why the government censors certain types of posts. With 11 thousand posts (rather than 11 million posts), they would not have been able to produce these category-specific estimates.
Second, size is particularly useful for is studying of rare events. For example, Goel and colleagues (2015) wanted to study the different ways that tweets can go viral. Because large cascades of re-tweets are extremely rare—about one in a 3,000—they needed to study more than a billion tweets in order to find enough large cascades for their analysis.
Third, large datasets enable researchers to detect small differences. In fact, much of the focus on big data in industry is about these small differences: reliably detecting the difference between 1% and 1.1% click-through rates on an ad can translate into millions of dollars in extra revenue. In some scientific settings, such small differences might not be particular important (even if they are statistically significant). But, in some policy settings, such small differences can become important when viewed in aggregate. For example, if there are two public health interventions and one is slightly more effective than the other, then switching to the more effective intervention could end up saving thousands of additional lives.
Finally, large data sets greatly increase our ability to make causal estimates from observational data. Although large datasets don’t fundamentally change the problems with making causal inference from observational data, matching and natural experiments—two techniques that researchers have developed for making causal claims from observational data—both greatly benefit from large datasets. I’ll explain and illustrate this claim in greater detail later in this chapter when I describe research strategies.
Although bigness is generally a good property when used correctly, I’ve noticed that bigness commonly leads to a conceptual error. For some reason, bigness seems to lead researchers to ignore how their data was generated. While bigness does reduce the need to worry about random error, it actually increases the need to worry about systematic errors, the kinds of errors that I’ll describe in more below that arise from biases in how data are created and collected. In a small dataset, both random error and systematic error can be important, but in a large dataset random error is can be averaged away and systematic error dominates. Researchers who don’t think about systematic error will end up using their large datasets to get a precise estimate of the wrong thing; they will be precisely inaccurate (McFarland and McFarland 2015).