2.3.6 Nonrepresentative

Nonrepresentative data are bad for out-of-sample generalizations, but can be quite useful for within-sample comparisons.

Some social scientists are accustomed to working with data that comes from a probabilistic random sample from a well-defined population, such as all adults in a particular country. This kind of data is called representative data because the sample “represents” the larger population. Many researchers prize representative data, and to some, representative data is synonymous with rigorous science whereas nonrepresentative data is synonymous with sloppiness. At the most extreme, some skeptics seem to believe that nothing can be learned from nonrepresentative data. If true, this would seem to severely limit what can be learned from big data sources because many of them are nonrepresentative. Fortunately, these skeptics are only partially right. There are certain research goals for which nonrepresentative data is clearly not well suited, but there are others for which it might actually be quite useful.

To understand this distinction, let’s consider a scientific classic: John Snow’s study of the 1853-54 cholera outbreak in London. At the time, many doctors believed that cholera was caused by “bad air,” but Snow believed that it was an infectious disease, perhaps spread by sewage-laced drinking water. To test this idea, Snow took advantage of what we might now call a natural experiment. He compared the cholera rates of households served by two different water companies: Lambeth and Southwark & Vauxhall. These companies served similar households, but they differed in one important way: in 1849—a few years before the epidemic began—Lambeth moved its intake point upstream from the main sewage discharge in London, whereas Southwark & Vauxhall left their intake pipe downstream from the sewage discharge. When Snow compared the death rates from cholera in households served by the two companies, he found that customers of Southwark & Vauxhall—the company that was providing customers sewage-tainted water—were 10 times more likely to die from cholera. This result provides strong scientific evidence for Snow’s argument about the cause of cholera, even though it is not based on a representative sample of people in London.

The data from these two companies, however, would not be ideal for answering a different question: what was the prevalence of cholera in London during the outbreak? For that second question, which is also important, it would be much better to have a representative sample of people from London.

As Snow’s work illustrates, there are some scientific questions for which nonrepresentative data can be quite effective and there are others for which it is not well suited. One crude way to distinguish these two kinds of questions is that some questions are about within-sample comparisons and some are about out-of-sample generalizations. This distinction can be further illustrated by another classic study in epidemiology: the British Doctors Study, which played an important role in demonstrating that smoking causes cancer. In this study, Richard Doll and A. Bradford Hill followed approximately 25,000 male doctors for several years and compared their death rates based on the amount that they smoked when the study began. Doll and Hill (1954) found a strong exposure-response relationship: the more heavily people smoked, the more likely they were to die from lung cancer. Of course, it would be unwise to estimate the prevalence of lung cancer among all British people based on this group of male doctors, but the within-sample comparison still provides evidence that smoking causes lung cancer.

Now that I’ve illustrated the difference between within-sample comparisons and out-of-sample generalizations, two caveats are in order. First, there are naturally questions about the extent to which a relationship that holds within a sample of male British doctors will also hold within a sample of female, British doctors or male British factory workers or female German factory workers or many other groups. These questions are interesting and important, but they are different from questions about the extent to which we can generalize from a sample to a population. Notice, for example, that you probably suspect that the relationship between smoking and cancer that was found in male British doctors will probably be similar in these other groups. Your ability to do this extrapolation does not come from the fact that male British doctors are a probabilistic random sample from any population; rather, it comes from an understanding of the mechanism that links smoking and cancer. Thus, the generalization from a sample to the population from which is drawn is a largely a statistical issue, but questions about the transportability of pattern found in one group to another group is largely a nonstatistical issue (Pearl and Bareinboim 2014; Pearl 2015).

At this point, a skeptic might point out that most social patterns are probably less transportable across groups than the relationship between smoking and cancer. And I agree. The extent to which we should expect patterns to be transportable is ultimately a scientific question that has to be decided based on theory and evidence. It should not automatically be assumed that patterns will be transportable, but nor should be it assumed that they won’t be transportable. These somewhat abstract questions about transportability will be familiar to you if you have followed the debates about how much researchers can learn about human behavior by studying undergraduate students (Sears 1986, [@henrich_most_2010]). Despite these debates, however, it would be unreasonable to say that researchers can’t learn anything from studying undergraduate students.

The second caveat is that most researchers with nonrepresentative data are not as careful as Snow or Doll and Hill. So, to illustrate what can go wrong when researchers try to make an out-of-sample generalization from nonrepresentative data, I’d like to tell you about a study of the 2009 German parliamentary election by Andranik Tumasjan and colleagues (2010). By analyzing more than 100,000 tweets, they found that the proportion of tweets mentioning a political party matched the proportion of votes that party received in the parliamentary election (figure 2.3). In other words, it appeared that Twitter data, which was essentially free, could replace traditional public opinion surveys, which are expensive because of their emphasis on representative data.

Given what you probably already know about Twitter, you should immediately be skeptical of this result. Germans on Twitter in 2009 were not a probabilistic random sample of German voters, and supporters of some parties might tweet about politics much more often than supporters of other parties. Thus, it seems surprising that all of the possible biases that you could imagine would somehow cancel out so that this data would be directly reflective of German voters. In fact, the results in Tumasjan et al. (2010) turned out to be too good to be true. A follow-up paper by Andreas Jungherr, Pascal Jürgens, and Harald Schoen (2012) pointed out that the original analysis had excluded the political party that had actually received the most mentions on Twitter: the Pirate Party, a small party that fights government regulation of the Internet. When the Pirate Party was included in the analysis, Twitter mentions becomes a terrible predictor of election results (figure 2.3). As this example illustrates, using nonrepresentative big data sources to do out-of-sample generalizations can go very wrong. Also, you should notice that the fact that there were 100,000 tweets was basically irrelevant: lots of nonrepresentative data is still non-representative, a theme that I’ll return to in chapter 3 when I discuss surveys.

Figure 2.3: Twitter mentions appear to predict the results of the 2009 German election (Tumasjan et al. 2010), but this excludes the party with the most mentions: Pirate Party (Jungherr, Jürgens, and Schoen 2012). See Tumasjan et al. (2012) for an argument in favor of excluding the Pirate Party. Adapted from Tumasjan et al. (2010), table 4 and Jungherr, Jürgens, and Schoen (2012), table 2.

To conclude, many big data sources are not representative samples from some well-defined population. For questions that require generalizing results from the sample to the population from which it was drawn, this is a serious problem. But for questions about within-sample comparisons, nonrepresentative data can be powerful, so long as researchers are clear about the characteristics of their sample and support claims about transportability with theoretical or empirical evidence. In fact, my hope is that big data sources will enable researchers to make more within-sample comparisons in many nonrepresentative groups, and my guess is that estimates from many different groups will do more to advance social research than a single estimate from a probabilistic random sample.