Representation is about making inferences from your respondents to your target population.
In order to understand the kind of errors that can happen when inferring from respondents to the larger population, let’s consider the Literary Digest straw poll that tried to predict the outcome of the 1936 US presidential election. Although it happened more than 75 years ago, this debacle still has an important lesson to teach researchers today.
Literary Digest was a popular general-interest magazine, and starting in 1920 they began running straw polls to predict the outcomes of presidential elections. To make these predictions, they would send ballots to lots of people and then simply tally up the ballots that were returned; Literary Digest proudly reported that the ballots they received were neither “weighted, adjusted, nor interpreted.” This procedure correctly predicted the winners of the elections in 1920, 1924, 1928 and 1932. In 1936, in the midst of the Great Depression, Literary Digest sent out ballots to 10 million people, whose names came predominantly from telephone directories and automobile registration records. Here’s how they described their methodology:
“THE DIGEST’s smooth-running machine moves with the swift precision of thirty years’ experience to reduce guesswork to hard facts … This week 500 pens scratched out more than a quarter of a million addresses a day. Every day, in a great room high above motor-ribboned Fourth Avenue, in New York, 400 workers deftly slide a million pieces of printed matter—enough to pave forty city blocks—into the addressed envelops [sic]. Every hour, in THE DIGEST’S own Post Office Substation, three chattering postage metering machines sealed and stamped the white oblongs; skilled postal employees flipped them into bulging mailsacks; fleet DIGEST trucks sped them to express mail-trains . . . Next week, the first answers from these ten million will begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totaled. When the last figure has been totted and checked, if past experience is a criterion, the country will know to within a fraction of 1 percent the actual popular vote of forty million [voters].” (August 22, 1936)
Literary Digest’s fetishization of size is instantly recognizable to any “big data” researcher today. Of the 10 million ballots distributed, an amazing 2.4 million were returned—that’s roughly 1,000 times larger than modern political polls. From these 2.4 million respondents, the verdict was clear: Alf Landon was going to defeat the incumbent Franklin Roosevelt. But, in fact, Roosevelt defeated Landon in a landslide. How could Literary Digest go wrong with so much data? Our modern understanding of sampling makes Literary Digest’s mistakes clear and helps us avoid making similar mistakes in the future.
Thinking clearly about sampling requires us to consider four different groups of people (figure 3.2). The first group is the target population; this is the group that the researcher defines as the population of interest. In the case of Literary Digest, the target population was voters in the 1936 presidential election.
After deciding on a target population, a researcher needs to develop a list of people that can be used for sampling. This list is called a sampling frame and the people on it are called the frame population. Ideally, the target population and the frame population would be exactly the same, but in practice this is often not the case. For example, in the case of Literary Digest, the frame population was the 10 million people whose names came predominately from telephone directories and automobile registration records. Differences between the target population and the frame population are called coverage error. Coverage error does not, by itself, guarantee problems. However, it can lead to coverage bias if people in the frame population are systematically different from people in the target population who are not in the frame population. This is, in fact, exactly what happened in the Literary Digest poll. The people in their frame population tended to be more likely to support Alf Landon, in part because they were wealthier (recall that both telephones and automobiles were relatively new and expensive in 1936). So, in the Literary Digest poll, coverage error led to coverage bias.
After defining the frame population, the next step is for a researcher to select the sample population; these are the people who the researcher will attempt to interview. If the sample has different characteristics than the frame population, then sampling can introduce sampling error. In the case of the Literary Digest fiasco, however, there actually was no sampling—the magazine to contact everyone in the frame population—and therefore there was no sampling error. Many researchers tend to focus on sampling error—this is typically the only kind of error captured by the margin of error reported in surveys—but the Literary Digest fiasco reminds us that we need to consider all sources of error, both random and systematic.
Finally, after selecting a sample population, a researcher attempts to interview all its members. Those people who are successfully interviewed are called respondents. Ideally, the sample population and the respondents would be exactly the same, but in practice there is nonresponse. That is, people who are selected in the sample sometimes do not participate. If the people who respond are different from those who don’t respond, then there can be nonresponse bias. Nonresponse bias was the second main problem with the Literary Digest poll. Only 24% of the people who received a ballot responded, and it turned out that people who supported Landon were more likely to respond.
Beyond just being an example to introduce the ideas of representation, the Literary Digest poll is an oft-repeated parable, cautioning researchers about the dangers of haphazard sampling. Unfortunately, I think that the lesson that many people draw from this story is the wrong one. The most common moral of the story is that researchers can’t learn anything from non-probability samples (i.e., samples without strict probability-based rules for selecting participants). But, as I’ll show later in this chapter, that’s not quite right. Instead, I think there are really two morals to this story; morals that are as true today as they were in 1936. First, a large amount of haphazardly collected data will not guarantee a good estimate. In general, having a large number of respondents decreases the variance of estimates, but it does not necessarily decrease the bias. With lots of data, researchers can sometimes get a precise estimate of the wrong thing; they can be precisely inaccurate (McFarland and McFarland 2015). The second main lesson from the Literary Digest fiasco is that researchers need to account for how their sample was collected when making estimates. In other words, because the sampling process in the Literary Digest poll was systematically skewed toward some respondents, researchers needed to use a more complex estimation process that weighted some respondents more than others. Later in this chapter, I’ll show you one such weighting procedure—post-stratification—that can enable you to make better estimates from haphazard samples.