Representation is about making inferences from your respondents to your target population.
In order to understand the kind of errors that can happen when inferring from respondents to the larger population, let’s consider the Literary Digest straw poll that tried to predict the outcome of the 1936 US Presidential election. Although it was more than 75 years ago, this debacle still has an important lesson to teach researchers today.
Literary Digest was a popular general-interest magazine, and starting in 1920 they began running straw polls to predict the outcomes of Presidential Elections. To make these predictions they would send ballots to lots of people, and then simply tally up the ballots that were returned; Literary Digest proudly reported that the ballots they received were neither “weighted, adjusted, nor interpreted.” This procedure correctly predicted the winner of the elections in 1920, 1924, 1928 and 1932. In 1936, in the midst of the Great Depression, Literary Digest sent out ballots to 10 million people, whose names predominately came from telephone directories and automobile registration records. Here’s how they described their methodology:
“THE DIGEST’s smooth-running machine moves with the swift precision of thirty years’ experience to reduce guesswork to hard facts . . . .This week 500 pens scratched out more than a quarter of a million addresses a day. Every day, in a great room high above motor-ribboned Fourth Avenue, in New York, 400 workers deftly slide a million pieces of printed matter—enough to pave forty city blocks—into the addressed envelops [sic]. Every hour, in THE DIGEST’S own Post Office Substation, three chattering postage metering machines sealed and stamped the white oblongs; skilled postal employees flipped them into bulging mailsacks; fleet DIGEST trucks sped them to express mail-trains . . . Next week, the first answers from these ten million will begin the incoming tide of marked ballots, to be triple-checked, verified, five-times cross-classified and totaled. When the last figure has been totted and checked, if past experience is a criterion, the country will know to within a fraction of 1 percent the actual popular vote of forty million [voters].” (August 22, 1936)
The Digest’s fetishization of size is instantly recognizable to any “big data” researcher today. Of the 10 million ballots distributed, an amazing 2.4 million ballots were returned—that’s roughly 1,000 times larger than modern political polls. From these 2.4 million respondents the verdict was clear: Literary Digest predicted that the challenger Alf Landon was going to defeat the incumbent Franklin Roosevelt. But, in fact, the exact opposite happened. Roosevelt defeated Landon in a landslide. How could Literary Digest go wrong with so much data? Our modern understanding of sampling makes Literary Digest’s errors clear and helps us avoid making similar errors in the future.
Thinking clearly about sampling requires us to consider four different groups of people (Figure 3.1). The first group of people is the target population; this is the group that the research defines as the population of interest. In the case of Literary Digest the target population was voters in the 1936 Presidential Election. After deciding on a target population, a researcher next needs to develop a list of people that can be used for sampling. This list is called a sampling frame and the population on the sampling frame is called the frame population. In the case of Literary Digest the frame population was the 10 million people whose names came predominately from telephone directories and automobile registration records. Ideally the target population and the frame population would be exactly the same, but in practice this is often not the case. Differences between the target population and frame population are called coverage error. Coverage error does not, by itself guarantee problems. But, if the people in the frame population are systematically different from people not in the frame population there will be coverage bias. Coverage error was the first of the major flaws with the Literary Digest poll. They wanted to learn about voters—that was their target population—but they constructed a sampling frame predominately from telephone directories and automobile registries, sources that over-represented wealthier Americans who were more likely to support Alf Landon (recall that both of these technologies, which are common today, were relatively new at the time and that the US was in the midst of the Great Depression).
After defining the frame population, the next step is for a researcher to select the sample population; these are the people that the researcher will attempt to interview. If the sample has different characteristics than the frame population, then we can introduce sampling error. This is the kind of error quantified in the margin of error that usually accompanies estimates. In the case of the Literary Digest fiasco, there actually was no sample; they attempted to contact everyone in the frame population. Even though there was no sampling error, there was obviously still error. This clarifies that the margins of errors that are typically reported with estimates from surveys are usually misleadingly small; they don’t include all sources of error.
Finally, a researcher attempts to interview everyone in the sample population. Those people that are successfully interviewed are called respondents. Ideally, the sample population and the respondents would be exactly the same, but in practice there is non-response. That is, people who are selected in the sample refuse to participate. If the people who respond are different from those who don’t respond, then there can be non-response bias. Non-response bias was the second main problem with the Literary Digest poll. Only 24% of the people who received a ballot responded, and it turned out that people who supported Landon were more likely to respond.
Beyond just being an example to introduce the ideas of representation, the Literary Digest poll is an oft-repeated parable, cautioning researchers about the dangers of haphazard sampling. Unfortunately, I think that the lesson that many people draw from this story is the wrong one. The most common moral of the story is that researchers can’t learn anything from non-probability samples (i.e., samples without strict probability-based rules for selecting participants). But, as I’ll show later in this chapter, that’s not quite right. Instead, I think there are really two morals to this story; morals that are as true today as they were in 1936. First, a large amount of haphazardly collected data will not guarantee a good estimate. Second, researchers need to account for how their data was collected when they are making estimates from it. In other words, because the data collection process in the Literary Digest poll was systematically skewed toward some respondents, researchers need to use a more complex estimation process that weights some respondents more than others. Later in this chapter, I’ll show you one such weighting procedure—post-stratification—that can enable you to make better estimates with non-probability samples.