Many of the themes in this chapter have also been echoed in recent presidential addresses at the American Association of Public Opinion Research (AAPOR), such as those by Dillman (2002), Newport (2011), Santos (2014), and Link (2015).
For more on the differences between survey research and in-depth interviews, see Small (2009). Related to in-depth interviews is a family of approaches called ethnography. In ethnographic research, researchers generally spend much more time with participants in their natural environment. For more on the differences between ethnography and in-depth interviews, see Jerolmack and Khan (2014). For more on digital ethnography, see Pink et al. (2015).
My description of the history of survey research is far too brief to include many of the exciting developments that have taken place. For more historical background, see Smith (1976), Converse (1987), and Igo (2008). For more on the idea of three eras of survey research, see Groves (2011) and Dillman, Smyth, and Christian (2008) (which breaks up the three eras slightly differently).
Groves and Kahn (1979) offer a peek inside the transition from the first to the second era in survey research by doing a detailed head-to-head comparison between a face-to-face and telephone survey. (???) look back at the historical development of random-digit-dialing sampling methods.
For more how survey research has changed in the past in response to changes in society, see Tourangeau (2004), (???), and Couper (2011).
The strengths and weaknesses of asking and observing have been debated by psychologists (e.g., Baumeister, Vohs, and Funder (2007)) and sociologists (e.g.,Jerolmack and Khan (2014); Maynard (2014); Cerulo (2014); Vaisey (2014); Jerolmack and Khan (2014)]. The difference between asking and observing also arises in economics, where researchers talk about stated and revealed preferences. For example, a researcher could ask respondents whether they prefer eating ice cream or going to the gym (stated preferences), or could observe how often people eat ice cream and go to the gym (revealed preferences). There is deep skepticism about certain types of stated preferences data in economics as described in Hausman (2012).
A main theme from these debates is that reported behavior is not always accurate. But, as was described in chapter 2, big data sources may not be accurate, they may not be collected on a sample of interest, and they may not be accessible to researchers. Thus, I think that, in some situations, reported behavior can be useful. Further, a second main theme from these debates is that reports about emotions, knowledge, expectations, and opinions are not always accurate. But, if information about these internal states is needed by researchers—either to help explain some behavior or as the thing to be explained—then asking may be appropriate. Of course, learning about internal states by asking questions can be problematic because sometimes the respondents themselves are not aware of their internal states (Nisbett and Wilson 1977).
Chapter 1 of Groves (2004) does an excellent job reconciling the occasionally inconsistent terminology used by survey researchers to describe the total survey error framework. For a book-length treatment of the total survey error framework, see Groves et al. (2009), and for a historical overview, see Groves and Lyberg (2010).
The idea of decomposing errors into bias and variance also comes up in machine learning; see, for example, section 7.3 of Hastie, Tibshirani, and Friedman (2009). This often leads researchers to talk about a “bias-variance” trade-off.
In terms of representation, a great introduction to the issues of nonresponse and nonresponse bias is the National Research Council report Nonresponse in Social Science Surveys: A Research Agenda (2013). Another useful overview is provided by Groves (2006). Also, entire special issues of the Journal of Official Statistics, Public Opinion Quarterly, and the Annals of the American Academy of Political and Social Science have been published on the topic of non-response. Finally, there are actually many different ways of calculating the response rate; these approaches are described in detail in a report by The American Association of Public Opinion Researchers (AAPOR) (???).
For more on the 1936 Literary Digest poll, see Bryson (1976), Squire (1988), Cahalan (1989), and Lusinchi (2012). For another discussion of this poll as a parable warning against haphazard data collection, see Gayo-Avello (2011). In 1936, George Gallup used a more sophisticated form of sampling and was able to produce more accurate estimates with a much smaller sample. Gallup’s success over the Literary Digest was a milestone in the development of survey research as is described in chapter 3 of@converse_survey_1987; chapter 4 of Ohmer (2006); and chapter 3 of@igo_averaged_2008.
In terms of measurement, a great first resource for designing questionnaires is Bradburn, Sudman, and Wansink (2004). For more advanced treatments, see Schuman and Presser (1996), which is specifically focused on attitude questions, and Saris and Gallhofer (2014), which is more general. A slightly different approach to measurement is taken in psychometrics, as described in (???). More on pretesting is available in Presser and Blair (1994), Presser et al. (2004), and chapter 8 of Groves et al. (2009). For more on survey experiments, see Mutz (2011).
In terms of cost, the classic, book-length treatment of the trade-off between survey costs and survey errors is Groves (2004).
Two classic book-length treatments of standard probability sampling and estimation are Lohr (2009) (more introductory) and Särndal, Swensson, and Wretman (2003) (more advanced). A classic book-length treatment of post-stratification and related methods is Särndal and Lundström (2005). In some digital-age settings, researchers know quite a bit about nonrespondents, which was not often true in the past. Different forms of nonresponse adjustment are possible when researchers have information about nonrespondents, as described by Kalton and Flores-Cervantes (2003) and Smith (2011).
The Xbox study by W. Wang et al. (2015) uses a technique called multilevel regression and post-stratification (“Mr. P.”) that allows researchers to estimate group means even when there are many, many groups. Although there is some debate about the quality of the estimates from this technique, it seems like a promising area to explore. The technique was first used in Park, Gelman, and Bafumi (2004), and there has been subsequent use and debate (Gelman 2007; Lax and Phillips 2009; Pacheco 2011; Buttice and Highton 2013; Toshkov 2015). For more on the connection between individual weights and group weights, see Gelman (2007).
For other approaches to weighting web surveys, see Schonlau et al. (2009), Bethlehem (2010), and Valliant and Dever (2011). Online panels can use either probability sampling or non-probability sampling. For more on online panels, see Callegaro et al. (2014).
Sometimes, researchers have found that probability samples and non-probability samples yield estimates of similar quality (Ansolabehere and Schaffner 2014), but other comparisons have found that non-probability samples do worse (Malhotra and Krosnick 2007; Yeager et al. 2011). One possible reason for these differences is that non-probability samples have improved over time. For a more pessimistic view of non-probability sampling methods see the AAPOR Task Force on Non-Probability Sampling (Baker et al. 2013), and I also recommend reading the commentary that follows the summary report.
Conrad and Schober (2008) is an edited volume titled Envisioning the Survey Interview of the Future, and it offers a variety of viewpoints about the future of asking questions. Couper (2011) addresses similar themes, and Schober et al. (2015) offer a nice example of how data collection methods that are tailored to a new setting can result in higher quality data. Schober and Conrad (2015) offer a more general argument about continuing to adjust the process of survey research to match changes in society.
Tourangeau and Yan (2007) review issues of social desirability bias in sensitive questions, and Lind et al. (2013) offer some possible reasons why people might disclose more sensitive information in a computer-administered interview. For more on the role of human interviewers in increasing participation rates in surveys, see Maynard and Schaeffer (1997), Maynard, Freese, and Schaeffer (2010), Conrad et al. (2013), and Schaeffer et al. (2013). For more on mixed-mode surveys, see Dillman, Smyth, and Christian (2014).
Stone et al. (2007) offer a book-length treatment of ecological momentary assessment and related methods.
For more advice on making surveys an enjoyable and valuable experience for participants, see work on the Tailored Design Method (Dillman, Smyth, and Christian 2014). For another interesting example of using Facebook apps for social science surveys, see Bail (2015).
Judson (2007) describes the process of combining surveys and administrative data as “information integration” and discusses some advantages of this approach, as well as offering some examples.
Regarding enriched asking, there have been many previous attempts to validate voting. For an overview of that literature, see Belli et al. (1999), Ansolabehere and Hersh (2012), Hanmer, Banks, and White (2014), and Berent, Krosnick, and Lupia (2016). See Berent, Krosnick, and Lupia (2016) for a more skeptical view of the results presented in Ansolabehere and Hersh (2012).
It is important to note that although Ansolabehere and Hersh were encouraged by the quality of data from Catalist, other evaluations of commercial vendors have been less enthusiastic. Pasek et al. (2014) found poor quality when data from a survey was compared with a consumer file from Marketing Systems Group (which itself merged together data from three providers: Acxiom, Experian, and InfoUSA). That is, the data file did not match survey responses that researchers expected to be correct, the consumer file had missing data for a large number of questions, and the missing data pattern was correlated with the reported survey value (in other words, the missing data was systematic, not random).
For more on record linkage between surveys and administrative data, see Sakshaug and Kreuter (2012) and Schnell (2013). For more on record linkage in general, see Dunn (1946) and Fellegi and Sunter (1969) (historical) and Larsen and Winkler (2014) (modern). Similar approaches have also been developed in computer science under names such as data deduplication, instance identification, name matching, duplicate detection, and duplicate record detection (Elmagarmid, Ipeirotis, and Verykios 2007). There are also privacy-preserving approaches to record linkage that do not require the transmission of personally identifying information (Schnell 2013). Researchers at Facebook developed a procedure to probabilistically link their records to voting behavior (Jones et al. 2013); this linkage was done to evaluate an experiment that I’ll tell you about in chapter 4 (Bond et al. 2012). For more on obtaining consent for record linkage, see Sakshaug et al. (2012).
Another example of linking a large-scale social survey to government administrative records comes from the Health and Retirement Survey and the Social Security Administration. For more on that study, including information about the consent procedure, see Olson (1996, 1999).
The process of combining many sources of administrative records into a master datafile—the process that Catalist employs—is common in the statistical offices of some national governments. Two researchers from Statistics Sweden have written a detailed book on the topic (Wallgren and Wallgren 2007). For an example of this approach in a single county in the United States (Olmstead County, Minnesota; home of the Mayo Clinic), see Sauver et al. (2011). For more on errors that can appear in administrative records, see Groen (2012).
Another way in which researchers can use big data sources in survey research is as a sampling frame for people with specific characteristics. Unfortunately, this approach can raise questions related to privacy (Beskow, Sandler, and Weinberger 2006).
Regarding amplified asking, this approach is not as new as it might appear from how I’ve described it. It has deep connections to three large areas in statistics: model-based post-stratification (Little 1993), imputation (Rubin 2004), and small area estimation (Rao and Molina 2015). It is also related to the use of surrogate variables in medical research (Pepe 1992).
The cost and time estimates in Blumenstock, Cadamuro, and On (2015) refer more to variable cost—the cost of one additional survey—and do not include fixed costs such as the cost of cleaning and processing the call data. In general, amplified asking will probably have high fixed costs and low variable costs similar to those of digital experiments (see chapter 4). For more on mobile phone-based surveys in developing countries, see Dabalen et al. (2016).
For ideas about how to do amplified asking better, I’d recommend learning more about multiple imputation (Rubin 2004). Also, if researchers doing amplified asking care about aggregate counts, rather than individual-level traits, then the approaches in King and Lu (2008) and Hopkins and King (2010) may be useful. Finally, for more about the machine learning approaches in Blumenstock, Cadamuro, and On (2015), see James et al. (2013) (more introductory) or Hastie, Tibshirani, and Friedman (2009) (more advanced).
One ethical issue regarding amplified asking is that it can be used to infer sensitive traits that people might not choose to reveal in a survey as described in Kosinski, Stillwell, and Graepel (2013).