One kind of observing that is not included in this chapter is ethnography. For more on ethnography in digital spaces, see Boellstorff et al. (2012), and for more on ethnography in mixed digital and physical spaces, see Lane (2016).
There is no single consensus definition of “big data,” but many definitions seem to focus on the “3 Vs”: volume, variety, and velocity (e.g., Japec et al. (2015)). See De Mauro et al. (2015) for a review of definitions.
My inclusion of government administrative data in the category of big data is a bit unusual, although others have also made this case, including Legewie (2015), Connelly et al. (2016), and Einav and Levin (2014). For more about the value of government administrative data for research, see Card et al. (2010), Adminstrative Data Taskforce (2012), and Grusky, Smeeding, and Snipp (2015).
For a view of administrative research from inside the government statistical system, particularly the US Census Bureau, see Jarmin and O’Hara (2016). For a book-length treatment of the administrative records research at Statistics Sweden, see Wallgren and Wallgren (2007).
In the chapter, I briefly compared a traditional survey such as the General Social Survey (GSS) with a social media data source such as Twitter. For a thorough and careful comparison between traditional surveys and social media data, see Schober et al. (2016).
These 10 characteristics of big data have been described in a variety of different ways by a variety of different authors. Writing that influenced my thinking on these issues includes Lazer et al. (2009), Groves (2011), Howison, Wiggins, and Crowston (2011), boyd and Crawford (2012), S. J. Taylor (2013), Mayer-Schönberger and Cukier (2013), Golder and Macy (2014), Ruths and Pfeffer (2014), Tufekci (2014), Sampson and Small (2015), K. Lewis (2015b), Lazer (2015), Horton and Tambe (2015), Japec et al. (2015), and Goldstone and Lupyan (2016).
Throughout this chapter, I’ve used the term digital traces, which I think is relatively neutral. Another popular term for digital traces is digital footprints (Golder and Macy 2014), but as Hal Abelson, Ken Ledeen, and Harry Lewis (2008) point out, a more appropriate term is probably digital fingerprints. When you create footprints, you are aware of what is happening and your footprints cannot generally be traced to you personally. The same is not true for your digital traces. In fact, you are leaving traces all the time about which you have very little knowledge. And, although these traces don’t have your name on them, they can often be linked back to you. In other words, they are more like fingerprints: invisible and personally identifying.
For more on why large datasets render statistical tests problematic, see M. Lin, Lucas, and Shmueli (2013) and McFarland and McFarland (2015). These issues should lead researchers to focus on practical significance rather than statistical significance.
For more about how Raj Chetty and colleagues obtained access to the tax records, see Mervis (2014).
Large datasets can also create computational problems that are generally beyond the capabilities of a single computer. Therefore, researchers making computations on large datasets often spread the work over many computers, a process sometimes called parallel programming. For an introduction to parallel programming, in particular a language called Hadoop, see Vo and Silvia (2016).
When considering always-on data, it is important to consider whether you are comparing the exact same people over time or whether you are comparing some changing group of people; see for example, Diaz et al. (2016).
A classic book on nonreactive measures is Webb et al. (1966). The examples in that book predate the digital age, but they are still illuminating. For examples of people changing their behavior because of the presence of mass surveillance, see Penney (2016) and Brayne (2014).
Reactivity is closely related to what researchers call demand effects (Orne 1962; Zizzo 2010) and the Hawthorne effect (Adair 1984; Levitt and List 2011).
For more on record linkage, see Dunn (1946) and Fellegi and Sunter (1969) (historical) and Larsen and Winkler (2014) (modern). Similar approaches have also been developed in computer science under names such as data deduplication, instance identification, name matching, duplicate detection, and duplicate record detection (Elmagarmid, Ipeirotis, and Verykios 2007). There are also privacy-preserving approaches to record linkage that do not require the transmission of personally identifying information (Schnell 2013). Facebook also has developed a process to link their records to voting behavior; this was done to evaluate an experiment that I’ll tell you about in chapter 4 (Bond et al. 2012; Jones et al. 2013).
For more on construct validity, see chapter 3 of Shadish, Cook, and Campbell (2001).
For more on the AOL search log debacle, see Ohm (2010). I offer advice about partnering with companies and governments in chapter 4 when I describe experiments. A number of authors have expressed concerns about research that relies on inaccessible data, see Huberman (2012) and boyd and Crawford (2012).
One good way for university researchers to acquire data access is to work at a company as an intern or visiting researcher. In addition to enabling data access, this process will also help the researcher learn more about how the data was created, which is important for analysis.
In terms of gaining access to government data, Mervis (2014) discusses how Raj Chetty and colleagues obtained access to the tax records used in their research on social mobility.
For more on the history of “representativeness” as a concept, see Kruskal and Mosteller (1979a), Kruskal and Mosteller (1979b), Kruskal and Mosteller (1979c), and Kruskal and Mosteller (1980).
My summaries of the work of Snow and the work of Doll and Hill were brief. For more on Snow’s work on cholera, see Freedman (1991). For more on the British Doctors Study see Doll et al. (2004) and Keating (2014).
Many researchers will be surprised to learn that although Doll and Hill had collected data from female doctors and from doctors under 35, they intentionally did not use this data in their first analysis. As they argued: “Since lung cancer is relatively rare in women and men under 35, useful figures are unlikely to be obtained in these groups for some years to come. In this preliminary report we have therefore confined our attention to men aged 35 and above.” Rothman, Gallacher, and Hatch (2013), which has the provocative title “Why representativeness should be avoided,” make a more general argument for the value of intentionally creating nonrepresentative data.
Nonrepresentativeness is a major problem for researchers and governments who wish to make statements about an entire population. This is less of a concern for companies, which are typically focused on their users. For more on how Statistics Netherlands considers the issue of nonrepresentativeness of business big data, see Buelens et al. (2014).
For examples of researchers expressing concern about non-representative nature of big data sources, see boyd and Crawford (2012), K. Lewis (2015b), and Hargittai (2015).
For a more detailed comparison of the goals of social surveys and epidemiological research, see Keiding and Louis (2016).
For more on attempts to use Twitter to make out-of-sample generalizations about voters, especially the case from the 2009 German election, see Jungherr (2013) and Jungherr (2015). Subsequent to the work of Tumasjan et al. (2010) researchers around the world have used fancier methods—such as using sentiment analysis to distinguish between positive and negative mentions of the parties—in order to improve the ability of Twitter data to predict a variety of different types of elections (Gayo-Avello 2013; Jungherr 2015, chap. 7.). Here’s how Huberty (2015) summarized the results of these attempts to predict elections:
“All known forecasting methods based on social media have failed when subjected to the demands of true forward-looking electoral forecasting. These failures appear to be due to fundamental properties of social media, rather than to methodological or algorithmic difficulties. In short, social media do not, and probably never will, offer a stable, unbiased, representative picture of the electorate; and convenience samples of social media lack sufficient data to fix these problems post hoc.”
In chapter 3, I’ll describe sampling and estimation in much greater detail. Even if data are nonrepresentative, under certain conditions, they can be weighted to produce good estimates.
System drift is very hard to see from the outside. However, the MovieLens project (discussed more in chapter 4) has been run for more than 15 years by an academic research group. Thus, they have been able to document and share information about the way that the system has evolved over time and how this might impact analysis (Harper and Konstan 2015).
A number of scholars have focused on drift in Twitter: Liu, Kliman-Silver, and Mislove (2014) and Tufekci (2014).
One approach to dealing with population drift is to create a panel of users, which allows researchers to study the same people over time, see Diaz et al. (2016).
I first heard the term “algorithmically confounded” used by Jon Kleinberg in a talk, but unfortunately I don’t remember when or where the talk was given. The first time that I saw the term in print was in Anderson et al. (2015), which is an interesting discussion of how the algorithms used by dating sites might complicate researchers’ ability to use data from these websites to study social preferences. This concern was raised by K. Lewis (2015a) in response to Anderson et al. (2014).
In addition to Facebook, Twitter also recommends people for users to follow based on the idea of triadic closure; see Su, Sharma, and Goel (2016). So the level of triadic closure in Twitter is a combination of some human tendency toward triadic closure and some algorithmic tendency to promote triadic closure.
For more on performativity—in particular the idea that some social science theories are “engines not cameras” (i.e., they shape the world rather than just describing it)—see Mackenzie (2008).
Governmental statistical agencies call data cleaning statistical data editing. De Waal, Puts, and Daas (2014) describe statistical data editing techniques developed for survey data and examine the extent to which they are applicable to big data sources, and Puts, Daas, and Waal (2015) present some of the same ideas for a more general audience.
For an overview of social bots, see Ferrara et al. (2016). For some examples of studies focused on finding spam in Twitter, see Clark et al. (2016) and Chu et al. (2012). Finally, Subrahmanian et al. (2016) describe the results of the DARPA Twitter Bot Challenge, a mass collaboration designed to compare approaches for detecting bots on Twitter.
Ohm (2015) reviews earlier research on the idea of sensitive information and offers a multi-factor test. The four factors he proposes are the magnitude of harm, the probability of harm, the presence of a confidential relationship, and whether the risk reflects majoritarian concerns.
Farber’s study of taxis in New York was based on an earlier study by Camerer et al. (1997) that used three different convenience samples of paper trip sheets. This earlier study found that drivers seemed to be target earners: they worked less on days where their wages were higher.
In subsequent work, King and colleagues have further explored online censorship in China (King, Pan, and Roberts 2014, [@king_how_2016]). For a related approach to measuring online censorship in China, see Bamman, O’Connor, and Smith (2012). For more on statistical methods like the one used in King, Pan, and Roberts (2013) to estimate the sentiment of the 11 million posts, see Hopkins and King (2010). For more on supervised learning, see James et al. (2013) (less technical) and Hastie, Tibshirani, and Friedman (2009) (more technical).
Forecasting is a big part of industrial data science (Mayer-Schönberger and Cukier 2013; Provost and Fawcett 2013). One type of forecasting that is commonly done by social researchers is demographic forecasting; see, for example, Raftery et al. (2012).
Google Flu Trends was not the first project to use search data to nowcast influenza prevalence. In fact, researchers in the United States (Polgreen et al. 2008; Ginsberg et al. 2009) and Sweden (Hulth, Rydevik, and Linde 2009) have found that certain search terms (e.g., “flu”) predicted national public health surveillance data before it was released. Subsequently many, many other projects have tried to use digital trace data for disease surveillance detection; see Althouse et al. (2015) for a review.
In addition to using digital trace data to predict health outcomes, there has also been a huge amount of work using Twitter data to predict election outcomes; for reviews see Gayo-Avello (2011), Gayo-Avello (2013), Jungherr (2015) (chapter 7), and Huberty (2015). Nowcasting of economic indicators, such as gross domestic product (GDP), is also common in central banks, see Bańbura et al. (2013). table 2.8 includes a few examples of studies that use some kind of digital trace to predict some kind of event in the world.
Digital trace | Outcome | Citation |
---|---|---|
Box office revenue of movies in the US | Asur and Huberman (2010) | |
Search logs | Sales of movies, music, books, and video games in the US | Goel et al. (2010) |
Dow Jones Industrial Average (US stock market) | Bollen, Mao, and Zeng (2011) | |
Social media and search logs | Surveys of investor sentiment and stock markets in the United States, United Kingdom, Canada, and China | Mao et al. (2015) |
Search logs | Prevalence of Dengue Fever in Singapore and Bangkok | Althouse, Ng, and Cummings (2011) |
Finally, Jon Kleinberg and colleagues (2015) have pointed out that forecasting problems fall into two, subtly different categories and that social scientists have tended to focus on one and ignore the other. Imagine one policy maker, I’ll call her Anna, who is facing a drought and must decide whether to hire a shaman to do a rain dance to increase the chance of rain. Another policy maker, I’ll call her Betty, must decide whether to take an umbrella to work to avoid getting wet on the way home. Both Anna and Betty can make a better decision if they understand weather, but they need to know different things. Anna needs to understand whether the rain dance causes rain. Betty, on the other hand, does not need to understand anything about causality; she just needs an accurate forecast. Social researchers often focus on the problems like the one faced by Anna—which Kleinberg and colleagues call “rain dance–like” policy problems—because they involve questions of causality. Questions like the one faced by Betty—which Kleinberg and colleagues call “umbrella-like” policy problems—can be quite important too, but have received much less attention from social researchers.
The journal P.S. Political Science had a symposium on big data, causal inference, and formal theory, and Clark and Golder (2015) summarize each contribution. The journal Proceedings of the National Academy of Sciences of the United States of America had a symposium on causal inference and big data, and Shiffrin (2016) summarizes each contribution. For machine learning approaches that attempt to automatically discover natural experiments inside of big data sources, see Jensen et al. (2008), Sharma, Hofman, and Watts (2015), and Sharma, Hofman, and Watts (2016).
In terms of natural experiments, Dunning (2012) provides an introductory, book-length treatment with many examples. For a skeptical view of natural experiments, see Rosenzweig and Wolpin (2000) (economics) or Sekhon and Titiunik (2012) (political science). Deaton (2010) and Heckman and Urzúa (2010) argue that focusing on natural experiments can lead researchers to focus on estimating unimportant causal effects; Imbens (2010) counters these arguments with a more optimistic view of the value of natural experiments.
When describing how a researcher could go from estimating the effect of being drafted to the effect of serving, I was describing a technique called instrumental variables. Imbens and Rubin (2015), in their chapters 23 and 24, provide an introduction and use the draft lottery as an example. The effect of military service on compliers is sometimes called the complier average causal effect (CAcE) and sometimes the local average treatment effect (LATE). Sovey and Green (2011), Angrist and Krueger (2001), and Bollen (2012) offer reviews of the usage of instrumental variables in political science, economics, and sociology, and Sovey and Green (2011) provides a “reader’s checklist” for evaluating studies using instrumental variables.
It turns out that the 1970 draft lottery was not, in fact properly randomized; there were small deviations from pure randomness (Fienberg 1971). Berinsky and Chatfield (2015) argues that this small deviation is not substantively important and discuss the importance of properly conducted randomization.
In terms of matching, see Stuart (2010) for an optimistic review, and Sekhon (2009) for a pessimistic review. For more on matching as a kind of pruning, see Ho et al. (2007). Finding a single perfect match for each person is often difficult, and this introduces a number of complexities. First, when exact matches are not available, researchers need to decide how to measure the distance between two units and if a given distance is close enough. A second complexity arises if researchers want to use multiple matches for each case in the treatment group, since this can lead to more precise estimates. Both of these issues, as well as others, are described in detail in chapter 18 of Imbens and Rubin (2015). See also Part II of (???).
See Dehejia and Wahba (1999) for an example where matching methods were able to produce estimates similar to those from a randomized controlled experiment. But, see Arceneaux, Gerber, and Green (2006) and Arceneaux, Gerber, and Green (2010) for examples where matching methods failed to reproduce an experimental benchmark.
Rosenbaum (2015) and Hernán and Robins (2016) offer other advice for discovering useful comparisons within big data sources.