Measurement in big data sources is much less likely to change behavior.
One challenge of social research is that people can change their behavior when they know that they are being observed by researchers. Social scientists generally call this reactivity (Webb et al. 1966). For example, people can be more generous in laboratory studies than field studies because in the former they are very aware that they are being observed (Levitt and List 2007a). One aspect of big data that many researchers find promising is that participants are generally not aware that their data are being captured or they have become so accustomed to this data collection that it no longer changes their behavior. Because participants are nonreactive, therefore, many sources of big data can be used to study behavior that has not been amenable to accurate measurement previously. For example, Stephens-Davidowitz (2014) used the prevalence of racist terms in search engine queries to measure racial animus in different regions of the United States. The nonreactive and big (see section 2.3.1) nature of the search data enabled measurements that would be difficult using other methods, such as surveys.
Nonreactivity, however, does not ensure that these data are somehow a direct reflection of people’s behavior or attitudes. For example, as one respondent in an interview-based study said, “It’s not that I don’t have problems, I’m just not putting them on Facebook” (Newman et al. 2011). In other words, even though some big data sources are nonreactive, they are not always free of social desirability bias, the tendency for people to want to present themselves in the best possible way. Further, as I’ll describe later in the chapter, the behavior captured in big data sources is sometimes impacted by the goals of platform owners, an issue I’ll call algorithmic confounding. Finally, although nonreactivity is advantageous for research, tracking people’s behavior without their consent and awareness raises ethical concerns that I’ll describe in detail in chapter 6.
The three properties that I just described—big, always-on, and nonreactive—are generally, but not always, advantageous for social research. Next, I’ll turn to the seven properties of big data sources—incomplete, inaccessible, non-representative, drifting, algorithmically confounded, dirty, and sensitive—that generally, but not always, create problems for research.