No matter how big your big data, it probably doesn’t have the information you want.
Most big data sources are incomplete, in the sense that they don’t have the information that you will want for your research. This is a common feature of data that were created for purposes other than research. Many social scientists have already had the experience of dealing with incompleteness, such as an existing survey that didn’t ask the question that was needed. Unfortunately, the problems of incompleteness tend to be more extreme in big data. In my experience, big data tends to be missing three types of information useful for social research: demographic information about participants, behavior on other platforms, and data to operationalize theoretical constructs.
Of the three kinds of incompleteness, the problem of incomplete data to operationalize theoretical constructs is the hardest to solve. And in my experience, it is often accidentally overlooked. Roughly, theoretical constructs are abstract ideas that social scientists study and operationalizing a theoretical construct means proposing some way to capture that construct with observable data. Unfortunately, this simple-sounding process often turns out to be quite difficult. For example, let’s imagine trying to empirically test the apparently simple claim that people who are more intelligent earn more money. In order to test this claim, you would need to measure “intelligence.” But what is intelligence? Gardner (2011) argued that there are actually eight different forms of intelligence. And are there procedures that could accurately measure any of these forms of intelligence? Despite enormous amounts of work by psychologists, these questions still don’t have unambiguous answers.
Thus, even a relatively simple claim—people who are more intelligent earn more money—can be hard to assess empirically because it can be hard to operationalize theoretical constructs in data. Other examples of theoretical constructs that are important but hard to operationalize include “norms,” “social capital,” and “democracy.” Social scientists call the match between theoretical constructs and data construct validity (Cronbach and Meehl 1955). As this short list of constructs suggests, construct validity is a problem that social scientists have struggled with for a very long time. But in my experience, the problems of construct validity are even greater when working with data that were not created for the purposes of research (Lazer 2015).
When you are assessing a research result, one quick and useful way to assess construct validity is to take the result, which is usually expressed in terms of constructs, and re-express it in terms of the data used. For example, consider two hypothetical studies that claim to show that people who are more intelligent earn more money. In the first study, the researcher found that people who score well on the Raven Progressive Matrices Test—a well-studied test of analytic intelligence (Carpenter, Just, and Shell 1990)—have higher reported incomes on their tax returns. In the second study, the researcher found that people on Twitter who used longer words are more likely to mention luxury brands. In both cases, these researchers could claim that they have shown that people who are more intelligent earn more money. However, in the first study the theoretical constructs are well operationalized by the data, while in the second they are not. Further, as this example illustrates, more data does not automatically solve problems with construct validity. You should doubt the results of the second study whether it involved a million tweets, a billion tweets, or a trillion tweets. For researchers not familiar with the idea of construct validity, table 2.2 provides some examples of studies that have operationalized theoretical constructs using digital trace data.
Data source | Theoretical construct | References |
---|---|---|
Email logs from a university (meta-data only) | Social relationships | Kossinets and Watts (2006), Kossinets and Watts (2009), De Choudhury et al. (2010) |
Social media posts on Weibo | Civic engagement | Zhang (2016) |
Email logs from a firm (meta-data and complete text) | Cultural fit in an organization | Srivastava et al. (2017) |
Although the problem of incomplete data for capturing theoretical constructs is pretty hard to solve, there are common solutions to the other common types of incompleteness: incomplete demographic information and incomplete information on behavior on other platforms. The first solution is to actually collect the data you need; I’ll tell you about that in chapter 3 when I tell you about surveys. The second main solution is to do what data scientists call user-attribute inference and social scientists call imputation. In this approach, researchers use the information that they have on some people to infer attributes of other people. A third possible solution is to combine multiple data sources. This process is sometimes called record linkage. My favorite metaphor for this process was written by Dunn (1946) in the very first paragraph of the very first paper ever written on record linkage:
“Each person in the world creates a Book of Life. This Book starts with birth and ends with death. Its pages are made up of records of the principal events in life. Record linkage is the name given to the process of assembling the pages of this book into a volume.”
When Dunn wrote that passage he was imagining that the Book of Life could include major life events like birth, marriage, divorce, and death. However, now that so much information about people is recorded, the Book of Life could be an incredibly detailed portrait, if those different pages (i.e., our digital traces) can be bound together. This Book of Life could be a great resource for researchers. But, it could also be called a database of ruin (Ohm 2010), which could be used for all kinds of unethical purposes, as I’ll describe in chapter 6 (Ethics).