In the analog age, collecting data about behavior—who does what when—was expensive, and therefore, relatively rare. Now, in the digital age, the behaviors of billions of people are recorded, stored, and analyzable. For example, every time you click on a website, make a call on your cell phone, or pay for something with your credit card, a digital record of your behavior is created and stored by a business. Because these data are a by-product of people’s every day actions, they are often called digital traces. In addition to these traces held by businesses, governments also have incredibly rich data about both people and businesses, data which is often digitized and analyzable. Together these business and government records are often called big data.
The ever-rising flood of big data means that we have moved from a world where behavioral data was scarce to a world where behavioral data is plentiful. But, because these types data are relatively new, an unfortunate amount of research using them looks like scientists blindly chasing available data. This chapter, instead, offers a principled approach to understanding the different sources of data and how they can be used. This richer understanding should help you better match your research questions to appropriate sources of data. Or, if such existing sources are lacking, convince you to collect your own data using the ideas in future chapters.
A first step to learning from big data is to realize that it is part of a broader category of data that has been used for social research for many years: observational data. Roughly, observational data is any data that results from observing a social system without intervening in some way. A crude way to think about it is that observational data is everything that does not involve talking with people (e.g., surveys, the topic of Chapter 3) or changing people’s environments (e.g., experiments, the topic of Chapter 4). Thus, in addition to business and government records, observational data also includes things like the text of newspaper articles and satellite photos.
This chapter has three parts. First, in Section 2.2, I describe big data in more detail and clarify a fundamental difference between it and the data that have generally been used for social research in the past. Then, in Section 2.3, I describe ten common characteristics of big data sources. Understanding these characteristics enables us to quickly recognize the strengths and weaknesses of existing sources and will help us harness the new sources that will be created in the future. Finally, in Section 2.4, I describe three main research strategies that you can use to learn from observational data: counting things, forecasting things, and approximating an experiment.