We can approximate experiments that we can’t do. Two approaches that especially benefit from the digital age are matching and natural experiments.
Many important scientific and policy questions are causal. Let’s consider, for example, the following question: what is the effect of a job training program on wages? One way to answer this question would be with a randomized controlled experiment where workers were randomly assigned to either receive training or not receive training. Then, researchers could estimate the effect of training for these participants by simply comparing the wages of people who received the training to those that did not receive it.
The simple comparison is valid because of something that happens before the data was even collected: the randomization. Without randomization, the problem is much trickier. A researcher could compare the wages of people who voluntarily signed up for training to those who didn’t sign-up. That comparison would probably show that people who received training earned more, but how much of this is because of training and how much of this is because people that sign-up for training are different from those that don’t sign-up for training? In other words, is it fair to compare the wages of these two groups of people?
This concern about fair comparisons leads some researchers to believe that it is impossible to make causal estimates without running an experiment. This claim goes too far. While it is true that experiments provide the strongest evidence for causal effects, there are other strategies that can provide valuable causal estimates. Instead of thinking that causal estimates are either easy (in the case of experiments) or impossible (in the case of passively observed data), it is better to think of the strategies for making causal estimates lying along a continuum from strongest to weakest (Figure 2.4). At the strongest end of the continuum are randomized controlled experiments. But, these are often difficult to do in social research because many treatments require unrealistic amounts of cooperation from governments or companies; quite simply there are many experiments that we cannot do. I will devote all of Chapter 4 to both the strengths and weaknesses of randomized controlled experiments, and I’ll argue that in some cases, there are strong ethical reasons to prefer observational to experimental methods.
Moving along the continuum, there are situations where researchers have not explicitly randomized. That is, researchers are attempting to learn experiment-like knowledge without actually doing an experiment; naturally, this is going to be tricky, but big data greatly improves our ability to make causal estimates in these situations.
Sometimes there are settings where randomness in the world happens to create something like an experiment for researchers. These designs are called natural experiments, and they will be considered in detail in Section 2.4.3.1. Two features of big data sources—their always-on nature and their size—greatly enhances our ability to learn from natural experiments when they occur.
Moving further away from randomized controlled experiments, sometimes there is not even an event in nature that we can use to approximate a natural experiment. In these settings, we can carefully construct comparisons within non-experimental data in an attempt to approximate an experiment. These designs are called matching, and they will be considered in detail in Section 2.4.3.2. Like natural experiments, matching is a design that also benefits from big data sources. In particular, the massive size—both in terms of number of cases and type of information per case—greatly facilitates matching. The key difference between natural experiments and matching is that in natural experiments the researcher knows the process through which treatment was assigned and believes it to be random.
The concept of fair comparisons that motivated the desires to do experiments also underlies the two alternative approaches: natural experiments and matching. These approaches will enable you to estimate causal effects from passively observed data by discovering fair comparisons sitting inside of the data that you already have.