I think the best way to understand experiments is the potential outcomes framework (which I discussed in the mathematical notes in chapter 2). The potential outcomes framework has a close relationships to the ideas from design-based sampling that I described in chapter 3 (Aronow and Middleton 2013; Imbens and Rubin 2015, chap. 6). This appendix has been written in such a way as to emphasize that connection. This emphasis is a bit non-traditional, but I think that the connection between sampling and experiments is helpful: it means that if you know something about sampling then you know something about experiments and vice versa. As I’ll show in these notes, the potential outcomes framework reveals the strength of randomized controlled experiments for estimating causal effects, and it shows the limitations of what can be done with even perfectly executed experiments.
In this appendix, I’ll describe the potential outcomes framework, duplicating some of the material from the mathematical notes in chapter 2 in order to make these notes more self-contained. Then I’ll describe some helpful results about the precision of estimates of the average treatment effects, including a discussion of optimal allocation and difference-in-differences estimators. This appendix draws heavily on Gerber and Green (2012).
Potential outcomes framework
In order to illustrate the potential outcomes framework, let’s return to Restivo and van de Rijt’s experiment to estimate the the effect of receiving a barnstar on future contributions to Wikipedia. The potential outcomes framework has three main elements: units, treatments, and potential outcomes. In the case of Restivo and van de Rijt, the units were deserving editors—those in the top 1% of contributors—who had not yet received a barnstar. We can index these editors by \(i = 1 \ldots N\). The treatments in their experiment were “barnstar” or “no barnstar,” and I’ll write \(W_i = 1\) if person \(i\) is in the treatment condition and \(W_i = 0\) otherwise. The third element of the potential outcomes framework is the most important: the potential outcomes. These are bit more conceptually difficult because they involve “potential” outcomes—things that could happen. For each Wikipedia editor, one can imagine the number of edits that she would make in the treatment condition (\(Y_i(1)\)) and the number that she would make in the control condition (\(Y_i(0)\)).
Note that this choice of units, treatments, and outcomes defines what can be learned from this experiment. For example, without any additional assumptions, Restivo and van de Rijt cannot say anything about the effects of barnstars on all Wikipedia editors or on outcomes such as edit quality. In general, the choice of units, treatments, and outcomes must be based on the goals of the study.
Given these potential outcomes—which are summarized in table 4.5—one can define the causal effect of the treatment for person \(i\) as
\[ \tau_i = Y_i(1) - Y_i(0) \qquad(4.1)\]
To me, this equation is the clearest way to define a causal effect, and, although extremely simple, this framework turns out to generalizable in many important and interesting ways (Imbens and Rubin 2015).
Person | Edits in treatment condition | Edits in control condition | Treatment effect |
---|---|---|---|
1 | \(Y_1(1)\) | \(Y_1(0)\) | \(\tau_1\) |
2 | \(Y_2(1)\) | \(Y_2(0)\) | \(\tau_2\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
N | \(Y_N(1)\) | \(Y_N(0)\) | \(\tau_N\) |
mean | \(\bar{Y}(1)\) | \(\bar{Y}(0)\) | \(\bar{\tau}\) |
If we define causality in this way, however, we run into a problem. In almost all cases, we don’t get to observe both potential outcomes. That is, a specific Wikipedia editor either received a barnstar or not. Therefore, we observe one of the potential outcomes—\(Y_i(1)\) or \(Y_i(0)\)—but not both. The inability to observe both potential outcomes is such a major problem that Holland (1986) called it the Fundamental Problem of Causal Inference.
Fortunately, when we are doing research, we don’t just have one person, we have many people, and this offers a way around the Fundamental Problem of Causal Inference. Rather than attempting to estimate the individual-level treatment effect, we can estimate the average treatment effect:
\[ \text{ATE} = \frac{1}{N} \sum_{i=1}^N \tau_i \qquad(4.2)\]
This is still expressed in terms of the \(\tau_i\) which are unobservable, but with some algebra (Eq 2.8 of Gerber and Green (2012)) we get
\[ \text{ATE} = \frac{1}{N} \sum_{i=1}^N Y_i(1) - \frac{1}{N} \sum_{i=1}^N Y_i(0) \qquad(4.3)\]
Equation 4.3 shows that if we can estimate the population average outcome under treatment (\(N^{-1} \sum_{i=1}^N Y_i(1)\)) and the population average outcome under control (\(N^{-1} \sum_{i=1}^N Y_i(1)\)), then we can estimate the average treatment effect, even without estimating the treatment effect for any particular person.
Now that I’ve defined our estimand—the thing we are trying to estimate—I’ll turn to how we can actually estimate it with data. I like to think about this estimation challenge as a sampling problem (think back to the mathematical notes in chapter 3). Imagine that we randomly pick some people to observe in the treatment condition and we randomly pick some people to observe in the control condition, then we can estimate the average outcome in each condition:
\[ \widehat{\text{ATE}} = \underbrace{\frac{1}{N_t} \sum_{i:W_i=1} Y_i(1)}_{\text{average edits, treatment}} - \underbrace{\frac{1}{N_c} \sum_{i:W_i=0} Y_i(0)}_{\text{average edits, control}} \qquad(4.4)\]
where \(N_t\) and \(N_c\) are the numbers of people in the treatment and control conditions. Equation 4.4 is a difference-of-means estimator. Because of the sampling design, we know that the first term is an unbiased estimator for the average outcome under treatment and the second term is an unbiased estimator under control.
Another way to think about what randomization enables is that it ensures that the comparison between treatment and control groups is fair because randomization ensures that the two groups will resemble each other. This resemblance holds for things we have measured (say the number of edits in the 30 days before the experiment) and the things we have not measured (say gender). This ability to ensure balance on both observed and unobserved factors is critical. To see the power of automatic balancing on unobserved factors, let’s imagine that future research finds that men are more responsive to awards than women. Would that invalidate the results of Restivo and van de Rijt’s experiment? No. By randomizing, they ensured that all unobservables would be balanced, in expectation. This protection against the unknown is very powerful, and it is an important way that experiments are different from the non-experimental techniques described in chapter 2.
In addition to defining the treatment effect for an entire population, it is possible to define a treatment effect for a subset of people. This is typically called a conditional average treatment effect (CATE). For example, in the study by Restivo and van de Rijt, let’s imagine that \(X_i\) is whether the editor was above or below the median number of edits during the 90 days before the experiment. One could calculate the treatment effect separately for these light and heavy editors.
The potential outcomes framework is a powerful way to think about causal inference and experiments. However, there are two additional complexities that you should keep in mind. These two complexities are often lumped together under the term Stable Unit Treatment Value Assumption (SUTVA). The first part of SUTVA is the assumption that the only thing that matters for person \(i\)’s outcome is whether that person was in the treatment or control condition. In other words, it is assumed that person \(i\) is not impacted by the treatment given to other people. This is sometimes called “no interference” or “no spillovers”, and can be written as:
\[ Y_i(W_i, \mathbf{W_{-i}}) = Y_i(W_i) \quad \forall \quad \mathbf{W_{-i}} \qquad(4.5)\]
where \(\mathbf{W_{-i}}\) is a vector of treatment statuses for everyone except person \(i\). One way that this can be violated is if the treatment from one person spills over onto another person, either positively or negatively. Returning to Restivo and van de Rijt’s experiment, imagine two friends \(i\) and \(j\) and that person \(i\) receives a barnstar and \(j\) does not. If \(i\) receiving the barnstar causes \(j\) to edit more (out of a sense of competition) or edit less (out of a sense of despair), then SUTVA has been violated. It can also be violated if the impact of the treatment depends on the total number of other people receiving the treatment. For example, if Restivo and van de Rijt had given out 1,000 or 10,000 barnstars instead of 100, this might have impacted the effect of receiving a barnstar.
The second issue lumped into SUTVA is the assumption that the only relevant treatment is the one that the researcher delivers; this assumption is sometimes called no hidden treatments or excludibility. For example, in Restivo and van de Rijt, it might have been the case that by giving a barnstar the researchers caused editors to be featured on a popular editors page and that it was being on the popular editors page—rather than receiving a barnstar—that caused the change in editing behavior. If this is true, then the effect of the barnstar is not distinguishable from the effect of being on the popular editors page. Of course, it is not clear if, from a scientific perspective, this should be considered attractive or unattractive. That is, you could imagine a researcher saying that the effect of receiving a barnstar includes all the subsequent treatments that the barnstar triggers. Or you could imagine a situation where a research would want to isolate the effect of barnstars from all these other things. One way to think about it is to ask if there is anything that leads to what Gerber and Green (2012) (p. 41) call a “breakdown in symmetry”? In other words, is there anything other than the treatment that causes people in the treatment and control conditions to be treated differently? Concerns about symmetry breaking are what lead patients in the control group in medical trials to take a placebo pill. That way, researchers can be sure that the only difference between the two conditions is the actual medicine and not the experience of taking the pill.
For more on SUTVA, see section 2.7 of Gerber and Green (2012), section 2.5 of Morgan and Winship (2014), and section 1.6 of Imbens and Rubin (2015).
Precision
In the previous section, I’ve described how to estimate the average treatment effect. In this section, I’ll provide some ideas about the variability of those estimates.
If you think about estimating the average treatment effect as estimating the difference between two sample means, then it is possible to show that the standard error of the average treatment effect is:
\[ SE(\widehat{\text{ATE}}) = \sqrt{\frac{1}{N-1} \left(\frac{m \text{Var}(Y_i(0))}{N-m} + \frac{(N-m) \text{Var}(Y_i(1))}{m} + 2\text{Cov}(Y_i(0), Y_i(1)) \right)} \qquad(4.6)\]
where \(m\) people assigned to treatment and \(N-m\) to control (see Gerber and Green (2012), eq. 3.4). Thus, when thinking about how many people to assign to treatment and how many to assign to control, you can see that if \(\text{Var}(Y_i(0)) \approx \text{Var}(Y_i(1))\), then you want \(m \approx N / 2\), as long as the costs of treatment and control are the same. Equation 4.6 clarifies why the design of Bond and colleagues’ (2012) experiment about the effects of social information on voting (figure 4.18) was inefficient statistically. Recall that it had 98% of participants in the treatment condition. This meant that the mean behavior in the control condition was not estimated as accurately as it could have been, which in turn meant that the estimated difference between the treatment and control condition was not estimated as accurately as it could be. For more on optimal allocation of participants to conditions, including when costs differ between conditions, see List, Sadoff, and Wagner (2011).
Finally, in the main text, I described how a difference-in-differences estimator, which is typically used in a mixed design, can lead to smaller variance than a difference-in-means estimator, which is typically used in a between-subjects design. If \(X_i\) is the value of the outcome before treatment, then the quantity that we are trying to estimate with the difference-in-differences approach is:
\[ \text{ATE}' = \frac{1}{N} \sum_{i=1}^N ((Y_i(1) - X_i) - (Y_i(0) - X_i)) \qquad(4.7)\]
The standard error of that quantity is (see Gerber and Green (2012), eq. 4.4)
\[ SE(\widehat{\text{ATE}'}) = \sqrt{\frac{1}{N-1} \left( \text{Var}(Y_i(0) - X_i) + \text{Var}(Y_i(1) - X_i) + 2\text{Cov}(Y_i(0) - X_i, Y_i(1) - X_i) \right)} \qquad(4.8)\]
A comparison of eq. 4.6 and eq. 4.8 reveals that the difference-in-differences approach will have a smaller standard error when (see Gerber and Green (2012), eq. 4.6)
\[ \frac{\text{Cov}(Y_i(0), X_i)}{\text{Var}(X_i)} + \frac{\text{Cov}(Y_i(1), X_i)}{\text{Var}(X_i)} > 1\qquad(4.9)\]
Roughly, when \(X_i\) is very predictive of \(Y_i(1)\) and \(Y_i(0)\), then you can get more precise estimates from a difference-of-differences approach than from a difference-of-means one. One way to think about this in the context of Restivo and van de Rijt’s experiment is that there is a lot of natural variation in the amount that people edit, so this makes comparing the treatment and control conditions difficult: it is hard to detect a relative small effect in noisy outcome data. But if you difference-out this naturally occurring variability, then there is much less variability, and that makes it easier to detect a small effect.
See Frison and Pocock (1992) for a precise comparison of difference-of-means, difference-of-differences, and ANCOVA-based approaches in the more general setting where there are multiple measurements pre-treatment and post-treatment. In particular, they strongly recommend ANCOVA, which I have not covered here. Further, see McKenzie (2012) for a discussion of the importance of multiple post-treatment outcome measures.