Randomized controlled experiments have four main ingredients: recruitment of participants, randomization of treatment, delivery of treatment, and measurement of outcomes.
Randomized controlled experiments have four main ingredients: recruitment of participants, randomization of treatment, delivery of treatment, and measurement of outcomes. The digital age does not change the fundamental nature of experimentation, but it does make it easier logistically. For example, in the past, it might have been difficult to measure the behavior of millions of people, but that is now routinely happening in many digital systems. Researchers who can figure out how to harness these new opportunities will be able to run experiments that were impossible previously.
To make this all a bit more concrete—both what has stayed the same and what has changed—let’s consider an experiment by Michael Restivo and Arnout van de Rijt (2012). They wanted to understand the effect of informal peer rewards on editorial contributions to Wikipedia. In particular, they studied the effects of barnstars, an award that any Wikipedian can give to any other Wikipedian to acknowledge hard work and due diligence. Restivo and van de Rijt gave barnstars to 100 deserving Wikipedians. Then, they tracked the recipients’ subsequent contributions to Wikipedia over the next 90 days. Much to their surprise, the people to whom they awarded barnstars tended to make fewer edits after receiving one. In other words, the barnstars seemed to be discouraging rather than encouraging contribution.
Fortunately, Restivo and van de Rijt were not running a “perturb and observe” experiment; they were running a randomized controlled experiment. So, in addition to choosing 100 top contributors to receive a barnstar, they also picked 100 top contributors to whom they did not give one. These 100 served as a control group. And, critically, who was in the treatment group and who was in the control group was determined randomly.
When Restivo and van de Rijt looked at the behavior of people in the control group, they found that their contributions were decreasing too. Further, when Restivo and van de Rijt compared people in the treatment group (i.e., received barnstars) to people in the control group, they found that people in the treatment group contributed about 60% more. In other words, the contributions of both groups were deceasing, but those of the control group were doing so much faster.
As this study illustrates, the control group in experiments is critical in a way that is somewhat paradoxical. In order to precisely measure the effect of barnstars, Restivo and van de Rijt needed to observe people who did not receive barnstars. Many times, researchers who are not familiar with experiments fail to appreciate the incredible value of the control group. If Restivo and van de Rijt had not had a control group, they would have drawn exactly the wrong conclusion. Control groups are so important that the CEO of a major casino company has said that there are only three ways that employees can be fired from his company: for theft, for sexual harassment, or for running an experiment without a control group (Schrage 2011).
Restivo and van de Rijt’s study illustrates the four main ingredients of an experiment: recruitment, randomization, intervention, and outcomes. Together, these four ingredients allow scientists to move beyond correlations and measure the causal effect of treatments. Specifically, randomization means that people in the treatment and control groups will be similar. This is important because it means that any difference in outcomes between the two groups can be attributed to the treatment and not a confounder.
In addition to being a nice illustration of the mechanics of experiments, Restivo and van de Rijt’s study also shows that the logistics of digital experiments can be completely different from those of analog experiments. In Restivo and van de Rijt’s experiment, it was easy to give the barnstar to anyone, and it was easy to track the outcome—number of edits—over an extended period of time (because edit history is automatically recorded by Wikipedia). This ability to deliver treatments and measure outcomes at no cost is qualitatively unlike experiments in the past. Although this experiment involved 200 people, it could have been run with 2,000 or even 20,000 people. The main thing preventing the researchers from scaling up their experiment by a factor of 100 was not cost; it was ethics. That is, Restivo and van de Rijt didn’t want to give barnstars to undeserving editors, and they didn’t want their experiment to disrupt the Wikipedia community (Restivo and Rijt 2012, 2014). I’ll return to some of the ethical considerations raised by experiments later in this chapter and in chapter 6.
In conclusion, the experiment of Restivo and van de Rijt clearly shows that while the basic logic of experimentation has not changed, the logistics of digital-age experiments can be dramatically different. Next, in order to more clearly isolate the opportunities created by these changes, I’ll compare the experiments that researchers can do now with the kinds of experiments that have been done in the past.