Validity refers to how much the results of an experiment support a more general conclusion.
No experiment is perfect, and researchers have developed an extensive vocabulary to describe possible problems. Validity refers to the extent to which the results of a particular experiment support some more general conclusion. Social scientists have found it helpful to split validity into four main types: statistical conclusion validity, internal validity, construct validity, and external validity (Shadish, Cook, and Campbell 2001, Ch 2). Mastering these concepts will provide you a mental checklist for critiquing and improving the design and analysis of an experiment, and it will help you communicate with other researchers.
Statistical conclusion validity centers around whether the statistical analysis of the experiment was done correctly. In the context of Schultz et al. (2007) such question might center on whether they computed their p-values correctly. Statistical analysis is beyond the scope of this book, but I can say that the statistical principles needed to design and analyze experiments have not changed in the digital age. However, the different data environment in digital experiments does create new statistical opportunities (e.g., using machine learning methods to estimate heterogeneity of treatment effects (Imai and Ratkovic 2013)) and new computational challenges (e.g., blocking in massive experiments (Higgins, Sävje, and Sekhon 2016)).
Internal validity centers around whether the experimental procedures were performed correctly. Returning to the experiment of Schultz et al. (2007), questions about internal validity could center around the randomization, delivery of the treatment, and measurement of outcomes. For example, you might be concerned that the research assistants did not read the electric meters reliably. In fact, Schultz and colleagues were worried about this problem and they had a sample of meters read twice; fortunately, the results were essentially identical. In general, Schultz and colleagues’ experiment appears to have high internal validity, but this is not always the case; complex field and online experiments often run into problems actually delivering the right treatment to the right people and measuring the outcomes for everyone. Fortunately, the digital age can help reduce concerns about internal validity because it makes it easier to ensure that the treatment is delivered as designed to those who are supposed to receive it and to measure outcomes for all participants.
Construct validity centers around the match between the data and the theoretical constructs. As discussed in Chapter 2, constructs are abstract concepts that social scientists reason about. Unfortunately, these abstract concepts don’t always have clear definitions and measurements. Returning to Schultz et al. (2007), the claim that injunctive social norms can lower electricity use requires researchers to design a treatment that would manipulate “injunctive social norms” (e.g., an emoticon) and to measure “electricity use”. In analog experiments, many researchers designed their own treatments and measured their own outcomes. This approach ensures that, as much as possible, the experiments match the abstract constructs being studied. In digital experiments where researchers partner with companies or governments to deliver treatments and use always-on data systems to measure outcomes, the match between the experiment and the theoretical constructs may be less tight. Thus, I expect that construct validity will tend to be a bigger concern in digital experiments than analog experiments.
Finally, external validity centers around whether the results of this experiment would generalize to other situations. Returning to Schultz et al. (2007), one could ask, will this same idea—providing people information about their energy usage in relationship to their peers and a signal of injunctive norms (e.g., an emoticon)—reduce energy usage if it was done in a different way in a different setting? For most well-designed and well-run experiments, concerns about external validity are the hardest to address. In the past, these debates about external validity were frequently just a bunch of people sitting in a room trying to imagine what would have happened if the procedures were done in a different way, or in a different place, or with different people. Fortunately, the digital age enables researchers to move beyond these data-free speculations and assess external validity empirically.
Because the results from Schultz et al. (2007) were so exciting, a company named Opower partnered with utilities in the United States to deploy the treatment more widely. Based on the design of Schultz et al. (2007), Opower created customized Home Energy Reports that had two main modules, one showing a household’s electricity usage relative to its neighbors with an emoticon and one providing tips for lowering energy usage (Figure 4.6). Then, in partnership with researchers, Opower ran randomized controlled experiments to assess the impact of the Home Energy Reports. Even though the treatments in these experiments were typically delivered physically—usually through old fashioned snail mail—the outcome was measured using digital devices in the physical world (e.g., power meters). Rather than manually collecting this information with research assistants visiting each house, the Opower experiments were all done in partnership with power companies enabling the researchers to access the power readings. Thus, these partially digital field experiments were run at a massive scale at low variable cost.
In a first set of experiments involving 600,000 households served by 10 utility companies around the United States, Allcott (2011) found the Home Energy Report lowered electricity consumption by 1.7%. In other words, the results from the much larger, more geographically diverse study were qualitatively similar to the results from Schultz et al. (2007). But, the effect size was smaller: in Schultz et al. (2007) the households in the descriptive and injective norms condition (the one with the emoticon) reduced their electricity usage by 5%. The precise reason for this difference is unknown, but Allcott (2011) speculated that receiving a handwritten emoticon as part of a study sponsored by a university might have a larger effect on behavior than receiving a printed emoticon as part of a mass produced report from a power company.
Further, in subsequent research, Allcott (2015) reported on an additional 101 experiments involving an additional 8 million households. In these next 101 experiments the Home Energy Report continued to cause people to lower their electricity consumption, but the effects were even smaller. The precise reason for this decline is not known, but Allcott (2015) speculated that the effectiveness of the report appeared to be declining over time because it was actually being applied to different types of participants. More specifically, utilities in more environmentalist areas were more likely adopt the program earlier and their customers were more responsive to the treatment. As utilities with less environmental customers adopted the program, its effectiveness appeared to decline. Thus, just as randomization in experiments ensures that the treatment and control group are similar, randomization in research sites ensures that the estimates can be generalized from a one group of participants to a more general population (think back to Chapter 3 about sampling). If research sites are not sampled randomly, then generalization—even from a perfectly designed and conducted experiment—can be problematic.
Together, these 111 experiments—10 in Allcott (2011) and 101 in Allcott (2015)—involved about 8.5 million households from all over the United States. They consistently show that Home Energy Reports reduce average electricity consumption, a result that supports the original findings of Schultz and colleagues from 300 homes in California. Beyond just replicating these original results, the follow-up experiments also show that the size of the effect varies by location. This set of experiments also illustrates two more general points about partially digital field experiments. First, researchers will be able to empirically address concerns about external validity when the cost of running experiments is low, and this can occur if the outcome is already being measured by an always-on data system. Therefore, it suggests that research should be on the look-out for other interesting and important behaviors that are already being recorded, and then design experiments on top of this existing measuring infrastructure. Second, this set of experiments reminds us that digital field experiments are not just online; increasingly I expect that they will be everywhere with many outcomes measured by sensors in the built environment.
The four types of validity—statistical conclusion validity, internal validity, construct validity, external validity—provide a mental checklist to help researchers assess whether the results from a particular experiment support a more general conclusion. Compared to analog age experiments, in digital age experiments it should be easier to address external validity empirically and it should be easier to ensure internal validity. On the other hand, issues of construct validity will probably be more challenging in digital age experiments (although that was not the case with the Opower experiments).