4.3 Two dimensions of experiments: lab-field and analog-digital

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

4.3 Two dimensions of experiments: lab-field and analog-digital

Lab experiments offer control, field experiments offer realism, and digital field experiments combine control and realism at scale.

Experiments come in many different shapes and sizes. But, despite these differences, researchers have found it helpful to organize experiments along a continuum between lab experiments and field experiments. Now, however, researchers should also organize experiments along a continuum between analog experiments and digital experiments. This two-dimensional design space will help you understand the strengths and weaknesses of different approaches and suggest areas of greatest opportunity (Figure 4.1).

Figure 4.1: Schematic of design space for experiments. In the past, experiments varied along the lab-field dimension. Now, they also vary on the analog-digital dimension. In my opinion, the area of greatest opportunity is digital field experiments.

In the past, the main way that researchers organized experiments was along the lab-field dimension. The majority of experiments in the social sciences are lab experiments where undergraduate students perform strange tasks in a lab for course credit. This type of experiment dominates research in psychology because it enables researchers to create very specific treatments designed to test very specific theories about social behavior. For certain problems, however, something feels a bit strange about drawing strong conclusions about human behavior from such unusual people performing such unusual tasks in such an unusual setting. These concerns have led to a movement toward field experiments. Field experiments combine the strong design of randomized control experiments with more representative groups of participants, performing more common tasks, in more natural settings.

Although some people think of lab and field experiments as competing methods, it is best to think of them as complementary methods with different strengths and weaknesses. For example, Correll, Benard, and Paik (2007) used both a lab experiment and a field experiment in an attempt to find the sources of the “motherhood penalty.” In the United States, mothers earn less money than childless women, even when comparing women with similar skills working in similar jobs. There are many possible explanations for this pattern, and one is that employers are biased against mothers. (Interestingly, the opposite seems to be true for fathers: they tend to earn more than comparable childless men). In order to assess possible bias against mothers, Correll and colleagues ran two experiments: one in the lab and one in the field.

First, in a lab experiment Correll and colleagues told participants, who were college undergraduates, that a California-based start-up communications company was conducting an employment search for a person to lead its new East Coast marketing department. Students were told that the company wanted their help in the hiring process and they were asked to review resumes of several potential candidates and to rate the candidates on a number of dimensions such as their intelligence, warmth, and commitment to work. Further, the students were asked if they would recommend hiring the applicant and what they would recommend as a starting salary. Unbeknownst to the students, however, the resumes were specifically constructed to be similar except for one thing: some of the resumes signaled motherhood (by listing involvement in a parent-teacher association) and some did not. Correll found that students were less likely to recommend hiring the mothers and offered them lower starting salary. Further, through a statistical analysis of both the ratings and the hiring-related decisions, Correll found that mothers’ disadvantages were largely explained by the fact that mothers were rated lower in terms of competence and commitment. In other words, Correll argues that these traits are the mechanism through which mothers are disadvantaged. Thus, this lab experiment allowed Correll and colleagues to measure a causal effect and provide a possible explanation for that effect.

Of course, one might be skeptical about drawing conclusions about the entire US labor market based on the decisions of a few hundred undergraduates who have probably never had a full time job, let alone hired people. Therefore, Correll and colleagues also conducted a complementary field experiment. The researchers responded to hundreds of advertised job openings by sending in fake cover letters and resumes. Similar to the materials shown to the undergraduates, some resumes signaled motherhood and some did not. Correll and colleagues found that mothers were less likely to get called back for interviews than equally qualified childless women. In other words, real employers making consequential decisions in a natural setting behaved much like the undergraduates. Did they make similar decisions for the same reason? Unfortunately, we don’t know. The researchers were not able to ask the employers to rate the candidates or explain their decisions.

This pair of experiments reveals a lot about lab and field experiments in general. Lab experiments offer researchers near total control of the environment in which participants are making decisions. So, for example, in the lab experiment, Correll was able to ensure that all the resumes were read in a quiet setting; in the field experiment, some of the resumes might not have even been read. Further, because participants in the lab setting know that they are being studied, researchers are often able to collect additional data that can help them understand why participants are making their decisions. For example, Correll asked participants in the lab experiment to rate the candidates on different dimensions. This kind of process data could help researchers understand the mechanisms behind differences in how participants treat the resumes.

On the other hand, these exact same characteristics that I just described as advantages are also sometimes considered disadvantages. Researchers who prefer field experiments argue that participants in lab experiments could act very differently when they are being closely observed. For example, in the lab experiment participants might have guessed the goal of the research and altered their behavior so as not to appear biased. Further, researchers who prefer field experiments might argue that small differences on resumes can only stand out in a very clean, sterile lab environment, and thus the lab experiment will over-estimate the effect of motherhood on real hiring decisions. Finally, many proponents of field experiments criticize lab experiments reliance on WEIRD participants: mainly students from Western, Educated, Industrialized, Rich, and Democratic countries (Henrich, Heine, and Norenzayan 2010). The experiments by Correll and colleagues (2007) illustrate the two extremes on the lab-field continuum. In between these two extremes there are a variety of hybrid designs including approaches such as bringing non-students into a lab or going into the field but still having participants perform an unusual task.

In addition to the lab-field dimension that has existed in the past, the digital age means that researchers now have a second major dimension along which experiments can vary: analog-digital. Just as there are pure lab experiments, pure field experiments, and a variety of hybrids in between, there are pure analog experiments, pure digital experiments, and a variety of hybrids. It is tricky to offer a formal definition of this dimension, but a useful working definition is that fully digital experiments are experiments that make use of digital infrastructure to recruit participants, randomize, deliver treatments, and measure outcomes. For example, Restivo and van de Rijt’s (2012) study of barnstars and Wikipedia was a fully digital experiment because it used digital systems for all four of these steps. Likewise fully analog experiments are experiments that do not make use of digital infrastructure for any of these four steps. Many of the classic experiments in psychology are analog experiments. In between these two extremes there are partially digital experiments that use a combination of analog and digital systems for the four steps.

Critically, the opportunities to run digital experiments are not just online. Researchers can run partially digital experiments by using digital devices in the physical world in order to deliver treatments or measure outcomes. For example, researchers could use smart phones to deliver treatments or sensors in the built environment to measure outcomes. In fact, as we will see later in this chapter, researchers have already used home power meters to measure outcomes in experiments about social norms and energy consumption involving 8.5 million of households (Allcott 2015). As digital devices become increasingly integrated into people’s lives and sensors become integrated into the built environment, these opportunities to run partially digital experiments in the physical world will increase dramatically. In other words, digital experiments are not just online experiments.

Digital systems create new possibilities for experiments everywhere along the lab-field continuum. In pure lab experiments, for example, researchers can use digital systems for finer measurement of participants’ behavior; one example of this type of improved measurement is eye-tracking equipment which provides precise and continuous measures of gaze location. The digital age also creates the possibility to run lab-like experiments online. For example, researchers have rapidly adopted Amazon Mechanical Turk (MTurk) to recruit participants for online experiments (Figure 4.2). MTurk matches “employers” who have tasks that need to be completed with “workers” who wish to complete those tasks for money. Unlike traditional labor markets, however, the tasks involved usually only require a few minutes to complete and the entire interaction between employer and worker is virtual. Because MTurk mimics aspects of traditional lab experiments—paying people to complete tasks that they would not do for free—it is naturally suited for certain types of experiments. Essentially, MTurk has created the infrastructure for managing a pool of participants—recruiting and paying people—and researchers have taken advantage of that infrastructure to tap into an always available pool of participants.

Figure 4.2: Papers published using data from Amazon Mechanical Turk (MTurk) (Bohannon 2016). MTurk and other online labor markets offer researchers a convenient way to recruit participants for experiments.

Digital experiments create even more possibilities for field-like experiments. Digital field experiments can offer tight control and process data to understand possible mechanisms (like lab experiments) and more diverse participants making real decisions in a natural environment (like field experiments). In addition to this combination of good characteristics of earlier experiments, digital field experiments also offer three opportunities that were difficult in analog lab and field experiments.

First, whereas most analog lab and field experiments have hundreds of participants, digital field experiments can have millions of participants. This change in scale is because some digital experiments can produce data at zero variable cost. That is, once researchers have created an experimental infrastructure, increasing the number of participants typically does not increase the cost. Increasing the number of participants by a factor of 100 or more is not just a quantitative change, it is a qualitative change, because it enables researchers to learn different things from experiments (e.g., heterogeneity of treatment effects) and run entirely different experimental designs (e.g., large group experiments). This point is so important, I’ll return to it towards the end of the chapter when I offer advice about creating digital experiments.

Second, whereas most analog lab and field experiments treat participants as indistinguishable widgets, digital field experiments often use background information about participants in the design and analysis stages of the research. This background information, which is called pre-treatment information, is often available in digital experiments because they take place in fully measured environments. For example, a researcher at Facebook has much more pre-treatment information than a researcher designing a standard lab experiment with undergraduates. This pre-treatment information enables researchers to move beyond treating participants as indistinguishable widgets. More specifically, pre-treatment information enables more efficient experimental designs—such as blocking (Higgins, Sävje, and Sekhon 2016) and targeted recruitment of participants (Eckles, Kizilcec, and Bakshy 2016)—and more insightful analysis—such as estimation of heterogeneity of treatment effects (Athey and Imbens 2016a) and covariate adjustment for improved precision (Bloniarz et al. 2016).

Third, whereas many analog lab and field experiments deliver treatments and measure outcomes in a relatively compressed amount of time, some digital field experiments involve treatments that can be delivered over time and the effects can also be measured over time. For example, Restivo and van de Rijt’s experiment has the outcome measured daily for 90 days, and one of the experiments I’ll tell you about later in the chapter (Ferraro, Miranda, and Price 2011) tracks outcomes over 3 years at basically no cost. These three opportunities—size, pre-treatment information, and longitudinal treatment and outcome data—are most common when experiments are run on top of always-on measurements systems (see Chapter 2 for more on always-on measurement systems).

While digital field experiments offer many possibilities, they also share some weaknesses with both analog lab and field experiments. For example, experiments cannot be used to study the past, and they can only estimate the effects of treatments that can be manipulated. Also, although experiments are undoubtedly useful to guide policy, the exact guidance they can offer is somewhat limited because of complications such as environmental dependence, compliance problems, and equilibrium effects (Banerjee and Duflo 2009; Deaton 2010). Finally, digital field experiments magnify the ethical concerns created by field experiments. Proponents of field experiments trumpet their ability to unobtrusively and randomly intervene into consequential decisions made by millions of people. These features offer certain scientific advantages, but they can also make field experiments ethically complex (think about it as researchers treating people like “lab rats” on a massive scale). Further, in addition to possible harms to participants, digital field experiments, because of their scale, can also raise concerns about the disruption of working social systems (e.g., concerns about disrupting Wikipedia’s reward system if Restivo and van der Rijt gave too many barnstars).