Mathematical notes

In this appendix, I will describe some of the ideas from the chapter in a slightly more mathematical form. The goal here is to help you get comfortable with the notation and mathematical framework used by survey researchers so that you can transition to some of more technical material written on these topics. I will start by introducing probability sampling, then move to probability sampling with nonresponse, and finally, non-probability sampling.

Probability sampling

As a running example, let’s consider the goal of estimating the unemployment rate in the United States. Let $U = \{1, \ldots, k, \ldots, N\}$ be the target population and let $y_k$ by the value of the outcome variable for the person $k$ . In this example $y_k$ is whether person $k$ is unemployed. Finally, let $F = \{1, \ldots, k, \ldots, N\}$ be the frame population, which for the sake of simplicity is assumed to be the same as the target population.

A basic sampling design is simple random sampling without replacement. In this case, each person is equally likely to be included in the sample $s = \{1, \ldots, i, \ldots, n\}$ . When the data are collected with this sampling design, a researchers can estimate the population unemployment rate with the sample mean:

$\hat{\bar{y}} = \frac{\sum_{i \in s} y_i}{n} \qquad(3.1)$

where $\bar{y}$ is the unemployment rate in the population and $\hat{\bar{y}}$ is the estimate of the unemployment rate (the $\hat{ }$ is commonly used to indicate an estimator).

In reality, researchers rarely use simple random sampling without replacement. For a variety of reasons (one of which I’ll describe in a moment), researchers often create samples with unequal probabilities of inclusion. For example, researchers might select people in Florida with higher probability of inclusion than people in California. In this case, the sample mean (eq. 3.1) might not be a good estimator. Instead, when there are unequal probabilities of inclusion, researchers use

$\hat{\bar{y}} = \frac{1}{N} \sum_{i \in s} \frac{y_i}{\pi_i} \qquad(3.2)$

where $\hat{\bar{y}}$ is the estimate of the unemployment rate and $\pi_i$ is person $i$ ’s probability of inclusion. Following standard practice, I’ll call the estimator in eq. 3.2 the Horvitz-Thompson estimator. The Horvitz-Thompson estimator is extremely useful because it leads to unbiased estimates for any probability sampling design (Horvitz and Thompson 1952). Because the Horvitz-Thompson estimator comes up so frequently, it is helpful to notice that it can be re-written as

$\hat{\bar{y}} = \frac{1}{N} \sum_{i \in s} w_i y_i \qquad(3.3)$

where $w_i = 1 / \pi_i$ . As eq. 3.3 reveals, the Horvitz-Thompson estimator is a weighted sample mean where the weights are inversely related to the probability of selection. In other words, the less likely a person is to be included in the sample, the more weight that person should get in the estimate.

As described earlier, researchers often sample people with unequal probabilities of inclusion. One example of a design that can lead to unequal probabilities of inclusion is stratified sampling, which is important to understand because it is closely related to the estimation procedure called post-stratification. In stratified sampling, a researcher splits the target population into $H$ mutually exclusive and exhaustive groups. These groups are called strata and are indicated as $U_1, \ldots, U_h, \ldots, U_H$ . In this example, the strata are states. The sizes of the groups are indicated as $N_1, \ldots, N_h, \ldots, N_H$ . A researcher might want to use stratified sampling in order to make sure that she has enough people in each state to make state-level estimates of unemployment.

Once the population has been split up into strata, assume that the researcher selects a simple random sample without replacement of size $n_h$ , independently from each strata. Further, assume that everyone selected in the sample becomes a respondent (I’ll handle non-response in the next section). In this case, the probability of inclusion is

$\pi_i = \frac{n_h}{N_h} \mbox{ for all } i \in h \qquad(3.4)$

Because these probabilities can vary from person to person, when making an estimate from this sampling design, researchers need to weight each respondent by the inverse of their probability of inclusion using the Horvitz-Thompson estimator (eq. 3.2).

Even though the Horvitz-Thompson estimator is unbiased, researchers can produce more accurate (i.e., lower variance) estimates by combining the sample with auxiliary information. Some people find it surprising that this is true even when there is perfectly executed probability sampling. These techniques using auxiliary information are particularly important because, as I will show later, auxiliary information is critical for making estimates from probability samples with nonresponse and from non-probability samples.

One common technique for utilizing auxiliary information is post-stratification. Imagine, for example, that a researcher knows the number of men and women in each of the 50 states; we can denote these group sizes as $N_1, N_2, \ldots, N_{100}$ . To combine this auxiliary information with the sample, the researcher can split the sample into $H$ groups (in this case 100), make an estimate for each group, and then create a weighted average of these group means:

$\hat{\bar{y}}_{post} = \sum_{h \in H} \frac{N_h}{N} \hat{\bar{y}}_h \qquad(3.5)$

Roughly, the estimator in eq. 3.5 is likely to be more accurate because it uses the known population information—the $N_h$ —to correct estimates if an unbalanced sample happens to be selected. One way to think about it is that post-stratification is like approximating stratification after the data has already been collected.

In conclusion, this section has described a few sampling designs: simple random sampling without replacements, sampling with unequal probability, and stratified sampling. It has also described two main ideas about estimation: the Horvitz-Thompson estimator and post-stratification. For a more formal definition of probability sampling designs, see chapter 2 of Särndal, Swensson, and Wretman (2003). For a more formal and complete treatment of stratified sampling, see section 3.7 of Särndal, Swensson, and Wretman (2003). For a technical description of the properties of the Horvitz-Thompson estimator, see Horvitz and Thompson (1952), Overton and Stehman (1995), or section 2.8 of@sarndal_model_2003. For a more formal treatment of post-stratification, see Holt and Smith (1979), Smith (1991), Little (1993), or section 7.6 of Särndal, Swensson, and Wretman (2003).

Probability sampling with nonresponse

Almost all real surveys have nonresponse; that is, not everyone in the sample population answers every question. There are two main kinds of nonresponse: item nonresponse and unit nonresponse. In item nonresponse, some respondents don’t answer some items (e.g., sometimes respondents don’t want to answer questions that they consider sensitive). In unit nonresponse, some people that are selected for the sample population don’t respond to the survey at all. The two most common reasons for unit nonresponse are that the sampled person cannot be contacted and the sample person is contacted but refuses to participate. In this section, I will focus on unit nonresponse; readers interested in item nonresponse should see Little and Rubin (2002).

Researchers often think about surveys with unit non-response as a two-stage sampling process. In the first stage, the researcher selects a sample $s$ such that each person has a probability of inclusion $\pi_i$ (where $0 < \pi_i \leq 1$ ). Then, in the second stage, people who are selected into the sample respond with probability $\phi_i$ (where $0 < \phi_i \leq 1$ ). This two-stage process results in the final set of respondents $r$ . An important difference between these two stages is that researchers control the process of selecting the sample, but they don’t control which of those sampled people become respondents. Putting these two processes together, the probability that someone will be a respondent is

$pr(i \in r) = \pi_i \phi_i \qquad(3.6)$

For the sake of simplicity, I’ll consider the case where the original sample design is simple random sampling without replacement. If a researcher selects a sample of size $n_s$ that yields $n_r$ respondents, and if the researcher ignores non-response and uses the mean of the respondents, then the bias of estimate will be:

$\mbox{bias of sample mean} = \frac{cor(\phi, y) S(y) S(\phi)}{\bar{\phi}} \qquad(3.7)$

where $cor(\phi, y)$ is the population correlation between the response propensity and the outcome (e.g., unemployment status), $S(y)$ is the population standard deviation of the outcome (e.g., unemployment status), $S(\phi)$ is the population standard deviation of the response propensity, and $\bar{\phi}$ is the population mean response propensity (Bethlehem, Cobben, and Schouten 2011, sec. 2.2.4).

Eq. 3.7 shows that nonresponse will not introduce bias if any of the following conditions are met:

There is no variation in unemployment status $(S(y) = 0)$ .
There is no variation in response propensities $(S(\phi) = 0)$ .
There is no correlation between response propensity and unemployment status $(cor(\phi, y) = 0)$ .

Unfortunately, none of these conditions seem likely. It seems implausible that there will be no variation in employment status or that there will be no variation in response propensities. Thus, the key term in eq. 3.7 is the correlation: $cor(\phi, y)$ . For example, if people are who unemployed are more likely to respond, then the estimated employment rate will be biased upward.

The trick to making estimates when there is nonresponse is to use auxiliary information. For example, one way in which you can use auxiliary information is post-stratification (recall eq. 3.5 from above). It turns out that the bias of the post-stratification estimator is:

$bias(\hat{\bar{y}}_{post}) = \frac{1}{N} \sum_{h=1}^H \frac{N_h cor(\phi, y)^{(h)} S(y)^{(h)} S(\phi)^{(h)}}{\bar{\phi}^{(h)}} \qquad(3.8)$

where $cor(\phi, y)^{(h)}$ , $S(y)^{(h)}$ , $S(\phi)^{(h)}$ , and $\bar{\phi}^{(h)}$ are defined as above but restricted to people in group $h$ (Bethlehem, Cobben, and Schouten 2011, sec. 8.2.1). Thus, the overall bias will be small if the bias in each post-stratification group is small. There are two ways that I like to think about making the bias small in each post-stratification group. First, you want to try to form homogeneous groups where there is little variation in response propensity ( $S(\phi)^{(h)} \approx 0$ ) and the outcome ( $S(y)^{(h)} \approx 0$ ). Second, you want to form groups where the people that you see are like the people that you don’t see ( $cor(\phi, y)^{(h)} \approx 0$ ). Comparing eq. 3.7 and eq. 3.8 helps clarify when post-stratification can reduce the biased caused by nonresponse.

In conclusion, this section has provided a model for probability sampling with non-response and shown the bias that nonresponse can introduce both without and with post-stratification adjustments. Bethlehem (1988) offers a derivation of the bias caused by nonresponse for more general sampling designs. For more on using post-stratification to adjust for nonresponse, see Smith (1991) and Gelman and Carlin (2002). Post-stratification is part of a more general family of techniques called calibration estimators, see Zhang (2000) for an article-length treatment and Särndal and Lundström (2005) for a book-length treatment. For more on other other weighting methods for adjusting for nonresponse, see Kalton and Flores-Cervantes (2003), Brick (2013), and Särndal and Lundström (2005).

Non-probability sampling

Non-probability sampling includes a huge variety of designs (Baker et al. 2013). Focusing specifically on the sample of Xbox users by Wang and colleagues (W. Wang et al. 2015), you can think of that kind of sample as one where the key part of the sampling design is not the $\pi_i$ (the researcher-driven probability of inclusion) but the $\phi_i$ (the respondent-driven response propensities). Naturally, this is not ideal because the $\phi_i$ are unknown. But, as Wang and colleagues showed, this kind of opt-in sample—even from a sampling frame with enormous coverage error—need not be catastrophic if the researcher has good auxiliary information and a good statistical model to account for these problems.

Bethlehem (2010) extends many of the above derivations about post-stratification to include both nonresponse and coverage errors. In addition to post-stratification, other techniques for working with non-probability samples—and probability samples with coverage errors and nonresponse—include sample matching (Ansolabehere and Rivers 2013; ???), propensity score weighting (Lee 2006; Schonlau et al. 2009), and calibration (Lee and Valliant 2009). One common theme among these techniques is the use of the auxiliary information.