Behavior in found data is not natural, it is driven by the engineering goals of the systems.
Although many found data sources are non-reactive because people are not aware their data are being recorded (Section 2.3.1.3), researchers should not consider behavior in these online systems to be “naturally occurring” or “pure.” In reality, the digital systems that record behavior are highly engineered to induce specific behaviors such as clicking on ads or posting content. The ways that the goals of system designers can introduce patterns into data is called algorithmic confounding. Algorithmic confounding is relatively unknown to social scientists, but it is a major concern among careful data scientists. And, unlike some of the other problems with digital traces, algorithmic confounding is largely invisible.
A relatively simple example of algorithmic confounding is the fact that on Facebook there are an anomalously high number of users with approximately 20 friends (Ugander et al. 2011). Scientists analyzing with this data without any understanding of how Facebook works could doubtlessly generate many stories about how 20 is some kind of magical social number. However, Ugander and his colleagues had a substantial understanding of the process that generated the data, and they knew that Facebook encouraged people with few connections on Facebook to make more friends until they reached 20 friends. Although Ugander and colleagues don’t say this in the paper, this policy was presumably created by Facebook in order to encourage new users to become more active. Without knowing about the existence of this policy, however, it is easy to draw the wrong conclusion from the data. In other words, the surprisingly high number of people with about 20 friends tells us more about Facebook than human behavior.
More pernicious than this previous example where algorithmic confounding produced a quirky result that a careful researchers might investigate further, there is an even trickier version of algorithmic confounding that occurs when designers of online systems are aware of social theories and then bake these theories into the working of their systems. Social scientists call this performativity: when theories change the world in such a way that they bring the world more into line with the theory. In the cases of performative algorithmic confounding, the confounded nature of the data is likely invisible.
One example of a pattern created by performativity is transitivity in online social networks. In the 1970s and 1980s, researchers repeatedly found that if you are friends with Alice and you are friends with Bob, then Bob and Alice are more likely to be friends with each other than two randomly chosen people. And, this very same pattern was found in the social graph on Facebook (Ugander et al. 2011). Thus, one might conclude that patterns of friendship on Facebook replicate patterns of offline friendships, at least in terms of transitivity. However, the magnitude of transitivity in the Facebook social graph is partially driven by algorithmic confounding. That is, data scientists at Facebook knew of the empirical and theoretical research about transitivity and then baked it into how Facebook works. Facebook has a “People You May Know” feature that suggests new friends, and one way that Facebook decides who to suggest to you is transitivity. That is, Facebook is more likely to suggest that you become friends with the friends of your friends. This feature thus has the effect of increasing transitivity in the Facebook social graph; in other words, the theory of transitivity brings the world into line with the predictions of the theory (Healy 2015). Thus, when big data sources appears to reproduce predictions of social theory, we must be sure that the theory itself was not baked into how the system worked.
Rather than thinking of big data sources as observing people in a natural setting, a more apt metaphor is observing people in a casino. Casinos are highly engineered environments designed to induce certain behaviors, and a researchers would never expect that behavior in a casino would provide an unfettered window into human behavior. Of course, we could learn something about human behavior studying people in casinos—in fact a casino might be an ideal setting for studying the relationship between alcohol consumption and risk preferences—but if we ignored that the data was being created in a casino we might draw some bad conclusions.
Unfortunately, dealing with algorithmic confounding is particularly difficult because many features of online systems are proprietary, poorly documented, and constantly changing. For example, as I’ll explain later in this chapter, algorithmic confounding was one possible explanation for the gradual break-down of Google Flu Trends (Section 2.4.2), but this claim was hard to assess because the inner workings of Google’s search algorithm are proprietary. The dynamic nature of algorithmic confounding is one form of system drift. Algorithmic confounding means that we should be cautious about any claim for human behavior that comes from a single digital system, no matter how big.