Behavior in big data systems is not natural; it is driven by the engineering goals of the systems.
Although many big data sources are nonreactive because people are not aware their data are being recorded (section 2.3.3), researchers should not consider behavior in these online systems to be “naturally occurring.” In reality, the digital systems that record behavior are highly engineered to induce specific behaviors such as clicking on ads or posting content. The ways that the goals of system designers can introduce patterns into data is called algorithmic confounding. Algorithmic confounding is relatively unknown to social scientists, but it is a major concern among careful data scientists. And, unlike some of the other problems with digital traces, algorithmic confounding is largely invisible.
A relatively simple example of algorithmic confounding is the fact that on Facebook there are an anomalously high number of users with approximately 20 friends, as was discovered by Johan Ugander and colleagues (2011). Scientists analyzing this data without any understanding of how Facebook works could doubtless generate many stories about how 20 is some kind of magical social number. Fortunately, Ugander and his colleagues had a substantial understanding of the process that generated the data, and they knew that Facebook encouraged people with few connections on Facebook to make more friends until they reached 20 friends. Although Ugander and colleagues don’t say this in their paper, this policy was presumably created by Facebook in order to encourage new users to become more active. Without knowing about the existence of this policy, however, it is easy to draw the wrong conclusion from the data. In other words, the surprisingly high number of people with about 20 friends tells us more about Facebook than about human behavior.
In this previous example, algorithmic confounding produced a quirky result that a careful researcher might detect and investigate further. However, there is an even trickier version of algorithmic confounding that occurs when designers of online systems are aware of social theories and then bake these theories into the working of their systems. Social scientists call this performativity: when a theory changes the world in such a way that it bring the world more into line with the theory. In the case of performative algorithmic confounding, the confounded nature of the data is very difficult to detect.
One example of a pattern created by performativity is transitivity in online social networks. In the 1970s and 1980s, researchers repeatedly found that if you are friends with both Alice and Bob, then Alice and Bob are more likely to be friends with each other than if they were two randomly chosen people. This very same pattern was found in the social graph on Facebook (Ugander et al. 2011). Thus, one might conclude that patterns of friendship on Facebook replicate patterns of offline friendships, at least in terms of transitivity. However, the magnitude of transitivity in the Facebook social graph is partially driven by algorithmic confounding. That is, data scientists at Facebook knew of the empirical and theoretical research about transitivity and then baked it into how Facebook works. Facebook has a “People You May Know” feature that suggests new friends, and one way that Facebook decides who to suggest to you is transitivity. That is, Facebook is more likely to suggest that you become friends with the friends of your friends. This feature thus has the effect of increasing transitivity in the Facebook social graph; in other words, the theory of transitivity brings the world into line with the predictions of the theory (Zignani et al. 2014; Healy 2015). Thus, when big data sources appear to reproduce predictions of social theory, we must be sure that the theory itself was not baked into how the system worked.
Rather than thinking of big data sources as observing people in a natural setting, a more apt metaphor is observing people in a casino. Casinos are highly engineered environments designed to induce certain behaviors, and a researcher would never expect behavior in a casino to provide an unfettered window into human behavior. Of course, you could learn something about human behavior by studying people in casinos, but if you ignored the fact that the data was being created in a casino, you might draw some bad conclusions.
Unfortunately, dealing with algorithmic confounding is particularly difficult because many features of online systems are proprietary, poorly documented, and constantly changing. For example, as I’ll explain later in this chapter, algorithmic confounding was one possible explanation for the gradual breakdown of Google Flu Trends (section 2.4.2), but this claim was hard to assess because the inner workings of Google’s search algorithm are proprietary. The dynamic nature of algorithmic confounding is one form of system drift. Algorithmic confounding means that we should be cautious about any claim regarding human behavior that comes from a single digital system, no matter how big.