Population drift, usage drift, and system drift make it hard to use big data sources to study long-term trends.
One of the great advantages of many big data sources is that they collect data over time. Social scientists call this kind of over-time data longitudinal data. And, naturally, longitudinal data are very important for studying change. In order to reliably measure change, however, the measurement system itself must be stable. In the words of sociologist Otis Dudley Duncan, “if you want to measure change, don’t change the measure” (Fischer 2011).
Unfortunately, many big data systems—especially business systems—are changing all the time, a process that I’ll call drift. In particular, these systems change in three main ways: population drift (change in who is using them), behavioral drift (change in how people are using them), and system drift (change in the system itself). The three sources of drift mean that any pattern in a big data source could be caused by an important change in the world, or it could be caused by some form of drift.
The first source of drift—population drift—is caused by changes in who is using the system, and these changes can happen on both short and long timescales. For example, during the US Presidential election of 2012 the proportion of tweets about politics that were written by women fluctuated from day to day (Diaz et al. 2016). Thus, what might appear to be a change in the mood of the Twitter-verse might actually just be a change in who is talking at any moment. In addition to these short-term fluctuations, there has also been a long-term trend of certain demographic groups adopting and abandoning Twitter.
In addition to changes in who is using a system, there are also changes in how the system is used, which I call behavioral drift. For example, during the 2013 Occupy Gezi protests in Turkey, protesters changed their use of hashtags as the protest evolved. Here’s how Zeynep Tufekci (2014) described the behavioral drift, which she was able to detect because she was observing behavior on Twitter and in person:
“What had happened was that as soon as the protest became the dominant story, large numbers of people … stopped using the hashtags except to draw attention to a new phenomenon … While the protests continued, and even intensified, the hashtags died down. Interviews revealed two reasons for this. First, once everyone knew the topic, the hashtag was at once superfluous and wasteful on the character-limited Twitter platform. Second, hashtags were seen only as useful for attracting attention to a particular topic, not for talking about it.”
Thus, researchers who were studying the protests by analyzing tweets with protest-related hashtags would have a distorted sense of what was happening because of this behavioral drift. For example, they might believe that the discussion of the protest decreased long before it actually decreased.
The third kind of drift is system drift. In this case, it is not the people changing or their behavior changing, but the system itself changing. For example, over time Facebook has increased the limit on the length of status updates. Thus, any longitudinal study of status updates will be vulnerable to artifacts caused by this change. System drift is closely related to a problem called algorithmic confounding, which I’ll cover in section 2.3.8.
To conclude, many big data sources are drifting because of changes in who is using them, in how they are being used, and in how the systems work. These sources of change are sometimes interesting research questions, but these changes complicate the ability of big data sources to track long-term changes over time.