Researchers scraped Chinese social media sites to study censorship. They dealt with incompleteness with latent-trait inference.
In addition to the big data used in the two previous examples, researchers can also collect their own observational data, as was wonderfully illustrated by Gary King, Jennifer Pan, and Molly Roberts’ (2013) research on censorship by the Chinese government.
Social media posts in China are censored by an enormous state apparatus that is thought to include tens of thousands of people. Researchers and citizens, however, have little sense of how these censors decide what content should be deleted from the social media. Scholars of China actually have conflicting expectations about which kinds of posts are most likely to get deleted. Some think that censors focus on posts that are critical of the state while others think they focus on posts that encourage collective behavior, such as protests. Figuring out which of these expectations is correct has implications for how researchers understand China and other authoritarian governments that engage in censorship. Therefore, King and colleagues wanted to compare posts that were published and subsequently deleted to posts that were published and never deleted.
Collecting these posts involved the amazing engineering feat of crawling more than 1,000 Chinese social media websites—each with different page layouts—finding relevant posts, and then revisiting these posts to see which were subsequently deleted. In addition to the normal engineering problems associated with large scale web-crawling, this project had the added challenge that it needed to be extremely fast because many censored posts are taken down in less than 24 hours. In other words, a slow crawler would miss lots of posts that were censored. Further, the crawlers had to do all this data collection while evading detection lest the social media websites block access or otherwise change their policies in response to the study.
Once this massive engineering task was completed, King and colleagues had obtained about 11 million posts on 85 different topics that were pre-specified based on their expected level of sensitivity. For example, a topic of high sensitivity is Ai Weiwei, the dissident artist; a topic of middle sensitivity is appreciation and devaluation of the Chinese currency, and a topic of low sensitivity is the World Cup. Of these 11 million posts about 2 million had been censored, but posts on highly sensitive topics were censored only slightly more often than posts on middle and low sensitivity topics. In other words, Chinese censors are about as likely to censor a post that mentions Ai Weiwei as a post that mentions the World Cup. These findings did not match the simplistic idea that the government censors all posts on sensitive topics.
This simple calculation of censorship rate by topic could be misleading, however. For example, the government might censor posts that are supportive of Ai Weiwei, but leave posts that are critical of him. In order to distinguish between posts more carefully, the researchers need to measure the sentiment of each post. Thus, one way to think about it is that the sentiment of each post in an important latent feature of each post. Unfortunately, despite much work, fully automated methods of sentiment detection using pre-existing dictionaries are still not very good in many situations (think back to the problems creating an emotional timeline of September 11, 2001 from Section 2.3.2.6). Therefore, King and colleagues needed a way to label their 11 million social media posts as to whether they were 1) critical of the state, 2) supportive of the state, or 3) irrelevant or factual reports about the events. This sounds like a massive job, but they solved it using a powerful trick; one that is common in data science but currently relatively rare in social science.
First, in a step typically called pre-processing, the researchers converted the social media posts into a document-term matrix, where there was one row for each document and one column that recorded whether the post contained a specific word (e.g., protest, traffic, etc.). Next, a group of research assistants hand-labeled the sentiment of a sample of post. Then, King and colleagues used this hand-labeled data to estimate a machine learning model that could infer the sentiment of a post based on its characteristics. Finally, they used this machine learning model to estimate the sentiment of all 11 million posts. Thus, rather than manually reading and labeling 11 million posts (which would be logistically impossible), they manually labeled a small number of posts and then used what data scientists would call supervised learning to estimate the categories of all the posts. After completing this analysis, King and colleagues were able to conclude that, somewhat surprisingly, the probability of a post being deleted was unrelated to whether it was critical of the state or supportive of the state.
In the end, King and colleagues discovered that only three types of posts were regularly censored: pornography, criticism of censors, and those that had collective action potential (i.e., the possibility of leading to large-scale protests). By observing a huge number of posts that were deleted and posts that were not deleted, King and colleagues were able to learn how the censors work just by watching and counting. In subsequent research, they actually directly intervened into the Chinese social media ecosystem by creating posts with systematically different content and measuring which get censored (King, Pan, and Roberts 2014). We will learn more about experimental approaches in Chapter 4. Further, foreshadowing a theme that will occur throughout the book, these latent-attribute inference problems—which can sometimes be solved with supervised learning—turn out to be very common in social research in the digital age. You will see pictures very similar to Figure 2.3 in Chapters 3 (Asking questions) and 5 (Creating mass collaboration); it is one of the few ideas that appears in multiple chapters.
All three of these examples—the working behavior of taxi drivers in New York, friendship formation by students, and social media censorship behavior of the Chinese government—show that relatively simple counting of observational data can enable researchers to test theoretical predictions. In some cases, big data enables you to do this counting relatively directly (as in the case of New York Taxis). In other cases, researchers will need to collect their own observational data (as in the case of Chinese censorship); deal with incompleteness by merging data together (as in the case of network evolution); or performing some form of latent-trait inference (as in the case of Chinese censorship). As I hope these examples show, for researchers who are able to ask interesting questions, big holds great promise.