Key:
[, ] Algorithmic confounding was a problem with Google Flu Trends. Read the paper by Lazer et al. (2014), and write a short, clear email to an engineer at Google explaining the problem and offering an idea of how to fix the problem.
[] Bollen, Mao, and Zeng (2011) claims that data from Twitter can be used to predict the stock market. This finding led to the creation of a hedge fund—Derwent Capital Markets—to invest in the stock market based on data collected from Twitter (Jordan 2010). What evidence would you want to see before putting your money in that fund?
[] While some public health advocates hail e-cigarettes as an effective aid for smoking cessation, others warn about the potential risks, such as the high-levels of nicotine. Imagine that a researcher decides to study public opinion toward e-cigarettes by collecting e-cigarettes-related Twitter posts and conducting sentiment analysis.
[] In November 2009, Twitter changed the question in the tweet box from “What are you doing?” to “What’s happening?” (https://blog.twitter.com/2009/whats-happening).
[] Kwak et al. (2010) analyzed 41.7 million user profiles, 1.47 billion social relations, 4262 trending topics, and 106 million tweets between June 6th and June 31st, 2009. Based on this analysis they concluded that Twitter serves more as a new medium of information sharing than a social network.
[] “Retweets” are often used to measure influence and spread of influence on Twitter. Initially, users had to copy and paste the tweet they liked, tag the original author with his/her handle, and manually type “RT” before the tweet to indicate that it’s a retweet. Then, in 2009 Twitter added a “retweet” button. In June 2016, Twitter made it possible for users to retweet their own tweets (https://twitter.com/twitter/status/742749353689780224). Do you think these changes should affect how you use “retweets” in your research? Why or why not?
[, , ] Michel et al. (2011) constructed a corpus emerging from Google’s effort to digitize books. Using the first version of the corpus, which was published in 2009 and contained over 5 million digitized books, the authors analyzed word usage frequency to investigate linguistic changes and cultural trends. Soon the Google Books Corpus became a popular data source for researchers, and a 2nd version of the database was released in 2012.
However, Pechenick, Danforth, and Dodds (2015) warned that researchers need to fully characterize the sampling process of the corpus before using it for drawing broad conclusions. The main issue is that the corpus is library-like, containing one of each book. As a result, an individual, prolific author is able to noticeably insert new phrases into the Google Books lexicon. Moreover, scientific texts constitute an increasingly substantive portion of the corpus throughout the 1900s. In addition, by comparing two versions of the English Fiction datasets, Pechenick et al. found evidence that insufficient filtering was used in producing the first version. All of the data needed for activity is available here: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
[, , , ] Penney (2016) explores whether the widespread publicity about NSA/PRISM surveillance (i.e., the Snowden revelations) in June 2013 is associated with a sharp and sudden decrease in traffic to Wikipedia articles on topics that raise privacy concerns. If so, this change in behavior would be consistent with a chilling effect resulting from mass surveillance. The approach of Penney (2016) is sometimes called an interrupted time series design and is related to the approaches in the chapter about approximating experiments from observational data (Section 2.4.3).
To choose the topic keywords, Penney referred to the list used by U.S. Department of Homeland Security for tracking and monitoring social media. The DHS list categorizes certain search terms into a range of issues, i.e. “Health Concern,” “Infrastructure Security,” and “Terrorism.” For the study group, Penney used the forty-eight keywords related to “Terrorism” (see Table 8 Appendix). He then aggregated Wikipedia article view counts on a monthly basis for the corresponding forty-eight Wikipedia articles over a thirty-two month period, from the beginning of January 2012 to the end of August 2014. To strengthen his argument, he also created several comparison groups by tracking article views on other topics.
Now, you are going to replicate and extend Penney (2016). All the raw data that you will need for this activity is available from Wikipedia (https://dumps.wikimedia.org/other/pagecounts-raw/). Or you can get it from the R package wikipediatrend (Meissner and Team 2016). When you write-up your responses, please note which data source you used. (Note: This same activity also appears in Chapter 6)
[] Efrati (2016) reports, based on confidential information, that “total sharing” on Facebook had declined by about 5.5% year over year while “original broadcast sharing” was down 21% year over year. This decline was particularly acute with Facebook users under 30 years of age. The report attributed the decline to two factors. One is the growth in the number of “friends” people have on Facebook. The other is that some sharing activity has shifted to messaging and to competitors such as SnapChat. The report also revealed the several tactics Facebook had tried to boost sharing, including News Feed algorithm tweaks that make original posts more prominent, as well as periodical reminders of the original posts users “On This Day” several years ago. What implications, if any, does these findings have for researchers who want to use Facebook as a data source?
[] Tumasjan et al. (2010) reported that proportion of tweets mentioning a political party matched the proportion of votes that party received in the German parliamentary election in 2009 (Figure 2.9). In other words, it appeared that you could use Twitter to predict the election. At the time this study was published it was considered extremely exciting because it seemed to suggest a valuable use for a common source of big data.
Given the bad features of big data, however, you should immediately be skeptical of this result. Germans on Twitter in 2009 were quite a non-representative group, and supporters of one party might tweet about politics more often. Thus, it seems surprising that all the possible biases that you could imagine would somehow cancel out. In fact, the results in Tumasjan et al. (2010) turned out to be too good to be true. In their paper, Tumasjan et al. (2010) considered six political parties: Christian Democrats (CDU), Christian Social Democrats (CSU), SPD, Liberals (FDP), The Left (Die Linke), and the Green Party (Grüne). However, the most mentioned German political party on Twitter at that time was the Pirate Party (Piraten), a party that fights government regulation of the Internet. When the Pirate Party was included in the analysis, Twitter mentions becomes a terrible predictor of election results (Figure 2.9) (Jungherr, Jürgens, and Schoen 2012).
Subsequently, other researchers around the world have used fancier methods—such as using sentiment analysis to distinguish between positive and negative mentions of the parties—in order to improve the ability of Twitter data to predict a variety of different types of elections (Gayo-Avello 2013; Jungherr 2015, Ch. 7.). Here’s how Huberty (2015) summarized the results of these attempts to predict elections:
“All known forecasting methods based on social media have failed when subjected to the demands of true forward-looking electoral forecasting. These failures appear to be due to fundamental properties of social media, rather than to methodological or algorithmic difficulties. In short, social media do not, and probably never will, offer a stable, unbiased, representative picture of the electorate; and convenience samples of social media lack sufficient data to fix these problems post hoc.”
Read some of the research that lead Huberty (2015) to that conclusion, and write a one page memo to a political candidate describing if and how Twitter should be used to forecast elections.
[] What is the difference between a sociologist and a historian? According to Goldthorpe (1991), the main difference between a sociologist and a historian is control over data collection. Historians are forced to use relics whereas sociologists can tailor their data collection to specific purposes. Read Goldthorpe (1991). How is the difference between sociology and history related to the idea of Custommades and Readymades?
[] Building on the previous question, Goldthorpe (1991) drew a number of critical responses, including one from Nicky Hart (1994) that challenged Goldthorpe’s devotion to tailor made data. To clarify the potential limitations of tailor-made data, Hart described the Affluent Worker Project, a large survey to measure the relationship between social class and voting that was conducted by Goldthorpe and colleagues in the mid-1960s. As one might expect from a scholar who favored designed data over found data, the Affluent Worker Project collected data that was tailored to address a recently proposed theory about the future of social class in an era of increasing living standards. But, Goldthorpe and colleagues somehow “forgot” to collect information about the voting behavior of women. Here’s how Nicky Hart (1994) summaries the whole episode:
“. . . it [is] difficult to avoid the conclusion that women were omitted because this ‘tailor made’ dataset was confined by a paradigmatic logic which excluded female experience. Driven by a theoretical vision of class consciousness and action as male preoccupations . . . , Goldthorpe and his colleagues constructed a set of empirical proofs which fed and nurtured their own theoretical assumptions instead of exposing them to a valid test of adequacy.”
Hart continued:
“The empirical findings of the Affluent Worker Project tell us more about the masculinist values of mid-century sociology than they inform the processes of stratification, politics and material life.”
Can you think of other examples where tailor-made data collection has the biases of the data collector built into it? How does this compare to algorithmic confounding? What implications might this have for when researchers should use Readymades and when they should use Custommades?
[] In this chapter, I contrasted data collected by researchers for researchers with administrative records created by companies and governments. Some people call these administrative records “found data,” which they contrast with “designed data.” It is true that administrative records are found by researchers, but they are also highly designed. For example, modern tech companies spend enormous amounts of time and resources to collect and curate their data. Thus, these administrative records are both found and designed, it just depends on your perspective (Figure 2.10).
Provide an example of data source where seeing it both as found and designed is helpful when using that data source for research.
[] In a thoughtful essay, Christian Sandvig and Eszter Hargittai (2015) describe two kinds of digital research, where the digital system is “instrument” or “object of study.” An example of the first kind of study is where Bengtsson and colleagues (2011) used mobile phone data to track migration after the earthquake in Haiti in 2010. An example of the second kind is where Jensen (2007) studies how the introduction of mobile phones throughout Kerala, India impacted the functioning of the market for fish. I find this helpful because it clarifies that studies using digital data sources can have quite different goals even if they are using the same kind of data source. In order to further clarify this distinction, describe four studies that you’ve seen: two that use a digital system as an instrument and two that use a digital system as an object of study. You can use examples from this chapter if you want.