Activities

degree of difficulty: easy , medium , hard , very hard
requires math ( $requires math$ )
requires coding ()
data collection ()
my favorites ()

[, ] Algorithmic confounding was a problem with Google Flu Trends. Read the paper by Lazer et al. (2014), and write a short, clear email to an engineer at Google explaining the problem and offering an idea of how to fix it.
[] Bollen, Mao, and Zeng (2011) claims that data from Twitter can be used to predict the stock market. This finding led to the creation of a hedge fund—Derwent Capital Markets—to invest in the stock market based on data collected from Twitter (Jordan 2010). What evidence would you want to see before putting your money in that fund?
[] While some public health advocates consider e-cigarettes an effective aid for smoking cessation, others warn about the potential risks, such as the high levels of nicotine. Imagine that a researcher decides to study public opinion toward e-cigarettes by collecting e-cigarettes-related Twitter posts and conducting sentiment analysis.
1. What are the three possible biases that you are most worried about in this study?
2. Clark et al. (2016) ran just such a study. First, they collected 850,000 tweets that used e-cigarette-related keywords from January 2012 through December 2014. Upon closer inspection, they realized that many of these tweets were automated (i.e., not produced by humans) and many of these automated tweets were essentially commercials. They developed a human detection algorithm to separate automated tweets from organic tweets. Using this human detect algorithm they found that 80% of tweets were automated. Does this finding change your answer to part (a)?
3. When they compared the sentiment in organic and automated tweets, they found that the automated tweets were more positive than organic tweets (6.17 versus 5.84). Does this finding change your answer to (b)?
[] In November 2009, Twitter changed the question in the tweet box from “What are you doing?” to “What’s happening?” (https://blog.twitter.com/2009/whats-happening).
1. How do you think the change of prompts will affect who tweets and/or what they tweet?
2. Name one research project for which you would prefer the prompt “What are you doing?” Explain why.
3. Name one research project for which you would prefer the prompt “What’s happening?” Explain why.
[] “Retweets” are often used to measure influence and spread of influence on Twitter. Initially, users had to copy and paste the tweet they liked, tag the original author with his/her handle, and manually type “RT” before the tweet to indicate that it was a retweet. Then, in 2009, Twitter added a “retweet” button. In June 2016, Twitter made it possible for users to retweet their own tweets (https://twitter.com/twitter/status/742749353689780224). Do you think these changes should affect how you use “retweets” in your research? Why or why not?
[, , , ] In a widely discussed paper, Michel and colleagues (2011) analyzed the content of more than five million digitized books in an attempt to identify long-term cultural trends. The data that they used has now been released as the Google NGrams dataset, and so we can use the data to replicate and extend some of their work.

In one of the many results in the paper, Michel and colleagues argued that we are forgetting faster and faster. For a particular year, say “1883,” they calculated the proportion of 1-grams published in each year between 1875 and 1975 that were “1883”. They reasoned that this proportion is a measure of the interest in events that happened in that year. In their figure 3a, they plotted the usage trajectories for three years: 1883, 1910, and 1950. These three years share a common pattern: little use before that year, then a spike, then decay. Next, to quantify the rate of decay for each year, Michel and colleagues calculated the “half-life” of each year for all years between 1875 and 1975. In their figure 3a (inset), they showed that the half-life of each year is decreasing, and they argued that this means that we are forgetting the past faster and faster. They used Version 1 of the English language corpus, but subsequently Google has released a second version of the corpus. Please read all the parts of the question before you begin coding.

This activity will give you practice writing reusable code, interpreting results, and data wrangling (such as working with awkward files and handling missing data). This activity will also help you get up and running with a rich and interesting dataset.
1. Get the raw data from the Google Books NGram Viewer website. In particular, you should use version 2 of the English language corpus, which was released on July 1, 2012. Uncompressed, this file is 1.4GB.
2. Recreate the main part of figure 3a of Michel et al. (2011). To recreate this figure, you will need two files: the one you downloaded in part (a) and the “total counts” file, which you can use to convert the raw counts into proportions. Note that the total counts file has a structure that may make it a bit hard to read in. Does version 2 of the NGram data produce similar results to those presented in Michel et al. (2011), which are based on version 1 data?
3. Now check your graph against the graph created by the NGram Viewer.
4. Recreate figure 3a (main figure), but change the $y$ -axis to be the raw mention count (not the rate of mentions).
5. Does the difference between (b) and (d) lead you to reevaluate any of the results of Michel et al. (2011). Why or why not?
6. Now, using the proportion of mentions, replicate the inset of figure 3a. That is, for each year between 1875 and 1975, calculate the half-life of that year. The half-life is defined to be the number of years that pass before the proportion of mentions reaches half its peak value. Note that Michel et al. (2011) do something more complicated to estimate the half-life—see section III.6 of the Supporting Online Information—but they claim that both approaches produce similar results. Does version 2 of the NGram data produce similar results to those presented in Michel et al. (2011), which are based on version 1 data? (Hint: Don’t be surprised if it doesn’t.)
7. Were there any years that were outliers such as years that were forgotten particularly quickly or particularly slowly? Briefly speculate about possible reasons for that pattern and explain how you identified the outliers.
8. Now replicate this result for version 2 of the NGrams data in Chinese, French, German, Hebrew, Italian, Russian and Spanish.
9. Comparing across all languages, were there any years that were outliers, such as years that were forgotten particularly quickly or particularly slowly? Briefly speculate about possible reasons for that pattern.
[, , , ] Penney (2016) explored whether the widespread publicity about NSA/PRISM surveillance (i.e., the Snowden revelations) in June 2013 was associated with a sharp and sudden decrease in traffic to Wikipedia articles on topics that raise privacy concerns. If so, this change in behavior would be consistent with a chilling effect resulting from mass surveillance. The approach of Penney (2016) is sometimes called an interrupted time series design, and it is related to the approaches described in section 2.4.3.

To choose the topic keywords, Penney referred to the list used by the US Department of Homeland Security for tracking and monitoring social media. The DHS list categorizes certain search terms into a range of issues, i.e., “Health Concern,” “Infrastructure Security,” and “Terrorism.” For the study group, Penney used the 48 keywords related to “Terrorism” (see appendix table 8). He then aggregated Wikipedia article view counts on a monthly basis for the corresponding 48 Wikipedia articles over a 32-month period, from the beginning of January 2012 to the end of August 2014. To strengthen his argument, he also created several comparison groups by tracking article views on other topics.

Now, you are going to replicate and extend Penney (2016). All the raw data that you will need for this activity is available from Wikipedia. Or you can get it from the R-package wikipediatrend (Meissner and R Core Team 2016). When you write up your responses, please note which data source you used. (Note that this same activity also appears in chapter 6.) This activity will give you practice in data wrangling and thinking about natural experiments in big data sources. It will also get you up and running with a potentially interesting data source for future projects.
1. Read Penney (2016) and replicate his figure 2 which shows the page views for “Terrorism”-related pages before and after the Snowden revelations. Interpret the findings.
2. Next, replicate figure 4A, which compares the study group (“Terrorism”-related articles) with a comparator group using keywords categorized under “DHS & Other Agencies” from the DHS list (see appendix table 10 and footnote 139). Interpret the findings.
3. In part (b) you compared the study group with one comparator group. Penney also compared with two other comparator groups: “Infrastructure Security” related articles (appendix table 11) and popular Wikipedia pages (appendix table 12). Come up with an alternative comparator group, and test whether the findings from part (b) are sensitive to your choice of comparator group. Which choice of makes most sense? Why?
4. Penney stated that keywords relating to “Terrorism” were used to select the Wikipedia articles because the US government cited terrorism as a key justification for its online surveillance practices. As a check of these 48 “Terrorism”-related keywords, Penney (2016) also conducted a survey on MTurk, asking respondents to rate each of ht keywords in terms of Government Trouble, Privacy-Sensitive, and Avoidance (appendix table 7 and 8). Replicate the survey on MTurk and compare your results.
5. Based on the results in part (d) and your reading of the article, do you agree with Penney’s choice of topic keywords in the study group? Why or why not? If not, what would you suggest instead?
[] Efrati (2016) reported, based on confidential information, that “total sharing” on Facebook had declined by about 5.5% year over year while “original broadcast sharing” was down 21% year over year. This decline was particularly acute with Facebook users under 30 years of age. The report attributed the decline to two factors. One is the growth in the number of “friends” people have on Facebook. The other is that some sharing activity has shifted to messaging and to competitors such as Snapchat. The report also revealed the several tactics Facebook had tried to boost sharing, including News Feed algorithm tweaks that make original posts more prominent, as well as periodic reminders of the original posts with the “On This Day” feature. What implications, if any, do these findings have for researchers who want to use Facebook as a data source?
[] What is the difference between a sociologist and a historian? According to Goldthorpe (1991), the main difference is control over data collection. Historians are forced to use relics, whereas sociologists can tailor their data collection to specific purposes. Read Goldthorpe (1991). How is the difference between sociology and history related to the idea of custommades and readymades?
[] This builds on the previous quesiton.Goldthorpe (1991) drew a number of critical responses, including one from Nicky Hart (1994) that challenged Goldthorpe’s devotion to tailor made data. To clarify the potential limitations of tailor-made data, Hart described the Affluent Worker Project, a large survey to measure the relationship between social class and voting that was conducted by Goldthorpe and colleagues in the mid-1960s. As one might expect from a scholar who favored designed data over found data, the Affluent Worker Project collected data that were tailored to address a recently proposed theory about the future of social class in an era of increasing living standards. But, Goldthorpe and colleagues somehow “forgot” to collect information about the voting behavior of women. Here’s how Nicky Hart (1994) summarized the whole episode:

“… it [is] difficult to avoid the conclusion that women were omitted because this ‘tailor made’ dataset was confined by a paradigmatic logic which excluded female experience. Driven by a theoretical vision of class consciousness and action as male preoccupations … , Goldthorpe and his colleagues constructed a set of empirical proofs which fed and nurtured their own theoretical assumptions instead of exposing them to a valid test of adequacy.”

Hart continued:

“The empirical findings of the Affluent Worker Project tell us more about the masculinist values of mid-century sociology than they inform the processes of stratification, politics and material life.”

Can you think of other examples where tailor-made data collection has the biases of the data collector built into it? How does this compare to algorithmic confounding? What implications might this have for when researchers should use readymades and when they should use custommades?
[] In this chapter, I have contrasted data collected by researchers for researchers with administrative records created by companies and governments. Some people call these administrative records “found data,” which they contrast with “designed data.” It is true that administrative records are found by researchers, but they are also highly designed. For example, modern tech companies work very hard to collect and curate their data. Thus, these administrative records are both found and designed, it just depends on your perspective (figure 2.12).

Figure 2.12: The picture is both a duck and a rabbit; what you see depends on your perspective. Big data sources are both found and designed; again, what you see depends on your perspective. For example, the call data records collected by a mobile-phone company are found data from the perspective of a researcher. But, these exact same records are designed data from the perspective of someone working in the billing department of the phone company. Source: Popular Science Monthly (1899)/Wikimedia Commons.

Provide an example of data source where seeing it both as found and designed is helpful when using that data source for research.
[] In a thoughtful essay, Christian Sandvig and Eszter Hargittai (2015) split digital research into two broad categories depending on whether the digital system is an “instrument” or “object of study.” An example of the first kind—where the system is an instrument—is the research by Bengtsson and colleagues (2011) on using mobile-phone data to track migration after the earthquake in Haiti in 2010. An example of the second kind—where the system is an object of study—is research by Jensen (2007) on how the introduction of mobile phones throughout Kerala, India impacted the functioning of the market for fish. I find this distinction helpful because it clarifies that studies using digital data sources can have quite different goals even if they are using the same kind of data source. In order to further clarify this distinction, describe four studies that you’ve seen: two that use a digital system as an instrument and two that use a digital system as an object of study. You can use examples from this chapter if you want.