Some of the information that companies and governments have is sensitive.
Health insurance companies have detailed information about the medical care received by their customers. This information could be used for important research about health, but if it became public, it could potentially lead to emotional harm (e.g., embarrassment) or economic harm (e.g., loss of employment). Many other big data sources also have information that is sensitive, which is part of the reason why they are often inaccessible.
Unfortunately, it turns out to be quite tricky to decide what information is actually sensitive (Ohm 2015), as was illustrated by the Netflix Prize. As I will describe in chapter 5, in 2006 Netflix released 100 million movie ratings provided by almost 500,000 members and had an open call where people from all over the world submitted algorithms that could improve Netflix’s ability to recommend movies. Before releasing the data, Netflix removed any obvious personally identifying information, such as names. But, just two weeks after the data was released Arvind Narayanan and Vitaly Shmatikov (2008) showed that it was possible to learn about specific people’s movie ratings using a trick that I’ll show you in chapter 6. Even though an attacker could discover a person’s movie ratings, there still doesn’t seem to be anything sensitive here. While that might be true in general, for at least some of the 500,000 people in the dataset, movie ratings were sensitive. In fact, in response to the release and re-identification of the data, a closeted lesbian woman joined a class-action suit against Netflix. Here’s how the problem was expressed in this lawsuit (Singel 2009):
“[M]ovie and rating data contains information of a … highly personal and sensitive nature. The member’s movie data exposes a Netflix member’s personal interest and/or struggles with various highly personal issues, including sexuality, mental illness, recovery from alcoholism, and victimization from incest, physical abuse, domestic violence, adultery, and rape.”
This example shows that there can be information that some people consider sensitive inside of what might appear to be a benign database. Further, it shows that a main defense that researchers employ to protect sensitive data—de-identification—can fail in surprising ways. These two ideas are developed in greater detail in chapter 6.
The final thing to keep in mind about sensitive data is that collecting it without people’s consent raises ethical questions, even if no specific harm is caused. Much like watching someone taking a shower without their consent might be considered a violation of that person’s privacy, collecting sensitive information—and remember how hard it can be to decide what is sensitive—without consent creates potential privacy concerns. I’ll return to questions about privacy in chapter 6.
In conclusion, big data sources, such as government and business administrative records, are generally not created for the purpose of social research. The big data sources of today, and likely tomorrow, tend to have 10 characteristics. Many of the properties that are generally considered to be good for research—big, always-on, and nonreactive—come from the fact in the digital age companies and governments are able to collect data at a scale that was not possible previously. And many of the properties that are generally considered to be bad for research—incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, inaccessible, dirty, and sensitive—come from the fact that these data were not collected by researchers for researchers. So far, I’ve talked about government and business data together, but there are some differences between the two. In my experience, government data tends to be less nonrepresentative, less algorithmically confounded, and less drifting. One the other hand, business administrative records tend to be more always-on. Understanding these 10 general characteristics is a helpful first step toward learning from big data sources. And now we turn to research strategies we can use with this data.