Data held by companies and governments are difficult for researchers to access.
In May 2014, the US National Security Agency opened a data center in rural Utah with an awkward name, the Intelligence Community Comprehensive National Cybersecurity Initiative Data Center. However, this data center, which has come to be known as the Utah Data Center, is reported to have astounding capabilities. One report alleges that it is able to store and process all forms of communication including “the complete contents of private emails, cell phone calls, and Google searches, as well as all sorts of personal data trails—parking receipts, travel itineraries, bookstore purchases, and other digital ‘pocket litter’” (Bamford 2012). In addition to raising concerns about the sensitive nature of much of the information captured in big data, which will be described further below, the Utah Data Center is an extreme example of a rich data source that is inaccessible to researchers. More generally, many sources of big data that would be useful are controlled and restricted by governments (e.g., tax data and educational data) or companies (e.g., queries to search engines and phone call meta-data). Therefore, even though these data sources exist, they are useless for the purposes of social research because they are inaccessible.
In my experience, many researchers based at universities misunderstand the source of this inaccessibility. These data are inaccessible not because people at companies and governments are stupid, lazy, or uncaring. Rather, there are serious legal, business, and ethical barriers that prevent data access. For example, some terms-of-service agreements for websites only allow data to be used by employees or to improve the service. So certain forms of data sharing could expose companies to legitimate lawsuits from customers. There are also substantial business risks to companies involved in sharing data. Try to imagine how the public would respond if personal search data accidentally leaked out from Google as part of a university research project. Such a data breach, if extreme, might even be an existential risk for the company. So Google—and most large companies—are very risk-averse about sharing data with researchers.
In fact, almost everyone who is in a position to provide access to large amounts of data knows the story of Abdur Chowdhury. In 2006, when he was the head of research at AOL, he intentionally released to the research community what he thought were anonymized search queries from 650,000 AOL users. As far as I can tell, Chowdhury and the researchers at AOL had good intentions, and they thought that they had anonymized the data. But they were wrong. It was quickly discovered that the data were not as anonymous as the researchers thought, and reporters from the New York Times were able to identify someone in the dataset with ease (Barbaro and Zeller 2006). Once these problems were discovered, Chowdhury removed the data from AOL’s website, but it was too late. The data had been reposted on other websites, and it will probably still be available when you are reading this book. Chowdhury was fired, and AOL’s chief technology officer resigned (Hafner 2006). As this example shows, the benefits for specific individuals inside of companies to facilitate data access are pretty small and the worst-case scenario is terrible.
Researchers can, however, sometimes gain access to data that is inaccessible to the general public. Some governments have procedures that researchers can follow to apply for access, and as the examples later in this chapter show, researchers can occasionally gain access to corporate data. For example, Einav et al. (2015) partnered with a researcher at eBay to study online auctions. I’ll talk more about the research that came from this collaboration later in the chapter, but I mention it now because it had all four of the ingredients that I see in successful partnerships: researcher interest, researcher capability, company interest, and company capability. I’ve seen many potential collaborations fail because either the researcher or the partner—be it a company or government—lacked one of these ingredients.
Even if you are able to develop a partnership with a business or gain access to restricted government data, however, there are some downsides for you. First, you will probably not be able to share your data with other researchers, which means that other researchers will not be able to verify and extend your results. Second, the questions that you can ask may be limited; companies are unlikely to allow research that could make them look bad. Finally, these partnerships can create at least the appearance of a conflict of interest, where people might think that your results were influenced by your partnerships. All of these downsides can be addressed, but it is important to be clear that working with data that is not accessible to everyone has both upsides and downsides.
In summary, lots of big data is inaccessible to researchers. There are serious legal, business, and ethical barriers that prevent data access, and these barriers will not go away as technology improves because they are not technical barriers. Some national governments have established procedures for enabling data access for some datasets, but the process is especially ad hoc at the state and local levels. Also, in some cases, researchers can partner with companies to obtain data access, but this can create a variety of problems for researchers and companies.