Some of the information that companies and governments have is sensitive.
Health insurance companies have detailed information about the medical care received by their customers. This information could be used for important research about health, but if it became public it could potentially lead to emotional harm (e.g., embarrassment) and economic harm (e.g., loss of employment). Far from distinctive, many big data sources have information that is sensitive. The sensitive nature of this information is part of the reason that big data sources are often inaccessible (described above).
One way that researchers attempt to deal with this situation is to de-identify datasets that have sensitive information. But, as I will show in detail in Chapter 6 (Ethics) this approach seriously limited in ways that are not widely appreciated by both social scientists and data scientists.
In conclusion, the big data sources of today (and tomorrow) generally have ten characteristics. Many of the good properties—big, always-on, and nonreactive—come from the fact in the digital age companies and governments are able to collect data at a scale that was not possible previously. And, many of the bad properties—incomplete, inaccessible, non-representative, drifting, algorithmically confounded, inaccessible, dirty, and sensitive—come from the fact that the data is not collected by researchers for researchers. Understanding these characteristics are a necessary first step to learning from big data. And, now we turn to research strategies we can use with this data.