Researchers used email logs and administrative records to understand friendship formation. This research requires dealing with the incompleteness of big data.
In many situations, researchers are not lucky enough to have everything that they want automatically collected in one place. Two common problems are incomplete information about the people and a mismatch between theoretical constructs and data. Both of these problems were addressed by Kossinets and Watts (2009) as part of their efforts to understand how social networks evolve.
Roughly speaking, researchers think that social network evolution is driven by three features: 1) the structure of existing relationships 2) shared activities (e.g., dorms, classes) and 3) demographics. Understanding the interrelationships between these three factors requires longitudinal network data combined with information about individuals’ demographics and activities. Earlier studies had some of these features, but none had all three.
Kossinets and Watts started their research by acquiring the email logs from a large university. However, these email logs alone were incomplete, they don’t include everything needed to understand the various factors driving network evolution. Therefore, Kossinets and Watts merged these email logs, with two other sources of information: demographic information collected by the university and information about shared activities (e.g., student residence information and a complete list of enrollment in courses). Once these three sources of information, each of which was incomplete, were merged together Kossinets and Watts had a powerful data structure for understanding network evolution.
But, there was one final challenge that they had to overcome. Kossinets and Watts wanted to study how the social network in this university evolved so they needed a way to use the email logs into an estimate of who was connected to who at which time. As discussed in previously (Section 2.3.2.1), this kind of operationalization of theoretical constructs is a big challenge when using digital traces for social research. In the end, Kossinets and Watts decided that two people were considered connected at time \(t\) if and only if they had exchanged emails (\(i\) emailed \(j\) and \(j\) emailed \(i\)) in the previous 60 days. These choices were not arbitrary; they were based on careful consideration of this empirical setting, and Kossinets and Watts checked that their results were robust to these choices. In general, if your operationalization involves choosing some specific cutoffs—say 60 days instead of 30 days or 90 days—it is a good idea to make sure that your results are not sensitive to this choice.
Once Kossinets and Watts addressed the problem caused by incompleteness (e.g., missing demographic information, missing information about shared activity, and missing theoretical constructs), they had data that enabled them to understand the three main forces that can drive network evolution: 1) the structure of existing relationships 2) shared activities (e.g., dorms, classes) and 3) demographics. Consistent with earlier research, they found that people with similar demographics are more likely to form relationships. However, unlike earlier studies, they found that this pattern was strongly mitigated by the existing network structure and shared activities. In other words, the pattern that earlier researchers had seen was partially explained by data that earlier researchers did not have. Thus, by successfully dealing with the incompleteness of their data, Kossinets and Watts were able clarify the interaction of a variety of different factors that drive social network evolutions.