Iiseti ezinkulu ziyindlela esiphelweni; bengengabo ekupheleni ngokwabo.
Inkalo exutyushwa kakhulu kwiimithombo ezinkulu zedatha kukuba i-BIG. Amaphepha amaninzi, umzekelo, qalisa ngokuxoxa-kwaye ngamanye amaxesha ukuziqhayisa-malunga nokuba lukhulu kangakanani idatha abaye bahlalutya ngayo. Ngokomzekelo, iphepha elipapashwe kwiSayensi ngokufunda indlela yokusebenzisa amagama kwiGoogle Books corpus kufaka oku kulandelayo (Michel et al. 2011) :
I-[yethu] iqulethe amagama angaphezu kwezigidi eziyi-500, ngesiNgesi (iibhiliyoni eziyi-361), isiFrentshi (45 billion), isiJamani (iibhiliyoni ezingama-37), isiJamani (ibhiliyoni eziyi-13), isiRashiya (2 billion). Imisebenzi emikhulu yashicilelwa kwi-1500s. Amaxesha okuqala eminyaka amelwa yiincwadi ezimbalwa kuphela ngonyaka, eziqulethwe ngamawaka amakhulu amawaka. Ngo-1800, i-corpus ikhula ibe ngamazwi angamawaka ayi-98 ngonyaka; ngo-1900, 1.8 billion; kwaye ngo-2000, ezili-11 ezigidi. I-corpus ayikwazi ukufundwa ngumntu. Ukuba uzama ukufunda kuphela ama-entries angama-angama-2000 kuphela, ngezinga elifanelekileyo lamazwi angama-200 / imizuzu, ngaphandle kokuphazamiseka kokutya okanye ukulala, kuya kuthatha iminyaka engama-80. Ukulandelelana kweencwadi kubangamawaka angama-1000 ubude ngaphezu kwegciwane lomntu: Ukuba ubhala phantsi ngendlela echanekileyo, kuya kufinyelela kwiNyanga kwaye ibuye emva kwama-10. "
Umlinganiselo wale datha akungathandabuzeki, kwaye sonke sinenhlanhla yokuba iqela leGoogle Books likhuphe le datha kuluntu (eqinisweni, ezinye zezinto ekupheleni kwesi sahluko zisebenzisa le data). Kodwa, nanini na xa ubona into enje kufuneka ucele: ngaba yonke loo datha yenza into nantoni na? Ngaba bebenokwenza uphando olufanayo ukuba idatha ingene kwiNyanga kwaye ibuye kanye kuphela? Kuthekani ukuba idatha ingakwazi ukufikelela kwi-Mount Everest okanye phezulu kwe-Eiffel Tower?
Kule meko, uphando lwabo, eneneni, lunezinye iziphumo ezifuna ukulungiswa kwamagama ngexesha elide. Ngokomzekelo, into enye abayihlolisayo kukuba ukuveliswa kwegrama, ngokukodwa utshintsho kwisantya sokunxibelelana kwesenzi esingaqhelekanga. Ekubeni ezinye izenzi ezingaqhelekanga ziqabile, ininzi yedatha imfuneko yokufumana utshintsho kwixesha. Kanti, ngokuphindaphindiweyo, abaphandi babonakala bephatha ubukhulu bemithombo yedatha enkulu njengendlela yokuphela- "jonga ukuba ingaba ndingakanani idilesi engayinceda"?
Kwamava am, ukufundwa kweziganeko ezinqabileyo ngenye yezinto ezintathu zenzululwazi ezithe ngqo ukuba iifasethi ezinkulu zikwazi ukukwenza. Okwesibini kukufundwa kwe-heterogeneity, njengoko kunokuboniswa ngophando lukaRaj Chetty kunye noogxa (2014) ekuhambeni kwentlalo e-United States. Kwixesha elidlulileyo, abaphandi abaninzi baye bafunda ukuhamba kweentlalo ngokuthelekisa iziphumo zokuphila zabazali nabantwana. Ukufumana okuqhubekayo kule ncwadi kukuba abazali abanomdla bavame ukuba nabantwana abanomdla, kodwa amandla olwalamano oluhlukeneyo maxesha ngamazwe nakwamanye amazwe (Hout and DiPrete 2006) . Kutshanje, nangona kunjalo, u-Chetty kunye nabalingane bakhe bakwazi ukusebenzisa iirekhodi zerhafu kwizigidi ezi-40 zabantu ukuba baqikelele ukuxhatshazwa kwezinto ezihamba phambili phakathi kwemimandla e-United States (umfanekiso 2.1). Bafumana, umzekelo, ukuba amathuba okuba umntwana afinyelele kwi-quintile ephezulu yesabelo sokufumana imali evela kwintsapho ephantsi kwequintile engama-13% eSan Jose, eCalifornia, kodwa i-4% kuphela eCharlotte, North Carolina. Ukuba ubheka umfanekiso 2.1 ngomzuzwana, unokuqala ukuzibuza ukuba kutheni ukuhamba kwamanye amazwe kuphakamileyo kwezinye iindawo kunabanye. U-Chetty kunye nabalingane bakhe babenombuzo ofanayo, kwaye bafumanisa ukuba ezo ndawo eziphezulu zihamba kunye nokwahlukana okungaphantsi kweendawo, ukungalingani kokungeniso kwemali, izikolo eziprayimari ezingcono, intlalontle enkulu yoluntu, kunye nokuzinza kwentsapho enkulu. Ewe, ezi zinto zodwa azibonisi ukuba ezi zinto zibangele ukuhamba okuphezulu, kodwa zibonisa iindlela ezinokuthi zihlolwe ngomsebenzi oqhubekayo, oko kanye kanye noChtty kunye nabalingane abenzile emsebenzini olandelayo. Phawula ukuba ubungakanani beenkcukacha kubaluleke kakhulu kule projekthi. Ukuba u-Chetty kunye noogxa basebenzise iirekhodi zeerhafu zabantu abayizinkulungwane ezingama-40 kunokuba zigidi ezingama-40, bebengayi kuba nako ukuqikelela i-heterogeneity yengingqi kwaye abazange bakwazi ukukwenza uphando olulandelayo ukuze bazame ukuchonga iindlela ezenza le nto.
Ekugqibeleni, ngaphezu kokufunda iziganeko ezinqabileyo kunye nokufunda i-heterogeneity, iifasethi ezinkulu ziya kwenza abaphandi bakwazi ukubona ukungafani. Enyanisweni, ezininzi iinkalo zijoliswe kwiinkcukacha ezinkulu kumashishini malunga nale mihluko emancinci: ukuthembela ngokuqinisekileyo umahluko phakathi kwe-1% kunye no-1.1% ukukrazula kwiirhafu kwisibhengezo kunokuguqulela kwizigidi zeedola kwimali eyongezelelweyo. Kwezinye izicwangciso zesayensi, kunjalo, ukuhlukana okuncinci kangako kungabalulekanga ngokukodwa, nokuba ngaba babalaseleyo (Prentice and Miller 1992) . Kodwa, kwezinye izicwangciso zenkqubo, zinokubaluleka xa zijongwa ngokubanzi. Umzekelo, ukuba kukho iindlela ezimbini zokungenelela ngempilo yoluntu kwaye enye isebenze ngakumbi kunomnye, ngoko ukukhetha ukungenelela okunempumelelo kunokuphelisa amawaka obomi obongezelelweyo.
Nangona ubukhulu bubuninzi bepropati xa kusetshenziswe ngokuchanekileyo, ndiye ndaqaphela ukuba ngamanye amaxesha kunokukhokelela kwisiphumo sengqondo. Ngesizathu esithile, ubukhulu bubonakala bukhokela abaphandi ukuba bangazibali ukuba idatha yabo yenziwe njani. Nangona ubukhulu bunciphisa isidingo sokuxhalabisa ngephutha lokungahambi, oko kwandisa imfuneko yokuxhalabisa ngeempazamo ezichanekileyo, iintlobo zeemposiso endiya kuzichaza ngezantsi ezivela kwizinto ezixhatshazwayo ekudaleni idatha. Ngokomzekelo, kwiprojekthi ndiza kuchaza kamva kwesi sahluko, abaphandi basebenzisa imilayezo eyenziwe ngoSeptemba 11, 2001 ukuvelisa ixesha elifanelekileyo lokusombulula isisombululo sokuphendulela ekuhlaselweni kwamaphekula (Back, Küfner, and Egloff 2010) . Ngenxa yokuba abaphandi banemiyalezo emininzi, abazange bafune ukuxhalaba malunga nokuba iipateni abazibonayo-ukongeza umsindo ngenxa yexesha-kunokuchazwa ngokungafani. Kwakukho idatha eninzi kwaye iphethini yayicacile kangangokuba zonke iimvavanyo zemibare yezibalo ziphakamisa ukuba le yimizekelo yangempela. Kodwa, ezi mvavanyo zazingenakwazi indlela idatha eyadalwa ngayo. Enyanisweni, kwavela ukuba ezininzi iipateni zazisuka kubhobho enye eyenza imilayezo engapheliyo nangakumbi engapheliyo yonke imini. Ukususa le nto inokutshabalalisa ezinye zeziphumo eziphambili kwiphepha (Pury 2011; Back, Küfner, and Egloff 2011) . Kulula nje, abaphandi abangacingi ngephutha echanekileyo bajongene nomngcipheko wokusebenzisa iifasethi zabo ezinkulu ukuze baqikelele ubungakanani obungabalulekanga, njengomxholo wemvakalelo yemilayezo engenanto engabonakaliyo.
Ekupheliseni, iifasethi ezinkulu azipheli ngokwabo, kodwa ziyakwazi ukwenza ezinye iindidi zophando kuquka ukufundwa kweziganeko ezinqabileyo, uqikelelo lwe-heterogeneity, kunye nokufumanisa iintlobo ezincinci. Ii-dataset ezinkulu zibonakala zikhokelela abanye abaphandi ukuba bangazibali ukuba idatha yabo idalwe njani, nto leyo ibangabangela ukuba baqikelele ubungakanani obungabalulekanga.