Datasets Large iyindlela iphetha; awayona kuphele ngokwazo.
Isici esixoxwa kakhulu kunazo zonke emithonjeni enkulu yedatha ukuthi yiBIG. Amaphepha amaningi, ngokwesibonelo, aqale ngokuxoxa-futhi ngezinye izikhathi ukuziqhayisa-mayelana nokuthi bangakanani idatha abayihlaziye. Isibonelo, iphepha elishicilelwe eSayensi lithatha izitayela zokusebenzisa amagama ku-Google Books corpus kufaka lokhu okulandelayo (Michel et al. 2011) :
I-[yethu] iqukethe amagama angaphezu kwezigidi eziyizinkulungwane ezingu-500, ngesiNgisi (361 billion), isiFulentshi (45 billion), isiSpanishi (45 billion), isiJalimane (37 billion), isiShayina (izigidi eziyizinkulungwane ezingu-13), isiRussia (35 billion) nesiHebheru (2 billion). Imisebenzi emikhulu yashicilelwe kuma-1500s. Emashumini eminyaka okuqala amelwa yizincwadi ezimbalwa ngonyaka, equkethe amagama ayizinkulungwane ezingamakhulu ayizinkulungwane. Ngo-1800, i-corpus ikhula ngamagama ayizigidi ezingu-98 ngonyaka; ngo-1900, 1.8 billion; futhi ngo-2000, 11 billion. I-corpus ayikwazi ukufundwa ngumuntu. Uma uzama ukufunda kuphela izilimi zolimi lwesiNgisi kusukela ngonyaka ka-2000 kuphela, ngezinga elifanele lamazwi angu-200 / iminithi, ngaphandle kokuphazamiseka kokudla noma ukulala, kungathatha iminyaka engu-80. Ukulandelana kwezinhlamvu kungaphezu kuka-1000 isikhathi eside kune-genome yomuntu: Uma ubhala ngokuqondile, kuzofinyelela eNyangeni futhi buyele izikhathi ezingu-10 ngaphezulu. "
Isibalo salokhu kwedatha ngokungangabazeki sihlaba umxhwele, futhi sonke sinenhlanhla ukuthi ithimba le-Google Books likhishwe le datha emphakathini (empeleni, eminye yemisebenzi ekupheleni kwalesi sahluko yenza ukusetshenziswa kwalokhu kwedatha). Kodwa, noma kunini lapho ubona okufana nalokhu kufanele ubuze: ingabe yonke leyo datha yenza noma yini? Kungenzeka yini ukuthi benze ucwaningo olufanayo uma idatha ingafinyelela eNyangeni futhi ibuye kanye kuphela? Kuthiwani uma idatha ingafinyelela phezulu eNtabeni i-Everest noma ngaphezulu kwe-Eiffel Tower?
Kulesi simo, ucwaningo lwabo, empeleni, lunezinto ezitholakele ezidinga amagama amakhulu esikhathini eside. Isibonelo, into eyodwa abayihlolayo yikuguquguquka kohlelo lolimi, ikakhulukazi izinguquko ngesilinganiso senkulumo engavamile yesivumelwano. Njengoba ezinye izenzo ezingajwayelekile zingavamile, idatha enkulu idinga ukubona izinguquko ngokuhamba kwesikhathi. Kodwa kaningi, abacwaningi kubonakala sengathi baphatha ubukhulu bemithombo yedatha enkulu njengokuphela- "buka ukuthi ngingakanani idatha engiyakwazi ukuyenza" -ngaphezu kwendlela ethile ebalulekile yesayensi.
Esihlangenweni sami, ukutadisha izenzakalo ezingavamile kungenye yezimiso ezintathu eziqondile zesayensi ukuthi amadathazethi amakhulu athambekele. Owesibili ukutadisha kwe-heterogeneity, njengoba kungafaniswa nesifundo sikaRaj Chetty nosebenza nabo (2014) ekuhambeni komphakathi e-United States. Esikhathini esidlule, abacwaningi abaningi baye bahlola ukuhamba komphakathi ngokuqhathanisa imiphumela yokuphila kwabazali nezingane. Ukuthola okuqhubekayo okuvela kulezi zincwadi ukuthi abazali abanenzuzo bavame ukuba nabantwana abahle kakhulu, kodwa amandla alobu buhlobo ahlukahluka ngokuhamba kwesikhathi nakwamanye amazwe (Hout and DiPrete 2006) . Ngokushesha nje, u-Chetty nozakwethu bakwazi ukusebenzisa amarekhodi ezintela ezivela kubantu abayizigidi ezingu-40 ukulinganisa ukungahambisani kwamanye amazwe ezindaweni zonke e-United States (umfanekiso 2.1). Ngokwesibonelo, bathole ukuthi amathuba okuba ingane ifinyelele ku-quintile ephezulu yokusabalalisa imali engenayo evela emndenini we-quintile engezansi cishe ngo-13% eSan Jose, eCalifornia, kodwa kuphela u-4% kuphela eCharlotte, eNorth Carolina. Uma ubheka isibalo 2.1 ngomzuzwana, ungase uqale ukuzibuza ukuthi kungani ukuhamba komphakathi kungaphezulu kwamanye amazwe kunabanye. U-Chetty nozakwabo babenombuzo ofanayo, futhi bathola ukuthi lezozindawo ezihamba phambili zinokuhlukaniswa okuncane, ukulingana okuncane okungenayo imali, izikole ezingcono kakhulu, inzuzo enkulu yenhlalakahle, nokuzinza okukhulu komndeni. Yiqiniso, lezi zixhumanisi zodwa azibonisi ukuthi lezi zimbangela zibangela ukuhamba okuphakeme, kodwa ziphakamisa izindlela ezingahle zihlolwe emsebenzini oqhubekayo, yilokho kanye u-Chetty nabasebenza nabo abenzile emsebenzini olandelayo. Phawula ukuthi ubungakanani bedatha babaluleke kakhulu kule phrojekthi. Uma u-Chetty nozakwabo basebenzise amarekhodi amarekhodi abantu abayizinkulungwane ezingama-40 kunokuba bangaphezu kwezigidi ezingu-40, bebengenakukwazi ukulinganisa ubuholi besifundazwe futhi bebengasoze bakwazi ukwenza ucwaningo olulandelayo ukuze bazame ukuthola izindlela ezenza lokhu kuhluka.
Ekugcineni, ngaphezu kokutadisha izenzakalo ezingavamile nokutadisha i-heterogeneity, amadathazethi amakhulu futhi akwazi abacwaningi ukuthi bathole umehluko omncane. Eqinisweni, okugxile ekubambeni kwedatha enkulu embonini mayelana nalezi zingxabano ezincane: ukuthembela kokuthola umehluko phakathi kuka-1% no-1.1% amazinga okuchofoza ku-ad kungaba ukuhumusha zibe izigidi zamaRandi ngenzuzo eyengeziwe. Kwezinye izilungiselelo zesayense, noma kunjalo, ukungezwani okuncane kangase kungabaluleki ngokukhethekile, noma ngabe kubalwa ngezibalo (Prentice and Miller 1992) . Kodwa, kwezinye izilungiselelo zenqubomgomo, zingaba zibalulekile uma zibukwa ngokubanzi. Isibonelo, uma kukhona ukungenelela kwempilo emibili yomphakathi kanti omunye usebenza kangcono kunomunye, bese ukhetha ukungenelela okuphumelelayo kungaphelela ukugcina izinkulungwane zempilo eyengeziwe.
Nakuba ubukhulu ngokuvamile buyimpahla emihle uma isetshenziswe kahle, ngiye ngaphawula ukuthi ngezinye izikhathi kungabangela iphutha lesimo. Ngesizathu esithile, ubukhulu bubonakala buhola abacwaningi ukuba banganaki ukuthi idatha yabo yenziwe kanjani. Nakuba ubukhulu bunciphisa isidingo sokukhathazeka ngesiphambeko esihleliwe, empeleni kukwandisa isidingo sokukhathazeka ngamaphutha ahlelekile, izinhlobo zeziphambeko engizozichaza ngezansi ezivela ekunganaki ukuthi idatha idalwe kanjani. Isibonelo, kuphrojekthi ngizokuchaza ngokuhamba kwesikhathi kulesi sahluko, abacwaningi basebenzisa imilayezo eyenziwe ngoSeptemba 11, 2001 ukukhiqiza isikhathi esiphezulu sokulungisa isimo sengqondo sokuphendula kokuhlasela kwamaphekula (Back, Küfner, and Egloff 2010) . Ngenxa yokuthi abacwaningi babe nemilayezo eminingi, abadinga ngempela ukukhathazeka ngokuthi amaphethini abone yini-ukwanda kwentukuthelo phakathi nenkathi yosuku-kungachazwa ngokushintsha okungahleliwe. Kwabe nedatha kakhulu futhi iphethini yayicacile kangangokuthi zonke izibalo zokubalwa kwezibalo ziphakamisa ukuthi lokhu kwakuyisibonelo sangempela. Kodwa, lezi zivivinyo zezibalo azikwazanga ukuthi idatha idalwe kanjani. Eqinisweni, kwavela ukuthi amaphethini amaningi ayenziwe ebhodini elilodwa elikhiqiza imilayezo eminingi engenasici kulo lonke usuku. Ukususa lokhu kubhujiswe ngokuphelele kwezinye izihluthulelo ezisemqoka ephepheni (Pury 2011; Back, Küfner, and Egloff 2011) . Ngokumane nje, abacwaningi abangacabangi ngesiphambeko esihleliwe babhekene nengozi yokusebenzisa amathakasethi abo amakhulu ukuthola isilinganiso esilinganiselwe semali engadingekile, njengokuqukethwe komzwelo kwemilayezo engenangqondo eyenziwe yi-bot ezenzakalelayo.
Ekuphetheni, amadathazethi amakhulu awapheli ngokwabo, kodwa angakwazi ukwenza izinhlobo ezithile zocwaningo kufaka phakathi isifundo semicimbi engavamile, ukulinganisa u-heterogeneity, nokuthola umehluko omncane. Ama-dataset amakhulu futhi abonakala eholele abanye abacwaningi ukuba banganaki ukuthi idatha yabo idalwe kanjani, okungaholela ekutholeni ukulinganisa okucacile kobuningi obungadingekile.