Datasets Large iyindlela iphetha; awayona kuphele ngokwazo.
Elokuqala izici ezinhle ezintathu idatha big kungulwazi okukhulunywa ngawo kakhulu: Yilezi idatha big. Lezi imithombo idatha kungaba big ngezindlela ezintathu ezahlukene: Abantu abaningi, ulwazi oluningi umuntu ngamunye, noma kokuma eziningi phezu kwesikhathi. Having kudathasethi omkhulu kwenza abanye ezithile heterogeneity ucwaningo-ekulinganiseni, ukutadisha izenzakalo ezingavamile, uzingela umehluko omncane, futhi nokwenza tilinganiso esiyimbangela kusuka idatha obubonakalayo. Kubonakala futhi kuyinto kuyoholela uhlobo oluthile sloppiness.
Into yokuqala okuyiwona size ziwusizo ikakhulu iphokophele ngalé ezomhlaba ukwenza izilinganiso for subgroups ethize. Ngokwesibonelo, uGary iNkosi, uJennifer Pan, Molly Roberts (2013) kulinganiswa ematfuba ukuthi izikhala media social e China yayiyobizwa zazihlolwa nguhulumeni. By ukuzivuselela ngale ematfuba avareji yamaphesende kucis akuyona iwusizo kakhulu ukuqonda ukuthi kungani uhulumeni censors ezinye izikhala kodwa hhayi abanye. Kodwa, ngenxa yokuthi kudathasethi yabo yayihlanganisa izikhala 11 million, iNkosi kanye nozakwabo bakhiqiza izilinganiso for ematfuba ukucwaninga izici izikhala on 85 izigaba ezihlukene (isib, izithombe zobulili ezingcolile, Tibet, futhi Traffic in Beijing). Ngokuqhathanisa ematfuba ukucwaninga izici izikhala ezigabeni ezahlukene, bakwazi ukuqonda okwengeziwe mayelana nendlela futhi kungani uhulumeni censors izinhlobo ezithile izikhala. With izikhala ayizinkulungwane 11 (kunokuthi million 11 izikhala), babengeke baye bakwazi ukukhiqiza lezi tilinganiso isigaba eqondene.
Okwesibili, size usizo ikakhulukazi ukufunda ngezehlakalo ezingavamile. Ngokwesibonelo, Goel kanye nozakwabo (2015) wayefuna ukufunda izindlela ezihlukene Tweets angaya viral. Ngenxa asibekela enkulu kabusha Tweets kakhulu. ezingavamile-kuqikelelwa koyedwa a 3,000-ababekudinga ukutadisha angaphezu kwezigidi eziyinkulungwane Tweets ukuze uthole asibekela ngokwanele elikhulu ucwaningo lwabo.
Okwesithathu, imininingwane yohlelo enkulu nika amandla abacwaningi ukuba bona umehluko omncane. Empeleni, ingxenye enkulu focus on idatha big embonini imayelana lezi umehluko omncane: thembeke uzingela umehluko phakathi kuka 1% no 1.1% kokuchofozela rates ikompuyutha ukuhumusha zibe izigidi zamaRandi imali engaphezulu. Kwezinye izinhlelo zesayensi, ezifana umehluko omncane ingase ingabi esithile ezibalulekile (ngisho noma bengamalungu wezibalo obalulekile). Kodwa, kwezinye izilungiselelo zenqubomgomo, ezifana umehluko omncane kungaba yinto ebaluleke kakhulu kubona uma zibhekwa ekuhlanganisweni. Ngokwesibonelo, uma kukhona ababili ezempilo yomphakathi omunye kancane ngempumelelo kunabanye, bese ushintshela ukungenela ephumelela kakhudlwana angagcina esindisa izinkulungwane zabantu ezengeziwe.
Ekugcineni, ende idatha lishona kakhulu ikhono lethu lokwenza tilinganiso esiyimbangela kusuka idatha obubonakalayo. Nakuba datasets ezinkulu musa ngokuyisisekelo ukushintsha izinkinga ekwenzeni nikhuluma esiyimbangela kusukela idatha elibukelayo, ukumadanisa kanye ucwaningo nambili engokwemvelo amasu ukuthi abacwaningi baye zenzelwe izinto ezingelona esiyimbangela kusukela elibukelayo idatha-kokubili bazuza kakhulu datasets ezinkulu. Mina ngizoba ukuchaza ezinemifanekiso lesi simangalo ngokuningiliziwe kamuva kulesi sahluko lapho ngichaze amasu ucwaningo.
Nakuba bigness ngokuvamile impahla omuhle uma isetshenziswa ngendlela efanele, ngiye ngaphawula ukuthi bigness evame kuholela iphutha lomqondo. Ngasizathu simbe, bigness kubonakala ukuba ahole abacwaningi ukungabi nandaba nokuthi idatha yabo elakhiwe. Nakuba bigness enza elikhulu ekuqedeni isidingo sokuthi bakhathazeke mayelana iphutha nomaphi, empeleni kwandisa isidingo sokukhathazeka mayelana amaphutha ehlelekile, izinhlobo amaphutha ukuthi ngizobuya ukuchaza kabanzi ngezansi ukuthi bavele ukucwasana e njani idatha zidalwe futhi kuqoqwe. Kudathasethi encane, kokubili isiphambeko okungahleliwe nesiphambeko ehlelekile kungaba ebalulekile, kodwa kudathasethi iphutha elikhulu engahleliwe kungenziwa okulinganiselwa ku away futhi isiphambeko ehlelekile obusa. Abacwaningi abahlole ungacabangi ngesiphambeko ehlelekile uzogcina usebenzisa datasets zabo ezinkulu ukuze uthole ukulinganisa okunembile into engafanele; bayoba ngokunembile olunganembile (McFarland and McFarland 2015) .