2.3.1.1 Big

Iiseti ezinkulu ziyindlela esiphelweni; bengengabo ekupheleni ngokwabo.

Ingqalo iimpawu ezintathu ezilungileyo data enkulu ngoyena yaxoxa: ezi data ezinkulu. Ezi imithombo yedatha kunokuba enkulu ngeendlela ezintathu ezahlukeneyo: abantu abaninzi, amaqashiso ulwazi kumntu ngamnye, okanye izinto ezininzi ekuhambeni kwexesha. Ukuba dataset enkulu lwenza iintlobo ezithile ezithile ukungafani uphando-yokulinganisa, ukufunda iziganeko ezinqabileyo, ekufumaneni umahluko ezincinane, nokwenza uqikelelo nakuyo evela data lokuqwalasela. Kubonakala ukuba kukhokelele nohlobo oluthile sloppiness.

Into yokuqala apho isayizi iluncedo ingakumbi litshintshela ngaphaya kunomyinge ukwenza oluqikelelwayo subgroups ezithile. Umzekelo, uGary uKumkani, Jennifer Pan, kwaye Molly Roberts (2013) Walinganisa kwenzeka ukuba izithuba eendaba lwezentlalo China iya zazihlolisiswa ngurhulumente. Ngokwayo eli linokuba kumyinge kucinywe akukho luncedo kakhulu ukuqonda ukuba kutheni urhulumente ahlolayo ezinye izithuba kodwa abanye. Kodwa ke, ngenxa yokuba dataset lwabo luquka izithuba 11 million, uKumkani noogxa livelise oluqikelelwayo amathuba zibekw kwizithuba ngomhla 85 iindidi ezahlukeneyo (umzekelo, imifanekiso engamanyala, eTibet, kwaye Traffic eBeijing). Ngokuthelekisa ngenene zibekw kwizithuba kumacandelo ngamacandelo ahlukeneyo, bakwazi ukuqonda okungakumbi ngendlela yaye kutheni urhulumente ahlolayo iintlobo ezithile izithuba. Iwaka 11 lezithuba (kunokuba kwezigidi 11 posts), abayi kuba nako ukuvelisa ezi uqikelelo kudidi-ngqo.

Okwesibini, ubungakanani ngokukodwa luncedo ufunda iziganeko ezinqabileyo. Umzekelo, Goel noogxa (2015) ufuna ukufunda ngeendlela ezahlukeneyo tweets singaya egazini. Ngenxa aqukuqela ezinkulu kwakhona Tweets-ngokugqithisileyo ezinqabileyo-malunga enye kwindawo 3,000-kwakufuneka ukufunda tweets ngaphezu billion ukuze sifumane dlulela ezikhulu ukwenzela uhlalutyo yabo ngokwaneleyo.

Okwesithathu, iiseti ezinkulu ukuba abaphandi ukuba ukubona umahluko amancinane. Enyanisweni, omninzi lugxininise data enkulu lushishino ngezi iiyantlukwano ezincinane: ngokuthembekileyo lokubhaqa umahluko phakathi amazinga konqakrazo-ukuya-1% ne-1.1% ngo kwemeko tolika ibe izigidi zeerandi kwingeniso ezingaphezulu. Kwezinye izicwangciso yenzululwazi, umahluko ezincinci ezifana ukuze singabi ethile ebalulekileyo (nokuba ngaba ukubaluleka kwezobalo). Kodwa ke, kwezinye izicwangciso-nkqubo, umahluko ezincinci ezifana unokuba ebalulekileyo xa zijongwe aggregate. Umzekelo, ukuba kukho ezimbini lezononophelo lwempilo yoluntu, omnye kancinane ngempumelelo ngaphezulu kwelinye, ukutshintshela ngongenelelo ngempumelelo ngaphezulu olunokugqibela lulondolozela amawakawaka abantu olongezelelweyo.

Okokugqibela, iisethi zedatha ezinkulu kuyandisa kakhulu amandla ethu ukwenza uqikelelo nakuyo evela data lokuqwalasela. Nangona zogcino- ezinkulu musa ngokwesiseko ukutshintsha iingxaki zokwenza ithethe woko evela data wokuqwalasela, ezithelekisekayo afunisele-amabini endalo ubuchule ukuba Abaphandi baye zokwenza amabango nakuyo ukusuka lokuqwalasela data-kunceda kakhulu iiseti ezinkulu. Ndiza ucacise ukubonisa eli bango ngokweenkcukacha olukhulu kamva kwesi sahluko xa ukuchaza izicwangciso zophando.

Nangona bigness ngokuqhelekileyo ipropati olungileyo xa zisetyenziswe ngokuchanekileyo, ndiye ndaphawula ukuba bigness ngokuqhelekileyo oku kukhokelela impazamo sengqiqo. Ngesizathu esithile, bigness kubonakala ukukhokela abaphandi ngoyaba indlela yenziwe data yabo. Nangona bigness ntoni ukunciphisa imfuneko naxhala imposiso efunisela, eneneni kwandisa imfuneko moss iimpazamo elandelelanayo, iintlobo iimpazamo ukuba ndiza ukuchaza ngaphezulu ngezantsi ezivelayo calulwe kwindlela data evelisa eziqokelelweyo. Xa dataset encinane, bobabini impazamo ukusetsha imposiso ngendlela kunokuba ebalulekileyo, kodwa impazamo enkulu dataset engakhethwanga na ukulinganisa kwaye imposiso ngendlela ukulawule. Abaphandi musa ucinge imposiso ngendlela uya bagqibele besebenzisa iiseti zabo elikhulu ukufumana uqikelelo ngqo into engalunganga; baya kuba kanye ayichananga (McFarland and McFarland 2015) .