Adkayd Large yihiin hab dhammaado, ma ay joogin dhamaadka ah naftooda ku kalsoonaadaan.
The ugu horeysay oo ka mid ah saddex sifooyinka wanaagsan ee xogta weyn waxaa ugu wada hadleen, oo kuwaasu waa xogta weyn. ilaha xogta waxay noqon karaan weyn saddex siyaabood oo kala duwan: dad badan, macluumaad badan oo qofkiiba, ama indha-indhaynta badan muddo. Lahaanshaha dataset weyn awood qaar ka mid ah noocyada gaarka ah ee heterogeneity cilmi-qiyaaso, waxbarasho dhacdooyin dhif ah, garashadiisa kala duwan yar, iyo samaynta qiyaasaha sababaha ka data dheehidda. Waxa kale oo ay u muuqataa in ay u horseedi in ay nooc gaar ah sloppiness.
Waxa ugu horeeya ee loogu talagalay taas oo size si gaar ah faa'iido u guurayaan ka baxsan celceliska in la sameeyo qiyaas waayo kooxaha gaarka ah. Tusaale ahaan, Gary King, Jennifer Pan, iyo Molly Roberts (2013) qiyaasay jaaniska in ay posts warbaahinta bulshada ee Shiinaha lagu tixan lahaa dowladda. laftiisa By this itimaalka celceliska tirtirka ma aha mid aad waxtar u fahmo sababta ay dawladdu censors posts qaar ka mid ah laakiin qaar kale ma. Laakiin, maxaa yeelay, dataset ka mid ahaa 11 million posts, King iyo shaqaalaheeda ayaa sidoo kale soo saaray qiyaasaha waayo jaaniska ah faafreebka for posts on 85 qaybood oo kala duwan (tusaale ahaan, filimada, Tibet, iyo Traffic ee Beijing). By is barbar jaaniska ah faafreebka for posts in qaybaha kala duwan, ay awoodaan in ay ka badan oo ku saabsan sida iyo sababta ay xukuumadda censors noocyo ka mid ah posts fahmi ahaayeen. Iyadoo 11 kun posts (halkii 11 million posts), mayna awoodin in ay soo saaraan qiyaasaha category-gaar ah, kuwaas oo.
Second, size waa gaar ah faa'iido u barataa dhacdooyin dhif. Tusaale ahaan, Goel iyo asxaabtii (2015) doonayay in ay bartaan siyaabaha kala duwan ee tweets tegi kartaa viral. Maxaa yeelay, Cascades badan oo dib-u-tweets tahay mid aad u dhif ah-oo ku saabsan mid ka mid ah 3,000-ay u baahan inaad wax ka barato ka badan bilyan Twitter si aad u hesho Cascades weyn oo ku filan si ay u falanqaynta.
Saddexaad, adkayd badan awood cilmi in lagu ogaado kala duwanaanshaha yar. Dhab ahaantii, wax badan oo ka mid ah diiradda on xogta weyn in industry ku saabsan kala duwanaanshaha yar waa: kalsoonaan karo arko faraqa u dhexeeya qiimaha click-dhex 1% iyo 1.1% on ad ah turjumi karaa malaayiin doolar oo dakhli dheeraad ah. In goobaha cilmi ah qaar ka mid ah, kala duwanaanshaha yar sida ma laga yaabaa in si gaar ah muhiim (xataa haddii ay tira-koob weyn yihiin). Laakiin, in goobaha siyaasadda qaar ka mid ah, kala duwanaanshaha yar sida ay noqon karaan muhiim ah markii viewed in wadar ahaan labada lug. Tusaale ahaan, haddii ay jiraan laba waxqabadyada caafimaadka dadweynaha iyo mid wax yar ka waxtar badan oo kale, ka dibna wareegaya faragelinta wax ku ool ah laga yaabo inuu badbaadiyo nolosha kumanaan ka dheeraad ah.
Ugu dambeyntii, nooc xog badan oo aad bay u weynaan awoodda aan u leenahay in la sameeyo qiyaas sababaha ka data dheehidda. Inkasta oo adkayd badan ma ay badasho dhibaatooyinka la samaynta baxayn sababaha ka data dheehidda, tallaabadaas iyo tijaabo-laba dabiiciga ah farsamooyinka in cilmi yeelatay samaynta sheegaya sababaha ka dheehidda xog-labada si weyn u ka faa'iideystaan adkayd badan. Waxaan kuu sharxi doonaa oo muujinaya labo geesood ah si faahfaahsan weyn dambe ee cutubkan markii aan ku tilmaami xeelado cilmi.
Inkastoo bigness guud waa hanti wanaagsan marka loo isticmaalo si sax ah, waxaan ogaaday in bigness caadi keenaysaa in qalad fikradeed. Sababtan qaar ka mid ah, bigness u muuqataa in ay u horseedi cilmi in ay iska indha sida ay xogta ahaa ee. Iyadoo bigness ma loo yareeyo baahida loo qabo in laga walwalo baadi random, waxaa dhab u kordhiyaa baahida loo qabo in laga walwalo khaladaad nidaamsan, noocyada qalad aan ku tilmaami doonaa in ka badan hoos ka kici ka eexasho in sida xogta loo abuuray oo laga soo ururiyey. In dataset yar, labada baadi random iyo baadi nidaamsan noqon kartaa muhiim ah, laakiin waxay ku suganyihiin baadi weyn dataset random waxaa la iska celcelis ahaan karaa oo baadi nidaamsan ka taliso. Cilmi-kii uma maleynayo oo ku saabsan qalad nidaamsan ilaa dhammaan doontaa isticmaalaya adkayd ay u ballaaran si aad u hesho qiyaasta saxda ah ee ah wax qalad ah; waxay noqon doonaan si hufan aan sax ahayn (McFarland and McFarland 2015) .