Idatha engabonakaliyo imbi ngenxa yokuphuma kweesampuli, kodwa inokuba luncedo kakhulu xa kuthelekiswa nesampula.
Ezinye izazinzululwazi zentlalo zijwayele ukusebenza kunye nedatha evela kwisampula esingenangxaki esivela kwindawo echazwe kakuhle, njengabantu abadala kwilizwe elithile. Olu hlobo lwedata lubizwa ngokuba yi-data emele ukuba isampuli "imele" abantu abaninzi. Abaphandi abaninzi baxabisa idatha yommeleli, kunye nakweminye, i-data engummeli iyafana nesayensi enzulu kodwa i-data engabonakaliyo iyafana ne-sloppiness. Eyona nto ibaluleke kakhulu, abanye bakholelwa ukuba akukho nto ingafundwa kwi-data engabonakaliyo. Ukuba kuyinyaniso, oku kuya kubonakala kunciphise into enokuyifunda kwimithombo emikhulu yedata kuba abaninzi babo abayiyo imbonakalo. Ngethamsanqa, aba baxolisi banelungelo elithile. Kukho iinjongo ezithile zophando malunga nolwazi olungabonakaliyo ngokucacileyo alufanelekanga, kodwa kukho ezinye apho okunokwenzeka ukuba luncedo kakhulu.
Ukuze siqonde oku kwahlukana, makhe siqwalasele i-classic yesayensi: Ukufunda kukaJohn Snow we-1853-54 ukuqhuma kweKholera eLondon. Ngelo xesha, oogqirha abaninzi babekholelwa ukuba ikholera ibangelwa "umoya ombi," kodwa i-Snow yayikholelwa ukuba sisifo esithathelwanayo, mhlawumbi sisasazwa ngamanzi okusela amanzi. Ukuvavanya le ngcamango, i-Snow yathatha inzuzo yintoni esinokuyenza ngoku kuthiwa yinto yokwenza imvelo. Waqhathanisa amazinga ekholera yamakhaya akhonza ngamashishini ahlukeneyo kwamanzi amabini: i-Lambeth neSouthwark neVauxhall. Ezi nkampani zikhonza amakhaya afanayo, kodwa zahluke ngendlela enye ebalulekileyo: ngo-1849-iminyaka embalwa ngaphambi kokuba le ngqungquthela iqale-i-Lambeth yashukumisela indawo yayo yokungena emanzini avela kwi-egoge yokukhutshwa kwamanzi emzimbeni eLondon, kanti i-Southwark neVauxhall beyishiyile ipayipi yabo yokungena ukukhutshwa kwamanzi. Xa i-Snow iqhathanisa amazinga okufa ekholera kwimikhaya ekhonjwe yiinkampani ezimbini, wafumanisa ukuba abathengi baseMzantsiwark neVauxhall-inkampani eyayinikela abathengi ngamanzi angcolileyo-ayenamaxesha angaphezu kweshumi okufa ngenxa yekholera. Esi siphumo sibonelela ubungqina obunzulu bezenzululwazi malunga ne-Snow's argument malunga nesizathu sekholera, nangona kungekelwe kwisampula esimeleyo kubantu baseLondon.
Idatha evela kule nkampani ezimbini, nangona kunjalo, ayiyi kuba yinto efanelekileyo yokuphendula umbuzo ohlukileyo: yayiyiphi intsholongwane yekholera eLondon ngexesha loqhwaku? Kulo mbuzo wesibini, obaluleke kakhulu, kuya kuba ngcono ukuba ube nesampuli emele abantu baseLondon.
Njengomsebenzi we-Snow, kubakho imibuzo ethile yenzululwazi ukuba idilesi engabonakaliyo inokusebenza kakuhle kwaye kukho ezinye ezingalungelekanga. Enye indlela engafanelekanga yokwahlula le mibini yemibuzo yimibuzo ethile malunga nokuqhathaniswa kweesampuli kunye nezinye zi malunga nokuphuma kweesampuli. Ukwahlukana kunokuboniswa ngenye ifundo zakudala kwi-epidemiology: i-British Doctors Study, eyadlala indima ebalulekileyo ekuboniseni ukuba ukutshaya kubangela umdlavuza. Kulolu cwaningo, uRichard Doll kunye no-A. Bradford Hill balandela oogqirha abangamadoda angama-25 000 iminyaka emininzi baphinde bafanise amazinga abo okufa ngokusekelwe kwisixa abavuthayo xa kuqhutywe isifundo. I-Doll kunye ne-Hill (1954) yafumana ubungqina obuqinileyo-mpendulo: abantu abaninzi bevutha, mhlawumbi bafa ngomdlavuza wamaphaphu. Ngokuqinisekileyo, bekungekho bubulumko ukuqikelela ukusabalalisa komhlaza wamaphaphu phakathi kwabo bonke abantu baseBrithani ngokusekelwe kweli qela leogqirha abangamadoda, kodwa ukuqhathaniswa kwangaphakathi kwesampula kunika ubungqina bokuthi ukutshaya kubangela umdlavuza wamaphaphu.
Kalokunje, ukuba ndibonise umahluko phakathi kokungqamanisa kweesampula kunye ne-out-of-sample generalization, iiproveats ezimbini zilungelelaniswa. Okokuqala, kunemibuzo ngokwemvelo malunga nobuhlobo obunalo phakathi kweesampula zamantombazana aseBrithani abaza kubamba kwisampuli yabasetyhini, oogqirha baseBrithani okanye abasebenzi basebenzi baseBritani baseBrithani okanye abasebenzi basefama baseJamani okanye amanye amaqela. Le mibuzo iyamangalisa kwaye ibalulekile, kodwa ihluke kwimibandela malunga nendlela esinokuyenza ngayo kwisampula kubantu. Isaziso, umzekelo, ukuba usenokukrokrela ukuba ubudlelwane obuphakathi kokutshaya nomhlaza obufunyenwe kwiiduna zaseBrithani ziza kuba zifana nala maqela. Ukukwazi kwakho ukwenza lo mgaqo-mpahla akuveli kwinto yokuba oogqirha baseBrithani bokuba isampula esisasazekayo kunoma yiphina indawo; kunoko, kuvela ekuqondeni kwendlela edibanisa ukutshaya nomhlaza. Ngaloo ndlela, ukuveliswa kweso sampula kubemi apho kubanjwe khona umbandela othile, kodwa imibuzo malunga nokuthuthwa kwephethini efunyenwe kwelinye iqela kwelinye iqela liyi-issue nonstatistical (Pearl and Bareinboim 2014; Pearl 2015) .
Kule ngongoma, inokuthi i-skeptic inokubonisa ukuba iipatheni zentlalontle mhlawumbi zingaphantsi kwezothutho kumaqela ngaphezu kobudlelwane phakathi kokutshaya nomhlaza. Kwaye ndiyavuma. Umlinganiselo esimele siwulindele ukuba iipatheni zizothuthoza ekugqibeleni umbuzo wesayensi onokuthi ugqitywe ngokusekelwe kwimbono kunye nobungqina. Akufanele kucatshulwa ngokuzenzekelayo ukuba iipateni ziya kuthuthwa, kodwa akufuneki ukuba kucinge ukuba abayi kuthuthwa. Le mibuzo engabonakaliyo malunga nokuthutha iya kuqhelaniswa nawe ukuba ulandele iingxoxo malunga nokuba ngabaphandi bafunda njani ngokuziphatha kwabantu ngokufunda abafundi be-graduate (Sears 1986, [@henrich_most_2010] ) . Naphezu kwezi mpikiswano, kunjalo, bekungekho ngqiqweni ukuthetha ukuba abaphandi abanakufunda nantoni na ekufundeni abafundi be-graduate.
I-caveat yesibini kukuba ininzi abaphandi abanolwazi olungabonakaliyo aluqapheli njenge-Snow okanye Doll kunye neNtaba. Ngoko, ukubonisa okokungahambi kakuhle xa abaphandi bezama ukwenza i-out-of-sample-generated from data not representative, ndingathanda ukukuxelela ngophando lwenyulo yepalamente ka-2009 ngu-Andranik Tumasjan kunye noogxa (2010) . Ngokuhlalutya iikopi ezingaphezu kwe-100,000, bafumanisa ukuba inani leetweets ezibhekiselele kwiqela lezopolitiko lilingana nenani lamavoti apho iqela lifunyenwe kunyulwa kwepalamente (umhlathi 2.3). Ngamanye amazwi, kubonakala sengathi idatha ye-Twitter, eyayikhululekile, ingathatha indawo yoluvo lwengqondo yoluntu lwendabuko, ebiza kakhulu ngenxa yokugxininisa kwiedatha ezimele.
Ukubone oko kusenokwenzeka ukuba uyayazi malunga ne-Twitter, kufuneka ube ngokukhawuleza kwesi siphumo. AmaJamani ku-Twitter ngo-2009 awazange abe sisampula esingenakwenzekayo sabavoto baseJamani, kwaye abaxhasi bamanye amaqela banokwenza i-tweet malunga nezopolitiko kaninzi kunokuxhaswa kwamanye amaqela. Ngaloo ndlela, kubonakala kuyamangalisa ukuba zonke izinto ezinokuthi uzikhe ucinga ukuba uya kuzikhupha ukuze idilesi ibonakalise ngokucacileyo abavoti baseJamani. Enyanisweni, iziphumo Tumasjan et al. (2010) ibe yinto enhle kakhulu ukuba yinyani. Iphepha elilandelelweyo ngu-Andreas Jungherr, uPascal Jürgens, noHarald Schoen (2012) babonise ukuba uhlalutyo lwangaphambili lwalukhuphe iqela lezopolitiko elalifumene ngokubhekiselele kwi-Twitter: iPirate Party, iqela elincinane elilwa nemimiselo karhulumente ye-intanethi. Xa iPirate Party ibandakanywe ekuhlalutheni, ukuthetha nge-Twitter kuba yinto ebalulekileyo yokubaluleka kweziphumo zonyulo (umhlathi 2.3). Njengoko lo mzekelo ubonisa, ukusebenzisa imithombo engundoqo engabonakaliyo yolwazi ukwenza i-out-of-sample generalization ingakwazi ukuhamba kakubi. Kwakhona, kufuneka uqaphele ukuba i-tweets eziyi-100 000 yayingabalulekanga: inkcukacha ezininzi ezingabonakaliyo, ingumxholo othi ndiza kubuya kwisahluko 3 xa ndiza kuxubusha uphando.
Ukugqitywa, imithombo emininzi yolwazi ayiyiyo isampuli esimeleyo ukusuka kwabanye abantu abachazwe kakuhle. Ngeemfuno ezifuna ukuvelisa iziphumo kwisampula kubemi ukusuka kuyo, le ngxaki enkulu. Kodwa ngemibuzo malunga nokuqhathaniswa kweesampula, i-data engabonakaliyo ingaba namandla, ngokude nje ngokuba abaphandi bacacile malunga neempawu zesampuli zabo kunye nenkxaso yamabango malunga nokuthutha ngezobugcisa okanye zobugcisa. Enyanisweni, ithemba lam kukuba imithombo ekulu yolwazi iza kwenza abaphandi benze okungakumbi ukuthelekiswa kwamanqaku kumaqela amaninzi angabonakaliyo, kwaye ukuqikelela kwam ukuba amaqela amaninzi ahlukeneyo aya kwenza okungakumbi ukuphucula uphando lwentlalo kunokuba kuqikelelwe okukodwa kwimeko ehleliweyo isampuli.