Datasets Large inzira kupera; havasi mugumo pachazvo.
Zvokutanga zvitatu zvinoitwa makuru mashoko akanaka ndiro basa rinonyanya kukurukurwa; ndidzo mashoko makuru. Izvi zvinokonzera date kunogona makuru nhatu dzakasiyana: vanhu vazhinji, mijenya mashoko pamunhu, kana Zvaanoona zhinji pamusoro nguva. Kuva guru dataset kunoita dzimwe mhando dzakananga kutsvakurudza-kuyera heterogeneity, kudzidza zvisingawanzoitiki zviitiko, Detective misiyano diki, uye kuita causal inofungidzirwa kubva observational mashoko. Zvinoita sokuti kutungamirira chairo mhando sloppiness.
Chinhu chokutanga icho saizi kunonyanya kubatsira zvinofamba zvinopfuura Avharenji kuti kufungidzira kuti dzakananga subgroups. Somuenzaniso, Gary King, Jennifer Pan, uye Molly Roberts (2013) akayera mukana kuti mumagariro vezvenhau namagwatidziro muna China zvaizoonekwa akaongororwa nehurumende. Roga ichi paavhareji ndingangokuvara deletion hakusi kubatsira chaizvo kunzwisisa nei hurumende censors dzimwe mbiru asi kwete vamwe. Asi, nekuti dataset yavo zvaisanganisira mamiriyoni 11 mbiru, Mambo uye nevamwe vaberekawo nokufungidzira kuti ndingangokuvara kusenza nokuda mbiru pamusoro 85 vakaparadzana Categories (semuenzaniso, zvinonyadzisira, Tibet, uye Traffic muna Beijing). Nokuenzanisa ndingangokuvara kusenza zvitumirwa vari muzvikwata zvakasiyana, vakakwanisa kunzwisisa zvakawanda nezvekuti uye nei hurumende censors mamwe mbiru. With 11 zviuru namagwatidziro (pane mamiriyoni 11 nembiru), vangadai vasina kubereka izvi muchikwata-chaicho inofungidzirwa.
Chechipiri, kukura kunonyanya kubatsira kuti kudzidza pamusoro zvisingawanzoitiki zviitiko. Somuenzaniso, Goel uye nevamwe (2015) vaida kudzidza dzakasiyana kuti Tweets vanogona kuenda kutapukirwa. Nokuti chikuru Cascades kunenge Tweets vari neapo-anenge mumwe ari 3,000-vaifanira kudzidza zvakawanda bhiriyoni Tweets kuti awane zvakakwana makuru Cascades nokuda Ongororo yavo.
Chechitatu, datasets huru vakwanise vatsvakurudzi kuona misiyano diki. Kutaura zvazviri, zvakawanda anonyanya mashoko makuru indasitiri riri pamusoro izvi misiyano duku: yakavimbika Detective musiyano pakati 1% uye 1.1% watinya-kuburikidza prices riri muGiriyedhi anogona kushandura kupinda mamiriyoni emadhora mamwe kwemari. Mune dzimwe mezviruva kwesayenzi, misiyano maduku akadaro arege kunyanya kukosha (kunyange kana vari statistically chinokosha). Asi, mune dzimwe mutemo muzviruva, misiyano maduku akadaro kunogona kuva kunokosha kana kuonekwa uwandu hwezvinhu zvose zvabatanidzwa.. Somuenzaniso, kana paine vaviri pachena utano kupindira uye munhu zvishoma kunobatsira kupfuura vamwe, ipapo nokuchinja kune zvinobudirira kupindira aigona anopedzisira pakuponesa nezviuru zve upenyu.
Pakupedzisira, hombe mashoko rinovira kuwedzera zvikuru kukwanisa kwedu kuita causal nokufungidzira kubva observational mashoko. Kunyange zvazvo datasets huru regai vangasimudzira kushandura matambudziko kuita causal inference kubva observational data, chienderane uye dzinongoitika kuedza maviri mitoo kuti vatsvakurudzi zvave kuita causal dzinofunga kubva observational mashoko-zvose vanobatsirwa zvikuru huru datasets. Ndichaenda tsanangura uye zvinoratidza vanoti izvi kuchakurukurwa zvakadzama gare gare muchitsauko chino apo ini tsanangura kutsvakurudza nzira.
Kunyange zvazvo bigness Kazhinji zvakanaka pfuma kana kushandiswa nemazvo, Ndakacherechedza kuti bigness inowanzonzi kunotungamirira chikanganiso Conceptual. Nokuda kwechimwewo chikonzero, bigness sokuti kutungamirira vatsvakurudzi kufuratira sei mashoko avo akanga vanowanika. Nepo bigness anoita dzinoderedza kunetseka nezvokukanganisa asingashumbi, chaizvoizvo anowedzera kudiwa kunetseka pamusoro enderana zvikanganiso, marudzi zvikanganiso kuti ndichashanda kutsanangura zvakawanda pazasi kuti muka kubva vakanga vaine kwavakarerekera mune sei mashoko vakasikwa uye dzakaunganidza. Duku dataset, zvikanganiso anongoitika uye hurongwa kukanganisa kunogona kuva kunokosha, asi dataset kurongwa kukanganisa guru iri rinogona neavhareji kure uye hurongwa kukanganisa kunodzora. Vatsvakurudzi vasingadi kufunga enderana kukanganisa achaguma vachishandisa datasets yavo yakakura kuwana chaiwo yera zvakaipa chinhu; vachava chaizvoizvo kururama (McFarland and McFarland 2015) .