Masu bincike scraped Sin kafofin watsa labarun shafukan don nazarin yin katsalandan. Suka aikata da incompleteness da latent-hali hasashe.
Baya ga babban data kasance a cikin biyu baya misalai, masu bincike kuma iya tattara nasu observational data, kamar yadda aka banmamaki kwatanta da Gary King, Jennifer Pan, kuma Molly Roberts ' (2013) da bincike kan} arar da gwamnatin kasar Sin.
Social kafofin watsa labarai a kasar Sin posts ana tace da wani babban jihar na'ura da aka yi tsammani a hada da dubban mutane. Masu bincike da kuma 'yan ƙasa, duk da haka, da kadan ma'ana na yadda wadannan dakatar hukunci da abin da abun ciki ya kamata a share daga kafofin watsa labarun. Masana na kasar Sin zahiri da na saɓani tsammanin game da abin da iri posts ne mafi kusantar su sami share. Wasu tunanin cewa dakatar da hankali kan posts da cewa su ne m jihar yayin da wasu sun yi zaton su mayar da hankali a kan posts cewa karfafa gama hali, kamar zanga-zangar. Figuring fitar da wadannan tsammanin daidai ne yana abubuwan da yadda masu bincike fahimci kasar Sin da kuma sauran gwamnatocin dake cewa tafiyar da katsalandan. Saboda haka, ya sarki da kuma abokan aiki so ya kwatanta posts da aka buga da baya share to posts da aka buga, kuma bai taba share.
Tattara wadannan posts hannu ban mamaki injiniya feat na rarrafe fiye da 1,000 na kasar Sin da kafofin watsa labarun yanar-kowa da daban-daban page shimfidu-gano dacewa posts, sa'an nan kuma revisiting wadannan posts ga abin da aka baya share. Baya ga al'ada injiniya matsaloli hade tare da manyan sikelin yanar gizo-rarrafe, wannan aikin yana da kara kalubale da cewa shi da ake bukata ya zama musamman m saboda mutane da yawa tace posts suna kwankwance a kasa da 24 hours. A wasu kalmomin, wani jinkirin crawler zai miss kuri'a na posts da aka tace. Bugu da ari, crawlers ya yi duk wannan data tarin yayin sunã kangẽwa ganewa har kafofin watsa labarun yanar toshe hanya ko in ba haka ba su canja manufofin a mayar da martani ga binciken.
Da zarar wannan m aikin injiniya aiki da aka kammala, King da kuma abokan aiki ya samu game da miliyan 11 posts on 85 daban-daban batutuwa da suke pre-kayyade bisa laákari da ana tsammanin matakin ji na ƙwarai. Alal misali, a topic na high ji na ƙwarai ne Ai Weiwei, da DiSSiDENT artist. a topic na tsakiyar ji na ƙwarai ne ra'ayi da kuma ragewar darajar kuɗi na kasar Sin waje, kuma a topic na low ji na ƙwarai ne gasar cin kofin duniya. Daga cikin wadannan miliyan 11 posts game da 2 miliyan aka tace, amma posts on sosai m batutuwa da aka tace kawai dan kadan more sau da yawa fiye posts on tsakiyar da kuma low ƙwarai batutuwa. A takaice, kasar Sin dakatar ne game da yadda wata ila don bincikar wani post wanda ya ambaci Ai Weiwei matsayin post cewa ambaci gasar cin kofin duniya. Wadannan binciken bai dace da simplistic ra'ayin cewa gwamnati dakatar da dukan posts on m batutuwa.
Wannan sauki lissafi na yin katsalandan rate by topic zai iya zama m, duk da haka. Alal misali, gwamnati ta iya bincikar posts da suke taimaka Ai Weiwei, amma ka bar posts da cewa su ne m daga gare shi. Domin rarrabe tsakanin posts more hankali, da masu bincike bukatar auna jin zuciya kowane post. Saboda haka, hanya daya da tunani game da shi shi ne cewa jin zuciya kowane post a cikin wani muhimmanci latent alama kowane post. Abin baƙin ciki, duk da yawa aiki, cikakken sarrafa kansa hanyoyin da jin zuciya ganewa ta yin amfani da pre-data kasance kamus ne har yanzu ba a da kyau da yawa a cikin yanayi (zaton baya ga matsaloli samar da wani tunanin lokacin watan Satumba 11, 2001 daga Sashe 2.3.2.6). Saboda haka, ya sarki da kuma abokan aiki da ake bukata a hanyar lalata da miliyan 11 kafofin watsa labarun posts ya ga ko da suka kasance 1) m jihar, 2) taimaka jihar, ko 3) m ko gaskiyane rahotanni game da abubuwan da suka faru. Wannan sauti kamar m aiki, amma sai suka warware shi ta amfani da iko zamba. daya, da yake kowa a data kimiyya amma a halin yanzu in mun gwada rare a social kimiyya.
Na farko, a wani mataki yawanci ake kira pre-aiki, da masu bincike tuba da kafofin watsa labarun posts a cikin wani daftarin aiki-lokaci matrix, inda akwai daya jere ga kowane daftarin aiki kuma daya shafi cewa rubuce ko post na dauke da takamaiman kalma (misali, zanga-zangar, zirga-zirga, da dai sauransu). Next, wani rukuni na bincike mataimakansa hannun-labeled da jin zuciya da wani samfuri na post. Sa'an nan kuma, King da kuma abokan aiki amfani da wannan hannun-labeled bayanai zuwa kimanta a na'ura ilmantarwa model da zai iya infer da jin zuciya da wani post dangane da halaye. A karshe, suka yi amfani da wannan na'ura ilmantarwa model to kimanta da jin zuciya dukan miliyan 11 posts. Saboda haka, maimakon hannu karatu da lakabtawa miliyan 11 posts (wanda zai zama logistically m), su da hannu labeled karamin yawan posts, sa'an nan kuma amfani da abin da data masana kimiyya zai kira dubawa koyon kimanta da nau'i-nau'i daga dukan posts. Bayan kammala wannan bincike, King da kuma abokan aiki sun iya kammala da cewa, da ɗan mamaki, yiwuwar samun wani post da ake share shi da alaqa da ko shi m jihar ko taimaka jihar.
A ƙarshe, King da kuma abokan aiki gano cewa kawai iri uku posts aka kai a kai tace: batsa, zargi da dakatar, da kuma waɗanda suke da aikin gayya m (watau da yiwuwar manyan ga manyan-sikelin boren). By lura wata babbar dama posts da aka share kuma posts da aka ba share, King da kuma abokan aiki sun iya koyon yadda dakatar aiki kawai ta hanyar kallon da kirgawa. A m bincike, sun zahiri kai tsaye ta shãmakace a cikin kasar Sin da kafofin watsa labarun yanayin kasa da samar da posts da tsare daban-daban da abun ciki da kuma aunawa wanda sami tace (King, Pan, and Roberts 2014) . Za mu koyi abubuwa game da gwaji fuskanci a Babi na 4. Bugu da ari, foreshadowing jigo da za su faru a ko'ina cikin littafin, wadannan latent-sifa hasashe matsaloli-wane iya, wani lokacin za a warware da dubawa koyo-juya a kira su sosai na kowa a social bincike a cikin digital shekaru. Za ka ga hotuna sosai kama da adadi 2.3 a kai a babi 3 (tambayoyi) da kuma 5 (Samar taro haɗin gwiwar); shi ne daya daga cikin 'yan ideas da ya bayyana a mahara surori.
All uku daga cikin wadannan misalai-aiki hali na taksi direbobi a New York, aminci samuwar da dalibai, kuma kafofin watsa labarun} arar hali na gwamnatin kasar Sin-show cewa mun gwada sauki kirgawa daga observational data iya taimaka masu bincike a gwada msar tambayar tsinkaya. A wasu lokuta, babban data sa ka ka yi wannan kirgawa gwada kai tsaye (kamar yadda a cikin akwati na New York Taksi). A wasu lokuta, masu bincike za su bukatar tattara nasu observational data (kamar yadda a cikin akwati na kasar Sin katsalandan); magance incompleteness da tattara abubuwa masu kyau data tare (kamar yadda a yanayin saukan cibiyar juyin halitta). ko yin wasu nau'i na latent-hali hasashe (kamar yadda a cikin akwati na kasar Sin katsalandan). Kamar yadda na fatan wadannan misalai nuna, ga masu bincike da suka sami damar tambayar da ban sha'awa tambayoyi, babban riqe babban alkawari.