Cilmi xoqay goobaha warbaahinta bulshada Chinese inaad wax ka barato faafreebka. Waxay la macaamilooday incompleteness la baxayn qarsoon-sifo.
Waxa intaa dheer in xogta weyn loo isticmaalaa in labada tusaale ee hore, cilmi sidoo kale qaadan kartaa xogta dheehidda iyaga u gaar ah, sida waxaa yaab leh muujiyey by Gary King, Jennifer Pan, iyo Molly Roberts ' (2013) cilmi-baaris ku saabsan faafreebka ay dowladda Shiinaha.
posts warbaahinta Social ee Shiinaha ayaa la faafreebayso ah hay'adaha dawladda u weyn in u maleeyeen waa in ka mid ah tobanaan kun oo qof. Cilmi iyo muwaadiniinta, si kastaba ha ahaatee, waxay leeyihiin dareen yar ee sida censors kuwaas oo go'aan ka gaarto waxa content waa in la tirtiro ka warbaahinta bulshada. Culimada ee Shiinaha dhab leenahay rajooyin u khilaafsan oo ku saabsan taas oo noocyada kala duwan ee posts u badan tahay in la tirtiro. Qaar waxay moodaan in censors diiradda on posts in ay yihiin muhiim ah ee gobolka, halka kuwa kale u malaynayaa in ay diiradda saaraan posts in dhiirri-dhaqanka shaqada, sida dibad. Isku dayida lagu ogaanayo oo laga filayo waa sax ayaa iman kara sida cilmi fahmi Shiinaha iyo dawladaha kale talisnimo in galaan faafreebka. Sidaa darteed, King iyo asxaabtii doonayay inay is barbar dhigaan posts la daabacay oo markii danbe la tirtiray in posts la daabacay oo aan marnaba la tirtiray.
Ururinta posts kuwaas oo ku lug rikoorkaan injineernimada cajiib ah oo gurguurta in ka badan 1,000 websites-kasta warbaahinta bulshada Chinese la page kala duwan Layouts-raadin ah posts khuseeya, ka dibna dib u soo booqanayaa posts kuwaas oo si aad u aragto, taas oo markii danbe laga tirtiray. Waxa intaa dheer in dhibaatooyinka injineernimada caadiga ah ee la xidhiidha baaxadda weyn web-gurguurashada, mashruucan lahaa caqabad ku daray in loo baahan yahay in ay aad u degdeg ah sababtoo ah posts badan tixan yihiin qaaday hoos ka yar 24 saacadood. In si kale loo dhigo, crawler la'ahay seegi doono badan oo posts la tixan. Dheeraad ah, guurguurtayaal ku lahaa in la sameeyo oo dhan ururinta xogta this halka diico ogaanshaha Waaba intaasoo ay website-yada warbaahinta bulshada quful ama haddii kale bedesho siyaasadaha ay jawaab u daraasadda.
Marka hawshan injineernimada weyn ayaa lagu soo gabagabeeyey, King iyo asxaabtii uu helay oo ku saabsan 11 million posts on 85 mawduucyo kala duwan in ay ahaayeen pre-qeexan oo ku salaysan heerka filayo of dareen. Tusaale ahaan, mawduuca ah oo dareen sare waa Ai Weiwei, artist ee mucaaradka, mawduuc ee dareenka dhexe waa qadarin iyo qiime dhaca lacagta Shiinaha, iyo mawduuc of dareen hooseeyo waa koobka adduunka. Kuwaas oo kala ah 11 million posts 2 million la tixan, laakiin posts ku saabsan mawduucyo xasaasi ah ayaa tixan oo kaliya wax yar inta badan posts ku saabsan mawduucyo dareenka dhexe oo ku yar. In si kale loo dhigo, censors Chinese yihiin oo ku saabsan sida ay u badan tahay inay faafreeb post a in sheegaye Ai Weiwei sida post a in sheegaye Koobka Adduunka. Natiijadu waxay ma u dhigma fikradda sahlay in dowladda censors dhan posts ku saabsan mawduucyo xasaasi ah.
Xisaabinta Tani fudud ee heerka faafreebka ay topic noqon karaa marin habaabin ah, si kastaba ha ahaatee. Tusaale ahaan, dawladda faafreeb laga yaabaa posts ee lagu taageerayo Ai Weiwei, laakiin ka tago posts in ay yihiin muhiim ah isaga. Si taxadar leh more kala saaro posts, cilmi u baahan tahay si loo cabbiro dareen ee post kasta. Sidaas darteed, mid ka mid ah si ay u malaynayaa in ay ku saabsan tahay in dareenka ee post kasta ee feature muhiim qarsoon ee post kasta. Nasiib darro, inkastoo shaqo badan, hababka si buuxda iswada ee la ogaado dareenka la isticmaalayo qaamuusyo pre-jira weli aad u wanaagsan in xaalado badan (qabaa in dhibaatada abuuraya waqtiga qiiro of September 11, 2001 ka Section 2.3.2.6 dib). Sidaa darteed, King iyo asxaabtii loo baahan yahay si ay u calaamadiso ay 11 million posts warbaahinta bulshada sida in ay ahaayeen 1) muhiimka ah ee gobolka, 2) taageero ee gobolka, ama 3) warar ku tacaluqda ama dhab ah oo ku saabsan dhacdooyinka. Tani waxay u muuqataa sidii shaqo weyn, laakiin waxa ay u xalin isticmaalaya trick xoog leh; mid ka mid ah in uu yahay wax caadi ah in sayniska xogta laakiin hadda dhif ku ahaa cilmiga bulshada.
First, in tallaabo caadi ahaan loo yaqaan pre-processing, cilmi diinta tiirarka warbaahinta bulshada galay furta document-dheer ah, halkaas oo ka mid ahayd isku xigta dukumenti kasta oo ka mid column in diiwaan haddii post ku jira eray gaar ah (tusaale ahaan, ka cabanaya, gaadiidka, iwm). Next, koox ka mid ah gargaarrada cilmi gacanta ku tilmaamay caadifo ee tusaalaha ah post. Markaas, King iyo asxaabtii isticmaalo xogta gacanta ku tilmaamay this in la qiyaaso model waxbarashada makiinad u xawilaan yaabaa caadifo ee post a ku salaysan sifooyinka ay. Ugu dambayntii, waxay isticmaali model this waxbarashada mashiinka in la qiyaaso caadifo oo dhan 11 million posts. Sidaas darteed, halkii ay gacanta akhriska iyo dhidibada 11 million posts (taas oo noqon lahayd Ma'aha wax aan macquul aheyn), ay gacanta ku tilmaamay tiro yar oo ah posts, ka dibna loo isticmaalo waxa xogta saynisyahano wici lahaa waxbarashada ee kormeersan in la qiyaaso qaybaha of posts oo dhan. Kadib markii uu dhamaystay falanqaynta this, King iyo asxaabtii ay awoodaan in ay ku tirinnaa in, xoogaa la yaab leh, jaaniska ah post a la tirtiray ahaa xidhiidhin in ay ahayd muhiim ah ee gobolka ama taageero ee gobolka.
In dhamaadka, King iyo asxaabtii ogaadeen in noocyada posts saddex keliya ayaa si joogta ah tixan: qaawan, naqdin censors, iyo kuwii lahaa kara tallaabo wadajir ah (ie, waxaa macquul ah ee keentay in dibad baaxad weyn). By eegaya tirada weyn ee posts la tirtiray iyo posts aan la tirtiray, King iyo asxaabtii ay awoodaan in ay bartaan sida censors ka shaqeeyaan oo kaliya iyagoo daawanaya iyo tirinta. In cilmi ku xiga, waxay si dhab ah si toos ah fara gelin hannaanka warbaahinta bulshada Shiinaha iyadoo la abuurayo posts la content iyo qiyaaseed si nidaamsan oo kala duwan oo is tixan (King, Pan, and Roberts 2014) . Waxaan kale oo ku saabsan hababka tijaabo Cutubka 4. dheeraad ah ka baran doontaa, tanune theme ah in ay dhici doona buugga oo dhan, kuwaas oo dhibaatooyin-kaas oo mararka qaar lagu xallin karaa iyada oo ka baxayn qarsoon-sifo kormeero waxbarasho-soo baxayso in aad caadi u ah in cilmi-bulshada ee da'da digital. Waxaad arki doontaa sawirada aad u la mid ah Jaantuska 2.3 ee Cutubyada 3aad (Waydiinta su'aalaha) iyo 5 (Abuuridda iskaashiga mass); waa mid ka mid ah waxoogaa fikrado ah oo u muuqata in cutubyada kala duwan.
Dhammaan saddex ka mid ah kuwan tusaalooyin-ku dhaqanka shaqo darawallada taxi ee New York, formation saaxiibtinimo by ardayda, iyo warbaahinta bulshada dhaqanka Faafreebka ee dawladda-show Chinese in tirinta fudud xogta dheehidda awood kartaa cilmi inay tijaabiso saadaasha teori. Xaaladaha qaarkood, xogta weyn awood aad si xad si toos ah samayn tirinta this (sida kiiska ee New York Taxis). Xaaladaha kale, cilmi-waxay u baahan doonaan si ay u ururiyaan xogta dheehidda iyaga u gaar ah (sida in ay dhacdo faafreebka Shiinaha); ula macaamiisho incompleteness wada biirtay xogta (sida in ay dhacdo horumar network); ama fulinta nooc ka mid ah ka baxayn qarsoon-sifo (sida in ay dhacdo faafreebka Shiinaha). Sida aan rajaynayaa tusaale u muujiyaan, waayo cilmi-kii ay awoodaan in ay su'aalo ku weydiiyaan oo xiiso leh, waa weyn heysta yabooha weyn.