Loj datasets yog ib tug txhais tau tias mus rau qhov kawg; lawv tsis yog kawg nyob rau hauv lawv tus kheej.
Tus thawj ntawm peb zoo cov yam ntxwv ntawm loj cov ntaub ntawv yog lub feem ntau yuav sib tham txog: cov no yog cov loj cov ntaub ntawv. Cov ntaub ntawv no qhov chaw yuav ua tau loj nyob rau hauv peb txoj kev sib txawv: muaj coob tus neeg, ntau ntau ntawm cov ntaub ntawv rau ib tug neeg, los yog ntau pom lub sij hawm. Muaj ib tug loj dataset enables ib co tej yam kev tshawb fawb-ntsuas heterogeneity, kawm tsawg txheej xwm, tebchaws me me sib txawv, thiab ua causal kev kwv yees los ntawm observational cov ntaub ntawv. Nws kuj nkawd yuav ua rau ib hom ntawm sloppiness.
Tus thawj tshaj plaws rau cov uas loj yog tshwj xeeb yog pab tau yog mus dhau nrim rau ua kev kwv yees rau tej pab pawg. Piv txwv li, Gary King, Jennifer Yias, thiab Molly Roberts (2013) ntsuas qhov yuav tshwm hais tias kev sib raug zoo xov xwm posts nyob rau hauv Tuam Tshoj yuav tsum tau censored los ntawm tsoom fwv. Los ntawm nws tus kheej qhov no nruab nrab yuav deletion yog tsis heev yuav pab tau kom to taub yog vim li cas tsoom fwv censors ib co posts tab sis tsis lwm tus neeg. Tab sis, vim hais tias lawv dataset muaj 11 lab posts, King thiab lug txhawb cov miv kuj ua kwv rau qhov yuav tshwm ntawm censorship rau posts on 85 cais pawg (xws li, saib duab liab qab, Tibet, thiab tsheb nyob rau hauv Beijing). By muab piv rau cov yuav censorship rau posts nyob rau hauv ntau pawg, lawv kuj muaj peev xwm to taub ntxiv txog yuav ua li cas thiab vim li cas cov tsoom fwv censors tej hom posts. Nrog 11 txhiab posts (es tshaj 11 lab posts), lawv yuav tsis tau los ua cov qeb kev kwv yees.
Ob txhais, loj, me yog tshwj xeeb yog pab tau rau cov kawm ntawm tsawg txheej xwm. Piv txwv li, Goel thiab lug txhawb cov miv (2015) xav kawm txog ntau txoj kev uas tweets yuav mus kis. Vim hais tias loj cascades ntawm re-tweets yog tsis tshua muaj heev-txog ib tug nyob rau hauv ib tug 3,000-lawv yuav tsum tau mus kawm ntau tshaj ib tug billion tweets nyob rau hauv thiaj li yuav nrhiav tau txaus loj cascades rau lawv tsom xam.
Peb, loj datasets pab soj ntsuam mus ntes me me sib txawv. Nyob rau hauv qhov tseeb, npaum li cas ntawm cov kom pom tseeb rau loj cov ntaub ntawv nyob rau hauv kev lag luam yog hais txog cov me me sib txawv: nti tebchaws qhov txawv ntawm 1% thiab 1.1% nias-los ntawm cov nqi rau ib qho ad yuav txhais mus rau hauv lab ntawm cov nyiaj nyob rau hauv ntxiv cov nyiaj tau los. Nyob rau hauv ib co scientific chaw, tej me sib txawv tej zaum yuav tsis particular tseem ceeb (txawm yog hais tias lawv yog cov lus tseem ceeb). Tab sis, nyob rau hauv ib co cai muaj nqis, tej me sib txawv ua ib qho tseem ceeb thaum uas nyob hauv ib pawg. Piv txwv li, yog hais tias muaj yog ob tug pej xeem noj qab haus huv tiv thaiv thiab ib tug yog me ntsis ntxiv zoo tshaj lwm tus, ces switching rau lub zoo tshaj kev pab yuav mus txuag phav tus lub neej.
Thaum kawg, loj cov ntaub ntawv poob lawm zoo heev ua kom peb muaj peev xwm ua kom causal kev kwv yees los ntawm observational cov ntaub ntawv. Txawm tias loj datasets tsis fundamentally hloov cov teeb meem nrog rau cov causal inference los ntawm observational cov ntaub ntawv, txuam thiab tej yam ntuj tso thwmsim-ob hom kev kawm uas soj ntsuam tau tsim ua causal neeg pab leg ntaubntawv ntawm observational cov ntaub ntawv-ob heev muaj txiaj ntsim los ntawm kev loj datasets. Kuv mam li piav thiab hais txog qhov no thov nyob rau hauv ntau dua kom meej tom qab nyob rau hauv tshooj no thaum kuv piav kev tshawb fawb tswv yim.
Txawm tias bigness yog feem ntau ib tug zoo cov khoom teejtug thaum siv kom yog yog, Kuv twb pom hais tias bigness kheev ua rau ib tug conceptual kev ua yuam kev. Rau ib txhia yog vim li cas, bigness nkawd mus ua kev soj ntsuam yuav las mees li cas lawv cov ntaub ntawv twb generated. Thaum bigness puas txo tau cov kev xav tau kev txhawj txog random kev ua yuam kev, nws yeej yuav tsub qhov yuav tsum tau txhawj txog systematic uas tsis, lub hom uas tsis kuv mam li piav qhia txog nyob rau hauv ntau hauv qab no uas tshwm sim los ntawm biases nyob rau hauv yuav ua li cas cov ntaub ntawv yog tsim thiab sau. Nyob rau hauv ib tug me me dataset, ob random kev ua yuam kev thiab systematic kev ua yuam kev yuav ua tau ib qho tseem ceeb, tab sis nyob rau hauv ib tug loj dataset random kev ua yuam kev yog yuav sim tam sim ntawd thiab systematic kev ua yuam kev dominates. Soj ntsuam uas tsis xav txog systematic kev ua yuam kev yuav mus siv lawv cov loj datasets kom tau ib tug leej kwv yees ntawm qhov tsis ncaj ncees tshaj plaws; lawv yuav tsum precisely muaj tseeb (McFarland and McFarland 2015) .