Rescuing Notes from Needs More Ratings: Behavioral Clustering in Community Notes on X
This repository contains the full analytical workflow for the paper.
This study was prepared as a final project for the University of Konstanz course AI, Society, and Human Behavior: Research Methods in Context in Winter Semester 2025/26.
The project is organized as a notebook-first pipeline with reusable logic extracted into src/. The workflow moves from data sampling to user clustering, note scoring, rescue simulation, topic modeling, LLM-based validation, figure generation, and paper compilation.
The main goal of the project is to study whether a meaningful subset of Community Notes stuck in the Needs More Ratings state can be rescued by a representative cross-cluster selection rule rather than a simple pooled-majority rule.
What This Repository Contains
Paper file/main.tex
LaTeX source of the paper.
Paper file/references.bib
BibTeX references.
src/
Reusable Python modules for clustering, scoring, topic analysis, plotting helpers, and path/config handling.
notebooks/
Ordered analysis notebooks (see Workflow Overview below for execution order) and one standalone tutorial (Gabriel_tutorial_for_LLMs_in_social_science_research.ipynb) from OpenAI's Gabriel library for using LLMs in social-science research.
data/interim/
Intermediate outputs created after clustering; distributed through the external data bundle rather than Git.
data/processed/
Processed outputs used by downstream analysis and figures; distributed through the external data bundle rather than Git.
data/gabriel/
Cached and saved outputs from the Gabriel LLM workflow; distributed through the external data bundle rather than Git.
figures/paper/
Paper-ready PDF and PNG figures.
master_sample.csv
The large sampled ratings table used as the starting point for the modular pipeline; distributed separately from the Git repository.
Data Availability
To keep the Git repository lightweight and pushable, the large analytical data artifacts are not stored in this repository.
This project is built on a sampled analytical dataset, not on the full raw Community Notes production dataset. The shareable replication bundle starts from master_sample.csv and the downstream parquet outputs. Readers who want the full upstream raw files should use the official Community Notes download sources linked below.
In particular, a fresh clone should not be expected to contain:
master_sample.csvdata/interim/*.parquetdata/processed/*.parquetdata/gabriel/caches and saved outputs
The code, notebooks, and paper sources are versioned here, but large generated datasets should be distributed separately, for example through an external data archive, a release asset, or a cloud storage link documented in this README.
External Data Download And Setup
If you want to reproduce the sampled analytical pipeline used in the paper without rebuilding the raw upstream files, download the external data bundle from:
- Google Drive folder:
https://drive.google.com/file/d/18fKA8sHn4ULetghboSboNBnWm-BcrO_O/view?usp=sharing
The downloadable bundle should be a single archive such as:
release-assets-bundle.zip
Important notes:
- this bundle is based on the sampled analytical dataset used in the paper
- it is not the full raw Community Notes production dataset
- if you want to reconstruct the pipeline from raw upstream exports, use the official Community Notes links in the next section instead
Recommended setup steps after downloading:
- Move
release-assets-bundle.zipinto the repository root. - Extract it in the repository root:
unzip release-assets-bundle.zip
- Move the extracted bundle files from
release-assets/data-v1/into the repository root:
mv release-assets/data-v1/community-notes-x-data-bundle.tar.gz .
mv release-assets/data-v1/master_sample.csv.gz .
mv release-assets/data-v1/SHA256SUMS.txt .
- Optionally verify file integrity:
shasum -a 256 -c SHA256SUMS.txt
- Decompress the sampled analytical table:
gunzip master_sample.csv.gz
- Extract the downstream parquet and Gabriel outputs:
tar -xzf community-notes-x-data-bundle.tar.gz
- Confirm that the repository root now contains
master_sample.csvand that thedata/interim/,data/processed/, anddata/gabriel/folders are populated.
After that, you can start directly from notebooks/01_clustering.ipynb.
Reproducibility Scope
There are two distinct reproducibility levels in this project:
- Full upstream reconstruction from raw platform files
This starts from the original raw Community Notes exports and rebuilds master_sample.csv.
- Downstream reconstruction from the sampled analytical table
This starts from master_sample.csv and rebuilds the clustering, rescue, topic, Gabriel, and plotting outputs.
For most readers, the practical reproducibility boundary is master_sample.csv onward. The sampling step exists and is documented in notebooks/master sampling.ipynb, but it requires access to raw upstream files that are not part of the modular notebook sequence.
Expected Raw Inputs for the Sampling Stage
The notebook notebooks/master sampling.ipynb expects the following files in the project root:
notes-00000.tsvratings-*.tsvnoteStatusHistory-00000.tsv
These raw public snapshots are not bundled in this repository. They must be downloaded separately from X's public Community Notes data sources.
Official starting points:
- Public data download entry point:
https://x.com/i/communitynotes/download-data - Community Notes Guide download page:
https://communitynotes.x.com/guide/en/under-the-hood/download-data - Official public repository and documentation hub:
https://github.com/twitter/communitynotes
That notebook creates:
master_sample.csv
This file is large and serves as the analytical entry point for the rest of the project.
Workflow Overview
The intended execution order is:
notebooks/master sampling.ipynbnotebooks/01_clustering.ipynbnotebooks/02_scoring.ipynbnotebooks/03_topics.ipynbnotebooks/05_gabriel_check.ipynb(optional but used in the paper)notebooks/06_plots.ipynb- Compile
Paper file/main.tex
The notebook notebooks/04_plots.ipynb is exploratory and not required for the final paper outputs.
Step-by-Step Pipeline
1. Master Sampling
Notebook:
notebooks/master sampling.ipynb
Purpose:
- restrict the universe to tweets with at least 3 distinct notes
- build a lightweight tweet-level political proxy using note summaries
- classify tweets into
politicalandnon_political - allocate the sample with a
65% political / 35% non-politicalbalance - balance the political share across political subtopics
- draw a
30%sample of ratings - merge note metadata and status history into the sampled table
Key output:
master_sample.csv
Important note:
- this notebook reconstructs the sampled analytical universe
- if you do not have the raw upstream TSV files, start from the existing
master_sample.csv - raw snapshots are obtained from X's public Community Notes downloads rather than from this repository
- topic modeling is not used to build the sample itself; it is applied later to the sampled analytical table after clustering and scoring
2. Clustering
Notebook:
notebooks/01_clustering.ipynb
Core idea:
- cluster users, not notes
- recover latent rating blocs from co-voting patterns
What the notebook does:
- loads
master_sample.csv - keeps only
HELPFULandNOT_HELPFULratings - maps those ratings to a binary vote
- applies the timeliness filter
- targets roughly
5,000of the most-rated notes and10,000of the most-active raters - builds the user-note matrix
- mean-centers observed votes by user
- zero-fills only for similarity construction
- computes cosine similarity on the centered matrix
- sets affinity to zero when user pairs share no co-rated note
- runs spectral clustering
- evaluates candidate
Kvalues with silhouette and stability diagnostics - saves cluster assignments and intermediate tables for downstream use
Outputs:
data/interim/ratings_filtered.parquetdata/interim/ratings_clustered.parquetdata/interim/user_clusters.parquetdata/interim/silhouette_over_k.parquetdata/interim/stability_over_k.parquetdata/interim/user_stats.parquetdata/interim/cluster_summary.parquet
3. Scoring and Rescue Simulation
Notebook:
notebooks/02_scoring.ipynb
What the notebook does:
- loads clustered ratings from
data/interim/ratings_clustered.parquet - computes note-level pooled approval
- computes cluster-specific approval
- applies the
0.5fallback when a cluster has no observed rating on a note - enforces the minimum per-cluster rater threshold for bridge scoring
- computes the geometric-mean bridge score
- computes disagreement measures
- simulates tweet-level selection rules:
Simple Majoritarian RuleRepresentativePluralistic-K- creates summary tables used by the paper and downstream notebooks
- runs robustness checks for alternative bridge aggregators and epsilon sensitivity
Main outputs:
data/processed/scores.parquetdata/processed/final_table.parquetdata/processed/rescue_summary.parquetdata/processed/pluralistic_breakdown.parquetdata/processed/selection_log.parquetdata/processed/selection_status_summary.parquet
Authoritative note:
- for exact downstream reproduction of the paper's representative selections, use
selection_log.parquetas the authoritative selection output - helper calculations that sort only on
bridge_scorecan behave differently under exact ties unless the same deterministic tie-breaking rule is used everywhere
Tie note:
- in practice, the important edge case is when two notes attached to the same tweet receive the exact same
bridge_score - if helper logic uses a different secondary ordering rule, the winning note can differ even though the bridge score is identical
- this matters because one tied note can already be
CURRENTLY_RATED_HELPFULwhile the other remainsNEEDS_MORE_RATINGS - the repository therefore treats
selection_log.parquetas the authoritative downstream record of representative picks
4. Topic Modeling
Notebook:
notebooks/03_topics.ipynb
What the notebook does:
- loads
scores.parquetand selection outputs - prepares note text for topic analysis
- fits BERTopic on note summaries
- attaches topic names and shortened labels
- computes topic-level disagreement structure
- computes topic salience by cluster
- computes topic-level rescue and failure summaries
- compares which topics are surfaced by different selection rules
Outputs:
data/processed/topic_notes.parquetdata/processed/topic_cluster_stats.parquetdata/processed/topic_exemplars.parquetdata/processed/topic_salience.parquetdata/processed/topic_salience_pivot.parquetdata/processed/topic_rescue_stats.parquetdata/processed/topic_strategy_summary.parquetdata/processed/topic_strategy_pivot.parquetdata/processed/topic_selection_overlap.parquet
5. Gabriel LLM Validation
Notebook:
notebooks/05_gabriel_check.ipynb
Purpose:
- evaluate whether representative-rescued notes look substantively worth rescuing
What the notebook does:
- loads processed rescue outputs
- isolates representative-rescued notes
- builds
llm_contextstrings combining note text and selected metadata - runs Gabriel classification for:
- political topic class
- troll vs non-troll
- runs Gabriel rating for:
- rescue worthiness
- informational value
- evidentiary specificity
- troll likelihood
- clarity
- runs blind validation by removing score metadata from the context
- compares representative-rescued notes with majoritarian-only rescued notes
- writes reusable caches so repeated runs can resume rather than starting over
Main save locations:
data/gabriel/rescue_notes_100pct_politics_classify/data/gabriel/rescue_notes_100pct_troll_classify/data/gabriel/rescue_notes_100pct_rescue_rate/data/gabriel/rescue_notes_100pct_blind_rescue_rate/data/gabriel/rescue_notes_100pct_maj_rescue_rate/data/gabriel/cache/
Important note:
- this notebook requires an OpenAI API key and network access when run fresh
- cached outputs in
data/gabriel/make later reruns cheaper and faster - repository-level documentation for this stage is collected in
LLM_APPENDIX.md
6. Final Paper Figures
Notebook:
notebooks/06_plots.ipynb
What the notebook does:
- loads prepared outputs only
- generates the paper-ready figures
- saves each figure to
figures/paper/as both PDF and PNG
Expected figure targets:
figures/paper/figure_01_cluster_diagnostics.pdffigures/paper/figure_03_strategy_status_counts.pdffigures/paper/figure_04_high_disagreement_topics.pdffigures/paper/figure_05_topic_cluster_polarity.pdffigures/paper/figure_06_gabriel_politics.pdffigures/paper/figure_07_gabriel_ratings.pdffigures/paper/figure_08_troll_topic_pockets.pdffigures/paper/figure_A1_cluster_tfidf.pdffigures/paper/figure_A2a_disagreement_direction_bar.pdffigures/paper/figure_A2b_disagreement_direction_scatter.pdffigures/paper/figure_A3_user_profiling.pdf
7. Paper Compilation
Paper source:
Paper file/main.tex
Compiling the paper requires the figure files and bibliography to already exist.
Typical local compile sequence:
cd "Paper file"
latexmk -pdf main.tex
This should produce:
Paper file/main.pdf
Suggested Runtime Order for a New User
If you downloaded the external sampled data bundle:
- Place
release-assets-bundle.zipin the repository root and extract it. - Move
community-notes-x-data-bundle.tar.gz,master_sample.csv.gz, andSHA256SUMS.txtfromrelease-assets/data-v1/into the repository root. - Run
shasum -a 256 -c SHA256SUMS.txtif you want an integrity check. - Run
gunzip master_sample.csv.gz. - Extract
community-notes-x-data-bundle.tar.gz. - Open
notebooks/01_clustering.ipynband run all cells. - Run
notebooks/02_scoring.ipynb. - Run
notebooks/03_topics.ipynb. - If you want the full paper replication, run
notebooks/05_gabriel_check.ipynb. - Run
notebooks/06_plots.ipynb. - Compile
Paper file/main.tex.
If you want to rebuild everything from the raw platform exports:
- Place
notes-00000.tsv,ratings-*.tsv, andnoteStatusHistory-00000.tsvin the repository root. - Run
notebooks/master sampling.ipynb. - Confirm that
master_sample.csvwas created. - Continue with the modular notebooks in numeric order.
Python Dependencies
This repository now includes:
requirements.txt
pinned package versions for the notebook workflow
.env.example
local runtime template for environment variables plus a commented record of the reference environment used during development
At minimum, the workflow depends on:
pythonpandasnumpyscikit-learnmatplotlibseabornpyarrowor another parquet backendbertopicsentence-transformersumap-learnopenai-gabrieljupyter
Additional system dependencies may be needed for:
- LaTeX compilation (
latexmk, TeX distribution) - transformer model downloads
- OpenAI/Gabriel API calls
Recommended setup:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
If you want to use environment variables instead of hard-coding API keys in notebooks:
- copy
.env.exampleto.env - fill in
OPENAI_API_KEY - load it locally however you normally manage environment variables
Limitations
- The analytical sample is intentionally structured rather than platform-representative.
- The full raw platform dataset is larger than the modular pipeline input and may not be available to all readers.
- The sampling stage depends on raw TSV files outside the modular notebook chain.
- The workflow is notebook-based, so execution order matters.
- Some parts of the LLM workflow rely on cached outputs and an external API.
- Model aliases such as
gpt-5-minimay evolve over time, so exact LLM reruns may drift unless versions and run dates are documented. - Exact ties in note selection can change outcomes unless tie-breaking is explicitly standardized.
- A pinned
requirements.txtis provided, but OS-level and model-level differences can still affect exact reruns.