Replication of “Community-Based Fact-Checking on Twitter’s Birdwatch Platform”

Introduction

This report replicates findings from the study “Community-Based Fact-Checking on Twitter’s Birdwatch Platform”, which explores how users collaboratively address misinformation on social media. Birdwatch, launched by Twitter in early 2021, invited users to add notes to tweets that may be misleading, offering corrections, context, or clarification. These notes are then rated by others for helpfulness, enabling a community-driven approach to content moderation.

Our replication uses two key datasets released by Twitter:

Notes dataset with 9382 community-contributed notes, each providing a textual explanation and a classification of whether the tweet is misleading.
Ratings dataset with 45959 helpfulness ratings submitted by other users in response to these notes.

Both datasets span the period from Jaunuary 23, 2021 to July 2021, matching the timeframe analyzed in the original Birdwatch paper.

This replication focuses on reproducing major figures and descriptive statistics from the original study using R. By doing so, we assess the robustness of the study’s insights and explore how design features like sourcing, clarity, and tone relate to perceived helpfulness in the context of misinformation reporting.

Figure 2. Trustworthy Sources by Classification

This figure shows how often notes marked as Misleading or Not Misleading included sources the writer believed to be trustworthy. The majority of Misleading notes linked to trustworthy sources, consistent with findings in the original Birdwatch paper that even notes flagging misinformation often cite credible material to support their claims.

# Recode values to readable labels for trustworthy sources and classification
notes %>%
  mutate(
    trustworthySources = factor(
      trustworthySources,
      levels = c(0, 1), # actual values
      labels = c("Not Trustworthy", "Trustworthy")
    ),
    classification = factor(
      classification,
      levels = c("NOT_MISLEADING", "MISINFORMED_OR_POTENTIALLY_MISLEADING"), # aactual values
      labels = c("Not Misleading", "Misleading")
    )) %>%

    # Count number of notes in each group
    group_by(classification,trustworthySources) %>% 
    summarise(total_num = n(), .groups ="drop") %>% 

    # Create bar chart showing number of notes by classification and source trustworthines
    ggplot(aes(total_num,classification, fill= trustworthySources)) +
    geom_bar(stat="identity") +
    labs(
      title = "Notes by Classification and Trustworthy Sources",
      y = "Classification",
      x = "Number of Notes"
      ) +
    scale_fill_manual(values = c("Not Trustworthy" = "#ffc70e", "Trustworthy" = "#1b58e7"))

Figure 3. Reasons Users Flagged Tweets as Misleading

This figure shows how frequently each predefined reason was selected in Birdwatch notes explaining why a tweet might be misleading. The top three reasons — Factual error, Missing important context, and Unverified claim as fact — were selected far more often than others. This aligns with the Birdwatch paper’s findings that users primarily focus on correcting factual inaccuracy and contextual gaps rather than less common concerns like Satire or Manipulated media.

library(ggplot2)

# Select relevant columns
misleading_cols <- c(
  "misleadingFactualError",
  "misleadingMissingImportantContext",
  "misleadingUnverifiedClaimAsFact",
  "misleadingOutdatedInformation",
  "misleadingOther",
  "misleadingSatire",
  "misleadingManipulatedMedia"
)

# Tally up the 1s
misleading_counts <- colSums(notes[misleading_cols] == 1, na.rm = TRUE)

# Create a data frame for plotting
misleading_counts_df <- data.frame(
  Category = c(
    "Factual error",
    "Missing important context",
    "Unverified claim as fact",
    "Outdated information",
    "Other",
    "Satire",
    "Manipulated media"
  ),
  Count = misleading_counts
)

# Plot
ggplot(misleading_counts_df, aes(x = Count, y = reorder(Category, Count))) +
  geom_bar(stat = "identity", fill = "firebrick") +
  labs(
    x = "Number of Birdwatch Notes",
    y = NULL
  )

Figure 4. Reasons Users Believed Tweets Were Not Misleading

This figure shows the distribution of reasons users selected when explaining why a tweet was not misleading. Most notes cited the tweet as Factually correct, with much smaller numbers citing reasons like Personal opinion or Clearly satire. This pattern suggests users are more confident marking tweets as accurate when they align with objective facts, echoing the Birdwatch paper’s observation that factual accuracy plays a major role in users’ trust assessments.

# Select relevant columns
not_misleading_cols <- c(
  "notMisleadingFactuallyCorrect",
  "notMisleadingPersonalOpinion",
  "notMisleadingClearlySatire",
  "notMisleadingOther",
  "notMisleadingOutdatedButNotWhenWritten"
)

# Tally up the 1s
not_misleading_counts <- colSums(notes[not_misleading_cols] == 1, na.rm = TRUE)

# Create a data frame for plotting
not_misleading_counts_df <- data.frame(
  Category = c(
    "Factually correct",
    "Personal opinion",
    "Clearly satire",
    "Other",
    "Outdated but not wen written"
  ),
  Count = not_misleading_counts
)

# Plot
ggplot(not_misleading_counts_df, aes(x = Count, y = reorder(Category, Count))) +
  geom_bar(stat = "identity", fill = "navy") +
  labs(
    x = "Number of Birdwatch Notes",
    y = NULL
  )

Figure 5. CCDFs of Word Count in Explanations of Birdwatch Notes

This figure compares the complementary cumulative distribution functions (CCDFs) for word counts in Misleading and Not Misleading notes. Notes marked as Misleading tend to be longer, with heavier tails in the distribution, suggesting that users invest more effort when explaining why content is problematic. This mirrors the Birdwatch paper’s finding that critical notes often include detailed justifications, while supportive notes are more concise.

notes %>%
  mutate(
    # Recode classification labels
    classification = factor(
      classification, 
      levels = c("NOT_MISLEADING", "MISINFORMED_OR_POTENTIALLY_MISLEADING"),
      labels = c("Not Misleading", "Misleading")
    ),

    # Replace URLs with placeholder so they count as one word
    summary_clean = str_replace_all(summary, "https?://[^\\s]+", "URL"),

    # Count number of words in each summary
    word_count = str_count(summary_clean, "\\w+")
  ) %>%
  
  # Count how many notes have each word count per classification
  group_by(classification, word_count) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(classification) %>%
  arrange(word_count) %>% 

  # Compute Complementary CDF (CCDF)
  mutate(ccdf = 1 - cumsum(count)/sum(count))%>% 

  # Plot the CCDF on log scale for y-axis
  ggplot(aes(word_count,ccdf, color = classification)) +
  geom_line() + 
  scale_y_log10() +
  
  labs(
      x = "Word count",
      y = "CCDF"
  ) +
  scale_color_manual(
  values = c("Not Misleading" = "blue", "Misleading" = "red")
)

Figure 7a. CCDF of Helpfulness Ratio for Birdwatch Notes

This figure shows the complementary cumulative distribution function (CCDF) of the helpfulness ratio (proportion of raters marking a note helpful) for notes classified as Misleading (red) and Not Misleading (blue). Notes flagged as Misleading tend to have higher helpfulness ratios, with their CCDF curve lying to the right of Not Misleading notes—this aligns with findings that “notes reporting misleading tweets tend to have a higher helpfulness ratio” :contentReferenceoaicite:3.

# Convert noteId to character to match formats between datasets
notes_clean <- notes %>% mutate(noteId = as.character(noteId))

# Merge notes and ratings data
notes_ratings <- left_join(notes_clean, ratings, by = "noteId")

# Compute CCDF of helpfulness ratio by classification
notes_ratings %>%  
  mutate(
    # recode classification labels for clarity
    classification = factor(
    classification, 
    levels = c("NOT_MISLEADING", "MISINFORMED_OR_POTENTIALLY_MISLEADING"),
    labels = c("Not Misleading", "Misleading")
    ),
    # Create binary helpfulness variable that accounts for both versions of helpfulness
    helpful_num = as.numeric(helpfulnessLevel %in% c("HELPFUL", "SOMEWHAT_HELPFUL") | helpful %in% c("1"))) %>% 
  select(noteId,helpful_num, classification) %>% 

  # Group by note and classification to calculate the helpfulness ratio
  group_by(noteId, classification) %>%
  summarise(helpfulness_ratio = mean(helpful_num, na.rm=TRUE),.groups = "drop") %>%

  # Calculate Complementary CDF (CCDF) for each classification
  group_by(classification) %>%
  arrange(helpfulness_ratio) %>%
  mutate(ccdf = 1 - cumsum(helpfulness_ratio)/sum(helpfulness_ratio)) %>%

  # Plot CCDF of helpfulness ratio
  ggplot(aes(helpfulness_ratio, ccdf, color = classification)) +
  geom_line() + scale_y_continuous(labels = percent) +
  scale_color_manual(values = c("Not Misleading" = "blue", "Misleading" = "red")
)

Figure 7b. CCDF of Total Votes for Birdwatch Notes

This figure shows the complementary cumulative distribution function (CCDF) of the total number of helpfulness votes received by notes classified as Misleading (red) and Not Misleading (blue). Notes marked Not Misleading received more votes overall, likely because these notes are shorter and easier to rate quickly. This supports the Birdwatch paper’s suggestion that less effortful notes tend to accumulate more ratings, even if they’re less detailed.

# Compute CCDF of total votes per note by classification
notes_ratings %>%  
  # Recode classification labels for clarity
  mutate(
    classification = factor(
    classification, 
    levels = c("NOT_MISLEADING", "MISINFORMED_OR_POTENTIALLY_MISLEADING"),
    labels = c("Not Misleading", "Misleading")
    ),
    # Convert vote counts to numeric
    helpful = as.numeric(helpful),
    notHelpful = as.numeric(notHelpful)) %>%

    # Sum total votes (helpful + notHelpful) for each note
    group_by(noteId,classification)%>%
    summarise(total_votes = sum(helpful, na.rm = TRUE)+sum(notHelpful, na.rm = TRUE), .groups = "drop") %>%
    
    # Count how many notes have each total vote count, by classification
    group_by(classification, total_votes) %>%
    summarise(count = n(), .groups = "drop") %>%
    
    # Calculate CCDF
    group_by(classification) %>%
    arrange(total_votes) %>%
    mutate(ccdf = 1 - (row_number() - 1) / n()) %>%

    # Plot CCDF of total votes
    ggplot(aes(total_votes, ccdf, color = classification)) +
    geom_line() + scale_y_continuous(labels = percent) +
    scale_color_manual(values = c("Not Misleading" = "blue", "Misleading" = "red")
)

Figure 8. Helpful Attributes Cited in Birdwatch Note Ratings

This figure shows how often each checkbox option was selected by users when rating a note as helpful. The most commonly cited qualities were that the note was Clear, provided Good sources, and was Informative. Less frequently, users appreciated Empathy, Addressing the claim, and Unique context. These results highlight that clarity and sourcing are especially valued by raters, reflecting the Birdwatch paper’s emphasis on transparency and factual grounding in helpful notes.

# Select relevant columns
helpful_cols <- c(
  "helpfulInformative",
  "helpfulClear",
  "helpfulGoodSources",
  "helpfulEmpathetic",
  "helpfulUniqueContext",
  "helpfulAddressesClaim",
  "helpfulImportantContext",
  "helpfulOther"
)

# Tally up the 1s
helpful_counts <- colSums(ratings[helpful_cols] == 1, na.rm = TRUE)

# Create a data frame for plotting
helpful_counts_df <- data.frame(
  Category = c(
    "Informative",
    "Clear",
    "Good sources",
    "Empathetic",
    "Unique context",
    "Addresses claim",
    "Important context",
    "Other"
  ),
  Count = helpful_counts
)

# Plot
ggplot(helpful_counts_df, aes(x = Count, y = reorder(Category, Count))) +
  geom_bar(stat = "identity", fill = "navy") +
  labs(
    x = "Number of Ratings",
    y = NULL
  )

Figure 9. Unhelpful Attributes Cited in Birdwatch Note Ratings

This figure shows how often each checkbox option was selected by users when explaining why a note was not helpful. The top reasons were that the note was Missing key points, had Sources missing or unreliable, or contained Opinion, speculation, or bias. These responses suggest that raters are especially critical when notes lack evidence or completeness. The findings reinforce the Birdwatch paper’s emphasis on the importance of sourcing and neutrality in promoting perceived helpfulness.

# Select relevant columns
not_helpful_cols <- c(
    "notHelpfulSourcesMissingOrUnreliable",
    "notHelpfulOpinionSpeculationOrBias",
    "notHelpfulMissingKeyPoints",
    "notHelpfulArgumentativeOrBiased",  #inflammatory
    "notHelpfulIncorrect",
    "notHelpfulOffTopic",
    "notHelpfulOther",
    "notHelpfulHardToUnderstand",
    "notHelpfulSpamHarassmentOrAbuse",
    "notHelpfulOutdated", 
    "notHelpfulIrrelevantSources"
)

# Tally up the 1s
not_helpful_counts <- colSums(ratings[not_helpful_cols] == 1, na.rm = TRUE)

# Create a data frame for plotting
not_helpful_counts_df <- data.frame(
  Category = c(
    "Sources missing or unreliable",
    "Opinion speculation or bias",
    "Missing key points",
    "Argumentative or inflammatory",
    "Incorrect",
    "Off topic",
    "Other",
    "Hard to understand",
    "Spam harassment or abuse",
    "Outdated",
    "Irrelevant sources"
  ),
  Count = not_helpful_counts
)

# Plot
ggplot(not_helpful_counts_df, aes(x = Count, y = reorder(Category, Count))) +
  geom_bar(stat = "identity", fill = "firebrick") +
  labs(
    x = "Number of Ratings",
    y = NULL
  )

Figure 10. Regression Predictors of Helpfulness Ratio

This figure shows standardized regression coefficients predicting the helpfulness ratio of Birdwatch notes, with 99% confidence intervals. The results suggest that notes marked as Misleading, those citing Trustworthy sources, and longer notes (higher word count) are associated with higher helpfulness ratings. In contrast, verified accounts and follower/friend counts appear to have minimal predictive value in this context.

source_tweet_object<- load("source_tweets.Rdata")
source_tweet <- get(source_tweet_object)
#summary(source_tweet)
#formula(source_tweet)


#Cleaning and merging the required features with the source tweet
notes_ratings_w_factors <- notes_ratings%>%
    mutate(
    classification = factor(
      classification, 
      levels = c("NOT_MISLEADING", "MISINFORMED_OR_POTENTIALLY_MISLEADING"),
      labels = c("Not Misleading", "Misleading")
    ),
    trustworthySources = as.factor(trustworthySources),
    noteId = as.numeric(noteId),
   #Did not filter the url as one word 
    word_count = str_count(summary, "\\w+"),
    helpful = as.numeric(helpful), #will be using v1 helpful columns
    notHelpful = as.numeric(notHelpful))%>%
    group_by(noteId)%>%
    mutate(total_votes = sum(helpful, na.rm = TRUE)+sum(notHelpful, na.rm = TRUE)) %>%
    mutate(helpfulness_ratio = helpful / total_votes )%>%
    ungroup()%>%
    filter(!is.na(helpfulness_ratio))%>% 
    select(noteId, classification,helpfulness_ratio, word_count, trustworthySources, helpful)


#str(notes_ratings_w_factors)
source_w_notes_ratings <- left_join(notes_ratings_w_factors, source_tweet, by= "noteId")
#str(source_w_notes_ratings) 



# Set cutoff date for calculating account age
cutoff_date <- as.Date("2021-07-01")

#Preprocessing more
source_w_notes_ratings <- source_w_notes_ratings %>%  
  mutate(
    account_age_days = as.numeric(difftime(cutoff_date, source_account_created_at, units = "days")),
    source_verified = factor(source_verified),
  )



# STEP 2: Standardization and encoding
rec <- recipe(helpful ~ classification + trustworthySources + word_count +
                source_followers_count + source_friends_count + source_verified + account_age_days,
              data = source_w_notes_ratings) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  prep()

df_std <- bake(rec, new_data = NULL)

# STEP 3: GLM
model <- glm(helpful ~ ., data = df_std, family = gaussian())

# STEP 4: Coefficients with 99% Confidence Intervals
model_summary <- tidy(model, conf.int = TRUE, conf.level = 0.99)


# STEP 5: Clean Labels
label_map <- c(
  "trustworthySources_X1" = "Trustworthy Sources",
  "word_count" = "Word Count",
  "source_followers_count" = "Followers Count",
  "source_friends_count" = "Followees Count",
  "source_verified_TRUE." = "Verified Account",
  "account_age_days" = "Account Age (Days)",
  "classification_Misleading" = "Classification: Misleading"
)
model_summary <- model_summary %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term_clean = recode(term, !!!label_map)
  )


# STEP 6: Visualization
model_summary %>%
  ggplot(aes(x = reorder(term_clean, estimate), y = estimate)) +
  geom_point(size = 3, color = "black") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.3, color = "steelblue") +
  labs(
    title = "Standardized GLM Coefficients for DV: Helpfulness"
  ) +
  theme_minimal(base_size = 14)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#table(source_w_notes_ratings$source_verified)

Conclusion

This replication study confirms key findings from the original Birdwatch paper, demonstrating that community-driven fact-checking is a viable strategy for addressing misinformation on social media. By analyzing notes and ratings from early 2021, the paper could talk about how helpfulness ratings tend to surface high-quality community notes, consensus is stronger on non-political topics, and a small number of highly active contributors dominate the platform’s activity.

Since the publication of the original paper, Birdwatch has evolved into Community Notes on X (formerly Twitter). While the core principles remain—crowdsourced fact-checking and cross-perspective agreement—the platform now uses additional algorithms and visibility rules (e.g., requiring diverse perspectives for notes to be shown publicly). This evolution reflects both the promise and the challenge of decentralized fact-checking systems.

As the author originally noted, and our replication confirms, key limitations persist:

Contributor imbalance: A few users still produce most notes, raising questions about representativeness and potential groupthink.
Ideological polarization: Notes on political or controversial topics often fail to achieve consensus, especially when users rate them as “opinionated” or “argumentative.”
Trust and perception: Community Notes, like Birdwatch, runs the risk of being misunderstood or distrusted by users who see them as biased or inconsistent.

Recommendations

Based on our replication and the evolution of Birdwatch into Community Notes on X, we suggest the following improvements:

Benchmark with Experts Compare Community Notes with professional fact-checkers (e.g., PolitiFact, Snopes) to assess alignment in accuracy, speed, and scope.

Improve Contributor Diversity Analyze ideological and demographic patterns in contributors to ensure a broader range of perspectives in note writing and rating.

Balance Participation Reduce reliance on “super-contributors” by introducing fair incentive systems to encourage wider, more representative engagement.

Handle Political Content Transparently Create clearer guidelines and consider showing multiple viewpoints for controversial topics rather than hiding notes that lack consensus.