Goal

The objective of this memo is to show summary statistics of the Inoculation against misinformation (IAM) experiment data (see Wiki; paper).

Sample

The experiment had 3 phases:

  • pre-survey

  • post-survey

  • follow-up survey

As observed in Table 2 in the paper (page 18), from the 25,287 participants who started the pre-survey, only 22,526 completed the pre-survey, 8,684 completed the post-survey, and 5,316 completed the follow-up survey.

The sample considered below consists of 8,684 participants who completed the post-survey. For the primary outcome, “Misinfo Sharing rate”, 7,688 participants are considered, as they completed the post-survey and shared at least one non-misinformation post in the pre-survey.

Information from the follow-up survey is not currently considered.


Outcomes

Each participant was exposed to 9 posts in the pre-survey and 9 posts in the post-survey. 3 of the 9 posts shown each time were non-misinformation. The rest were misinformation.

We consider 12 different outcomes. Outcomes with equal weighting (7, 8, 11, 12) are not currently included in the paper. For six of the outcomes we report for pre-survey and post-survey: Misinfo posts, Non-misinfo posts, Sharing discernment score, Sharing discernment score with equal weighting, Accuracy discernment score, and Accuracy discernment score with equal weighting.

1. Primary outcome: Misinfo Sharing rate

\[ \textit{Misinfo Sharing rate}_i = \frac{\sum^3_{j = 1} [S^{M,post}_i(j)|S^{N,pre}_i(j) = 1]}{\sum^3_{j = 1} S^{N,pre}_i(j)} \]

The denominator is the number of non-misinformation posts participant i shares in the pre-survey out of the three they are shown. The numerator is the number of misinformation posts corresponding to the non-misinformation posts participant i shares in the pre-survey that participant i shares in the post-survey.

2. Sharing rate

Proportion of posts shared out of the 9 posts shown. Values range from 0 to 1.

3. Misinfo Posts

Proportion of misinformation posts shared out of the 6 posts shown. Values range from 0 to 1.

4. Non-misinfo posts

Proportion of non-misinformation posts shared out of the 3 posts shown. Values range from 0 to 1.

5. Sharing Discernment Score

Difference between the number of non-misinformation posts and the number of misinformation posts a participant decides to share. Values range from -6 to 3.

6. Post - Pre difference in Sharing Discernment Score

Difference in the Sharing Discernment Score between the post- and pre-survey. Values range from -9 to 9.

7. Sharing Discernment Score, equal weighting

Difference between the number of non-misinformation posts and the number of misinformation posts a participant decides to share, weighting non-misinformation posts double. Values range from -6 to 6.

8. Post - Pre difference in Sharing Discernment Score, equal weighting

Difference in the Sharing Discernment Score with equal weighting between the post- and pre-survey. Values range from -12 to 12.

9. Accuracy Discernment Score

The score is defined based on the answer to the question “To the best of your knowledge, how accurate is the claim in the above post?”:

The accuracy discernment score is a weighted sum across the 9 posts shown to the participant, where the answers take the following weights:

\[ \text{If post is misinformation:} \begin{cases} -3 & \text{if responded "Very accurate"} \\ -1 & \text{if responded "Somewhat accurate"} \\ 1 & \text{if responded "Not very accurate"} \\ 3 & \text{if responded "Not at all accurate"} \end{cases} \]

\[ \text{If post is non-misinformation:} \begin{cases} 3 & \text{if responded "Very accurate"} \\ 1 & \text{if responded "Somewhat accurate"} \\ -1 & \text{if responded "Not very accurate"} \\ -3 & \text{if responded "Not at all accurate"} \end{cases} \]

Values range from -27 to 27.

10. Post - Pre difference in Accuracy Discernment Score

Difference in the Accuracy Discernment Score between the post- and pre-survey. Values range from -54 to 54.

11. Accuracy Discernment Score, equal weighting

The score is defined based on the answer to the question “To the best of your knowledge, how accurate is the claim in the above post?”:

The accuracy discernment score is a weighted sum across the 9 posts shown to the participant, where the answers take the following weights:

\[ \text{If post is misinformation:} \begin{cases} -3 & \text{if responded "Very accurate"} \\ -1 & \text{if responded "Somewhat accurate"} \\ 1 & \text{if responded "Not very accurate"} \\ 3 & \text{if responded "Not at all accurate"} \end{cases} \]

\[ \text{If post is non-misinformation:} \begin{cases} 6 & \text{if responded "Very accurate"} \\ 2 & \text{if responded "Somewhat accurate"} \\ -2 & \text{if responded "Not very accurate"} \\ -6 & \text{if responded "Not at all accurate"} \end{cases} \]

Values range from -36 to 36.

12. Post - Pre difference in Accuracy Discernment Score, equal weighting

Difference in the Accuracy Discernment Score with equal weighting between the post- and pre-survey. Values range from -72 to 72.


Data

In this script, we use two datasets.

intermediate_data_wide.csv: Contains sharing and perceived accuracy aggregated at the individual level. Includes treatment conditions and covariates.

intermediate_data_long_treatments.csv: Contains sharing and perceived accuracy at the individual-post level. Includes treatment conditions, covariates, and individual-aggregated outcomes.

For more information about how these datasets are generated and a data dictionary, see memo.


Summary statistics


Standard error functions

se_cont = function(x, na.rm=FALSE) {
  if (na.rm) x = na.omit(x)
  sqrt(var(x)/length(x))}

se_binary = function(x, na.rm=FALSE) {
  if (na.rm) x = na.omit(x)
  sqrt(mean(x)*(1-mean(x))/length(x))}

Load data

wide <- read.csv("intermediate_data_wide.csv")
long <- read.csv("intermediate_data_long.csv") %>% filter(pre_post != "followup")
long_treatments <- read.csv("../../main_analysis/intermediate_outcomes/intermediate_data_long_treatments.csv") %>% filter(pre_post != "followup")

Outcomes

Wide
# Summary table
summary_table <- function(data, continuous_vars, var_names, possible_range) {
  
  # Empty dataframe to store summary data frames
  summary_dataframe <- data.frame(Variable = character(),
                                   n = numeric(),
                                   Mean = numeric(),
                                   SD = numeric(), Min = numeric(),
                                   Q1 = numeric(), Median = numeric(),
                                   Q3 = numeric(), Max = numeric(),
                                   Possible_Range = character(),
                                   stringsAsFactors = FALSE)
  
  # Loop through each continuous variable
  for (i in seq_along(continuous_vars)) {
    var <- continuous_vars[i]
    var_name <- var_names[i]
    range <- possible_range[i]
    # Summary data frame for continuous variable
    summary_data <- data %>%
      summarise(
        Variable = var_name,
        n = n(),
        Mean = formatC(mean(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        SD = formatC(sd(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Min = formatC(min(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Q1 = formatC(quantile(data[[var]], 0.25, na.rm = TRUE), format = "f", digits = 3),
        Median = formatC(median(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Q3 = formatC(quantile(data[[var]], 0.75, na.rm = TRUE), format = "f", digits = 3),
        Max = formatC(max(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Possible_Range = range
      )
    
    # Append the summary data frame to the main dataframe
    summary_dataframe <- rbind(summary_dataframe, summary_data)
  }
  
  # Return the main dataframe with all summary data frames
  return(summary_dataframe)
}
summary_table_categorical <- function(data, categorical_vars, var_names, possible_range) {
  
  # Empty dataframe to store summary data frames
  summary_dataframe <- data.frame(Variable = character(),
                                   n = numeric(),
                                   Share = numeric(),
                                   Possible_Range = character(),
                                   stringsAsFactors = FALSE)
  
  # Loop through each categorical variable
  for (i in seq_along(categorical_vars)) {
    var <- categorical_vars[i]
    var_name <- var_names[i]
    range <- possible_range[i]
    # Summary data frame for categorical variable
    summary_data <- data %>%
      filter(as.character(.data[[var]]) == "1") %>%
      summarise(
        Variable = var_name,
        n = n(),
        Share = formatC(n() / nrow(data), format = "f", digits = 3),
        Possible_Range = range
      )
    
    # Append the summary data frame to the main dataframe
    summary_dataframe <- rbind(summary_dataframe, summary_data)
  }
  
  # Return the main dataframe with all summary data frames
  return(summary_dataframe)
}

# Specify the continuous variables
{
continuous_vars_pre_post <- c("share_diff",
                              "misinfo_diff",
                              "base_rate_diff",
                              "disc_diff",
                              "disc_rates_diff_eq",
                              "acc_disc_diff",
                              "acc_disc_diff_eq")
continuous_vars_names_pre_post <- c("Post - Pre Sharing rate",
                                    "Post - Pre Misinfo sharing rate",
                                    "Post - Pre Non-Misinfo sharing rate",
                                    "Post - Pre Sharing Discernment score",
                                    "Post - Pre Sharing Discernment score, equal weighting",
                                    "Post - Pre Accuracy Discernment score",
                                    "Post - Pre Accuracy Discernment score, equal weighting")
possible_range_prepost <- c("[-1,1]","[-1,1]","[-1,1]","[-9,9]","[-12,12]","[-54,54]","[-72,72]")

continuous_vars_pre <- c("share_pre",
                              "misinfo_pre",
                              "base_rate_pre",
                              "disc_pre",
                              "disc_score_pre_eq",
                              "acc_disc_pre",
                              "acc_disc_pre_eq")
continuous_vars_names_pre <- c("Pre Sharing rate",
                                    "Pre Misinfo Sharing rate",
                                    "Pre Non-Misinfo sharing rate",
                                    "Pre Sharing Discernment score",
                                    "Pre Sharing Discernment score, equal weighting",
                                    "Pre Accuracy Discernment score",
                                    "Pre Accuracy Discernment score, equal weighting")
possible_range_pre <- c("[0,1]","[0,1]","[0,1]","[-6,3]","[-6,6]","[-27,27]","[-36,36]")

continuous_vars_post <- c("share_post",
                              "misinfo_post",
                              "base_rate_post",
                              "disc_post",
                              "disc_score_post_eq",
                              "acc_disc_post",
                              "acc_disc_post_eq")
continuous_vars_names_post <- c("Post Sharing rate",
                                    "Pre Misinfo Sharing rate",
                                    "Pre Non-Misinfo sharing rate",
                                    "Pre Sharing Discernment score",
                                    "Pre Sharing Discernment score, equal weighting",
                                    "Pre Accuracy Discernment score",
                                    "Pre Accuracy Discernment score, equal weighting")
possible_range_post <- c("[0,1]","[0,1]","[0,1]","[-6,3]","[-6,6]","[-27,27]","[-36,36]")
}

# Create tables for original, pre-treatment, and post-treatment
pre_post <- summary_table(wide, continuous_vars_pre_post,continuous_vars_names_pre_post,possible_range_prepost)
pre <- summary_table(wide, continuous_vars_pre,continuous_vars_names_pre,possible_range_pre)
post <- summary_table(wide, continuous_vars_post,continuous_vars_names_post,possible_range_post)

# Create the tabset with the tables as tab content
tabset <- tabsetPanel(
    tabPanel("Pre", DT::datatable(pre, options = list(pageLength = 7), rownames = FALSE)),
    tabPanel("Post", DT::datatable(post, options = list(pageLength = 7), rownames = FALSE)),
    tabPanel("Pre - Post", DT::datatable(pre_post, options = list(pageLength = 7), rownames = FALSE))
)

# Render the tabset
htmltools::tagList(tabset)
Histograms and density plots


Pre-Survey

# Plot histograms and density plots
variables <- c("share_pre", "misinfo_pre", "base_rate_pre", 
               "disc_pre", "disc_score_pre_eq", 
               "acc_disc_pre", "acc_disc_pre_eq")

labels <- c("Pre Sharing rate",
            "Pre Misinfo sharing rate",
            "Pre Non-Misinfo sharing rate",
            "Pre Sharing Discernment score",
            "Pre Sharing Discernment score, equal weighting",
            "Pre Accuracy Discernment score",
            "Pre Accuracy Discernment score, equal weighting")

# Create a list to store the plots
plots <- list()

# Loop through each variable and create histograms and density plots
for (i in seq_along(variables)) {
  # Histogram plot
  hist_plot <- ggplot(wide, aes(x = .data[[variables[i]]])) +
    geom_histogram(aes(y = ..count..), bins = 30, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram\n", labels[i]),
         x = labels[i], y = "Count") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Density plot
  density_plot <- ggplot(wide, aes(x = .data[[variables[i]]])) +
    geom_density(aes(y = ..scaled..), alpha = 0.5, fill = "orange", adjust = 2) +
    labs(title = paste("Density Plot\n", labels[i]),
         x = labels[i], y = "Density") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Add plots to the list
  plots[[length(plots) + 1]] <- hist_plot
  plots[[length(plots) + 1]] <- density_plot
}

# Print each plot
invisible(lapply(plots, print))


Post-Survey

# Plot histograms and density plots
variables <- c("share_post", "misinfo_post", "base_rate_post", 
               "disc_post", "disc_score_post_eq", 
               "acc_disc_post", "acc_disc_post_eq")

labels <- c("Post Sharing rate",
            "Post Misinfo sharing rate",
            "Post Non-Misinfo sharing rate",
            "Post Sharing Discernment score",
            "Post Sharing Discernment score, equal weighting",
            "Post Accuracy Discernment score",
            "Post Accuracy Discernment score, equal weighting")

# Create a list to store the plots
plots <- list()

# Loop through each variable and create histograms and density plots
for (i in seq_along(variables)) {
  # Histogram plot
  hist_plot <- ggplot(wide, aes(x = .data[[variables[i]]])) +
    geom_histogram(aes(y = ..count..), bins = 30, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram\n", labels[i]),
         x = labels[i], y = "Count") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Density plot
  density_plot <- ggplot(wide, aes(x = .data[[variables[i]]])) +
    geom_density(aes(y = ..scaled..), alpha = 0.5, fill = "orange", adjust = 2) +
    labs(title = paste("Density Plot\n", labels[i]),
         x = labels[i], y = "Density") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Add plots to the list
  plots[[length(plots) + 1]] <- hist_plot
  plots[[length(plots) + 1]] <- density_plot
}

# Print each plot
invisible(lapply(plots, print))


Pre-Post

# Plot histograms and density plots
variables <- c("share_diff", "misinfo_diff", "base_rate_diff", 
               "disc_diff", "disc_rates_diff_eq", 
               "acc_disc_diff", "acc_disc_diff_eq")

labels <- c("Post - Pre Sharing rate",
            "Post - Pre Misinfo sharing rate",
            "Post - Pre Non-Misinfo sharing rate",
            "Post - Pre Sharing Discernment score",
            "Post - Pre Sharing Discernment score, equal weighting",
            "Post - Pre Accuracy Discernment score",
            "Post - Pre Accuracy Discernment score, equal weighting")

# Create a list to store the plots
plots <- list()

# Loop through each variable and create histograms and density plots
for (i in seq_along(variables)) {
  # Histogram plot
  hist_plot <- ggplot(wide, aes(x = .data[[variables[i]]])) +
    geom_histogram(aes(y = ..count..), bins = 30, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram\n", labels[i]),
         x = labels[i], y = "Count") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Density plot
  density_plot <- ggplot(wide, aes(x = .data[[variables[i]]])) +
    geom_density(aes(y = ..scaled..), alpha = 0.5, fill = "orange", adjust = 2) +
    labs(title = paste("Density Plot\n", labels[i]),
         x = labels[i], y = "Density") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Add plots to the list
  plots[[length(plots) + 1]] <- hist_plot
  plots[[length(plots) + 1]] <- density_plot
}

# Print each plot
invisible(lapply(plots, print))


Summary statistics by treatment condition
# Function to create a summary table
summary_table_treatment <- function(data, continuous_vars, var_names, possible_range, period) {
  
  # Empty dataframe to store summary data frames
  summary_dataframe <- data.frame(Variable = character(),
                                   Period = character(),
                                   n = numeric(),
                                   Mean = numeric(),
                                   SD = numeric(), Min = numeric(),
                                   Q1 = numeric(), Median = numeric(),
                                   Q3 = numeric(), Max = numeric(),
                                   Possible_Range = character(),
                                   stringsAsFactors = FALSE)
  
  # Loop through each continuous variable
  for (i in seq_along(continuous_vars)) {
    var <- continuous_vars[i]
    var_name <- var_names[i]
    range <- possible_range[i]
    # Summary data frame for continuous variable
    summary_data <- data %>%
      summarise(
        Variable = var_name,
        Period = period,
        n = n(),
        Mean = formatC(mean(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        SD = formatC(sd(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Min = formatC(min(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Q1 = formatC(quantile(data[[var]], 0.25, na.rm = TRUE), format = "f", digits = 3),
        Median = formatC(median(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Q3 = formatC(quantile(data[[var]], 0.75, na.rm = TRUE), format = "f", digits = 3),
        Max = formatC(max(data[[var]], na.rm = TRUE), format = "f", digits = 3),
        Possible_Range = range
      )
    
    # Append the summary data frame to the main dataframe
    summary_dataframe <- rbind(summary_dataframe, summary_data)
  }
  
  # Return the main dataframe with all summary data frames
  return(summary_dataframe)
}

continuous_vars_names <- c("Sharing rate",
                           "Misinfo sharing rate",
                           "Non-Misinfo sharing rate",
                           "Sharing Discernment score",
                           "Sharing Discernment score, equal weighting",
                           "Accuracy Discernment score",
                           "Accuracy Discernment score, equal weighting")

summary_tables <- list()

# Loop over each treatment value
for (t in c(unique(wide$treatment),"All")) {
  
  # Filter the data for the current treatment
  if (t != "All") {
  filtered_data <- wide %>% dplyr::filter(treatment == t)
  } else {
  filtered_data <- wide
  }
  
  # Create a combined summary table for pre, post, and pre_post
  combined_summary <- rbind(
    summary_table_treatment(filtered_data, continuous_vars_pre, continuous_vars_names, possible_range_pre, "Pre"),
    summary_table_treatment(filtered_data, continuous_vars_post, continuous_vars_names, possible_range_post, "Post"),
    summary_table_treatment(filtered_data, continuous_vars_pre_post, continuous_vars_names, possible_range_prepost, "Pre-Post")
  )
  
  # Store the combined summary table in the list
  summary_tables[[t]] <- combined_summary
}

# Initialize an empty list to store tab panels
tab_panels <- list()

# Loop through each treatment in summary_tables
for (treatment in names(summary_tables)) {
    tab_name <- treatment
    tab_content <- DT::datatable(summary_tables[[treatment]], options = list(pageLength = 7), rownames = FALSE)
    tab_panels[[length(tab_panels) + 1]] <- tabPanel(tab_name, tab_content)
}

# Create the tabset with the tab panels
tabset <- tabsetPanel(!!!tab_panels)

# Render the tabset
htmltools::tagList(tabset)
Histograms and density plots

No-course Baseline

# Plot histograms and density plots
variables <- c("share_diff", "misinfo_diff", "base_rate_diff", 
               "disc_diff", "disc_rates_diff_eq", 
               "acc_disc_diff", "acc_disc_diff_eq")

labels <- c("Post - Pre Sharing rate",
            "Post - Pre Misinfo sharing rate",
            "Post - Pre Non-Misinfo sharing rate",
            "Post - Pre Sharing Discernment score",
            "Post - Pre Sharing Discernment score, equal weighting",
            "Post - Pre Accuracy Discernment score",
            "Post - Pre Accuracy Discernment score, equal weighting")

# Create a list to store the plots
plots <- list()

wide_control <- wide %>% filter(treatment == ("No-course Baseline"))

# Loop through each variable and create histograms and density plots
for (i in seq_along(variables)) {
  # Histogram plot
  hist_plot <- ggplot(wide_control, aes(x = .data[[variables[i]]])) +
    geom_histogram(aes(y = ..count..), bins = 30, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram\n", labels[i]),
         x = labels[i], y = "Count") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Density plot
  density_plot <- ggplot(wide_control, aes(x = .data[[variables[i]]])) +
    geom_density(aes(y = ..scaled..), alpha = 0.5, fill = "orange", adjust = 2) +
    labs(title = paste("Density Plot\n", labels[i]),
         x = labels[i], y = "Density") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Add plots to the list
  plots[[length(plots) + 1]] <- hist_plot
  plots[[length(plots) + 1]] <- density_plot
}

# Print each plot
invisible(lapply(plots, print))


Emotions

# Plot histograms and density plots
variables <- c("share_diff", "misinfo_diff", "base_rate_diff", 
               "disc_diff", "disc_rates_diff_eq", 
               "acc_disc_diff", "acc_disc_diff_eq")

labels <- c("Post - Pre Sharing rate",
            "Post - Pre Misinfo sharing rate",
            "Post - Pre Non-Misinfo sharing rate",
            "Post - Pre Sharing Discernment score",
            "Post - Pre Sharing Discernment score, equal weighting",
            "Post - Pre Accuracy Discernment score",
            "Post - Pre Accuracy Discernment score, equal weighting")

# Create a list to store the plots
plots <- list()

wide_emotions <- wide %>% filter(treatment == ("Emotions"))

# Loop through each variable and create histograms and density plots
for (i in seq_along(variables)) {
  # Histogram plot
  hist_plot <- ggplot(wide_emotions, aes(x = .data[[variables[i]]])) +
    geom_histogram(aes(y = ..count..), bins = 30, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram\n", labels[i]),
         x = labels[i], y = "Count") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Density plot
  density_plot <- ggplot(wide_emotions, aes(x = .data[[variables[i]]])) +
    geom_density(aes(y = ..scaled..), alpha = 0.5, fill = "orange", adjust = 2) +
    labs(title = paste("Density Plot\n", labels[i]),
         x = labels[i], y = "Density") +
    theme_minimal() + 
    theme(plot.title = element_text(size = 9.5),
          axis.text.x = element_text(size = 6))
  
  # Add plots to the list
  plots[[length(plots) + 1]] <- hist_plot
  plots[[length(plots) + 1]] <- density_plot
}

# Print each plot
invisible(lapply(plots, print))



Covariates

Covariate Description
Age Integer denoting age of participants
Man Complement to ‘woman’ and ‘other’
High school or less including ‘Less than a high school diploma’ and ‘High school degree or equivalent’ (reference)
Some college including ‘Some college, no degree’ and ‘Associate degree’
Bachelor’s degree Bachelor’s degree
Graduate degree including ‘Master’s degree’ and ‘Doctorate or professional degree’
Married Complement to ‘Single,’ ‘Widowed,’ ‘Divorced,’ and ‘Separated’
Employed including ‘Employed full time,’ ‘Employed part time,’ and ‘Self-employed’
Unemployed including ‘Unemployed and currently looking for work,’ ‘Unemployed not currently looking for work,’ ‘Retired,’ ‘Homemaker,’ and ‘Unable to work’ (reference)
Student Student
Mostly urban Live in mostly urban area
Suburban Live in suburban area
Mostly rural Live in mostly rural area (reference)
Christian Complement to ‘None,’ ‘Hinduism,’ ‘Muslim,’ ‘Traditionalist,’ and ‘Other’
Attends religious services Frequency of attending religious services including ‘Less than once a month,’ ‘One to three times per month,’ ‘Once a week,’ ‘More than once a week but less than daily,’ and ‘Daily’; Complement to ‘Never’ (reference)
Uses social media 1 if ‘Yes’ and 0 if ‘No’ (reference)
Passed att. check Pre 1 if the user passed the attention check and 0 otherwise
Hours per day spent in social media Integer denoting how many hours are spent on average each day on social media
Social media share 0-20% Proportion of the content the participant sees on social media that they choose to share is between 0 and 20% (reference)
Social media share 20-40% Proportion of the content the participant sees on social media that they choose to share is between 20 and 40%
Social media share 40-60% Proportion of the content the participant sees on social media that they choose to share is between 40 and 60%
Social media share 60-80% Proportion of the content the participant sees on social media that they choose to share is between 60 and 80%
Social media share 80-100% Proportion of the content the participant sees on social media that they choose to share is between 80 and 100%
All posts Pre-survey sharing: proportion of all posts shared
Non-misinformation posts Pre-survey sharing: proportion of non-misinformation posts shared
Misinformation posts Pre-survey sharing: proportion of misinformation posts shared
# Specify the continuous variables
{
continuous_vars <- c("age",
                     "social_media_hours")
continuous_vars_names <- c("Age",
                           "Hours spent on social media")
continuous_possible_range <- c("[18,99]","[0,24]")

categorical_vars <- c("gender_Man",
                     "education_High_school_or_less","education_Some_college","education_Bachelor_degree","education_Graduate_degree",
                     "marital_Married_or_in_a_domestic_partnership",
                     "employment_Employed","employment_Unemployed","employment_Student",
                     "location_Mostly_urban","location_Suburban","location_Mostly_rural",
                     "religion_Christian","religiosity_Attends",
                     "social_media_bin_Yes",
                     "att_check_pre",
                     "social_media_share_80_100","social_media_share_60_80","social_media_share_40_60","social_media_share_20_40","social_media_share_0_20")
categorical_vars_names <- c("Man",
                     "High school or less","Some college","Bachelor degree","Graduate degree",
                     "Married",
                     "Employed","Unemployed","Student",
                     "Mostly urban","Suburban","Mostly rural",
                     "Christian","Attends religious services",
                     "Uses social media",
                     "Passed attention check in pre",
                     "Shares 80-100% posts on social media","Shares 60-80% posts on social media","Shares 40-60% posts on social media","Shares 20-40% posts on social media","Shares 0-20% posts on social media")
categorical_possible_range <- c("[0,1]",
                                "[0,1]","[0,1]","[0,1]","[0,1]",
                                "[0,1]",
                                "[0,1]","[0,1]","[0,1]",
                                "[0,1]","[0,1]","[0,1]",
                                "[0,1]","[0,1]",
                                "[0,1]",
                                "[0,1]",
                                "[0,1]","[0,1]","[0,1]","[0,1]","[0,1]")
}

# Create tables for categorical and continuous
continuous <- summary_table(wide, continuous_vars,continuous_vars_names,continuous_possible_range)
categorical <- summary_table_categorical(wide, categorical_vars, categorical_vars_names, categorical_possible_range)
# Create the tabset with the tables as tab content
tabset <- tabsetPanel(
  tabPanel("Categorical", DT::datatable(categorical, options = list(pageLength = 7), rownames = FALSE)),
    tabPanel("Continuous", DT::datatable(continuous, options = list(pageLength = 2), rownames = FALSE))
)

# Render the tabset
htmltools::tagList(tabset)
# Calculate PDFs
age_density <- density(as.numeric(wide$age))
hours_density <- density(as.numeric(wide$social_media_hours))

# Create a data frame for plotting
age_pdf <- data.frame(x = age_density$x, y = age_density$y)
hours_pdf <- data.frame(x = hours_density$x, y = hours_density$y)

# Plot PDFs
# Age
pdf_plot <- ggplot() +
  geom_line(data = age_pdf, aes(x = x, y = y), color = "blue", linetype = "solid") +
  labs(x = "Value", y = "Density", title = "Age") +
  theme_minimal() +
  scale_color_manual(values = c("Age" = "blue")) +
  scale_linetype_manual(values = c("Age" = "solid"))

# Display the plot
print(pdf_plot)

# Hours spent on social media
pdf_plot <- ggplot() +
  geom_line(data = hours_pdf, aes(x = x, y = y), color = "red", linetype = "solid") +
  labs(x = "Value", y = "Density", title = "Hours spent on social media") +
  theme_minimal() +
  scale_color_manual(values = c("Hours spent on social media" = "red")) +
  scale_linetype_manual(values = c("Hours spent on social media" = "solid"))

# render the plot
print(pdf_plot)

# Sample data
data <- data.frame(
  gender_Man = rbinom(1000,1,length(wide$gender_Man[wide$gender_Man == 1])/length(wide$gender_Man)),
  education_High_school_or_less = rbinom(1000,1,length(wide$education_High_school_or_less[wide$education_High_school_or_less == 1])/length(wide$education_High_school_or_less)),
  education_Some_college = rbinom(1000,1,length(wide$education_Some_college[wide$education_Some_college == 1])/length(wide$education_Some_college)),
  education_Bachelor_degree = rbinom(1000,1,length(wide$education_Bachelor_degree[wide$education_Bachelor_degree == 1])/length(wide$education_Bachelor_degree)),
  education_Graduate_degree = rbinom(1000,1,length(wide$education_Graduate_degree[wide$education_Graduate_degree == 1])/length(wide$education_Graduate_degree)),
  marital_Married_or_in_a_domestic_partnership = rbinom(1000,1,length(wide$marital_Married_or_in_a_domestic_partnership[wide$marital_Married_or_in_a_domestic_partnership == 1])/length(wide$marital_Married_or_in_a_domestic_partnership)),
  employment_Employed = rbinom(1000,1,length(wide$employment_Employed[wide$employment_Employed == 1])/length(wide$employment_Employed)),
  employment_Unemployed = rbinom(1000,1,length(wide$employment_Unemployed[wide$employment_Unemployed == 1])/length(wide$employment_Unemployed)),
  employment_Student = rbinom(1000,1,length(wide$employment_Student[wide$employment_Student == 1])/length(wide$employment_Student)),
  location_Mostly_urban = rbinom(1000,1,length(wide$location_Mostly_urban[wide$location_Mostly_urban == 1])/length(wide$location_Mostly_urban)),
  location_Suburban = rbinom(1000,1,length(wide$location_Suburban[wide$location_Suburban == 1])/length(wide$location_Suburban)),
  location_Mostly_rural = rbinom(1000,1,length(wide$location_Mostly_rural[wide$location_Mostly_rural == 1])/length(wide$location_Mostly_rural)),
  religion_Christian = rbinom(1000,1,length(wide$religion_Christian[wide$religion_Christian == 1])/length(wide$religion_Christian)),
  religiosity_Attends = rbinom(1000,1,length(wide$religiosity_Attends[wide$religiosity_Attends == 1])/length(wide$religiosity_Attends)),
  social_media_bin_Yes = rbinom(1000,1,length(wide$social_media_bin_Yes[wide$social_media_bin_Yes == 1])/length(wide$social_media_bin_Yes)),
  att_check_pre = rbinom(1000,1,length(wide$att_check_pre[wide$att_check_pre == 1])/length(wide$att_check_pre)),
  social_media_share_80_100 = rbinom(1000,1,length(wide$social_media_share_80_100[wide$social_media_share_80_100 == 1])/length(wide$social_media_share_80_100)),
  social_media_share_60_80 = rbinom(1000,1,length(wide$social_media_share_60_80[wide$social_media_share_60_80 == 1])/length(wide$social_media_share_60_80)),
  social_media_share_40_60 = rbinom(1000,1,length(wide$social_media_share_40_60[wide$social_media_share_40_60 == 1])/length(wide$social_media_share_40_60)),
  social_media_share_20_40 = rbinom(1000,1,length(wide$social_media_share_20_40[wide$social_media_share_20_40 == 1])/length(wide$social_media_share_20_40)),
  social_media_share_0_20 = rbinom(1000,1,length(wide$social_media_share_0_20[wide$social_media_share_0_20 == 1])/length(wide$social_media_share_0_20))
)

# Loop through each categorical variable
for (i in seq_along(categorical_vars)) {
  var <- categorical_vars[i]
  var_name <- categorical_vars_names[i]
  
  # Calculate PMF
  pmf_data <- data %>%
    count(!!sym(var)) %>%
    mutate(proportion = n / sum(n))
  
  # Plot PMF
  pmf_plot <- ggplot(pmf_data, aes(x = !!sym(var), y = proportion)) +
    geom_bar(stat = "identity", fill = "navy") +
    labs(title = paste("", var_name), x = var_name, y = "Proportion") +
    theme_minimal() +
    theme(axis.title = element_text(size = 10)) +
    scale_y_continuous(labels = scales::percent) +
    scale_x_continuous(breaks = c(0,1))
  
  print(pmf_plot)
}


Correlation matrix of covariates

The heatmap below shows that the correlation between covariates is not extreme. The largest non-trivial correlation is between Age and Married (0.40).


Pre-Post Correlation

Wide
prepost_correlation <- function(data, pre_vars, post_vars, var_names, possible_range) {
  
  # Empty dataframe to store summary data frames
  prepost_correlation_dataframe <- data.frame(Variable = character(),
                                   Correlation = numeric(),
                                   stringsAsFactors = FALSE)
  
  # Loop through each pair of pre and post variables
  for (i in seq_along(pre_vars)) {
    pre_var <- pre_vars[i]
    post_var <- post_vars[i]
    var_name <- var_names[i]
    range <- possible_range[i]
    
    # Calculate correlation between pre and post variables
    correlation <- formatC(cor(data[[pre_var]], data[[post_var]], use = "complete.obs"), format = "f", digits = 3)
    
    # Data frame for the pre-post correlation
    prepost_correlation_data <- data.frame(Variable = var_name, Correlation = correlation)
    
    # Append the pre-post correlation data frame to the main dataframe
    prepost_correlation_dataframe <- rbind(prepost_correlation_dataframe, prepost_correlation_data)
  }
  
  # Return the main dataframe with all pre-post correlation data frames
  return(prepost_correlation_dataframe)
}

continuous_vars_pre <- c("share_pre",
                              "misinfo_pre",
                              "base_rate_pre",
                              "disc_pre",
                              "disc_score_pre_eq",
                              "acc_disc_pre",
                              "acc_disc_pre_eq")
continuous_vars_post <- c("share_post",
                              "misinfo_post",
                              "base_rate_post",
                              "disc_post",
                              "disc_score_post_eq",
                              "acc_disc_post",
                              "acc_disc_post_eq")
var_names <- c("Sharing rate",
               "Misinfo Sharing rate",
               "Non-Misinfo sharing rate",
               "Sharing Discernment score",
               "Sharing Discernment score, equal weighting",
               "Accuracy Discernment score",
               "Accuracy Discernment score, equal weighting")

possible_range <- rep("[-1,1]", 7)

correlation_prepost_correlation_control <- prepost_correlation(wide %>% filter(treatment == "No-course Baseline"), continuous_vars_pre, continuous_vars_post, var_names, possible_range)
correlation_prepost_correlation_facts <- prepost_correlation(wide %>% filter(treatment == "Facts Baseline"), continuous_vars_pre, continuous_vars_post, var_names, possible_range)
correlation_prepost_correlation_emotions <- prepost_correlation(wide %>% filter(treatment == "Emotions"), continuous_vars_pre, continuous_vars_post, var_names, possible_range)
correlation_prepost_correlation_reasoning <- prepost_correlation(wide %>% filter(treatment == "Reasoning"), continuous_vars_pre, continuous_vars_post, var_names, possible_range)
correlation_prepost_correlation_combo <- prepost_correlation(wide %>% filter(treatment == "Combo"), continuous_vars_pre, continuous_vars_post, var_names, possible_range)
correlation_prepost_correlation <- prepost_correlation(wide, continuous_vars_pre, continuous_vars_post, var_names, possible_range)

# Create the tabset with the tables as tab content
tabset <- tabsetPanel(
  tabPanel("No-Course Baseline", DT::datatable(correlation_prepost_correlation_control, options = list(pageLength = 7), rownames = FALSE)),
  tabPanel("Facts Baseline", DT::datatable(correlation_prepost_correlation_facts, options = list(pageLength = 7), rownames = FALSE)),
  tabPanel("Emotions", DT::datatable(correlation_prepost_correlation_emotions, options = list(pageLength = 7), rownames = FALSE)),
  tabPanel("Reasoning", DT::datatable(correlation_prepost_correlation_reasoning, options = list(pageLength = 7), rownames = FALSE)),
  tabPanel("Combo", DT::datatable(correlation_prepost_correlation_combo, options = list(pageLength = 7), rownames = FALSE)),
  tabPanel("All", DT::datatable(correlation_prepost_correlation, options = list(pageLength = 7), rownames = FALSE))
)

# Render the tabset
htmltools::tagList(tabset)


Variance decomposition

Within and between user standard deviation and variance.

Within: standard deviation and variance of \(x_{ip} - \bar{x_i} + \bar{\bar{x}}\)

Between: standard deviation and variance of: \(\bar{x_i}\)

Where i is the individual, p is a post, and \(\bar{\bar{x}}\) is the global mean of the variable (share).

# data <- filtered_data
# treatment <- t
# period <- p
# share_type <- s

# Function to create a variance summary table
variance_summary_table <- function(data, treatment, period, share_type) {


  # Filter data based on the period and share type
  if (period != "pre and post") {
    data <- data %>% filter(pre_post == period)
  }
  
  if (share_type == "misinfo share") {
    data <- data %>% filter(post_is_misinfo == 1)
  } else if (share_type == "non-misinfo share") {
    data <- data %>% filter(post_is_misinfo == 0)
  }
  
  # Define data as panel
  data <- pdata.frame(data, index = c('user'))
  
  within_sd <- formatC(xtsum::within_sd(data, 'share', 'user'), format = "f", digits = 4)
  within_variance <- formatC(xtsum::within_sd(data, 'share', 'user')^2, format = "f", digits = 4)
  between_sd <- formatC(xtsum::between_sd(data, 'share', 'user'), format = "f", digits = 4)
  between_variance <- formatC(xtsum::between_sd(data, 'share', 'user')^2, format = "f", digits = 4)
  
  summary_table <- data.frame(
    Outcome = share_type,
    Treatment = treatment,
    Period = period,
    Within_user_Standard_Deviation = within_sd,
    Within_user_Variance = within_variance,
    Between_user_Standard_Deviation = between_sd,
    Between_user_Variance = between_variance
  )
  
  return(summary_table)
}

# Initialize an empty list to store the variance summary tables
variance_summary_tables <- list()

treatments <- c("No-course Baseline", "Facts Baseline", "Emotions", "Reasoning", "Combo", "All")
periods <- c("pre", "post", "pre and post")
share_types <- c("share", "misinfo share", "non-misinfo share")

# Loop over each combination of treatment, period, and share type
for (t in treatments) {
  
  # Empty dataframe to store all combinations for the current treatment
  all_combinations <- NULL
  
  for (p in periods) {
    for (s in share_types) {
      if (t != "All") {
        # Filter the data for the current treatment
        filtered_data <- long_treatments %>% filter(treatment == t)
      } else {
        filtered_data <- long_treatments
      }
      
      # Create a variance summary table for the current combination
      variance_summary <- variance_summary_table(filtered_data, t, p, s)
      
      # Store the variance summary table in all_combinations dataframe
      all_combinations <- rbind(all_combinations, variance_summary)
    }
  }
# Store the all_combinations dataframe in the list with treatment name as key
  variance_summary_tables[[t]] <- all_combinations
}

# Initialize an empty list to store tab panels
variance_tab_panels <- list()

# Loop through each key in variance_summary_tables
for (treatment in names(variance_summary_tables)) {
    tab_name <- paste0(treatment)
    tab_content <- DT::datatable(variance_summary_tables[[treatment]], options = list(pageLength = 9), rownames = FALSE)
    variance_tab_panels[[length(variance_tab_panels) + 1]] <- tabPanel(tab_name, tab_content)
}
# Create the tabset with the variance tab panels
variance_tabset <- tabsetPanel(!!!variance_tab_panels)

# Render the tabset
htmltools::tagList(variance_tabset)