Purpose

The purpose of this script is to clean the raw FB Misinformation Chatfuel data, document key decisions made, and generate two clean datasets for further analysis.

The key outputs are as follows:

  1. A long-form dataset, misinfo_clean_wide.csv
  • This dataset contains one row per respondent. It includes all the responents who consented to the study and passed data-quality checks (such as being from one of the eligible countries), regardless of whether they completed the intervention or the quiz. For the users who did not complete the intervention or the quiz, the relevant variables are set to missing.
  1. A wide-form dataset, misinfo_clean_long.csv
  • This dataset contains 5 rows per respondent, one for each post (excluding attention check) that the user saw. It includes only the respondents who completed the intervention and the quiz, and passed data-quality checks (such as there being no missing values for the quiz questions).

Summary

The raw Chatfuel dataset was last pulled on 02/23/2024 and processed by the anonymization script. The anonymized version of dataset contains the variable analytic_id as the unique identifier for each respondent. We filtered respondents based on the following criteria:

  • The eligibility criteria for respondents were determined by their participation in two iterations of Phase 1 and three iterations of Phase 2. During Phase 1, participants interacted with one of two versions of the chatbot, labeled as ExperimentLaunchV1 and ExperimentLaunchV2. In Phase 2, participants were exposed to one of three versions of the chatbot, named Phase2, Phase2_V2, and Phase2_V3. Chatbot users who were deemed ineligible were marked as a missing value under the NewVersion variable and subsequently removed from the dataset. Additionally, respondents who took part in technical pilots were excluded from the dataset.

  • ExperimentLaunchV1 corresponds to the time in Phase 1 before we reduced incentives in South Africa and Nigeria. ExperimentLaunchV2 corresponds to the time in Phase 1 after we reduced incentives in South Africa and Nigeria.

  • Phase2 corresponds to the time in Phase 2 before we fixed the issue with randomization (which was causing correct marginal distributions of arm and payment but incorrect joint distribution). Phase2_V2 corresponds to the time in Phase 2 after we fixed the issue with randomization. Phase2_V3 then corresponds to the time after we removed the $1 and $3 payment options in Phase 2.

  • We excluded participants who started before the experiment (11/16/2023) and after the end of experiment (12/31/2023)

  • The eligibility of respondents is based on the four countries: Nigeria, Kenya, Ghana, and South Africa. Respondents who are not from these countries were dropped from the raw dataset.

  • Early in Phase 1 of the experiment, before we implemented a fix, some users were able to do the chatbot multiple times. After the fix many users continued trying to enter the chatbot for the second time but were turned away. We made a clear distinction between two types of respondents: attempted_repeaters and actual_repeaters. attempted_repeaters refers to those respondents who attempted to interact with the chatbot more than once but were automatically prevented from doing so. On the other hand, actual_repeaters are those respondents who managed to participate in more than one treatment arm before being identified and removed from the dataset.

  • We ensure uniqueness of observations on several places. First, in the anonymization script we ensure each row corresponds to a unique chatfuel user id. Second, we checked for the number of unique analytic_id and the number of observations in the dataset. We confirm that the number of unique analytic_id is equal to the number of observations in the dataset, so each observation is indeed unique.

  • We checked for a series of anomalies in the raw dataset and found the following:

    1. Phase 1 participants having Phase 2 specific attributes and vice versa: We noticed some participants in Phase 1 being assigned to the long baseline group and some participants in Phase 2 being assigned to the video intervention group. We dropped these participants from the dataset.

    2. Negative time duration: We noticed about 82 respondents (<0.10 %) having a negative time duration for the amount of time completing the treatment intervention or the amount of time completing the quiz. Upon inspection, it is likely due to a bug in Chatfuel. We dropped these completers from the dataset.

    3. Missing values: quiz completers are defined as those who have a timestamp in the misinfoQuiz_end_time. Logically, these participants must have recorded entries for all of the misinfo quiz attributes, including reliable_score, manipulative_score, and share_score. However, we noticed, in some instances, that there missing values for these variables. These are concentrated in Phase 1 and caused by the fact that we initially did not enforce that users select from one of the pre-defined options for the quiz questions. This affected ~1000 observations. A second, much smaller, set of missing values is due to a bug in Chatfuel. This affects less than 100 users in total. For both sets of users we set all of their quiz-related outcomes to missing.

    4. Participants with wrong or missing posts: we drop ~100 partitipants with missing post_id in any post and two partiticpant with post_id’s corresponding to pilot version of the quiz.

Loading the anonymized dataset for cleaning

Data

  • The raw dataset was last pulled on 02/23/2024 and stored at ~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/

  • The anonymized dataset is stored in Github directory ~fb_misinfo_interventions/data/chatfuel/raw

# file path `~fb_misinfo_interventions/data/chatfuel/raw`
data <- fread("./data/chatfuel/raw/fbmisinfo_anon.csv.gz")

Cleaning phases

  • Looking at respondents across different phases and generating a phase identifier (phase_coded)
    1. Respondents who were in Version ExperimentLaunchV1 or ExperimentLaunchV2 as Phase 1
    2. Respondents who were in Version Phase2, Phase2_V2 or Phase2_V3 as Phase 2
  • Blank values represent participants who accessed the chatbot but did not meet the eligibility criteria to proceed further. These individuals engaged with the Messenger popup, which resulted in their time being logged. However, these participants were subsequently removed from the dataset.
total_count <- data %>% 
  count(NewVersion) %>% 
  summarise(total = sum(n)) %>%
  pull(total)

data %>% 
  count(NewVersion) %>%
  rbind(data.frame(NewVersion = "Total", n = total_count)) %>%
  kable("html", col.names=(c("Phase", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%") 
Phase Count
68462
ExperimentLaunchV1 9904
ExperimentLaunchV2 30604
Final_Technical_Pilot 181
Misinfo_Pilot 287
Phase2 8057
Phase2_V2 27276
Phase2_V3 15512
Total 160283
# generating coded phase 1 and 2 variables and dropping respondents not in Phase 1 or Phase 2
data$phase_coded <-data$NewVersion 
data$phase_coded[data$phase_coded=="ExperimentLaunchV1" | data$phase_coded=="ExperimentLaunchV2"] = "Phase 1"
data$phase_coded[data$phase_coded=="Phase2" | data$phase_coded=="Phase2_V2" | data$phase_coded=="Phase2_V3"] = "Phase 2"

data = data |> mutate(
  subphase = case_when(
    phase_coded == "Phase 1" & (as.POSIXct(entry_end_time) <= as.POSIXct("2023-11-21 01:00:00 UTC") ) ~ "1A",
    phase_coded == "Phase 1" & (as.POSIXct(entry_end_time) > as.POSIXct("2023-11-21 01:00:00 UTC") ) ~ "1B",
    NewVersion == "Phase2" ~ "2A",
    NewVersion == "Phase2_V2" ~ "2B",
    NewVersion == "Phase2_V3" ~ "2C"
  )
)

total_count <- data %>% 
  count(phase_coded) %>% 
  summarise(total = sum(n)) %>%
  pull(total)

data %>% 
  count(phase_coded) %>%
  rbind(data.frame(phase_coded = "Total", n = total_count)) %>%
  kable("html", col.names=(c("Phase", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%") 
Phase Count
68462
Final_Technical_Pilot 181
Misinfo_Pilot 287
Phase 1 40508
Phase 2 50845
Total 160283
a= as.character(length(unique(data$`analytic_id`)))


if (nrow(data) != length(unique(data$analytic_id))) {
  stop("The number of rows does not match the number of unique analytic_id values.")
} else {
  # Continue 
  print("Unique observations check passed.")
}
## [1] "Unique observations check passed."

Checking for uniqueness

  • The count of unique analytic_id is 160283.

  • The number of unique analytic_id is equal to the number of total observation, so there is no repetition of analytic_id. Each observation is indeed unique.

Dropping respondents not in Phase 1 or Phase 2

data <- data %>% 
  filter(phase_coded == "Phase 1" | phase_coded == "Phase 2")

b <- as.character(length(unique(data$`analytic_id`)))
total_count <- data %>% 
  count(phase_coded) %>% 
  summarise(total = sum(n)) %>%
  pull(total)

data %>% 
  count(phase_coded) %>%
  rbind(data.frame(phase_coded = "Total", n = total_count)) %>%
  kable("html", col.names=(c("Phase", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%") 
Phase Count
Phase 1 40508
Phase 2 50845
Total 91353
  • The number of observations before dropping is 160283, and after dropping the number of observations is 91353

Dropping participants started before the experiment started and started after the experiment ended

before_and_after <- data %>% 
  select(analytic_id, MisinfoChat_start_time) %>%
  mutate(start_day = as.Date(MisinfoChat_start_time)) %>%
  filter(start_day < as.Date("2023-11-16") | start_day > as.Date("2024-01-01"))

before_drop_starttime <- as.character(nrow(data))

data <- data %>% 
  filter(!(analytic_id %in% before_and_after$analytic_id))

after_drop_starttime <- as.character(nrow(data))
  • We noticed that there are 173 participants who started before the experiment started (11/16/2023) and after the end of experiment (12/31/2023). We will drop these participants from the dataset.
  • Before dropping, the number of observations is 91353, and after dropping the number of observations is 91180

Identifying attempted and actual repeaters

#check for entries with multiple treatment arms
data_repeaters <- data %>%
    select(analytic_id, phase_coded,
           repeater,
           arm, 
           InterArms_start_time, InterArmsVideo_end_time, # video start and end time
           InterArmsLongBase_start_time, InterArmsLongBase_end_time, #long baseline start,end time
           InterArmsBase_start_time, InterArmsBase_end_time, #short baseline start,end time
           InterArmsSMS_start_time, InterArmsSMS_end_time, #sms start,end time
           InterArmsGame_start_time, interArmsGame_end_time, # game start and end time
           MisinfoQuiz_start_time) %>%
    mutate(has_longbase_time = if_else(!is.na(InterArmsLongBase_start_time) | !is.na(InterArmsLongBase_end_time), TRUE, FALSE),
           has_video_time = if_else(!is.na(InterArms_start_time) | !is.na(InterArmsVideo_end_time), TRUE, FALSE),
           has_game_time = if_else(!is.na(InterArmsGame_start_time) | !is.na(interArmsGame_end_time), TRUE, FALSE),
           has_shortbase_time = if_else(!is.na(InterArmsBase_start_time) | !is.na(InterArmsBase_end_time), TRUE, FALSE),
           has_sms_time = if_else(!is.na(InterArmsSMS_start_time) | !is.na(InterArmsSMS_end_time), TRUE, FALSE),
           repeated = has_longbase_time + has_shortbase_time + has_sms_time + has_video_time +has_game_time > 1) %>%
    select(analytic_id, arm, repeater, repeated, phase_coded, MisinfoQuiz_start_time)

data_repeaters = data_repeaters |> 
    mutate(
      actual_repeater = repeated,
      attempted_repeater = case_when(
        repeater == 1 & repeated == 0 ~ TRUE,
        TRUE ~ FALSE
      ),
      repeater = actual_repeater | attempted_repeater
    )

# Count repeaters and attempted repeaters by phase
data_repeaters |> group_by(phase_coded) |> 
  summarize(
    `Actual Repeaters` = sum(actual_repeater),
    `Attempted Repeaters` = sum(attempted_repeater)
  ) |>
  mutate(
    `Total` = `Actual Repeaters` + `Attempted Repeaters`
  ) |>
  kable("html", col.names = c("Phase", "Actual Repeaters", "Attempted Repeaters", "Total")) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Phase Actual Repeaters Attempted Repeaters Total
Phase 1 3429 9640 13069
Phase 2 29 9982 10011
  • Repeaters are classified as attempted repeaters or actual repeaters.

  • We made a clear distinction between two types of respondents: attempted_repeaters and actual_repeaters. attempted_repeaters refers to those respondents who attempted to interact with the chatbot more than once but were automatically prevented from doing so. On the other hand, actual_repeaters are those respondents who managed to participate in more than one treatment arm before being identified and removed from the dataset.

  • The actual repeaters are concentrated in Phase 1. When we noticed this was an issue we implemented a fix where the chatbot would prevent users from entering the chatbot for the second time.

  • We identified actual_repeaters by checking the time stamps of the different treatment arms. If a respondent had a time stamp for more than one treatment arm, they were classified as actual_repeaters.

Excluding repeaters

data <- left_join(
  data,
  data_repeaters %>% select(analytic_id, actual_repeater, attempted_repeater),
  by = "analytic_id",
  relationship = "one-to-one",
  unmatched = "error"
  ) %>%
    filter(!actual_repeater)
data |> group_by(phase_coded) |> 
  summarize(
    `Actual Repeaters` = sum(actual_repeater),
    `Attempted Repeaters` = sum(attempted_repeater)
  ) |>
  mutate(
    `Total` = `Actual Repeaters` + `Attempted Repeaters`
  ) |>
  kable("html", col.names = c("Phase", "Actual Repeaters", "Attempted Repeaters", "Total")) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Phase Actual Repeaters Attempted Repeaters Total
Phase 1 0 9640 9640
Phase 2 0 9982 9982
c = as.character(length(unique(data$`analytic_id`)))
  • Before dropping the actual repeaters, we had 91353 observations. After dropping, we have 87722 observations.

Cleaning treatment arms variable

  • In Phase 1, there are 4 groups: baseline, sms, video, and game, while in Phase 2, there are 3 groups: short baseline, long baseline, and sms.

  • Variable arm_coded is the cleaned treatment arm variable.

    1. For Phase 1, the treatment variable arm_coded includes Original Baseline, SMS, Video, and Game.
    2. For Phase 2, the treatment variable arm_coded includes Original Baseline, Long Baseline, and SMS.

Checking the raw treatment arm variable

  • Checking the raw arm variable, we have the following entries: baseline, game, longbaseline, sms, video

  • The empty cell indicates participants who have missing values in the treatment variables. These respondents dropped off before and at the treatment arm assignment.

# checking possible treatment responses 
data %>% 
  count(arm) %>%
  kable("html", col.names=(c("Treatment Arm", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%") 
Treatment Arm Count
16998
baseline 13705
game 7480
longbaseline 14159
sms 27841
video 7539
data %>% group_by(phase_coded) %>%
  count(arm)%>%
  kable("html", col.names=(c("Phase", "Treatment Arm", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%") 
Phase Treatment Arm Count
Phase 1 6915
Phase 1 baseline 7452
Phase 1 game 7480
Phase 1 longbaseline 2
Phase 1 sms 7553
Phase 1 video 7538
Phase 2 10083
Phase 2 baseline 6253
Phase 2 longbaseline 14157
Phase 2 sms 20288
Phase 2 video 1
  • We noticed one participant in Phase 2 who had arm == video and two participants in Phase 1 having arm == longbaseline. We’ll drop these participants from the dataset.

Dropping participants with incorrect treatment arm

data <- data %>% filter(!(arm == "video" & phase_coded == "Phase 2"))
data <- data %>% filter(!(arm == "longbaseline" & phase_coded == "Phase 1"))
c1 <- as.character(length(unique(data$`analytic_id`)))
  • Before dropping we had 87722 observations. After dropping, we have 87719 observations.

Recoding treatment arm variable

# Recoding treatment arm variable 
data <- data %>%
  mutate(
    arm_coded = case_when(
      arm == "baseline" & phase_coded == "Phase 1" ~ "Original Baseline",
      arm %in% c("SMS", "sms") ~ "SMS",
      arm == "video" & phase_coded == "Phase 1" ~ "Video",
      arm == "game" & phase_coded == "Phase 1" ~ "Game",
      arm == "baseline" & phase_coded == "Phase 2" ~ "Original Baseline",
      arm == "longbaseline" & phase_coded == "Phase 2" ~ "Long Baseline",
      TRUE ~ "None"
    )
  )

Count of Participants in Treatment Arm across Phase 1 and Phase 2

data %>%
  group_by(phase_coded, arm_coded) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = arm_coded, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded,
  `Original Baseline`, `Game`, `SMS`, `Video`, `Long Baseline`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase Original Baseline Game SMS Video Long Baseline Total
Phase 1 7452 7480 7553 7538 0 36938
Phase 2 6253 0 20288 0 14157 50781

Cleaning user country

  • Variable user_country indicates where the respondents come from

  • The coded version of user_country is country_coded

  • Responses are messy with some random links and emojis which are treated as missing values

  • To proceed, the chatbot required that users select from a pre-defined list of countries or enter one of the options manually. In the later case, the capitalization of the country name was not enforced.

  • Participants who are not from either Kenya, South Africa, Nigeria, or Ghana are dropped from the dataset

Checking possible country answers

data$user_country <- data$user_country
data %>%
  count(user_country)
##         user_country     n
##               <char> <int>
##   1:                  8888
##   2:        ,nigeria     1
##   3: .\nSouth Africa     1
##   4:  . south Africa     1
##   5:              ..     1
##  ---                      
## 216:  you are stupid     1
## 217:           𝒢𝒽𝒶𝓃𝒶     1
## 218:               😊     1
## 219:            😡😡😡😡     1
## 220:               🤔     1

Cleaning and renaming country responses

# rewriting responses

data = data |> mutate(
  country_coded = case_when(
    tolower(user_country) == "nigeria" ~ "Nigeria",
    tolower(user_country) == "kenya" ~ "Kenya",
    tolower(user_country) == "south africa" ~ "South Africa",
    tolower(user_country) == "ghana" ~ "Ghana",
    TRUE ~ "Other"
  )
)

data %>%
  group_by(phase_coded, country_coded) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = country_coded, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `Ghana`, `Kenya`, `Nigeria`, `South Africa`, `Other`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase Ghana Kenya Nigeria South Africa Other Total
Phase 1 4404 4278 14168 9691 4397 36938
Phase 2 4668 6117 20461 14107 5428 50781

Dropping participants who are not from either Kenya, South Africa, Nigeria, or Ghana

  • Respondents who are dropped passed the consent checkpoint but did not come from either four of our countries of interest

  • ineligible_country is a variable that indicates whether the respondent is from a country other than Kenya, South Africa, Nigeria, or Ghana

data = data |> mutate(
  is_ineligible_country = country_coded == "Other"
)

# Calculate the number of unique participants before and after dropping ineligible countries
c2_before = as.character(length(unique(data$`analytic_id`)))

n_other = (data$is_ineligible_country) |> sum()

data = data |> filter(!is_ineligible_country)

c2 = as.character(length(unique(data$`analytic_id`)))
  • We dropped 9825 participants who are not from either Kenya, South Africa, Nigeria, or Ghana.

  • Before dropping, we had 87719 observations. After dropping we have 77894 observations.

Checking payment amount for each country

  • The actual payment amount for participation was only recorded for Phase 2 participants by variable user_payment_amount

  • In Version 1 of Phase 1, participants were paid 12.5GHS in Ghana, 150Ksh in Kenya, 20R in South Africa, and 800N in Nigeria. In Version 2 of Phase 1, participants were paid 12.5GHS in Ghana, 150Ksh in Kenya, 10R in South Africa, and 400N in Nigeria

  • The variable payment_amount_coded is manually created to contain the payment amount in both Phase 1 and Phase 2

# manually entered payment amount for each country in Phase 1 and Phase 2 

data <- data |>
  mutate(
    payment_amount_coded = case_when(
      subphase == "1A" & country_coded == "Ghana" ~ "₵12.5",
      subphase == "1A" & country_coded == "Kenya" ~ "150Ksh",
      subphase == "1A" & country_coded == "South Africa" ~ "R20",
      subphase == "1A" & country_coded == "Nigeria" ~ "₦800",
      subphase == "1B" & country_coded == "Ghana" ~ "₵12.5",
      subphase == "1B" & country_coded == "Kenya" ~ "150Ksh",
      subphase == "1B" & country_coded == "South Africa" ~ "R10",
      subphase == "1B" & country_coded == "Nigeria" ~ "₦400",
      TRUE ~ user_payment_amount),
    payment_condition_dollar = case_when(
      user_payment_condition == "high" ~ 3,
      user_payment_condition == "low" ~ .5,
      user_payment_condition == "medium" ~ 1,
      subphase == "1A"  ~ 1,
      subphase == "1B" & country_coded == "Ghana" ~ 1,
      subphase == "1B" & country_coded == "Kenya" ~ 1,
      subphase == "1B" & country_coded == "South Africa" ~ .5,
      subphase == "1B" & country_coded == "Nigeria" ~ .5,
      TRUE ~ NA)
    )
  
# assert that there are no NAs in those variables

if(any(is.na(data$payment_amount_coded)) | any(is.na(data$payment_condition_dollar))) {
  stop("There are NAs in payment_amount_coded or payment_condition_dollar")
} else {
  print("Asserted that there are no NAs in payment_amount_coded or payment_condition_dollar")
}
## [1] "Asserted that there are no NAs in payment_amount_coded or payment_condition_dollar"
data |>
  group_by(phase_coded, payment_condition_dollar) |>
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = payment_condition_dollar, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `High` = `3`, `Medium` = `1`, `Low` = `0.5`, `Total`) |>
  kable("html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  scroll_box(width = "100%")
Phase High Medium Low Total
Phase 1 0 15849 14058 29907
Phase 2 9333 9771 21542 40646

Stratas

data = data |> mutate(
  strata = paste(subphase, country_coded, payment_condition_dollar, sep = "_")
)

Checking manipulative / reliable order

  • In the quiz, for each post, every user first saw the question about their intention to share. Then, they were asked two questions regarding their assessment of the manipulativeness and reliability of these posts. The order in which these questions were asked was randomized at the user level when they started the quiz. Users who never participated in the quiz will therefore have a missing value
data %>%
  group_by(phase_coded, manipulative_reliable_order) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = manipulative_reliable_order, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `Manipulative First` = `1`, `Reliable First` = `2` , `Missing` = `NA`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase Manipulative First Reliable First Missing Total
Phase 1 10378 10194 9335 29907
Phase 2 15453 15559 9634 40646

Checking time variables

  • MisinfoChat_start_time : time stamp for starting the chatbot

  • InterArms_start_time and InterArmsVideo_end_time: time stamp for starting and ending the video intervention

  • InterArmsGame_start_time and interArmsGame_end_time: time stamp for starting and ending the game intervention

  • InterArmsSMS_start_time and InterArmsSMS_end_time: time stamp for starting and ending the SMS intervention

  • InterArmsBase_start_time and InterArmsBase_end_time: time stamp for starting and ending the baseline intervention

  • InterArmsLongBase_start_time and InterArmsLongBase_end_time: time stamp for starting and ending the long baseline intervention

  • MisinfoQuiz_start_time and misinfoQuiz_end_time: time stamp for starting and ending the Misinfo quiz

  • baseline_duration: time duration between the start and the end of the orginal baseline in minutes

  • video_duration: time duration between the start and the end of the video intervention in minutes

  • game_duration: time duration between the start and the end of the game intervention in minutes

  • sms_duration: time duration between the start and the end of the SMS intervention in minutes

  • longbase_duration: time duration between the start and the end of the long baseline in minutes

  • intervention_duration: time duration between the start and the end of the intervention in minutes

  • quiz_duration: time duration between the start and the end of the misinformation quiz in minutes

  • total_duration: time duration between the start of the chatbot and the end of the misinformation quiz in minutes

# calculating duration 
data <- data %>%
  mutate(
    baseline_duration = as.numeric(difftime(InterArmsBase_end_time, InterArmsBase_start_time, units = "mins")),
    video_duration = as.numeric(difftime(InterArmsVideo_end_time, InterArms_start_time, units = "mins")),
    game_duration = as.numeric(difftime(interArmsGame_end_time, InterArmsGame_start_time, units = "mins")),
    sms_duration = as.numeric(difftime(InterArmsSMS_end_time, InterArmsSMS_start_time, units = "mins")),
    longbase_duration = as.numeric(difftime(InterArmsLongBase_end_time, InterArmsLongBase_start_time, units = "mins")),
    misinfo_quiz_duration = as.numeric(difftime(misinfoQuiz_end_time, MisinfoQuiz_start_time, units = "mins")),
    total_duration = as.numeric(difftime(misinfoQuiz_end_time, MisinfoChat_start_time, units = "mins"))
    ) %>%
  mutate(intervention_duration = case_when(
    arm_coded == "Video" ~ video_duration,
    arm_coded == "Game" ~ game_duration,
    arm_coded == "SMS" ~ sms_duration,
    arm_coded == "Original Baseline" ~ baseline_duration,
    arm_coded == "Long Baseline" ~ longbase_duration,
    TRUE ~ NA
  )
  )
data <- data |> mutate(
  is_abnormal_time = ((intervention_duration < 0) & !is.na(intervention_duration) ) | ((misinfo_quiz_duration < 0)) & !is.na(misinfo_quiz_duration) | ((total_duration < 0)) & !is.na(total_duration)) 

# removing missing values from duration (for those who did not finish the intervention)
is_abnormal_time = as.character(sum(data$is_abnormal_time, na.rm = TRUE))  
data <- data |> filter(!is_abnormal_time)
  • Here we checked for negative time taken to complete the intervention and the quiz. We noticed 198 time values and droped these observations from the dataset. After dropping, we have 70355 observations.

Cleaning and creating dropoff variable

Here we create flags for each stage of the chatbot and quiz to track the progress of the participants. We then create a dropoff variable to track the stage at which participants dropped off.

  • started_coded_num: 1 if the participant started the chatbot, 0 otherwise

  • consent_coded_num: 1 if the participant consented to the quiz, 0 otherwise

  • started_arm_coded_num: 1 if the participant started any of the interventions, 0 otherwise

  • completed_arm_coded_num: 1 if the participant completed any of the interventions, 0 otherwise

  • quiz_started_coded_num: 1 if the participant started the quiz, 0 otherwise

  • quiz_consent_coded_num: 1 if the participant consented to the quiz, 0 otherwise

  • quiz_completed_coded_num: 1 if the participant completed the quiz, 0 otherwise

  • completed_all_coded_num: 1 if the participant completed all, 0 otherwise

data <- data %>%
  mutate(started_coded_num = ifelse(!is.na(MisinfoChat_start_time), 1, 0),
         started_arm_coded_num = ifelse(!is.na(InterArms_start_time) | !is.na(InterArmsSMS_start_time) | !is.na(InterArmsGame_start_time) | !is.na(InterArmsBase_start_time) | !is.na(InterArmsLongBase_start_time), 1, 0),
         completed_arm_coded_num = ifelse(!is.na(intervention_duration), 1, 0),
         quiz_started_coded_num = ifelse(!is.na(MisinfoQuiz_start_time), 1, 0),
         quiz_consent_coded_num = case_when(
          quiz_consent == "I acknowledge" ~ 1,
          TRUE ~ 0),
         quiz_completed_coded_num = ifelse(!is.na(misinfoQuiz_end_time), 1, 0),
         completed_all_coded_num = ifelse(!is.na(completed), 1, 0))

Creating dropoff variable

  • stage provides the location where participants dropped off from the chatbot
data <- data |>
  rowwise() |>
  mutate(
    stage_reached_num = sum(
      started_coded_num,
      started_arm_coded_num,
      completed_arm_coded_num,
      quiz_consent_coded_num,
      quiz_started_coded_num,
      quiz_completed_coded_num,
      completed_all_coded_num),
    stage_reached_char = case_when(
      stage_reached_num == 7 ~ "7-Completed all",
      stage_reached_num == 6 ~ "6-Completed quiz",
      stage_reached_num == 5 ~ "5-Consented to quiz",
      stage_reached_num == 4 ~ "4-Reached quiz section",
      stage_reached_num == 3 ~ "3-Completed intervention",
      stage_reached_num == 2 ~ "2-Started intervention",
      stage_reached_num == 1 ~ "1-Started chatbot",
      TRUE ~ "Did not start"
  ))


data |> group_by(phase_coded, stage_reached_num, stage_reached_char) |>
  summarize(N = n()) |> 
  arrange(phase_coded, desc(stage_reached_num)) |>
  group_by(phase_coded) |>
  mutate(Total = cumsum(N)) |>
  arrange(phase_coded, stage_reached_num) |>
  mutate(
    `Reached here (% Total)` = round(Total / Total[1] * 100,2),
    `Reached here (% Previous)` = round(Total / lag(Total) * 100,2),
    `Finished here (% Total)` = round(N / Total[1] * 100,2),
    `Finished here (% Previous)` = round(N / lag(Total) * 100,2)
  ) |>
  select(
    `Phase` = phase_coded,
    `Stage Reached` = stage_reached_char,
    `Reached here (n)` = Total,
    `Reached here (% Total)`,
    `Reached here (% Previous)`,
    `Finished here` = N,
    `Finished here (% Total)`,
    `Finished here (% Previous)`
  ) |>
  kable("html") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Phase Stage Reached Reached here (n) Reached here (% Total) Reached here (% Previous) Finished here Finished here (% Total) Finished here (% Previous)
Phase 1 2-Started intervention 29727 100.00 NA 9029 30.37 NA
Phase 1 3-Completed intervention 20698 69.63 69.63 2 0.01 0.01
Phase 1 4-Reached quiz section 20696 69.62 99.99 306 1.03 1.48
Phase 1 5-Consented to quiz 20390 68.59 98.52 3569 12.01 17.24
Phase 1 6-Completed quiz 16821 56.58 82.50 289 0.97 1.42
Phase 1 7-Completed all 16532 55.61 98.28 16532 55.61 98.28
Phase 2 1-Started chatbot 40628 100.00 NA 365 0.90 NA
Phase 2 2-Started intervention 40263 99.10 99.10 8880 21.86 21.86
Phase 2 4-Reached quiz section 31383 77.24 77.95 390 0.96 0.97
Phase 2 5-Consented to quiz 30993 76.28 98.76 3837 9.44 12.23
Phase 2 6-Completed quiz 27156 66.84 87.62 317 0.78 1.02
Phase 2 7-Completed all 26839 66.06 98.83 26839 66.06 98.83

Checking quiz completers and treatment groups to see if NA entrants

c4 = as.character(length(unique(data$`analytic_id`)))
abnormal_quiz_completers <- data %>%
  filter(quiz_completed_coded_num == 1) %>%
  select(arm_coded, analytic_id, phase_coded, quiz_completed_coded_num, intervention_duration) %>%
  filter(is.na(arm_coded))

is_abnormal_quiz_completers <- as.character(nrow(abnormal_quiz_completers))

data <- data %>%
  filter(!analytic_id %in% abnormal_quiz_completers$analytic_id)

d <- as.character(length(unique(data$`analytic_id`)))
  • Noticed 0 participants who completed the quiz but did not have an recorded entry for treatment assignment. They did not receive any intervention, so we dropped them from the our dataset.

  • Before dropping, we had 70355 number of observations. After droppping, we have 70355 number of observations

Cleaning post-level variables and outcomes

  • Three variables: sharing, manipulative, and reliable scores are recorded for each post. We shall reshape them individually. While merging we shall impose a one-to-one relationship to ensure that we do not lose any data.
# Reshape post ids
data_long <- data %>% 
  select(starts_with("post_id"), analytic_id, phase_coded, arm_coded, attention_check_passed, quiz_completed_coded_num) %>%
  select(-ends_with("_followup")) %>%
  pivot_longer(
    cols=starts_with("post_id"),
    names_to="post_number",
    values_to="post_id",
    names_prefix="post_id_"
    )

# Reshape manipulative scores
manipulative_score <- data %>% 
  select(starts_with("post_id"), contains(c("_manipulative")), analytic_id) %>%
  select(-ends_with(c("_followup", "_reliable_order"))) %>%
  pivot_longer(
    cols = contains("_manipulative"),
    names_to = "post_number",
    values_to = "manipulative_score",
    names_pattern = "post_(.*)_manipulative"
  ) %>%
  select(post_number, analytic_id, manipulative_score)

# Reshape reliable scores
reliable_score <- data %>% 
  select(starts_with("post_id"), contains(c("_reliable")), analytic_id) %>%
  select(-ends_with(c("_followup", "_reliable_order"))) %>%
  pivot_longer(
    cols = contains("_reliable"),
    names_to = "post_number",
    values_to = "reliable_score",
    names_pattern = "post_(.*)_reliable"
  ) %>%
  select(post_number, analytic_id, reliable_score)

# Reshape share scores
share_score <- data %>%
  select(starts_with("post_id"), ends_with("_share"), analytic_id) %>%
  pivot_longer(
    cols = contains("_share"),
    names_to = "post_number",
    values_to = "share_score",
    names_pattern = "post_(.*)_share"
  ) %>%
  select(post_number, analytic_id, share_score)


# Merging reshaped variables

data_long = data_long |>
  left_join(manipulative_score,
    by = c("analytic_id", "post_number"),
    relationship = "one-to-one",
    unmatched = "error"
  ) |>
  left_join(reliable_score,
    by = c("analytic_id", "post_number"),
    relationship = "one-to-one",
    unmatched = "error"
  ) |>
  left_join(share_score,
    by = c("analytic_id", "post_number"),
    relationship = "one-to-one",
    unmatched = "error"
  )

# converting to numeric responses
    # for manipulative and reliable scores: 1 = not at all, 5 = very
    # for share score: 1 = yes, 0 = no


data_long = data_long |> 
  mutate(
    manipulative_score = parse_number(manipulative_score),
    reliable_score = parse_number(reliable_score),
    share_score = case_when(
      tolower(share_score) == "yes" ~ 1,
      tolower(share_score) == "no" ~ 0,
      TRUE ~ NA_real_
    )
  ) 

Checking abnormal reliable and manipulative scores

  • abnormal_manipulativescore or abnormal_reliablescore responses for manipulative and reliable scores are not in {1,2,3,4,5}.
data_long <- data_long |>
  mutate(
    abnormal_manipulativescore = ifelse(!(manipulative_score %in% 1:5) & !is.na(manipulative_score), 1, 0),
    abnormal_reliablescore = ifelse(!(reliable_score %in% 1:5) & !is.na(reliable_score), 1, 0),
    manipulative_score = ifelse(abnormal_manipulativescore == 1, NA_real_, manipulative_score) |> as.integer(),
    reliable_score = ifelse(abnormal_reliablescore == 1, NA_real_, reliable_score) |> as.integer()
  )

abnormal_scores = data_long |> 
  filter(abnormal_manipulativescore == 1 | abnormal_reliablescore == 1) |> 
  distinct(analytic_id)


data_long = data_long |> filter(!analytic_id %in% abnormal_scores$analytic_id)
  • We dropped quiz answers of 73 participants who had abnormal manipulative or reliable scores in any of the questions. After dropping, we have 421692 user-post observations.

Checking attention check pass

Here we are checking attention check pass across two phases. - In Phase 1, the attention check was confusingly (arguably, incorrectly) worded and required users to respond in a specific way to three questions in a row based on a single prompt. This resulted in a very low pass rate. - In Phase 2 we used a simpler attention check that simply required users to select one option (either 2 or 4, randomized) from a set of {1, 2, 3, 4, 5}.

# attention check pass variable

# Attention check in Phase 2 was built in the Chatbot and was a single question with 5 options.
data_attention_2 = data |> filter(
  phase_coded == "Phase 2") |>
  mutate(
    attention_check_passed = case_when(
      attention_check_passed == "TRUE" ~ 1,
      attention_check_passed == "FALSE" ~ 0,
      TRUE ~ NA_real_
    )
  ) |>
  select(analytic_id, attention_check_passed)

# Attention check in Phase 1 required users to respond in a specific way to three questions in a row based on a single prompt. Passing was not automatically recorded in the data, so we manually coded it.
data_attantion_1 = data_long |>
  filter(phase_coded == "Phase 1") |>
  filter(post_id %in% c("attention_check_1", "attention_check_2")) |>
  mutate(
    attention_check_passed = case_when(
      post_id == "attention_check_1" & manipulative_score == 4 & reliable_score == 4 & share_score == 1 ~ 1,
      post_id == "attention_check_2" & manipulative_score == 2 & reliable_score == 2 & share_score == 0 ~ 1,
      post_id == "attention_check_1" & !(manipulative_score == 4 & reliable_score == 4 & share_score == 1) ~ 0, 
      post_id == "attention_check_2" & !(manipulative_score == 2 & reliable_score == 2 & share_score == 0) ~ 0,
      TRUE ~ NA_real_
    )
  ) |>
  select(analytic_id, attention_check_passed)

data_attention <- bind_rows(data_attention_2, data_attantion_1)


data_attention |> left_join(
    data |> select(analytic_id, phase_coded),
    by = "analytic_id",
    relationship = "one-to-one"
  ) |>
  group_by(phase_coded) |>
  summarize(
    `Attention Check Passed` = sum(attention_check_passed, na.rm = TRUE),
    `Attention Check Non-missing` = sum(!is.na(attention_check_passed)),
    `Pass Rate (%)` = round(`Attention Check Passed` / `Attention Check Non-missing` * 100, 2)
  ) |>
  rename(`Phase` = phase_coded) |>
  kable("html") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Phase Attention Check Passed Attention Check Non-missing Pass Rate (%)
Phase 1 493 18363 2.68
Phase 2 23742 29073 81.66

Inspecting missing values

  • We are cautious of missing values in the quiz among quiz completers. We will now evaluate the issue of missingness

  • We will start by dropping the attention check posts from the long dataset as they were coded differently in Phase 2 (and we analyzed them above); we will also drop those that did not complete the quiz (158040 out of 421692 user-posts`).

# Creating post_type for each post
data_long <- data_long |>
  mutate(post_type = case_when(
    str_detect(post_id, "non_misinfo") ~ "non_misinfo",
    str_detect(post_id, "discrediting") ~ "discrediting",
    str_detect(post_id, "false") ~ "false_dichotomy",
    str_detect(post_id, "attention_check") ~ "attention_check",
    str_detect(post_id, "emotional") ~ "emotional",
    TRUE ~ NA
  ))

data_long = data_long |> filter(!post_type == "attention_check") |> filter(quiz_completed_coded_num == 1)

Checking missing values in post id

data_long |> group_by(post_id) |> summarize(n = n()) |> arrange(desc(n)) |> kable("html") |> kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |> scroll_box(width = "100%")
post_id n
emotional_28 22051
non_misinfo_10 22039
false_dichotomy_21 21964
non_misinfo_6 21944
discrediting_17 21938
discrediting_14 21902
false_dichotomy_23 21876
non_misinfo_8 21876
non_misinfo_2 21821
emotional_27 21789
valid_posts = c(
  "emotional_28", "non_misinfo_10", "false_dichotomy_21",
  "non_misinfo_6", "discrediting_17", "discrediting_14",
  "non_misinfo_8", "false_dichotomy_23", "non_misinfo_2", "emotional_27"
  )


abnormal_posts = data_long |>
  filter(is.na(post_id) | !(post_id %in% valid_posts)) |>
  distinct(analytic_id) |>
  mutate(abnormal_posts = 1)

data_long = data_long |> filter(!analytic_id %in% abnormal_posts$analytic_id)

data = data |> left_join(
  abnormal_posts,
  by = "analytic_id"
  ) |>
  mutate(abnormal_posts = ifelse(is.na(abnormal_posts), 0, abnormal_posts)) |>
  filter(abnormal_posts == 0) 

The valid post ids, which correspond to stimuli used in the survey are: emotional_28, non_misinfo_10, false_dichotomy_21, non_misinfo_6, discrediting_17, discrediting_14, non_misinfo_8, false_dichotomy_23, non_misinfo_2, emotional_27. We shall drop all users who have at least one missing value in the post_id variable. We shall also drop all users who have at least one invalid post_id value. A total of 0 users are dropped.

missing_treatment_arm <- data_long |>
  filter(arm_coded == "None") |>
  select(analytic_id) |>
  distinct() |>
  mutate(missing_treatment_arm = 1)

data_long = data_long |> filter(!analytic_id %in% missing_treatment_arm$analytic_id)

data = data |> left_join(
  missing_treatment_arm,
  by = "analytic_id"
  ) |>
  mutate(missing_treatment_arm = ifelse(is.na(missing_treatment_arm), 0, missing_treatment_arm)) |>
  filter(missing_treatment_arm == 0)

Finally, we shall drop all users who have missing treatment arm values, a total of 0 users.

Missing values per phase

Note, in Phase 1 we initially did not enforce that the users select from one of the given options. This results in the relatively higher number of missing values in Phase 1.

By users

data_long |> 
  group_by(phase_coded, analytic_id) |>
  summarize(
    `User has any missing values` = any(is.na(share_score)) | any(is.na(reliable_score)) | any(is.na(manipulative_score)),
    `User has all missing values` = all(is.na(share_score)) & all(is.na(reliable_score)) & all(is.na(manipulative_score)),
    `User has no missing values` = !any(is.na(share_score)) & !any(is.na(reliable_score)) & !any(is.na(manipulative_score))
    ) |>
  group_by(phase_coded) |>
  summarize(
    `Total Users` = n(),
    `Users with any missing values` = sum(`User has any missing values`),
    `Users with all missing values` = sum(`User has all missing values`),
    `Users with no missing values` = sum(`User has no missing values`)
  ) |>
  kable("html", caption = "Users with missing values, per phase") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Users with missing values, per phase
phase_coded Total Users Users with any missing values Users with all missing values Users with no missing values
Phase 1 16680 987 0 15693
Phase 2 27160 46 0 27114

By user-posts

data_long |>
  group_by(phase_coded) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per phase") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |> 
  scroll_box(width = "100%")
Missing values per phase
phase_coded Total Missing Share Score Missing Reliable Score Missing Manipulative Score Missing All Scores All Scores Non-missing
Phase 1 83400 463 558 510 34 82107
Phase 2 135800 35 32 37 22 135745

Missing values per post order

We now look into whether the missing values are concentrated in a particular post order.

data_long |>
  group_by(post_number) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per post order") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Missing values per post order
post_number Total Missing Share Score Missing Reliable Score Missing Manipulative Score Missing All Scores All Scores Non-missing
1 36466 88 108 93 5 36215
2 36568 69 84 83 8 36374
3 36555 76 87 83 8 36349
4 36461 63 86 86 9 36269
5 36495 87 99 100 14 36270
6 36655 115 126 102 12 36375

Missing values per post id

  • The missing values are equally distributed across the post ids.
data_long |>
  group_by(post_id) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per post id") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Missing values per post id
post_id Total Missing Share Score Missing Reliable Score Missing Manipulative Score Missing All Scores All Scores Non-missing
discrediting_14 21902 46 62 56 6 21773
discrediting_17 21938 57 53 53 9 21806
emotional_27 21789 52 65 54 7 21653
emotional_28 22051 49 65 67 6 21905
false_dichotomy_21 21964 44 62 58 6 21828
false_dichotomy_23 21876 51 61 67 7 21729
non_misinfo_10 22039 36 56 53 5 21922
non_misinfo_2 21821 42 45 44 0 21698
non_misinfo_6 21944 69 56 45 3 21799
non_misinfo_8 21876 52 65 50 7 21739

Missing values per phase and treatment arm

  • The missing values are equally distributed across the treatment arms within each phase.
data_long |>
  group_by(phase_coded, arm_coded) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per phase and treatment arm") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Missing values per phase and treatment arm
phase_coded arm_coded Total Missing Share Score Missing Reliable Score Missing Manipulative Score Missing All Scores All Scores Non-missing
Phase 1 Game 17265 99 132 125 5 16972
Phase 1 Original Baseline 24225 128 153 159 11 23864
Phase 1 SMS 21415 115 150 121 8 21078
Phase 1 Video 20495 121 123 105 10 20193
Phase 2 Long Baseline 45865 8 7 10 6 45853
Phase 2 Original Baseline 21355 5 5 6 4 21347
Phase 2 SMS 68580 22 20 21 12 68545

Missing values per attention check

data_long |>
  group_by(attention_check_passed) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per attention check") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Missing values per attention check
attention_check_passed Total Missing Share Score Missing Reliable Score Missing Manipulative Score Missing All Scores All Scores Non-missing
FALSE 23080 4 5 5 2 23072
TRUE 112705 28 25 30 18 112661
NA 83415 466 560 512 36 82119

Calculating mean scores for each participant

Note that we will use a restrictive sample definition, where we only consider the users with no missing values. As a result we lose 1033 users out of 43840 users who completed the quiz. Vast majority of these users are from Phase 1. Concretely, for any user with one or more missing values in the quiz questions, we drop all their answers. These users will continue to be tracked in the dataset, but we will not use their answers in the survey analysis. In other words, for users who have at least one missing value in the main survey quiz questions, we will treat them as not completing the main survey quiz, but not dropping them.

Variable definitions:

  • mean_share_misinfo: mean (across posts) intention to share misinformation posts for a given user; range [0, 1]
  • mean_reliable_misinfo: mean (across posts) reliable score for misinformation posts for a given user; range [1, 5]
  • mean_manipulative_misinfo: mean (across posts) manipulative score for misinformation posts for a given user; range [1, 5]
  • mean_share_nonmisinfo: mean (across posts) intention to share non-misinformation post for a given user; range [0, 1]
  • mean_reliable_nonmisinfo: mean (across posts) reliable score for non-misinformation posts for a given user; range [1, 5]
  • mean_manipulative_nonmisinfo: mean (across posts) manipulative score for non-misinformation posts for a given user; range [1, 5]
  • sharing_discernment: difference in mean share score between non-misinformation and misinformation posts for a given user; range [-1, 1]
  • reliability_discernment: difference in mean reliable score between non-misinformation and misinformation posts for a given user; range [-4, 4]
  • manipulation_discernment: difference in mean manipulative score between misinformation and non-misinformation posts for a given user; range [-4, 4]

Note that the discernment scores are constructed in such a way that a positive value indicates better discernment. For example, positive value for reliability discernment indicates that user scored non-misinformation posts as more reliable than misinformation posts. On the other hand, positive monipulative discernment indicates that user scored non-misinformation posts as less manipulative than misinformation posts.

# Find users with missing answers
missing_scores = data_long |>
  group_by(analytic_id) |>
  summarize(
    any_missing_share = any(is.na(share_score)),
    any_missing_reliable = any(is.na(reliable_score)),
    any_missing_manipulative = any(is.na(manipulative_score)),
    any_missing = any(is.na(share_score) | is.na(reliable_score) | is.na(manipulative_score))
  ) |>
  filter(any_missing)

# Drop answers of users with missing scores
data_long = data_long |> filter(!(analytic_id %in% missing_scores$analytic_id))


# Compute outcomes
outcomes_misinfo = data_long |>
  filter(post_type %in% c("discrediting", "false_dichotomy", "emotional")) |>
  group_by(analytic_id) |>
  summarize(
    mean_share_misinfo = mean(share_score),
    mean_reliable_misinfo = mean(reliable_score),
    mean_manipulative_misinfo = mean(manipulative_score)
  )

outcomes_nonmisinfo = data_long |>
  filter(post_type == "non_misinfo") |>
  group_by(analytic_id) |>
  summarize(
    mean_share_nonmisinfo = mean(share_score),
    mean_reliable_nonmisinfo = mean(reliable_score),
    mean_manipulative_nonmisinfo = mean(manipulative_score),
  )

outcomes = outcomes_misinfo |>
  left_join(outcomes_nonmisinfo,
  by = "analytic_id",
  relationship = "one-to-one",
  unmatched = "error"
  )


outcomes = outcomes |>
  mutate(
    sharing_discernment = mean_share_nonmisinfo - mean_share_misinfo,
    reliability_discernment = mean_reliable_nonmisinfo - mean_reliable_misinfo,
    manipulation_discernment = mean_manipulative_misinfo - mean_manipulative_nonmisinfo
  )

# Check that there are no missing values in the outcomes

if (any(is.na(outcomes))) {
  stop("There are missing values in the outcomes")
}
  • For the purpose of our analysis, we dropped participants with at least one missing value in share_score, manipulative_score, and reliability_scores.

  • We dropped 1033 participants who had at least one missing value in the scores. After dropping, we have 214035 user-post observations.

Checking missing values in attention check passed

data_long |>
  group_by(attention_check_passed) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per attention check") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")
Missing values per attention check
attention_check_passed Total Missing Share Score Missing Reliable Score Missing Manipulative Score Missing All Scores All Scores Non-missing
FALSE 23045 0 0 0 0 23045
TRUE 112515 0 0 0 0 112515
NA 78475 0 0 0 0 78475

Generating long dataset

data_long_final <- data_long %>%
  select(analytic_id, post_id, post_number, post_type, share_score, reliable_score, manipulative_score, phase_coded, quiz_completed_coded_num, arm_coded)

# labeling   
library(Hmisc)


label(data_long_final$analytic_id) <- "Analytic ID"
label(data_long_final$post_id) <- "Post ID"
label(data_long_final$post_number) <- "Post Order Number"
label(data_long_final$post_type) <- "Post Type"
label(data_long_final$share_score) <- "Share Score"
label(data_long_final$reliable_score) <- "Reliable Score"
label(data_long_final$manipulative_score) <- "Manipulative Score"
label(data_long_final$phase_coded) <- "Phase"
label(data_long_final$quiz_completed_coded_num) <- "Quiz Completion"
label(data_long_final$arm_coded) <- "Treatment arm"

var_names <- names(data_long_final)

# Extracting labels
var_labels <- sapply(data_long_final, label)

# Combine into a data frame
label_info_long <- data.frame(Variable = var_names, Label = var_labels, stringsAsFactors = FALSE)

write.csv(label_info_long, "./data/chatfuel/processed/long_data_variable.csv", row.names = FALSE)

fwrite(data_long_final, "./data/chatfuel/processed/misinfo_clean_long.csv.gz")

saveRDS(data_long_final, "./data/chatfuel/processed/misinfo_clean_long.rds")
  • Long dataset is exported as misinfo_clean_long.csv

  • The dataset contains the following variables:

    • analytic_id: unique identifier for each participant
    • post_id: unique identifier for each post
    • post_number: order of the post
    • post_type: type of the post
    • share_score: score for sharing discernment
    • reliable_score: score for reliability discernment
    • manipulative_score: score for manipulation discernment
    • phase_coded: phase of the study
    • quiz_completed_coded_num: 1 if the participant completed the quiz, 0 otherwise
    • arm_coded: treatment arm
  • The dataset is also saved as misinfo_clean_long.rds for future use.

  • The dataset contains 214035 user-post observations corresponding to 42807 participants.

Generating wide dataset

data_wide = data |>
  select(
    analytic_id, phase_coded, country_coded, payment_amount_coded, payment_condition_dollar, arm_coded,
    MisinfoChat_start_time, total_duration, intervention_duration, 
    misinfo_quiz_duration, misinfoQuiz_end_time, MisinfoQuiz_start_time,
    missing_treatment_arm, abnormal_posts, stage_reached_num,
    stage_reached_char, consent_coded_num, manipulative_reliable_order,
    subphase, strata
  ) |> left_join(
    outcomes,
    by = "analytic_id",
    relationship = "one-to-one"
  ) |> left_join(
    data_attention,
    by = "analytic_id",
    relationship = "one-to-one"
  )

data_wide = data_wide |> 
  mutate(
    quiz_completed = !any(is.na(manipulative_reliable_order), is.na(mean_share_misinfo), is.na(mean_reliable_misinfo), is.na(mean_manipulative_misinfo), is.na(mean_share_nonmisinfo), is.na(mean_reliable_nonmisinfo), is.na(mean_manipulative_nonmisinfo))
  )

# Recode attention check to NA for those who did not complete the quiz
data_wide = data_wide |> 
  mutate(
    attention_check_passed = ifelse(quiz_completed, attention_check_passed, NA_real_)
  )

# Assert number of those who completed the quiz corresponds to the number of users in long dataset

if (sum(data_wide$quiz_completed) != nrow(data_long_final |> distinct(analytic_id))) {
  stop("Number of users in wide dataset does not correspond to the number of users in long dataset")
}

# Rename to avoid confusion 

data_wide$quiz_completed_coded <- data_wide$quiz_completed

#labeling 

label(data_wide$analytic_id) <- "Analytic ID"
label(data_wide$phase_coded) <- "Phase"
label(data_wide$country_coded) <- "Country"
label(data_wide$stage_reached_num) <- "Stage Reached (Numeric, 1-7; 7 = Completed initial chatbot)"
label(data_wide$stage_reached_char) <- "Stage Reached"
label(data_wide$payment_amount_coded) <- "Payment Amount"
label(data_wide$arm_coded) <- "Treatment Arm"
label(data_wide$intervention_duration) <- "time duration between the start and the end of the intervention in minutes"
label(data_wide$total_duration) <- "time duration between the start of the chatbot and the end of the misinformation quiz in minutes"
label(data_wide$misinfo_quiz_duration) <- "time duration between the start and the end of the misinformation quiz in minutes"
label(data_wide$misinfoQuiz_end_time) <- "Misinfo Quiz End Time"
label(data_wide$MisinfoQuiz_start_time) <- "Misinfo Quiz Start Time"
label(data_wide$consent_coded_num) <- "Consent: 1 for I consent, start now, 0 for No let's stop here"
label(data_wide$attention_check_passed) <- "Attention Check Passed; 1 = PASS; 0 = FAIL"
label(data_wide$manipulative_reliable_order) <- "Order of Manipulative and Reliable Questions"
label(data_wide$mean_share_misinfo) <- "Mean Share Score for Misinformation Posts"
label(data_wide$mean_reliable_misinfo) <- "Mean Reliable Score for Misinformation Posts"
label(data_wide$mean_manipulative_misinfo) <- "Mean Manipulative Score for Misinformation Posts"
label(data_wide$mean_share_nonmisinfo) <- "Mean Share Score for Non-Misinformation Posts"
label(data_wide$mean_reliable_nonmisinfo) <- "Mean Reliable Score for Non-Misinformation Posts"
label(data_wide$mean_manipulative_nonmisinfo) <- "Mean Manipulative Score for Non-Misinformation Posts"
label(data_wide$sharing_discernment) <- "Discernment in Sharing Misinformation and Non-Misinformation Posts"
label(data_wide$reliability_discernment) <- "Discernment in Reliability of Misinformation and Non-Misinformation Posts"
label(data_wide$manipulation_discernment) <- "Discernment in Manipulation of Misinformation and Non-Misinformation Posts"
label(data_wide$MisinfoChat_start_time) <- "Misinformation Chatbot Start Time"
label(data_wide$subphase) <- "Sub-Phase (for Stratification)"
label(data_wide$strata) <- "Subphase-Country-Payment Strata"

label(data_wide$quiz_completed_coded) <- "Quiz Completion: 1 = Completed, 0 = Did not complete"

var_names <- names(data_wide)

# Extracting labels
var_labels <- sapply(data_wide, label)

# Combine into a data frame
label_info_wide <- data.frame(Variable = var_names, Label = var_labels, stringsAsFactors = FALSE)

write.csv(label_info_wide, "./data/chatfuel/processed/wide_data_variable.csv", row.names = FALSE)

# Save data to csv.gz
fwrite(data_wide, "./data/chatfuel/processed/misinfo_clean_wide.csv.gz")

# Save data to .rds
saveRDS(data_wide, "./data/chatfuel/processed/misinfo_clean_wide.rds")

#### Save to stata format
library(haven)

# shorten variable names to first 30 characters to fit stata limit
data_wide_stata = data_wide |>
  rename_all(~str_replace_all(., "followup", "fllwp")) |>
  rename_all(~str_replace_all(., "reflection", "refl")) |>
  rename_all(~str_replace_all(., "manipulative", "mani")) |>
  rename_all(~str_replace_all(., "reliable", "reli")) |>
  rename_all(~str_replace_all(., "reflection", "refl")) |>
  rename_all(~str_replace_all(., "discernment", "disc")) |>
  rename_all(~str_replace_all(., "misinfo", "mis")) |>
  rename_all(~str_replace_all(., "nonmisinfo", "nonmis")) |>
  rename_all(~str_replace_all(., "share", "shre")) |>
  rename_all(~str_replace_all(., "abnormal", "abn")) |>
  rename_all(~str_replace_all(., "completed", "compl")) |>
  rename_all(~str_replace_all(., "inter", "int"))

write_dta(data_wide_stata, "./data/chatfuel/processed/misinfo_clean_wide.dta")
  • Wide dataset is exported as misinfo_clean_wide.csv

  • The dataset contains the following variables:

    • analytic_id: unique identifier for each participant
    • phase_coded: phase of the study
    • country_coded: country of the participant
    • payment_amount_coded: payment amount
    • arm_coded: treatment arm
    • MisinfoChat_start_time: time stamp for starting the chatbot
    • total_duration: time taken to complete the chatbot
    • intervention_duration: time taken to complete the intervention
    • misinfo_quiz_duration: time taken to complete the quiz
    • misinfoQuiz_end_time: time stamp for ending the Misinfo quiz
    • MisinfoQuiz_start_time: time stamp for starting the Misinfo quiz
    • missing_treatment_arm: 1 if the participant did not have a treatment arm, 0 otherwise
    • abnormal_posts: 1 if the participant had abnormal posts, 0 otherwise
    • stage_reached_num: stage reached by the participant (numeric)
    • stage_reached_char: stage reached by the participant
    • consent_coded_num: 1 if the participant consented to the quiz, 0 otherwise
    • manipulative_reliable_order: order of manipulative and reliable questions
    • mean_share_misinfo: mean share score for misinformation posts
    • mean_reliable_misinfo: mean reliable score for misinformation posts
    • mean_manipulative_misinfo: mean manipulative score for misinformation posts
    • mean_share_nonmisinfo: mean share score for non-misinformation posts
    • mean_reliable_nonmisinfo: mean reliable score for non-misinformation posts
    • mean_manipulative_nonmisinfo: mean manipulative score for non-misinformation posts
    • sharing_discernment: discernment in sharing misinformation and non-misinformation posts
    • reliability_discernment: discernment in reliability of misinformation and non-misinformation posts
    • manipulation_discernment: discernment in manipulation of misinformation and non-misinformation posts
    • attention_check_passed: 1 if the participant passed the attention check, 0 otherwise, NA if did not complete the quiz
  • The dataset is also saved as misinfo_clean_wide.rds for future use.

  • The dataset contains 70355 observations corresponding to 70355 participants who consented.

  • Among the participants, 42807 (60.84%) completed the quiz.

  • Among the participants who completed the quiz, 22946 (53.6%) passed the attention check.

Inpsecting the wide dataset

  • In the summary statistics script, we discovered a few issues regarding missing values within the wide dataset. We expected quiz_completers to not have any missing values for the intervention_duration, quiz_duration, and attention_check_passed variable.
df_wide_check <- data_wide %>%  
  filter(quiz_completed == 1) %>% 
  select(analytic_id, intervention_duration, misinfo_quiz_duration, total_duration, attention_check_passed) %>% 
  filter(is.na(intervention_duration) | is.na(misinfo_quiz_duration) | is.na(total_duration) | is.na(attention_check_passed))

n_manual_check <- as.character(nrow(df_wide_check))

# Save to csv

write.csv(df_wide_check , "./data/chatfuel/processed/need_manual_check.csv", row.names = FALSE)
  • We found 21 participants with missing values for the intervention_duration, misinfo_quiz_duration, total_duration, and attention_check_passed variables. We will manually check these participants on Chatfuel. The list of participants is saved as need_manual_check.csv in the chatfuel folder.