Purpose

The purpose of this script is to clean the raw FB Misinformation Chatfuel data, document key decisions made, and generate two clean datasets for further analysis.

The key outputs are as follows:

A long-form dataset, misinfo_clean_wide.csv

This dataset contains one row per respondent. It includes all the responents who consented to the study and passed data-quality checks (such as being from one of the eligible countries), regardless of whether they completed the intervention or the quiz. For the users who did not complete the intervention or the quiz, the relevant variables are set to missing.

A wide-form dataset, misinfo_clean_long.csv

This dataset contains 5 rows per respondent, one for each post (excluding attention check) that the user saw. It includes only the respondents who completed the intervention and the quiz, and passed data-quality checks (such as there being no missing values for the quiz questions).

Summary

The raw Chatfuel dataset was last pulled on 02/23/2024 and processed by the anonymization script. The anonymized version of dataset contains the variable analytic_id as the unique identifier for each respondent. We filtered respondents based on the following criteria:

The eligibility criteria for respondents were determined by their participation in two iterations of Phase 1 and three iterations of Phase 2. During Phase 1, participants interacted with one of two versions of the chatbot, labeled as ExperimentLaunchV1 and ExperimentLaunchV2. In Phase 2, participants were exposed to one of three versions of the chatbot, named Phase2, Phase2_V2, and Phase2_V3. Chatbot users who were deemed ineligible were marked as a missing value under the NewVersion variable and subsequently removed from the dataset. Additionally, respondents who took part in technical pilots were excluded from the dataset.
ExperimentLaunchV1 corresponds to the time in Phase 1 before we reduced incentives in South Africa and Nigeria. ExperimentLaunchV2 corresponds to the time in Phase 1 after we reduced incentives in South Africa and Nigeria.
Phase2 corresponds to the time in Phase 2 before we fixed the issue with randomization (which was causing correct marginal distributions of arm and payment but incorrect joint distribution). Phase2_V2 corresponds to the time in Phase 2 after we fixed the issue with randomization. Phase2_V3 then corresponds to the time after we removed the $1 and $3 payment options in Phase 2.
We excluded participants who started before the experiment (11/16/2023) and after the end of experiment (12/31/2023)
The eligibility of respondents is based on the four countries: Nigeria, Kenya, Ghana, and South Africa. Respondents who are not from these countries were dropped from the raw dataset.
Early in Phase 1 of the experiment, before we implemented a fix, some users were able to do the chatbot multiple times. After the fix many users continued trying to enter the chatbot for the second time but were turned away. We made a clear distinction between two types of respondents: attempted_repeaters and actual_repeaters. attempted_repeaters refers to those respondents who attempted to interact with the chatbot more than once but were automatically prevented from doing so. On the other hand, actual_repeaters are those respondents who managed to participate in more than one treatment arm before being identified and removed from the dataset.
We ensure uniqueness of observations on several places. First, in the anonymization script we ensure each row corresponds to a unique chatfuel user id. Second, we checked for the number of unique analytic_id and the number of observations in the dataset. We confirm that the number of unique analytic_id is equal to the number of observations in the dataset, so each observation is indeed unique.
We checked for a series of anomalies in the raw dataset and found the following:
1. Phase 1 participants having Phase 2 specific attributes and vice versa: We noticed some participants in Phase 1 being assigned to the long baseline group and some participants in Phase 2 being assigned to the video intervention group. We dropped these participants from the dataset.
2. Negative time duration: We noticed about 82 respondents (<0.10 %) having a negative time duration for the amount of time completing the treatment intervention or the amount of time completing the quiz. Upon inspection, it is likely due to a bug in Chatfuel. We dropped these completers from the dataset.
3. Missing values: quiz completers are defined as those who have a timestamp in the misinfoQuiz_end_time. Logically, these participants must have recorded entries for all of the misinfo quiz attributes, including reliable_score, manipulative_score, and share_score. However, we noticed, in some instances, that there missing values for these variables. These are concentrated in Phase 1 and caused by the fact that we initially did not enforce that users select from one of the pre-defined options for the quiz questions. This affected ~1000 observations. A second, much smaller, set of missing values is due to a bug in Chatfuel. This affects less than 100 users in total. For both sets of users we set all of their quiz-related outcomes to missing.
4. Participants with wrong or missing posts: we drop ~100 partitipants with missing post_id in any post and two partiticpant with post_id’s corresponding to pilot version of the quiz.

Loading the anonymized dataset for cleaning

Data

The raw dataset was last pulled on 02/23/2024 and stored at ~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/
The anonymized dataset is stored in Github directory ~fb_misinfo_interventions/data/chatfuel/raw

# file path `~fb_misinfo_interventions/data/chatfuel/raw`
data <- fread("./data/chatfuel/raw/fbmisinfo_anon.csv.gz")

Cleaning phases

Looking at respondents across different phases and generating a phase identifier (phase_coded)
1. Respondents who were in Version ExperimentLaunchV1 or ExperimentLaunchV2 as Phase 1
2. Respondents who were in Version Phase2, Phase2_V2 or Phase2_V3 as Phase 2
Blank values represent participants who accessed the chatbot but did not meet the eligibility criteria to proceed further. These individuals engaged with the Messenger popup, which resulted in their time being logged. However, these participants were subsequently removed from the dataset.

total_count <- data %>% 
  count(NewVersion) %>% 
  summarise(total = sum(n)) %>%
  pull(total)

data %>% 
  count(NewVersion) %>%
  rbind(data.frame(NewVersion = "Total", n = total_count)) %>%
  kable("html", col.names=(c("Phase", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Count
	68462
ExperimentLaunchV1	9904
ExperimentLaunchV2	30604
Final_Technical_Pilot	181
Misinfo_Pilot	287
Phase2	8057
Phase2_V2	27276
Phase2_V3	15512
Total	160283

# generating coded phase 1 and 2 variables and dropping respondents not in Phase 1 or Phase 2
data$phase_coded <-data$NewVersion 
data$phase_coded[data$phase_coded=="ExperimentLaunchV1" | data$phase_coded=="ExperimentLaunchV2"] = "Phase 1"
data$phase_coded[data$phase_coded=="Phase2" | data$phase_coded=="Phase2_V2" | data$phase_coded=="Phase2_V3"] = "Phase 2"

data = data |> mutate(
  subphase = case_when(
    phase_coded == "Phase 1" & (as.POSIXct(entry_end_time) <= as.POSIXct("2023-11-21 01:00:00 UTC") ) ~ "1A",
    phase_coded == "Phase 1" & (as.POSIXct(entry_end_time) > as.POSIXct("2023-11-21 01:00:00 UTC") ) ~ "1B",
    NewVersion == "Phase2" ~ "2A",
    NewVersion == "Phase2_V2" ~ "2B",
    NewVersion == "Phase2_V3" ~ "2C"
  )
)

total_count <- data %>% 
  count(phase_coded) %>% 
  summarise(total = sum(n)) %>%
  pull(total)

data %>% 
  count(phase_coded) %>%
  rbind(data.frame(phase_coded = "Total", n = total_count)) %>%
  kable("html", col.names=(c("Phase", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Count
	68462
Final_Technical_Pilot	181
Misinfo_Pilot	287
Phase 1	40508
Phase 2	50845
Total	160283

a= as.character(length(unique(data$`analytic_id`)))


if (nrow(data) != length(unique(data$analytic_id))) {
  stop("The number of rows does not match the number of unique analytic_id values.")
} else {
  # Continue 
  print("Unique observations check passed.")
}

## [1] "Unique observations check passed."

Checking for uniqueness

The count of unique analytic_id is 160283.
The number of unique analytic_id is equal to the number of total observation, so there is no repetition of analytic_id. Each observation is indeed unique.

Dropping respondents not in Phase 1 or Phase 2

data <- data %>% 
  filter(phase_coded == "Phase 1" | phase_coded == "Phase 2")

b <- as.character(length(unique(data$`analytic_id`)))

total_count <- data %>% 
  count(phase_coded) %>% 
  summarise(total = sum(n)) %>%
  pull(total)

data %>% 
  count(phase_coded) %>%
  rbind(data.frame(phase_coded = "Total", n = total_count)) %>%
  kable("html", col.names=(c("Phase", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Count
Phase 1	40508
Phase 2	50845
Total	91353

The number of observations before dropping is 160283, and after dropping the number of observations is 91353

Dropping participants started before the experiment started and started after the experiment ended

before_and_after <- data %>% 
  select(analytic_id, MisinfoChat_start_time) %>%
  mutate(start_day = as.Date(MisinfoChat_start_time)) %>%
  filter(start_day < as.Date("2023-11-16") | start_day > as.Date("2024-01-01"))

before_drop_starttime <- as.character(nrow(data))

data <- data %>% 
  filter(!(analytic_id %in% before_and_after$analytic_id))

after_drop_starttime <- as.character(nrow(data))

We noticed that there are 173 participants who started before the experiment started (11/16/2023) and after the end of experiment (12/31/2023). We will drop these participants from the dataset.
Before dropping, the number of observations is 91353, and after dropping the number of observations is 91180

Identifying attempted and actual repeaters

#check for entries with multiple treatment arms
data_repeaters <- data %>%
    select(analytic_id, phase_coded,
           repeater,
           arm, 
           InterArms_start_time, InterArmsVideo_end_time, # video start and end time
           InterArmsLongBase_start_time, InterArmsLongBase_end_time, #long baseline start,end time
           InterArmsBase_start_time, InterArmsBase_end_time, #short baseline start,end time
           InterArmsSMS_start_time, InterArmsSMS_end_time, #sms start,end time
           InterArmsGame_start_time, interArmsGame_end_time, # game start and end time
           MisinfoQuiz_start_time) %>%
    mutate(has_longbase_time = if_else(!is.na(InterArmsLongBase_start_time) | !is.na(InterArmsLongBase_end_time), TRUE, FALSE),
           has_video_time = if_else(!is.na(InterArms_start_time) | !is.na(InterArmsVideo_end_time), TRUE, FALSE),
           has_game_time = if_else(!is.na(InterArmsGame_start_time) | !is.na(interArmsGame_end_time), TRUE, FALSE),
           has_shortbase_time = if_else(!is.na(InterArmsBase_start_time) | !is.na(InterArmsBase_end_time), TRUE, FALSE),
           has_sms_time = if_else(!is.na(InterArmsSMS_start_time) | !is.na(InterArmsSMS_end_time), TRUE, FALSE),
           repeated = has_longbase_time + has_shortbase_time + has_sms_time + has_video_time +has_game_time > 1) %>%
    select(analytic_id, arm, repeater, repeated, phase_coded, MisinfoQuiz_start_time)

data_repeaters = data_repeaters |> 
    mutate(
      actual_repeater = repeated,
      attempted_repeater = case_when(
        repeater == 1 & repeated == 0 ~ TRUE,
        TRUE ~ FALSE
      ),
      repeater = actual_repeater | attempted_repeater
    )

# Count repeaters and attempted repeaters by phase
data_repeaters |> group_by(phase_coded) |> 
  summarize(
    `Actual Repeaters` = sum(actual_repeater),
    `Attempted Repeaters` = sum(attempted_repeater)
  ) |>
  mutate(
    `Total` = `Actual Repeaters` + `Attempted Repeaters`
  ) |>
  kable("html", col.names = c("Phase", "Actual Repeaters", "Attempted Repeaters", "Total")) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Phase	Actual Repeaters	Attempted Repeaters	Total
Phase 1	3429	9640	13069
Phase 2	29	9982	10011

Repeaters are classified as attempted repeaters or actual repeaters.
We made a clear distinction between two types of respondents: attempted_repeaters and actual_repeaters. attempted_repeaters refers to those respondents who attempted to interact with the chatbot more than once but were automatically prevented from doing so. On the other hand, actual_repeaters are those respondents who managed to participate in more than one treatment arm before being identified and removed from the dataset.
The actual repeaters are concentrated in Phase 1. When we noticed this was an issue we implemented a fix where the chatbot would prevent users from entering the chatbot for the second time.
We identified actual_repeaters by checking the time stamps of the different treatment arms. If a respondent had a time stamp for more than one treatment arm, they were classified as actual_repeaters.

Excluding repeaters

data <- left_join(
  data,
  data_repeaters %>% select(analytic_id, actual_repeater, attempted_repeater),
  by = "analytic_id",
  relationship = "one-to-one",
  unmatched = "error"
  ) %>%
    filter(!actual_repeater)

data |> group_by(phase_coded) |> 
  summarize(
    `Actual Repeaters` = sum(actual_repeater),
    `Attempted Repeaters` = sum(attempted_repeater)
  ) |>
  mutate(
    `Total` = `Actual Repeaters` + `Attempted Repeaters`
  ) |>
  kable("html", col.names = c("Phase", "Actual Repeaters", "Attempted Repeaters", "Total")) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Phase	Actual Repeaters	Attempted Repeaters	Total
Phase 1	0	9640	9640
Phase 2	0	9982	9982

c = as.character(length(unique(data$`analytic_id`)))

Before dropping the actual repeaters, we had 91353 observations. After dropping, we have 87722 observations.

Cleaning treatment arms variable

In Phase 1, there are 4 groups: baseline, sms, video, and game, while in Phase 2, there are 3 groups: short baseline, long baseline, and sms.
Variable arm_coded is the cleaned treatment arm variable.
1. For Phase 1, the treatment variable arm_coded includes Original Baseline, SMS, Video, and Game.
2. For Phase 2, the treatment variable arm_coded includes Original Baseline, Long Baseline, and SMS.

Checking the raw treatment arm variable

Checking the raw arm variable, we have the following entries: baseline, game, longbaseline, sms, video
The empty cell indicates participants who have missing values in the treatment variables. These respondents dropped off before and at the treatment arm assignment.

# checking possible treatment responses 
data %>% 
  count(arm) %>%
  kable("html", col.names=(c("Treatment Arm", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Treatment Arm	Count
	16998
baseline	13705
game	7480
longbaseline	14159
sms	27841
video	7539

data %>% group_by(phase_coded) %>%
  count(arm)%>%
  kable("html", col.names=(c("Phase", "Treatment Arm", "Count"))) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Treatment Arm	Count
Phase 1		6915
Phase 1	baseline	7452
Phase 1	game	7480
Phase 1	longbaseline	2
Phase 1	sms	7553
Phase 1	video	7538
Phase 2		10083
Phase 2	baseline	6253
Phase 2	longbaseline	14157
Phase 2	sms	20288
Phase 2	video	1

We noticed one participant in Phase 2 who had arm == video and two participants in Phase 1 having arm == longbaseline. We’ll drop these participants from the dataset.

Dropping participants with incorrect treatment arm

data <- data %>% filter(!(arm == "video" & phase_coded == "Phase 2"))
data <- data %>% filter(!(arm == "longbaseline" & phase_coded == "Phase 1"))
c1 <- as.character(length(unique(data$`analytic_id`)))

Before dropping we had 87722 observations. After dropping, we have 87719 observations.

Recoding treatment arm variable

# Recoding treatment arm variable 
data <- data %>%
  mutate(
    arm_coded = case_when(
      arm == "baseline" & phase_coded == "Phase 1" ~ "Original Baseline",
      arm %in% c("SMS", "sms") ~ "SMS",
      arm == "video" & phase_coded == "Phase 1" ~ "Video",
      arm == "game" & phase_coded == "Phase 1" ~ "Game",
      arm == "baseline" & phase_coded == "Phase 2" ~ "Original Baseline",
      arm == "longbaseline" & phase_coded == "Phase 2" ~ "Long Baseline",
      TRUE ~ "None"
    )
  )

Count of Participants in Treatment Arm across Phase 1 and Phase 2

data %>%
  group_by(phase_coded, arm_coded) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = arm_coded, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded,
  `Original Baseline`, `Game`, `SMS`, `Video`, `Long Baseline`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Original Baseline	Game	SMS	Video	Long Baseline	Total
Phase 1	7452	7480	7553	7538	0	36938
Phase 2	6253	0	20288	0	14157	50781

Cleaning user country

Variable user_country indicates where the respondents come from
The coded version of user_country is country_coded
Responses are messy with some random links and emojis which are treated as missing values
To proceed, the chatbot required that users select from a pre-defined list of countries or enter one of the options manually. In the later case, the capitalization of the country name was not enforced.
Participants who are not from either Kenya, South Africa, Nigeria, or Ghana are dropped from the dataset

Checking possible country answers

data$user_country <- data$user_country
data %>%
  count(user_country)

##         user_country     n
##               <char> <int>
##   1:                  8888
##   2:        ,nigeria     1
##   3: .\nSouth Africa     1
##   4:  . south Africa     1
##   5:              ..     1
##  ---                      
## 216:  you are stupid     1
## 217:           𝒢𝒽𝒶𝓃𝒶     1
## 218:               😊     1
## 219:            😡😡😡😡     1
## 220:               🤔     1

Cleaning and renaming country responses

# rewriting responses

data = data |> mutate(
  country_coded = case_when(
    tolower(user_country) == "nigeria" ~ "Nigeria",
    tolower(user_country) == "kenya" ~ "Kenya",
    tolower(user_country) == "south africa" ~ "South Africa",
    tolower(user_country) == "ghana" ~ "Ghana",
    TRUE ~ "Other"
  )
)

data %>%
  group_by(phase_coded, country_coded) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = country_coded, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `Ghana`, `Kenya`, `Nigeria`, `South Africa`, `Other`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Ghana	Kenya	Nigeria	South Africa	Other	Total
Phase 1	4404	4278	14168	9691	4397	36938
Phase 2	4668	6117	20461	14107	5428	50781

Dropping participants who are not from either Kenya, South Africa, Nigeria, or Ghana

Respondents who are dropped passed the consent checkpoint but did not come from either four of our countries of interest
ineligible_country is a variable that indicates whether the respondent is from a country other than Kenya, South Africa, Nigeria, or Ghana

data = data |> mutate(
  is_ineligible_country = country_coded == "Other"
)

# Calculate the number of unique participants before and after dropping ineligible countries
c2_before = as.character(length(unique(data$`analytic_id`)))

n_other = (data$is_ineligible_country) |> sum()

data = data |> filter(!is_ineligible_country)

c2 = as.character(length(unique(data$`analytic_id`)))

We dropped 9825 participants who are not from either Kenya, South Africa, Nigeria, or Ghana.
Before dropping, we had 87719 observations. After dropping we have 77894 observations.

Cleaning consent responses

The forced responses in the chatbot are “I consent, start now” for consenting and “No, let’s stop here” for not providing consent
Checking the consent record responses, we noticed that some responses are not in the format of “I consent, start now” (in any case, with or without comma)” and “No, let’s stop here” (in any case, with or without comma). To align to IRB as much as possible, we only considered those who answered “I consent, start now” as consented and those who answered “No, let’s stop here” or other responses as not consented
consent_coded is the cleaned variable for consent responses

Checking possible consent responses

data$consent_coded <- data$consent
data %>% count(consent_coded)

##                                                                                                                                                                                                                                                                      consent_coded
##                                                                                                                                                                                                                                                                             <char>
##   1:                                                                                                                                                                                                                                                                              
##   2:                                                                                                                                                                                                                                                                       . When?
##   3:                                                                                                                                                                                                                                                                    0533521502
##   4:                                                                                                                                                                                                                                                                    0745432220
##   5:                                                                                                                                                                                                                                                                   09130184439
##  ---                                                                                                                                                                                                                                                                              
## 142: https://scontent.xx.fbcdn.net/v/t39.1997-6/851582_369239386556143_1497813874_n.png?stp=dst-png_p100x100&_nc_cat=1&ccb=1-7&_nc_sid=fc3f23&_nc_ohc=U7sp_UFFoegAX99GmqW&_nc_ad=z-m&_nc_cid=0&_nc_ht=scontent.xx&oh=00_AfCAi_Arf-558PO_y_N3FoExzrS1jQz-kRj0A5oKuGKLlA&oe=656E2711
## 143:                                                                                                                                  https://www.facebook.com/permalink.php?story_fbid=pfbid0cYWUZP6pVvFz1tw25oNREeQPSYCurhdJRp4tzvXjATFC7vqoHcb7wRgBsVn3X7xSl&id=100086763992371
## 144:                                                                                                                                                                                                                                    please send in this number\n\n079 446 7436
## 145:                                                                                                                                                                                                                                                                   sure let go
## 146:                                                                                                                                                                                                                                                           yes sure why asking
##          n
##      <int>
##   1:  4731
##   2:     1
##   3:     1
##   4:     1
##   5:     1
##  ---      
## 142:     1
## 143:     1
## 144:     1
## 145:     1
## 146:     1

Cleaning and renaming consent responses

data = data |> mutate(
  consent_coded_num = case_when(
    tolower(consent_coded) %in% c(
      "i consent, start now",
      "i consent start now"
      ) ~ 1,
    tolower(consent_coded) %in% c(
      "no, let's stop here",
      "no let's stop here "
      ) ~ 0,
    TRUE ~ NA_real_
  )
)

n_not_consent = data |> filter(is.na(consent_coded_num) | consent_coded_num == 0) |> nrow()

data %>%
  group_by(phase_coded, consent_coded_num) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = consent_coded_num, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `I consent, start now` = `1`, `No, let's stop here` = `0`, `Other/NA` = `NA`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	I consent, start now	No, let’s stop here	Other/NA	Total
Phase 1	29907	1000	1634	32541
Phase 2	40646	1419	3288	45353

data <- data %>% filter(consent_coded_num == 1 & !is.na(consent_coded_num))

We now drop the 7341 participants who did not consent to the study. Having dropped the participants who did not explicitely consent to the study, we now have 70553 observations.

Checking payment amount for each country

The actual payment amount for participation was only recorded for Phase 2 participants by variable user_payment_amount
In Version 1 of Phase 1, participants were paid 12.5GHS in Ghana, 150Ksh in Kenya, 20R in South Africa, and 800N in Nigeria. In Version 2 of Phase 1, participants were paid 12.5GHS in Ghana, 150Ksh in Kenya, 10R in South Africa, and 400N in Nigeria
The variable payment_amount_coded is manually created to contain the payment amount in both Phase 1 and Phase 2

# manually entered payment amount for each country in Phase 1 and Phase 2 

data <- data |>
  mutate(
    payment_amount_coded = case_when(
      subphase == "1A" & country_coded == "Ghana" ~ "₵12.5",
      subphase == "1A" & country_coded == "Kenya" ~ "150Ksh",
      subphase == "1A" & country_coded == "South Africa" ~ "R20",
      subphase == "1A" & country_coded == "Nigeria" ~ "₦800",
      subphase == "1B" & country_coded == "Ghana" ~ "₵12.5",
      subphase == "1B" & country_coded == "Kenya" ~ "150Ksh",
      subphase == "1B" & country_coded == "South Africa" ~ "R10",
      subphase == "1B" & country_coded == "Nigeria" ~ "₦400",
      TRUE ~ user_payment_amount),
    payment_condition_dollar = case_when(
      user_payment_condition == "high" ~ 3,
      user_payment_condition == "low" ~ .5,
      user_payment_condition == "medium" ~ 1,
      subphase == "1A"  ~ 1,
      subphase == "1B" & country_coded == "Ghana" ~ 1,
      subphase == "1B" & country_coded == "Kenya" ~ 1,
      subphase == "1B" & country_coded == "South Africa" ~ .5,
      subphase == "1B" & country_coded == "Nigeria" ~ .5,
      TRUE ~ NA)
    )
  
# assert that there are no NAs in those variables

if(any(is.na(data$payment_amount_coded)) | any(is.na(data$payment_condition_dollar))) {
  stop("There are NAs in payment_amount_coded or payment_condition_dollar")
} else {
  print("Asserted that there are no NAs in payment_amount_coded or payment_condition_dollar")
}

## [1] "Asserted that there are no NAs in payment_amount_coded or payment_condition_dollar"

data |>
  group_by(phase_coded, payment_condition_dollar) |>
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = payment_condition_dollar, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `High` = `3`, `Medium` = `1`, `Low` = `0.5`, `Total`) |>
  kable("html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  scroll_box(width = "100%")

Phase	High	Medium	Low	Total
Phase 1	0	15849	14058	29907
Phase 2	9333	9771	21542	40646

Stratas

data = data |> mutate(
  strata = paste(subphase, country_coded, payment_condition_dollar, sep = "_")
)

Checking manipulative / reliable order

In the quiz, for each post, every user first saw the question about their intention to share. Then, they were asked two questions regarding their assessment of the manipulativeness and reliability of these posts. The order in which these questions were asked was randomized at the user level when they started the quiz. Users who never participated in the quiz will therefore have a missing value

data %>%
  group_by(phase_coded, manipulative_reliable_order) %>%
  summarise(n = n()) |>
  group_by(phase_coded) |>
  mutate(Total = sum(n)) |>
  pivot_wider(names_from = manipulative_reliable_order, values_from = n, values_fill = 0) |>
  select(
  `Phase` = phase_coded, `Manipulative First` = `1`, `Reliable First` = `2` , `Missing` = `NA`, `Total`) |>
  kable("html") %>% 
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")

Phase	Manipulative First	Reliable First	Missing	Total
Phase 1	10378	10194	9335	29907
Phase 2	15453	15559	9634	40646

Checking time variables

MisinfoChat_start_time : time stamp for starting the chatbot
InterArms_start_time and InterArmsVideo_end_time: time stamp for starting and ending the video intervention
InterArmsGame_start_time and interArmsGame_end_time: time stamp for starting and ending the game intervention
InterArmsSMS_start_time and InterArmsSMS_end_time: time stamp for starting and ending the SMS intervention
InterArmsBase_start_time and InterArmsBase_end_time: time stamp for starting and ending the baseline intervention
InterArmsLongBase_start_time and InterArmsLongBase_end_time: time stamp for starting and ending the long baseline intervention
MisinfoQuiz_start_time and misinfoQuiz_end_time: time stamp for starting and ending the Misinfo quiz
baseline_duration: time duration between the start and the end of the orginal baseline in minutes
video_duration: time duration between the start and the end of the video intervention in minutes
game_duration: time duration between the start and the end of the game intervention in minutes
sms_duration: time duration between the start and the end of the SMS intervention in minutes
longbase_duration: time duration between the start and the end of the long baseline in minutes
intervention_duration: time duration between the start and the end of the intervention in minutes
quiz_duration: time duration between the start and the end of the misinformation quiz in minutes
total_duration: time duration between the start of the chatbot and the end of the misinformation quiz in minutes

# calculating duration 
data <- data %>%
  mutate(
    baseline_duration = as.numeric(difftime(InterArmsBase_end_time, InterArmsBase_start_time, units = "mins")),
    video_duration = as.numeric(difftime(InterArmsVideo_end_time, InterArms_start_time, units = "mins")),
    game_duration = as.numeric(difftime(interArmsGame_end_time, InterArmsGame_start_time, units = "mins")),
    sms_duration = as.numeric(difftime(InterArmsSMS_end_time, InterArmsSMS_start_time, units = "mins")),
    longbase_duration = as.numeric(difftime(InterArmsLongBase_end_time, InterArmsLongBase_start_time, units = "mins")),
    misinfo_quiz_duration = as.numeric(difftime(misinfoQuiz_end_time, MisinfoQuiz_start_time, units = "mins")),
    total_duration = as.numeric(difftime(misinfoQuiz_end_time, MisinfoChat_start_time, units = "mins"))
    ) %>%
  mutate(intervention_duration = case_when(
    arm_coded == "Video" ~ video_duration,
    arm_coded == "Game" ~ game_duration,
    arm_coded == "SMS" ~ sms_duration,
    arm_coded == "Original Baseline" ~ baseline_duration,
    arm_coded == "Long Baseline" ~ longbase_duration,
    TRUE ~ NA
  )
  )

data <- data |> mutate(
  is_abnormal_time = ((intervention_duration < 0) & !is.na(intervention_duration) ) | ((misinfo_quiz_duration < 0)) & !is.na(misinfo_quiz_duration) | ((total_duration < 0)) & !is.na(total_duration)) 

# removing missing values from duration (for those who did not finish the intervention)
is_abnormal_time = as.character(sum(data$is_abnormal_time, na.rm = TRUE))

data <- data |> filter(!is_abnormal_time)

Here we checked for negative time taken to complete the intervention and the quiz. We noticed 198 time values and droped these observations from the dataset. After dropping, we have 70355 observations.

Cleaning and creating dropoff variable

Here we create flags for each stage of the chatbot and quiz to track the progress of the participants. We then create a dropoff variable to track the stage at which participants dropped off.

started_coded_num: 1 if the participant started the chatbot, 0 otherwise
consent_coded_num: 1 if the participant consented to the quiz, 0 otherwise
started_arm_coded_num: 1 if the participant started any of the interventions, 0 otherwise
completed_arm_coded_num: 1 if the participant completed any of the interventions, 0 otherwise
quiz_started_coded_num: 1 if the participant started the quiz, 0 otherwise
quiz_consent_coded_num: 1 if the participant consented to the quiz, 0 otherwise
quiz_completed_coded_num: 1 if the participant completed the quiz, 0 otherwise
completed_all_coded_num: 1 if the participant completed all, 0 otherwise

data <- data %>%
  mutate(started_coded_num = ifelse(!is.na(MisinfoChat_start_time), 1, 0),
         started_arm_coded_num = ifelse(!is.na(InterArms_start_time) | !is.na(InterArmsSMS_start_time) | !is.na(InterArmsGame_start_time) | !is.na(InterArmsBase_start_time) | !is.na(InterArmsLongBase_start_time), 1, 0),
         completed_arm_coded_num = ifelse(!is.na(intervention_duration), 1, 0),
         quiz_started_coded_num = ifelse(!is.na(MisinfoQuiz_start_time), 1, 0),
         quiz_consent_coded_num = case_when(
          quiz_consent == "I acknowledge" ~ 1,
          TRUE ~ 0),
         quiz_completed_coded_num = ifelse(!is.na(misinfoQuiz_end_time), 1, 0),
         completed_all_coded_num = ifelse(!is.na(completed), 1, 0))

Creating dropoff variable

stage provides the location where participants dropped off from the chatbot

data <- data |>
  rowwise() |>
  mutate(
    stage_reached_num = sum(
      started_coded_num,
      started_arm_coded_num,
      completed_arm_coded_num,
      quiz_consent_coded_num,
      quiz_started_coded_num,
      quiz_completed_coded_num,
      completed_all_coded_num),
    stage_reached_char = case_when(
      stage_reached_num == 7 ~ "7-Completed all",
      stage_reached_num == 6 ~ "6-Completed quiz",
      stage_reached_num == 5 ~ "5-Consented to quiz",
      stage_reached_num == 4 ~ "4-Reached quiz section",
      stage_reached_num == 3 ~ "3-Completed intervention",
      stage_reached_num == 2 ~ "2-Started intervention",
      stage_reached_num == 1 ~ "1-Started chatbot",
      TRUE ~ "Did not start"
  ))


data |> group_by(phase_coded, stage_reached_num, stage_reached_char) |>
  summarize(N = n()) |> 
  arrange(phase_coded, desc(stage_reached_num)) |>
  group_by(phase_coded) |>
  mutate(Total = cumsum(N)) |>
  arrange(phase_coded, stage_reached_num) |>
  mutate(
    `Reached here (% Total)` = round(Total / Total[1] * 100,2),
    `Reached here (% Previous)` = round(Total / lag(Total) * 100,2),
    `Finished here (% Total)` = round(N / Total[1] * 100,2),
    `Finished here (% Previous)` = round(N / lag(Total) * 100,2)
  ) |>
  select(
    `Phase` = phase_coded,
    `Stage Reached` = stage_reached_char,
    `Reached here (n)` = Total,
    `Reached here (% Total)`,
    `Reached here (% Previous)`,
    `Finished here` = N,
    `Finished here (% Total)`,
    `Finished here (% Previous)`
  ) |>
  kable("html") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Phase	Stage Reached	Reached here (n)	Reached here (% Total)	Reached here (% Previous)	Finished here	Finished here (% Total)	Finished here (% Previous)
Phase 1	2-Started intervention	29727	100.00	NA	9029	30.37	NA
Phase 1	3-Completed intervention	20698	69.63	69.63	2	0.01	0.01
Phase 1	4-Reached quiz section	20696	69.62	99.99	306	1.03	1.48
Phase 1	5-Consented to quiz	20390	68.59	98.52	3569	12.01	17.24
Phase 1	6-Completed quiz	16821	56.58	82.50	289	0.97	1.42
Phase 1	7-Completed all	16532	55.61	98.28	16532	55.61	98.28
Phase 2	1-Started chatbot	40628	100.00	NA	365	0.90	NA
Phase 2	2-Started intervention	40263	99.10	99.10	8880	21.86	21.86
Phase 2	4-Reached quiz section	31383	77.24	77.95	390	0.96	0.97
Phase 2	5-Consented to quiz	30993	76.28	98.76	3837	9.44	12.23
Phase 2	6-Completed quiz	27156	66.84	87.62	317	0.78	1.02
Phase 2	7-Completed all	26839	66.06	98.83	26839	66.06	98.83

Checking quiz completers and treatment groups to see if NA entrants

c4 = as.character(length(unique(data$`analytic_id`)))

abnormal_quiz_completers <- data %>%
  filter(quiz_completed_coded_num == 1) %>%
  select(arm_coded, analytic_id, phase_coded, quiz_completed_coded_num, intervention_duration) %>%
  filter(is.na(arm_coded))

is_abnormal_quiz_completers <- as.character(nrow(abnormal_quiz_completers))

data <- data %>%
  filter(!analytic_id %in% abnormal_quiz_completers$analytic_id)

d <- as.character(length(unique(data$`analytic_id`)))

Noticed 0 participants who completed the quiz but did not have an recorded entry for treatment assignment. They did not receive any intervention, so we dropped them from the our dataset.
Before dropping, we had 70355 number of observations. After droppping, we have 70355 number of observations

Cleaning post-level variables and outcomes

Three variables: sharing, manipulative, and reliable scores are recorded for each post. We shall reshape them individually. While merging we shall impose a one-to-one relationship to ensure that we do not lose any data.

# Reshape post ids
data_long <- data %>% 
  select(starts_with("post_id"), analytic_id, phase_coded, arm_coded, attention_check_passed, quiz_completed_coded_num) %>%
  select(-ends_with("_followup")) %>%
  pivot_longer(
    cols=starts_with("post_id"),
    names_to="post_number",
    values_to="post_id",
    names_prefix="post_id_"
    )

# Reshape manipulative scores
manipulative_score <- data %>% 
  select(starts_with("post_id"), contains(c("_manipulative")), analytic_id) %>%
  select(-ends_with(c("_followup", "_reliable_order"))) %>%
  pivot_longer(
    cols = contains("_manipulative"),
    names_to = "post_number",
    values_to = "manipulative_score",
    names_pattern = "post_(.*)_manipulative"
  ) %>%
  select(post_number, analytic_id, manipulative_score)

# Reshape reliable scores
reliable_score <- data %>% 
  select(starts_with("post_id"), contains(c("_reliable")), analytic_id) %>%
  select(-ends_with(c("_followup", "_reliable_order"))) %>%
  pivot_longer(
    cols = contains("_reliable"),
    names_to = "post_number",
    values_to = "reliable_score",
    names_pattern = "post_(.*)_reliable"
  ) %>%
  select(post_number, analytic_id, reliable_score)

# Reshape share scores
share_score <- data %>%
  select(starts_with("post_id"), ends_with("_share"), analytic_id) %>%
  pivot_longer(
    cols = contains("_share"),
    names_to = "post_number",
    values_to = "share_score",
    names_pattern = "post_(.*)_share"
  ) %>%
  select(post_number, analytic_id, share_score)


# Merging reshaped variables

data_long = data_long |>
  left_join(manipulative_score,
    by = c("analytic_id", "post_number"),
    relationship = "one-to-one",
    unmatched = "error"
  ) |>
  left_join(reliable_score,
    by = c("analytic_id", "post_number"),
    relationship = "one-to-one",
    unmatched = "error"
  ) |>
  left_join(share_score,
    by = c("analytic_id", "post_number"),
    relationship = "one-to-one",
    unmatched = "error"
  )

# converting to numeric responses
    # for manipulative and reliable scores: 1 = not at all, 5 = very
    # for share score: 1 = yes, 0 = no


data_long = data_long |> 
  mutate(
    manipulative_score = parse_number(manipulative_score),
    reliable_score = parse_number(reliable_score),
    share_score = case_when(
      tolower(share_score) == "yes" ~ 1,
      tolower(share_score) == "no" ~ 0,
      TRUE ~ NA_real_
    )
  )

Checking abnormal reliable and manipulative scores

abnormal_manipulativescore or abnormal_reliablescore responses for manipulative and reliable scores are not in {1,2,3,4,5}.

data_long <- data_long |>
  mutate(
    abnormal_manipulativescore = ifelse(!(manipulative_score %in% 1:5) & !is.na(manipulative_score), 1, 0),
    abnormal_reliablescore = ifelse(!(reliable_score %in% 1:5) & !is.na(reliable_score), 1, 0),
    manipulative_score = ifelse(abnormal_manipulativescore == 1, NA_real_, manipulative_score) |> as.integer(),
    reliable_score = ifelse(abnormal_reliablescore == 1, NA_real_, reliable_score) |> as.integer()
  )

abnormal_scores = data_long |> 
  filter(abnormal_manipulativescore == 1 | abnormal_reliablescore == 1) |> 
  distinct(analytic_id)


data_long = data_long |> filter(!analytic_id %in% abnormal_scores$analytic_id)

We dropped quiz answers of 73 participants who had abnormal manipulative or reliable scores in any of the questions. After dropping, we have 421692 user-post observations.

Checking attention check pass

Here we are checking attention check pass across two phases. - In Phase 1, the attention check was confusingly (arguably, incorrectly) worded and required users to respond in a specific way to three questions in a row based on a single prompt. This resulted in a very low pass rate. - In Phase 2 we used a simpler attention check that simply required users to select one option (either 2 or 4, randomized) from a set of {1, 2, 3, 4, 5}.

# attention check pass variable

# Attention check in Phase 2 was built in the Chatbot and was a single question with 5 options.
data_attention_2 = data |> filter(
  phase_coded == "Phase 2") |>
  mutate(
    attention_check_passed = case_when(
      attention_check_passed == "TRUE" ~ 1,
      attention_check_passed == "FALSE" ~ 0,
      TRUE ~ NA_real_
    )
  ) |>
  select(analytic_id, attention_check_passed)

# Attention check in Phase 1 required users to respond in a specific way to three questions in a row based on a single prompt. Passing was not automatically recorded in the data, so we manually coded it.
data_attantion_1 = data_long |>
  filter(phase_coded == "Phase 1") |>
  filter(post_id %in% c("attention_check_1", "attention_check_2")) |>
  mutate(
    attention_check_passed = case_when(
      post_id == "attention_check_1" & manipulative_score == 4 & reliable_score == 4 & share_score == 1 ~ 1,
      post_id == "attention_check_2" & manipulative_score == 2 & reliable_score == 2 & share_score == 0 ~ 1,
      post_id == "attention_check_1" & !(manipulative_score == 4 & reliable_score == 4 & share_score == 1) ~ 0, 
      post_id == "attention_check_2" & !(manipulative_score == 2 & reliable_score == 2 & share_score == 0) ~ 0,
      TRUE ~ NA_real_
    )
  ) |>
  select(analytic_id, attention_check_passed)

data_attention <- bind_rows(data_attention_2, data_attantion_1)


data_attention |> left_join(
    data |> select(analytic_id, phase_coded),
    by = "analytic_id",
    relationship = "one-to-one"
  ) |>
  group_by(phase_coded) |>
  summarize(
    `Attention Check Passed` = sum(attention_check_passed, na.rm = TRUE),
    `Attention Check Non-missing` = sum(!is.na(attention_check_passed)),
    `Pass Rate (%)` = round(`Attention Check Passed` / `Attention Check Non-missing` * 100, 2)
  ) |>
  rename(`Phase` = phase_coded) |>
  kable("html") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Phase	Attention Check Passed	Attention Check Non-missing	Pass Rate (%)
Phase 1	493	18363	2.68
Phase 2	23742	29073	81.66

Inspecting missing values

We are cautious of missing values in the quiz among quiz completers. We will now evaluate the issue of missingness
We will start by dropping the attention check posts from the long dataset as they were coded differently in Phase 2 (and we analyzed them above); we will also drop those that did not complete the quiz (158040 out of 421692 user-posts`).

# Creating post_type for each post
data_long <- data_long |>
  mutate(post_type = case_when(
    str_detect(post_id, "non_misinfo") ~ "non_misinfo",
    str_detect(post_id, "discrediting") ~ "discrediting",
    str_detect(post_id, "false") ~ "false_dichotomy",
    str_detect(post_id, "attention_check") ~ "attention_check",
    str_detect(post_id, "emotional") ~ "emotional",
    TRUE ~ NA
  ))

data_long = data_long |> filter(!post_type == "attention_check") |> filter(quiz_completed_coded_num == 1)

Checking missing values in post id

data_long |> group_by(post_id) |> summarize(n = n()) |> arrange(desc(n)) |> kable("html") |> kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |> scroll_box(width = "100%")

post_id	n
emotional_28	22051
non_misinfo_10	22039
false_dichotomy_21	21964
non_misinfo_6	21944
discrediting_17	21938
discrediting_14	21902
false_dichotomy_23	21876
non_misinfo_8	21876
non_misinfo_2	21821
emotional_27	21789

valid_posts = c(
  "emotional_28", "non_misinfo_10", "false_dichotomy_21",
  "non_misinfo_6", "discrediting_17", "discrediting_14",
  "non_misinfo_8", "false_dichotomy_23", "non_misinfo_2", "emotional_27"
  )


abnormal_posts = data_long |>
  filter(is.na(post_id) | !(post_id %in% valid_posts)) |>
  distinct(analytic_id) |>
  mutate(abnormal_posts = 1)

data_long = data_long |> filter(!analytic_id %in% abnormal_posts$analytic_id)

data = data |> left_join(
  abnormal_posts,
  by = "analytic_id"
  ) |>
  mutate(abnormal_posts = ifelse(is.na(abnormal_posts), 0, abnormal_posts)) |>
  filter(abnormal_posts == 0)

The valid post ids, which correspond to stimuli used in the survey are: emotional_28, non_misinfo_10, false_dichotomy_21, non_misinfo_6, discrediting_17, discrediting_14, non_misinfo_8, false_dichotomy_23, non_misinfo_2, emotional_27. We shall drop all users who have at least one missing value in the post_id variable. We shall also drop all users who have at least one invalid post_id value. A total of 0 users are dropped.

missing_treatment_arm <- data_long |>
  filter(arm_coded == "None") |>
  select(analytic_id) |>
  distinct() |>
  mutate(missing_treatment_arm = 1)

data_long = data_long |> filter(!analytic_id %in% missing_treatment_arm$analytic_id)

data = data |> left_join(
  missing_treatment_arm,
  by = "analytic_id"
  ) |>
  mutate(missing_treatment_arm = ifelse(is.na(missing_treatment_arm), 0, missing_treatment_arm)) |>
  filter(missing_treatment_arm == 0)

Finally, we shall drop all users who have missing treatment arm values, a total of 0 users.

Missing values per phase

Note, in Phase 1 we initially did not enforce that the users select from one of the given options. This results in the relatively higher number of missing values in Phase 1.

By users

data_long |> 
  group_by(phase_coded, analytic_id) |>
  summarize(
    `User has any missing values` = any(is.na(share_score)) | any(is.na(reliable_score)) | any(is.na(manipulative_score)),
    `User has all missing values` = all(is.na(share_score)) & all(is.na(reliable_score)) & all(is.na(manipulative_score)),
    `User has no missing values` = !any(is.na(share_score)) & !any(is.na(reliable_score)) & !any(is.na(manipulative_score))
    ) |>
  group_by(phase_coded) |>
  summarize(
    `Total Users` = n(),
    `Users with any missing values` = sum(`User has any missing values`),
    `Users with all missing values` = sum(`User has all missing values`),
    `Users with no missing values` = sum(`User has no missing values`)
  ) |>
  kable("html", caption = "Users with missing values, per phase") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Users with missing values, per phase
phase_coded	Total Users	Users with any missing values	Users with all missing values	Users with no missing values
Phase 1	16680	987	0	15693
Phase 2	27160	46	0	27114

By user-posts

data_long |>
  group_by(phase_coded) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per phase") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |> 
  scroll_box(width = "100%")

Missing values per phase
phase_coded	Total	Missing Share Score	Missing Reliable Score	Missing Manipulative Score	Missing All Scores	All Scores Non-missing
Phase 1	83400	463	558	510	34	82107
Phase 2	135800	35	32	37	22	135745

Missing values per post order

We now look into whether the missing values are concentrated in a particular post order.

data_long |>
  group_by(post_number) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per post order") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Missing values per post order
post_number	Total	Missing Share Score	Missing Reliable Score	Missing Manipulative Score	Missing All Scores	All Scores Non-missing
1	36466	88	108	93	5	36215
2	36568	69	84	83	8	36374
3	36555	76	87	83	8	36349
4	36461	63	86	86	9	36269
5	36495	87	99	100	14	36270
6	36655	115	126	102	12	36375

Missing values per post id

The missing values are equally distributed across the post ids.

data_long |>
  group_by(post_id) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per post id") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Missing values per post id
post_id	Total	Missing Share Score	Missing Reliable Score	Missing Manipulative Score	Missing All Scores	All Scores Non-missing
discrediting_14	21902	46	62	56	6	21773
discrediting_17	21938	57	53	53	9	21806
emotional_27	21789	52	65	54	7	21653
emotional_28	22051	49	65	67	6	21905
false_dichotomy_21	21964	44	62	58	6	21828
false_dichotomy_23	21876	51	61	67	7	21729
non_misinfo_10	22039	36	56	53	5	21922
non_misinfo_2	21821	42	45	44	0	21698
non_misinfo_6	21944	69	56	45	3	21799
non_misinfo_8	21876	52	65	50	7	21739

Missing values per phase and treatment arm

The missing values are equally distributed across the treatment arms within each phase.

data_long |>
  group_by(phase_coded, arm_coded) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per phase and treatment arm") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Missing values per phase and treatment arm
phase_coded	arm_coded	Total	Missing Share Score	Missing Reliable Score	Missing Manipulative Score	Missing All Scores	All Scores Non-missing
Phase 1	Game	17265	99	132	125	5	16972
Phase 1	Original Baseline	24225	128	153	159	11	23864
Phase 1	SMS	21415	115	150	121	8	21078
Phase 1	Video	20495	121	123	105	10	20193
Phase 2	Long Baseline	45865	8	7	10	6	45853
Phase 2	Original Baseline	21355	5	5	6	4	21347
Phase 2	SMS	68580	22	20	21	12	68545

Missing values per attention check

data_long |>
  group_by(attention_check_passed) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per attention check") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Missing values per attention check
attention_check_passed	Total	Missing Share Score	Missing Reliable Score	Missing Manipulative Score	Missing All Scores	All Scores Non-missing
FALSE	23080	4	5	5	2	23072
TRUE	112705	28	25	30	18	112661
NA	83415	466	560	512	36	82119

Calculating mean scores for each participant

Note that we will use a restrictive sample definition, where we only consider the users with no missing values. As a result we lose 1033 users out of 43840 users who completed the quiz. Vast majority of these users are from Phase 1. Concretely, for any user with one or more missing values in the quiz questions, we drop all their answers. These users will continue to be tracked in the dataset, but we will not use their answers in the survey analysis. In other words, for users who have at least one missing value in the main survey quiz questions, we will treat them as not completing the main survey quiz, but not dropping them.

Variable definitions:

mean_share_misinfo: mean (across posts) intention to share misinformation posts for a given user; range [0, 1]
mean_reliable_misinfo: mean (across posts) reliable score for misinformation posts for a given user; range [1, 5]
mean_manipulative_misinfo: mean (across posts) manipulative score for misinformation posts for a given user; range [1, 5]
mean_share_nonmisinfo: mean (across posts) intention to share non-misinformation post for a given user; range [0, 1]
mean_reliable_nonmisinfo: mean (across posts) reliable score for non-misinformation posts for a given user; range [1, 5]
mean_manipulative_nonmisinfo: mean (across posts) manipulative score for non-misinformation posts for a given user; range [1, 5]
sharing_discernment: difference in mean share score between non-misinformation and misinformation posts for a given user; range [-1, 1]
reliability_discernment: difference in mean reliable score between non-misinformation and misinformation posts for a given user; range [-4, 4]
manipulation_discernment: difference in mean manipulative score between misinformation and non-misinformation posts for a given user; range [-4, 4]

Note that the discernment scores are constructed in such a way that a positive value indicates better discernment. For example, positive value for reliability discernment indicates that user scored non-misinformation posts as more reliable than misinformation posts. On the other hand, positive monipulative discernment indicates that user scored non-misinformation posts as less manipulative than misinformation posts.

# Find users with missing answers
missing_scores = data_long |>
  group_by(analytic_id) |>
  summarize(
    any_missing_share = any(is.na(share_score)),
    any_missing_reliable = any(is.na(reliable_score)),
    any_missing_manipulative = any(is.na(manipulative_score)),
    any_missing = any(is.na(share_score) | is.na(reliable_score) | is.na(manipulative_score))
  ) |>
  filter(any_missing)

# Drop answers of users with missing scores
data_long = data_long |> filter(!(analytic_id %in% missing_scores$analytic_id))


# Compute outcomes
outcomes_misinfo = data_long |>
  filter(post_type %in% c("discrediting", "false_dichotomy", "emotional")) |>
  group_by(analytic_id) |>
  summarize(
    mean_share_misinfo = mean(share_score),
    mean_reliable_misinfo = mean(reliable_score),
    mean_manipulative_misinfo = mean(manipulative_score)
  )

outcomes_nonmisinfo = data_long |>
  filter(post_type == "non_misinfo") |>
  group_by(analytic_id) |>
  summarize(
    mean_share_nonmisinfo = mean(share_score),
    mean_reliable_nonmisinfo = mean(reliable_score),
    mean_manipulative_nonmisinfo = mean(manipulative_score),
  )

outcomes = outcomes_misinfo |>
  left_join(outcomes_nonmisinfo,
  by = "analytic_id",
  relationship = "one-to-one",
  unmatched = "error"
  )


outcomes = outcomes |>
  mutate(
    sharing_discernment = mean_share_nonmisinfo - mean_share_misinfo,
    reliability_discernment = mean_reliable_nonmisinfo - mean_reliable_misinfo,
    manipulation_discernment = mean_manipulative_misinfo - mean_manipulative_nonmisinfo
  )

# Check that there are no missing values in the outcomes

if (any(is.na(outcomes))) {
  stop("There are missing values in the outcomes")
}

For the purpose of our analysis, we dropped participants with at least one missing value in share_score, manipulative_score, and reliability_scores.
We dropped 1033 participants who had at least one missing value in the scores. After dropping, we have 214035 user-post observations.

Checking missing values in attention check passed

data_long |>
  group_by(attention_check_passed) |>
  summarize(
    `Total` = n(),
    `Missing Share Score` = sum(is.na(share_score)),
    `Missing Reliable Score` = sum(is.na(reliable_score)),
    `Missing Manipulative Score` = sum(is.na(manipulative_score)),
    `Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
    `All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
  ) |>
  kable("html", caption = "Missing values per attention check") |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
  scroll_box(width = "100%")

Missing values per attention check
attention_check_passed	Total	All Scores Non-missing
FALSE	23045	23045
TRUE	112515	112515
NA	78475	78475

Generating long dataset

data_long_final <- data_long %>%
  select(analytic_id, post_id, post_number, post_type, share_score, reliable_score, manipulative_score, phase_coded, quiz_completed_coded_num, arm_coded)

# labeling   
library(Hmisc)


label(data_long_final$analytic_id) <- "Analytic ID"
label(data_long_final$post_id) <- "Post ID"
label(data_long_final$post_number) <- "Post Order Number"
label(data_long_final$post_type) <- "Post Type"
label(data_long_final$share_score) <- "Share Score"
label(data_long_final$reliable_score) <- "Reliable Score"
label(data_long_final$manipulative_score) <- "Manipulative Score"
label(data_long_final$phase_coded) <- "Phase"
label(data_long_final$quiz_completed_coded_num) <- "Quiz Completion"
label(data_long_final$arm_coded) <- "Treatment arm"

var_names <- names(data_long_final)

# Extracting labels
var_labels <- sapply(data_long_final, label)

# Combine into a data frame
label_info_long <- data.frame(Variable = var_names, Label = var_labels, stringsAsFactors = FALSE)

write.csv(label_info_long, "./data/chatfuel/processed/long_data_variable.csv", row.names = FALSE)

fwrite(data_long_final, "./data/chatfuel/processed/misinfo_clean_long.csv.gz")

saveRDS(data_long_final, "./data/chatfuel/processed/misinfo_clean_long.rds")

Long dataset is exported as misinfo_clean_long.csv
The dataset contains the following variables:
- analytic_id: unique identifier for each participant
- post_id: unique identifier for each post
- post_number: order of the post
- post_type: type of the post
- share_score: score for sharing discernment
- reliable_score: score for reliability discernment
- manipulative_score: score for manipulation discernment
- phase_coded: phase of the study
- quiz_completed_coded_num: 1 if the participant completed the quiz, 0 otherwise
- arm_coded: treatment arm
The dataset is also saved as misinfo_clean_long.rds for future use.
The dataset contains 214035 user-post observations corresponding to 42807 participants.

Generating wide dataset

data_wide = data |>
  select(
    analytic_id, phase_coded, country_coded, payment_amount_coded, payment_condition_dollar, arm_coded,
    MisinfoChat_start_time, total_duration, intervention_duration, 
    misinfo_quiz_duration, misinfoQuiz_end_time, MisinfoQuiz_start_time,
    missing_treatment_arm, abnormal_posts, stage_reached_num,
    stage_reached_char, consent_coded_num, manipulative_reliable_order,
    subphase, strata
  ) |> left_join(
    outcomes,
    by = "analytic_id",
    relationship = "one-to-one"
  ) |> left_join(
    data_attention,
    by = "analytic_id",
    relationship = "one-to-one"
  )

data_wide = data_wide |> 
  mutate(
    quiz_completed = !any(is.na(manipulative_reliable_order), is.na(mean_share_misinfo), is.na(mean_reliable_misinfo), is.na(mean_manipulative_misinfo), is.na(mean_share_nonmisinfo), is.na(mean_reliable_nonmisinfo), is.na(mean_manipulative_nonmisinfo))
  )

# Recode attention check to NA for those who did not complete the quiz
data_wide = data_wide |> 
  mutate(
    attention_check_passed = ifelse(quiz_completed, attention_check_passed, NA_real_)
  )

# Assert number of those who completed the quiz corresponds to the number of users in long dataset

if (sum(data_wide$quiz_completed) != nrow(data_long_final |> distinct(analytic_id))) {
  stop("Number of users in wide dataset does not correspond to the number of users in long dataset")
}

# Rename to avoid confusion 

data_wide$quiz_completed_coded <- data_wide$quiz_completed

#labeling 

label(data_wide$analytic_id) <- "Analytic ID"
label(data_wide$phase_coded) <- "Phase"
label(data_wide$country_coded) <- "Country"
label(data_wide$stage_reached_num) <- "Stage Reached (Numeric, 1-7; 7 = Completed initial chatbot)"
label(data_wide$stage_reached_char) <- "Stage Reached"
label(data_wide$payment_amount_coded) <- "Payment Amount"
label(data_wide$arm_coded) <- "Treatment Arm"
label(data_wide$intervention_duration) <- "time duration between the start and the end of the intervention in minutes"
label(data_wide$total_duration) <- "time duration between the start of the chatbot and the end of the misinformation quiz in minutes"
label(data_wide$misinfo_quiz_duration) <- "time duration between the start and the end of the misinformation quiz in minutes"
label(data_wide$misinfoQuiz_end_time) <- "Misinfo Quiz End Time"
label(data_wide$MisinfoQuiz_start_time) <- "Misinfo Quiz Start Time"
label(data_wide$consent_coded_num) <- "Consent: 1 for I consent, start now, 0 for No let's stop here"
label(data_wide$attention_check_passed) <- "Attention Check Passed; 1 = PASS; 0 = FAIL"
label(data_wide$manipulative_reliable_order) <- "Order of Manipulative and Reliable Questions"
label(data_wide$mean_share_misinfo) <- "Mean Share Score for Misinformation Posts"
label(data_wide$mean_reliable_misinfo) <- "Mean Reliable Score for Misinformation Posts"
label(data_wide$mean_manipulative_misinfo) <- "Mean Manipulative Score for Misinformation Posts"
label(data_wide$mean_share_nonmisinfo) <- "Mean Share Score for Non-Misinformation Posts"
label(data_wide$mean_reliable_nonmisinfo) <- "Mean Reliable Score for Non-Misinformation Posts"
label(data_wide$mean_manipulative_nonmisinfo) <- "Mean Manipulative Score for Non-Misinformation Posts"
label(data_wide$sharing_discernment) <- "Discernment in Sharing Misinformation and Non-Misinformation Posts"
label(data_wide$reliability_discernment) <- "Discernment in Reliability of Misinformation and Non-Misinformation Posts"
label(data_wide$manipulation_discernment) <- "Discernment in Manipulation of Misinformation and Non-Misinformation Posts"
label(data_wide$MisinfoChat_start_time) <- "Misinformation Chatbot Start Time"
label(data_wide$subphase) <- "Sub-Phase (for Stratification)"
label(data_wide$strata) <- "Subphase-Country-Payment Strata"

label(data_wide$quiz_completed_coded) <- "Quiz Completion: 1 = Completed, 0 = Did not complete"

var_names <- names(data_wide)

# Extracting labels
var_labels <- sapply(data_wide, label)

# Combine into a data frame
label_info_wide <- data.frame(Variable = var_names, Label = var_labels, stringsAsFactors = FALSE)

write.csv(label_info_wide, "./data/chatfuel/processed/wide_data_variable.csv", row.names = FALSE)

# Save data to csv.gz
fwrite(data_wide, "./data/chatfuel/processed/misinfo_clean_wide.csv.gz")

# Save data to .rds
saveRDS(data_wide, "./data/chatfuel/processed/misinfo_clean_wide.rds")

#### Save to stata format
library(haven)

# shorten variable names to first 30 characters to fit stata limit
data_wide_stata = data_wide |>
  rename_all(~str_replace_all(., "followup", "fllwp")) |>
  rename_all(~str_replace_all(., "reflection", "refl")) |>
  rename_all(~str_replace_all(., "manipulative", "mani")) |>
  rename_all(~str_replace_all(., "reliable", "reli")) |>
  rename_all(~str_replace_all(., "reflection", "refl")) |>
  rename_all(~str_replace_all(., "discernment", "disc")) |>
  rename_all(~str_replace_all(., "misinfo", "mis")) |>
  rename_all(~str_replace_all(., "nonmisinfo", "nonmis")) |>
  rename_all(~str_replace_all(., "share", "shre")) |>
  rename_all(~str_replace_all(., "abnormal", "abn")) |>
  rename_all(~str_replace_all(., "completed", "compl")) |>
  rename_all(~str_replace_all(., "inter", "int"))

write_dta(data_wide_stata, "./data/chatfuel/processed/misinfo_clean_wide.dta")

Wide dataset is exported as misinfo_clean_wide.csv
The dataset contains the following variables:
- analytic_id: unique identifier for each participant
- phase_coded: phase of the study
- country_coded: country of the participant
- payment_amount_coded: payment amount
- arm_coded: treatment arm
- MisinfoChat_start_time: time stamp for starting the chatbot
- total_duration: time taken to complete the chatbot
- intervention_duration: time taken to complete the intervention
- misinfo_quiz_duration: time taken to complete the quiz
- misinfoQuiz_end_time: time stamp for ending the Misinfo quiz
- MisinfoQuiz_start_time: time stamp for starting the Misinfo quiz
- missing_treatment_arm: 1 if the participant did not have a treatment arm, 0 otherwise
- abnormal_posts: 1 if the participant had abnormal posts, 0 otherwise
- stage_reached_num: stage reached by the participant (numeric)
- stage_reached_char: stage reached by the participant
- consent_coded_num: 1 if the participant consented to the quiz, 0 otherwise
- manipulative_reliable_order: order of manipulative and reliable questions
- mean_share_misinfo: mean share score for misinformation posts
- mean_reliable_misinfo: mean reliable score for misinformation posts
- mean_manipulative_misinfo: mean manipulative score for misinformation posts
- mean_share_nonmisinfo: mean share score for non-misinformation posts
- mean_reliable_nonmisinfo: mean reliable score for non-misinformation posts
- mean_manipulative_nonmisinfo: mean manipulative score for non-misinformation posts
- sharing_discernment: discernment in sharing misinformation and non-misinformation posts
- reliability_discernment: discernment in reliability of misinformation and non-misinformation posts
- manipulation_discernment: discernment in manipulation of misinformation and non-misinformation posts
- attention_check_passed: 1 if the participant passed the attention check, 0 otherwise, NA if did not complete the quiz
The dataset is also saved as misinfo_clean_wide.rds for future use.
The dataset contains 70355 observations corresponding to 70355 participants who consented.
Among the participants, 42807 (60.84%) completed the quiz.
Among the participants who completed the quiz, 22946 (53.6%) passed the attention check.

Inpsecting the wide dataset

In the summary statistics script, we discovered a few issues regarding missing values within the wide dataset. We expected quiz_completers to not have any missing values for the intervention_duration, quiz_duration, and attention_check_passed variable.

df_wide_check <- data_wide %>%  
  filter(quiz_completed == 1) %>% 
  select(analytic_id, intervention_duration, misinfo_quiz_duration, total_duration, attention_check_passed) %>% 
  filter(is.na(intervention_duration) | is.na(misinfo_quiz_duration) | is.na(total_duration) | is.na(attention_check_passed))

n_manual_check <- as.character(nrow(df_wide_check))

# Save to csv

write.csv(df_wide_check , "./data/chatfuel/processed/need_manual_check.csv", row.names = FALSE)

We found 21 participants with missing values for the intervention_duration, misinfo_quiz_duration, total_duration, and attention_check_passed variables. We will manually check these participants on Chatfuel. The list of participants is saved as need_manual_check.csv in the chatfuel folder.

FB-Misinformation Data Cleaning

Kiet Le & Szymon Sacher

2024-04-10