The purpose of this script is to clean the raw FB Misinformation Chatfuel data, document key decisions made, and generate two clean datasets for further analysis.
The key outputs are as follows:
misinfo_clean_wide.csv
misinfo_clean_long.csv
The raw Chatfuel dataset was last pulled on 02/23/2024 and processed
by the anonymization
script. The anonymized version of dataset contains the variable
analytic_id
as the unique identifier for each respondent.
We filtered respondents based on the following criteria:
The eligibility criteria for respondents were determined by their
participation in two iterations of Phase 1 and three iterations of Phase
2. During Phase 1
, participants interacted with one of two
versions of the chatbot, labeled as ExperimentLaunchV1
and
ExperimentLaunchV2
. In Phase 2
, participants
were exposed to one of three versions of the chatbot, named
Phase2
, Phase2_V2
, and Phase2_V3
.
Chatbot users who were deemed ineligible were marked as a missing value
under the NewVersion variable and subsequently removed from the dataset.
Additionally, respondents who took part in technical pilots were
excluded from the dataset.
ExperimentLaunchV1
corresponds to the time in
Phase 1
before we reduced incentives in South Africa and
Nigeria. ExperimentLaunchV2
corresponds to the time in
Phase 1
after we reduced incentives in South Africa and
Nigeria.
Phase2
corresponds to the time in
Phase 2
before we fixed the issue with randomization (which
was causing correct marginal distributions of arm and payment but
incorrect joint distribution). Phase2_V2
corresponds to the
time in Phase 2
after we fixed the issue with
randomization. Phase2_V3
then corresponds to the time after
we removed the $1 and $3 payment options in
Phase 2
.
We excluded participants who started before the experiment (11/16/2023) and after the end of experiment (12/31/2023)
The eligibility of respondents is based on the four countries: Nigeria, Kenya, Ghana, and South Africa. Respondents who are not from these countries were dropped from the raw dataset.
Early in Phase 1
of the experiment, before we
implemented a fix, some users were able to do the chatbot multiple
times. After the fix many users continued trying to enter the chatbot
for the second time but were turned away. We made a clear distinction
between two types of respondents: attempted_repeaters
and
actual_repeaters
. attempted_repeaters
refers
to those respondents who attempted to interact with the chatbot more
than once but were automatically prevented from doing so. On the other
hand, actual_repeaters
are those respondents who managed to
participate in more than one treatment arm before being identified and
removed from the dataset.
We ensure uniqueness of observations on several places. First, in
the anonymization
script we ensure each row corresponds to a unique
chatfuel user id
. Second, we checked for the number of
unique analytic_id
and the number of observations in the
dataset. We confirm that the number of unique analytic_id
is equal to the number of observations in the dataset, so each
observation is indeed unique.
We checked for a series of anomalies in the raw dataset and found the following:
Phase 1 participants having Phase 2 specific attributes and vice versa: We noticed some participants in Phase 1 being assigned to the long baseline group and some participants in Phase 2 being assigned to the video intervention group. We dropped these participants from the dataset.
Negative time duration: We noticed about 82 respondents (<0.10 %) having a negative time duration for the amount of time completing the treatment intervention or the amount of time completing the quiz. Upon inspection, it is likely due to a bug in Chatfuel. We dropped these completers from the dataset.
Missing values: quiz completers
are defined as those
who have a timestamp in the misinfoQuiz_end_time
.
Logically, these participants must have recorded entries for all of the
misinfo quiz attributes, including reliable_score
,
manipulative_score
, and share_score
. However,
we noticed, in some instances, that there missing values for these
variables. These are concentrated in Phase 1
and caused by
the fact that we initially did not enforce that users select from one of
the pre-defined options for the quiz questions. This affected ~1000
observations. A second, much smaller, set of missing values is due to a
bug in Chatfuel. This affects less than 100 users in total. For both
sets of users we set all of their quiz-related outcomes to
missing.
Participants with wrong or missing posts: we drop ~100
partitipants with missing post_id
in any post and two
partiticpant with post_id
’s corresponding to pilot version
of the quiz.
The raw dataset was last pulled on 02/23/2024 and stored at
~ssh://sherlock/oak/stanford/groups/athey/fb_misinfo_interventions/data/
The anonymized dataset is stored in Github directory
~fb_misinfo_interventions/data/chatfuel/raw
phase_coded
)
ExperimentLaunchV1
or
ExperimentLaunchV2
as Phase 1
Phase2
,
Phase2_V2
or Phase2_V3
as
Phase 2
total_count <- data %>%
count(NewVersion) %>%
summarise(total = sum(n)) %>%
pull(total)
data %>%
count(NewVersion) %>%
rbind(data.frame(NewVersion = "Total", n = total_count)) %>%
kable("html", col.names=(c("Phase", "Count"))) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Count |
---|---|
68462 | |
ExperimentLaunchV1 | 9904 |
ExperimentLaunchV2 | 30604 |
Final_Technical_Pilot | 181 |
Misinfo_Pilot | 287 |
Phase2 | 8057 |
Phase2_V2 | 27276 |
Phase2_V3 | 15512 |
Total | 160283 |
# generating coded phase 1 and 2 variables and dropping respondents not in Phase 1 or Phase 2
data$phase_coded <-data$NewVersion
data$phase_coded[data$phase_coded=="ExperimentLaunchV1" | data$phase_coded=="ExperimentLaunchV2"] = "Phase 1"
data$phase_coded[data$phase_coded=="Phase2" | data$phase_coded=="Phase2_V2" | data$phase_coded=="Phase2_V3"] = "Phase 2"
data = data |> mutate(
subphase = case_when(
phase_coded == "Phase 1" & (as.POSIXct(entry_end_time) <= as.POSIXct("2023-11-21 01:00:00 UTC") ) ~ "1A",
phase_coded == "Phase 1" & (as.POSIXct(entry_end_time) > as.POSIXct("2023-11-21 01:00:00 UTC") ) ~ "1B",
NewVersion == "Phase2" ~ "2A",
NewVersion == "Phase2_V2" ~ "2B",
NewVersion == "Phase2_V3" ~ "2C"
)
)
total_count <- data %>%
count(phase_coded) %>%
summarise(total = sum(n)) %>%
pull(total)
data %>%
count(phase_coded) %>%
rbind(data.frame(phase_coded = "Total", n = total_count)) %>%
kable("html", col.names=(c("Phase", "Count"))) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Count |
---|---|
68462 | |
Final_Technical_Pilot | 181 |
Misinfo_Pilot | 287 |
Phase 1 | 40508 |
Phase 2 | 50845 |
Total | 160283 |
a= as.character(length(unique(data$`analytic_id`)))
if (nrow(data) != length(unique(data$analytic_id))) {
stop("The number of rows does not match the number of unique analytic_id values.")
} else {
# Continue
print("Unique observations check passed.")
}
## [1] "Unique observations check passed."
The count of unique analytic_id
is 160283.
The number of unique analytic_id
is equal to the
number of total observation, so there is no repetition of
analytic_id
. Each observation is indeed unique.
data <- data %>%
filter(phase_coded == "Phase 1" | phase_coded == "Phase 2")
b <- as.character(length(unique(data$`analytic_id`)))
total_count <- data %>%
count(phase_coded) %>%
summarise(total = sum(n)) %>%
pull(total)
data %>%
count(phase_coded) %>%
rbind(data.frame(phase_coded = "Total", n = total_count)) %>%
kable("html", col.names=(c("Phase", "Count"))) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Count |
---|---|
Phase 1 | 40508 |
Phase 2 | 50845 |
Total | 91353 |
before_and_after <- data %>%
select(analytic_id, MisinfoChat_start_time) %>%
mutate(start_day = as.Date(MisinfoChat_start_time)) %>%
filter(start_day < as.Date("2023-11-16") | start_day > as.Date("2024-01-01"))
before_drop_starttime <- as.character(nrow(data))
data <- data %>%
filter(!(analytic_id %in% before_and_after$analytic_id))
after_drop_starttime <- as.character(nrow(data))
#check for entries with multiple treatment arms
data_repeaters <- data %>%
select(analytic_id, phase_coded,
repeater,
arm,
InterArms_start_time, InterArmsVideo_end_time, # video start and end time
InterArmsLongBase_start_time, InterArmsLongBase_end_time, #long baseline start,end time
InterArmsBase_start_time, InterArmsBase_end_time, #short baseline start,end time
InterArmsSMS_start_time, InterArmsSMS_end_time, #sms start,end time
InterArmsGame_start_time, interArmsGame_end_time, # game start and end time
MisinfoQuiz_start_time) %>%
mutate(has_longbase_time = if_else(!is.na(InterArmsLongBase_start_time) | !is.na(InterArmsLongBase_end_time), TRUE, FALSE),
has_video_time = if_else(!is.na(InterArms_start_time) | !is.na(InterArmsVideo_end_time), TRUE, FALSE),
has_game_time = if_else(!is.na(InterArmsGame_start_time) | !is.na(interArmsGame_end_time), TRUE, FALSE),
has_shortbase_time = if_else(!is.na(InterArmsBase_start_time) | !is.na(InterArmsBase_end_time), TRUE, FALSE),
has_sms_time = if_else(!is.na(InterArmsSMS_start_time) | !is.na(InterArmsSMS_end_time), TRUE, FALSE),
repeated = has_longbase_time + has_shortbase_time + has_sms_time + has_video_time +has_game_time > 1) %>%
select(analytic_id, arm, repeater, repeated, phase_coded, MisinfoQuiz_start_time)
data_repeaters = data_repeaters |>
mutate(
actual_repeater = repeated,
attempted_repeater = case_when(
repeater == 1 & repeated == 0 ~ TRUE,
TRUE ~ FALSE
),
repeater = actual_repeater | attempted_repeater
)
# Count repeaters and attempted repeaters by phase
data_repeaters |> group_by(phase_coded) |>
summarize(
`Actual Repeaters` = sum(actual_repeater),
`Attempted Repeaters` = sum(attempted_repeater)
) |>
mutate(
`Total` = `Actual Repeaters` + `Attempted Repeaters`
) |>
kable("html", col.names = c("Phase", "Actual Repeaters", "Attempted Repeaters", "Total")) |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
Phase | Actual Repeaters | Attempted Repeaters | Total |
---|---|---|---|
Phase 1 | 3429 | 9640 | 13069 |
Phase 2 | 29 | 9982 | 10011 |
Repeaters are classified as attempted repeaters
or
actual repeaters
.
We made a clear distinction between two types of respondents:
attempted_repeaters
and actual_repeaters
.
attempted_repeaters
refers to those respondents who
attempted to interact with the chatbot more than once but were
automatically prevented from doing so. On the other hand,
actual_repeaters
are those respondents who managed to
participate in more than one treatment arm before being identified and
removed from the dataset.
The actual repeaters are concentrated in Phase 1
.
When we noticed this was an issue we implemented a fix where the chatbot
would prevent users from entering the chatbot for the second
time.
We identified actual_repeaters
by checking the time
stamps of the different treatment arms. If a respondent had a time stamp
for more than one treatment arm, they were classified as
actual_repeaters
.
data <- left_join(
data,
data_repeaters %>% select(analytic_id, actual_repeater, attempted_repeater),
by = "analytic_id",
relationship = "one-to-one",
unmatched = "error"
) %>%
filter(!actual_repeater)
data |> group_by(phase_coded) |>
summarize(
`Actual Repeaters` = sum(actual_repeater),
`Attempted Repeaters` = sum(attempted_repeater)
) |>
mutate(
`Total` = `Actual Repeaters` + `Attempted Repeaters`
) |>
kable("html", col.names = c("Phase", "Actual Repeaters", "Attempted Repeaters", "Total")) |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
Phase | Actual Repeaters | Attempted Repeaters | Total |
---|---|---|---|
Phase 1 | 0 | 9640 | 9640 |
Phase 2 | 0 | 9982 | 9982 |
In Phase 1, there are 4 groups: baseline
,
sms
, video
, and game
, while in
Phase 2, there are 3 groups: short baseline
,
long baseline
, and sms
.
Variable arm_coded
is the cleaned treatment arm
variable.
arm_coded
includes
Original Baseline
, SMS
, Video
,
and Game
.arm_coded
includes
Original Baseline
, Long Baseline
, and
SMS
.Checking the raw arm
variable, we have the following
entries: baseline
, game
,
longbaseline
, sms
, video
The empty cell indicates participants who have missing values in the treatment variables. These respondents dropped off before and at the treatment arm assignment.
# checking possible treatment responses
data %>%
count(arm) %>%
kable("html", col.names=(c("Treatment Arm", "Count"))) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Treatment Arm | Count |
---|---|
16998 | |
baseline | 13705 |
game | 7480 |
longbaseline | 14159 |
sms | 27841 |
video | 7539 |
data %>% group_by(phase_coded) %>%
count(arm)%>%
kable("html", col.names=(c("Phase", "Treatment Arm", "Count"))) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Treatment Arm | Count |
---|---|---|
Phase 1 | 6915 | |
Phase 1 | baseline | 7452 |
Phase 1 | game | 7480 |
Phase 1 | longbaseline | 2 |
Phase 1 | sms | 7553 |
Phase 1 | video | 7538 |
Phase 2 | 10083 | |
Phase 2 | baseline | 6253 |
Phase 2 | longbaseline | 14157 |
Phase 2 | sms | 20288 |
Phase 2 | video | 1 |
arm
==
video
and two participants in Phase 1 having
arm
== longbaseline
. We’ll drop these
participants from the dataset.data <- data %>% filter(!(arm == "video" & phase_coded == "Phase 2"))
data <- data %>% filter(!(arm == "longbaseline" & phase_coded == "Phase 1"))
c1 <- as.character(length(unique(data$`analytic_id`)))
# Recoding treatment arm variable
data <- data %>%
mutate(
arm_coded = case_when(
arm == "baseline" & phase_coded == "Phase 1" ~ "Original Baseline",
arm %in% c("SMS", "sms") ~ "SMS",
arm == "video" & phase_coded == "Phase 1" ~ "Video",
arm == "game" & phase_coded == "Phase 1" ~ "Game",
arm == "baseline" & phase_coded == "Phase 2" ~ "Original Baseline",
arm == "longbaseline" & phase_coded == "Phase 2" ~ "Long Baseline",
TRUE ~ "None"
)
)
data %>%
group_by(phase_coded, arm_coded) %>%
summarise(n = n()) |>
group_by(phase_coded) |>
mutate(Total = sum(n)) |>
pivot_wider(names_from = arm_coded, values_from = n, values_fill = 0) |>
select(
`Phase` = phase_coded,
`Original Baseline`, `Game`, `SMS`, `Video`, `Long Baseline`, `Total`) |>
kable("html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Original Baseline | Game | SMS | Video | Long Baseline | Total |
---|---|---|---|---|---|---|
Phase 1 | 7452 | 7480 | 7553 | 7538 | 0 | 36938 |
Phase 2 | 6253 | 0 | 20288 | 0 | 14157 | 50781 |
Variable user_country
indicates where the
respondents come from
The coded version of user_country
is
country_coded
Responses are messy with some random links and emojis which are treated as missing values
To proceed, the chatbot required that users select from a pre-defined list of countries or enter one of the options manually. In the later case, the capitalization of the country name was not enforced.
Participants who are not from either Kenya, South Africa, Nigeria, or Ghana are dropped from the dataset
## user_country n
## <char> <int>
## 1: 8888
## 2: ,nigeria 1
## 3: .\nSouth Africa 1
## 4: . south Africa 1
## 5: .. 1
## ---
## 216: you are stupid 1
## 217: 𝒢𝒽𝒶𝓃𝒶 1
## 218: 😊 1
## 219: 😡😡😡😡 1
## 220: 🤔 1
# rewriting responses
data = data |> mutate(
country_coded = case_when(
tolower(user_country) == "nigeria" ~ "Nigeria",
tolower(user_country) == "kenya" ~ "Kenya",
tolower(user_country) == "south africa" ~ "South Africa",
tolower(user_country) == "ghana" ~ "Ghana",
TRUE ~ "Other"
)
)
data %>%
group_by(phase_coded, country_coded) %>%
summarise(n = n()) |>
group_by(phase_coded) |>
mutate(Total = sum(n)) |>
pivot_wider(names_from = country_coded, values_from = n, values_fill = 0) |>
select(
`Phase` = phase_coded, `Ghana`, `Kenya`, `Nigeria`, `South Africa`, `Other`, `Total`) |>
kable("html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Ghana | Kenya | Nigeria | South Africa | Other | Total |
---|---|---|---|---|---|---|
Phase 1 | 4404 | 4278 | 14168 | 9691 | 4397 | 36938 |
Phase 2 | 4668 | 6117 | 20461 | 14107 | 5428 | 50781 |
Respondents who are dropped passed the consent
checkpoint but did not come from either four of our countries of
interest
ineligible_country
is a variable that indicates
whether the respondent is from a country other than Kenya, South Africa,
Nigeria, or Ghana
data = data |> mutate(
is_ineligible_country = country_coded == "Other"
)
# Calculate the number of unique participants before and after dropping ineligible countries
c2_before = as.character(length(unique(data$`analytic_id`)))
n_other = (data$is_ineligible_country) |> sum()
data = data |> filter(!is_ineligible_country)
c2 = as.character(length(unique(data$`analytic_id`)))
We dropped 9825 participants who are not from either Kenya, South Africa, Nigeria, or Ghana.
Before dropping, we had 87719 observations. After dropping we have 77894 observations.
The forced responses in the chatbot are “I consent, start now” for consenting and “No, let’s stop here” for not providing consent
Checking the consent record responses, we noticed that some responses are not in the format of “I consent, start now” (in any case, with or without comma)” and “No, let’s stop here” (in any case, with or without comma). To align to IRB as much as possible, we only considered those who answered “I consent, start now” as consented and those who answered “No, let’s stop here” or other responses as not consented
consent_coded
is the cleaned variable for consent
responses
## consent_coded
## <char>
## 1:
## 2: . When?
## 3: 0533521502
## 4: 0745432220
## 5: 09130184439
## ---
## 142: https://scontent.xx.fbcdn.net/v/t39.1997-6/851582_369239386556143_1497813874_n.png?stp=dst-png_p100x100&_nc_cat=1&ccb=1-7&_nc_sid=fc3f23&_nc_ohc=U7sp_UFFoegAX99GmqW&_nc_ad=z-m&_nc_cid=0&_nc_ht=scontent.xx&oh=00_AfCAi_Arf-558PO_y_N3FoExzrS1jQz-kRj0A5oKuGKLlA&oe=656E2711
## 143: https://www.facebook.com/permalink.php?story_fbid=pfbid0cYWUZP6pVvFz1tw25oNREeQPSYCurhdJRp4tzvXjATFC7vqoHcb7wRgBsVn3X7xSl&id=100086763992371
## 144: please send in this number\n\n079 446 7436
## 145: sure let go
## 146: yes sure why asking
## n
## <int>
## 1: 4731
## 2: 1
## 3: 1
## 4: 1
## 5: 1
## ---
## 142: 1
## 143: 1
## 144: 1
## 145: 1
## 146: 1
data = data |> mutate(
consent_coded_num = case_when(
tolower(consent_coded) %in% c(
"i consent, start now",
"i consent start now"
) ~ 1,
tolower(consent_coded) %in% c(
"no, let's stop here",
"no let's stop here "
) ~ 0,
TRUE ~ NA_real_
)
)
n_not_consent = data |> filter(is.na(consent_coded_num) | consent_coded_num == 0) |> nrow()
data %>%
group_by(phase_coded, consent_coded_num) %>%
summarise(n = n()) |>
group_by(phase_coded) |>
mutate(Total = sum(n)) |>
pivot_wider(names_from = consent_coded_num, values_from = n, values_fill = 0) |>
select(
`Phase` = phase_coded, `I consent, start now` = `1`, `No, let's stop here` = `0`, `Other/NA` = `NA`, `Total`) |>
kable("html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | I consent, start now | No, let’s stop here | Other/NA | Total |
---|---|---|---|---|
Phase 1 | 29907 | 1000 | 1634 | 32541 |
Phase 2 | 40646 | 1419 | 3288 | 45353 |
We now drop the 7341 participants who did not consent to the study. Having dropped the participants who did not explicitely consent to the study, we now have 70553 observations.
The actual payment amount for participation was only recorded for
Phase 2 participants by variable
user_payment_amount
In Version 1 of Phase 1, participants were paid 12.5GHS in Ghana, 150Ksh in Kenya, 20R in South Africa, and 800N in Nigeria. In Version 2 of Phase 1, participants were paid 12.5GHS in Ghana, 150Ksh in Kenya, 10R in South Africa, and 400N in Nigeria
The variable payment_amount_coded
is manually
created to contain the payment amount in both Phase 1 and Phase
2
# manually entered payment amount for each country in Phase 1 and Phase 2
data <- data |>
mutate(
payment_amount_coded = case_when(
subphase == "1A" & country_coded == "Ghana" ~ "₵12.5",
subphase == "1A" & country_coded == "Kenya" ~ "150Ksh",
subphase == "1A" & country_coded == "South Africa" ~ "R20",
subphase == "1A" & country_coded == "Nigeria" ~ "₦800",
subphase == "1B" & country_coded == "Ghana" ~ "₵12.5",
subphase == "1B" & country_coded == "Kenya" ~ "150Ksh",
subphase == "1B" & country_coded == "South Africa" ~ "R10",
subphase == "1B" & country_coded == "Nigeria" ~ "₦400",
TRUE ~ user_payment_amount),
payment_condition_dollar = case_when(
user_payment_condition == "high" ~ 3,
user_payment_condition == "low" ~ .5,
user_payment_condition == "medium" ~ 1,
subphase == "1A" ~ 1,
subphase == "1B" & country_coded == "Ghana" ~ 1,
subphase == "1B" & country_coded == "Kenya" ~ 1,
subphase == "1B" & country_coded == "South Africa" ~ .5,
subphase == "1B" & country_coded == "Nigeria" ~ .5,
TRUE ~ NA)
)
# assert that there are no NAs in those variables
if(any(is.na(data$payment_amount_coded)) | any(is.na(data$payment_condition_dollar))) {
stop("There are NAs in payment_amount_coded or payment_condition_dollar")
} else {
print("Asserted that there are no NAs in payment_amount_coded or payment_condition_dollar")
}
## [1] "Asserted that there are no NAs in payment_amount_coded or payment_condition_dollar"
data |>
group_by(phase_coded, payment_condition_dollar) |>
summarise(n = n()) |>
group_by(phase_coded) |>
mutate(Total = sum(n)) |>
pivot_wider(names_from = payment_condition_dollar, values_from = n, values_fill = 0) |>
select(
`Phase` = phase_coded, `High` = `3`, `Medium` = `1`, `Low` = `0.5`, `Total`) |>
kable("html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "100%")
Phase | High | Medium | Low | Total |
---|---|---|---|---|
Phase 1 | 0 | 15849 | 14058 | 29907 |
Phase 2 | 9333 | 9771 | 21542 | 40646 |
data %>%
group_by(phase_coded, manipulative_reliable_order) %>%
summarise(n = n()) |>
group_by(phase_coded) |>
mutate(Total = sum(n)) |>
pivot_wider(names_from = manipulative_reliable_order, values_from = n, values_fill = 0) |>
select(
`Phase` = phase_coded, `Manipulative First` = `1`, `Reliable First` = `2` , `Missing` = `NA`, `Total`) |>
kable("html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>% scroll_box(width = "100%")
Phase | Manipulative First | Reliable First | Missing | Total |
---|---|---|---|---|
Phase 1 | 10378 | 10194 | 9335 | 29907 |
Phase 2 | 15453 | 15559 | 9634 | 40646 |
MisinfoChat_start_time
: time stamp for starting the
chatbot
InterArms_start_time
and
InterArmsVideo_end_time
: time stamp for starting and ending
the video intervention
InterArmsGame_start_time
and
interArmsGame_end_time
: time stamp for starting and ending
the game intervention
InterArmsSMS_start_time
and
InterArmsSMS_end_time
: time stamp for starting and ending
the SMS intervention
InterArmsBase_start_time
and
InterArmsBase_end_time
: time stamp for starting and ending
the baseline intervention
InterArmsLongBase_start_time
and
InterArmsLongBase_end_time
: time stamp for starting and
ending the long baseline intervention
MisinfoQuiz_start_time
and
misinfoQuiz_end_time
: time stamp for starting and ending
the Misinfo quiz
baseline_duration
: time duration between the start
and the end of the orginal baseline in minutes
video_duration
: time duration between the start and
the end of the video intervention in minutes
game_duration
: time duration between the start and
the end of the game intervention in minutes
sms_duration
: time duration between the start and
the end of the SMS intervention in minutes
longbase_duration
: time duration between the start
and the end of the long baseline in minutes
intervention_duration
: time duration between the
start and the end of the intervention in minutes
quiz_duration
: time duration between the start and
the end of the misinformation quiz in minutes
total_duration
: time duration between the start of
the chatbot and the end of the misinformation quiz in minutes
# calculating duration
data <- data %>%
mutate(
baseline_duration = as.numeric(difftime(InterArmsBase_end_time, InterArmsBase_start_time, units = "mins")),
video_duration = as.numeric(difftime(InterArmsVideo_end_time, InterArms_start_time, units = "mins")),
game_duration = as.numeric(difftime(interArmsGame_end_time, InterArmsGame_start_time, units = "mins")),
sms_duration = as.numeric(difftime(InterArmsSMS_end_time, InterArmsSMS_start_time, units = "mins")),
longbase_duration = as.numeric(difftime(InterArmsLongBase_end_time, InterArmsLongBase_start_time, units = "mins")),
misinfo_quiz_duration = as.numeric(difftime(misinfoQuiz_end_time, MisinfoQuiz_start_time, units = "mins")),
total_duration = as.numeric(difftime(misinfoQuiz_end_time, MisinfoChat_start_time, units = "mins"))
) %>%
mutate(intervention_duration = case_when(
arm_coded == "Video" ~ video_duration,
arm_coded == "Game" ~ game_duration,
arm_coded == "SMS" ~ sms_duration,
arm_coded == "Original Baseline" ~ baseline_duration,
arm_coded == "Long Baseline" ~ longbase_duration,
TRUE ~ NA
)
)
data <- data |> mutate(
is_abnormal_time = ((intervention_duration < 0) & !is.na(intervention_duration) ) | ((misinfo_quiz_duration < 0)) & !is.na(misinfo_quiz_duration) | ((total_duration < 0)) & !is.na(total_duration))
# removing missing values from duration (for those who did not finish the intervention)
is_abnormal_time = as.character(sum(data$is_abnormal_time, na.rm = TRUE))
Here we create flags for each stage of the chatbot and quiz to track the progress of the participants. We then create a dropoff variable to track the stage at which participants dropped off.
started_coded_num
: 1 if the participant started the
chatbot, 0 otherwise
consent_coded_num
: 1 if the participant consented to
the quiz, 0 otherwise
started_arm_coded_num
: 1 if the participant started
any of the interventions, 0 otherwise
completed_arm_coded_num
: 1 if the participant
completed any of the interventions, 0 otherwise
quiz_started_coded_num
: 1 if the participant started
the quiz, 0 otherwise
quiz_consent_coded_num
: 1 if the participant
consented to the quiz, 0 otherwise
quiz_completed_coded_num
: 1 if the participant
completed the quiz, 0 otherwise
completed_all_coded_num
: 1 if the participant
completed all, 0 otherwise
data <- data %>%
mutate(started_coded_num = ifelse(!is.na(MisinfoChat_start_time), 1, 0),
started_arm_coded_num = ifelse(!is.na(InterArms_start_time) | !is.na(InterArmsSMS_start_time) | !is.na(InterArmsGame_start_time) | !is.na(InterArmsBase_start_time) | !is.na(InterArmsLongBase_start_time), 1, 0),
completed_arm_coded_num = ifelse(!is.na(intervention_duration), 1, 0),
quiz_started_coded_num = ifelse(!is.na(MisinfoQuiz_start_time), 1, 0),
quiz_consent_coded_num = case_when(
quiz_consent == "I acknowledge" ~ 1,
TRUE ~ 0),
quiz_completed_coded_num = ifelse(!is.na(misinfoQuiz_end_time), 1, 0),
completed_all_coded_num = ifelse(!is.na(completed), 1, 0))
stage
provides the location where participants dropped
off from the chatbotdata <- data |>
rowwise() |>
mutate(
stage_reached_num = sum(
started_coded_num,
started_arm_coded_num,
completed_arm_coded_num,
quiz_consent_coded_num,
quiz_started_coded_num,
quiz_completed_coded_num,
completed_all_coded_num),
stage_reached_char = case_when(
stage_reached_num == 7 ~ "7-Completed all",
stage_reached_num == 6 ~ "6-Completed quiz",
stage_reached_num == 5 ~ "5-Consented to quiz",
stage_reached_num == 4 ~ "4-Reached quiz section",
stage_reached_num == 3 ~ "3-Completed intervention",
stage_reached_num == 2 ~ "2-Started intervention",
stage_reached_num == 1 ~ "1-Started chatbot",
TRUE ~ "Did not start"
))
data |> group_by(phase_coded, stage_reached_num, stage_reached_char) |>
summarize(N = n()) |>
arrange(phase_coded, desc(stage_reached_num)) |>
group_by(phase_coded) |>
mutate(Total = cumsum(N)) |>
arrange(phase_coded, stage_reached_num) |>
mutate(
`Reached here (% Total)` = round(Total / Total[1] * 100,2),
`Reached here (% Previous)` = round(Total / lag(Total) * 100,2),
`Finished here (% Total)` = round(N / Total[1] * 100,2),
`Finished here (% Previous)` = round(N / lag(Total) * 100,2)
) |>
select(
`Phase` = phase_coded,
`Stage Reached` = stage_reached_char,
`Reached here (n)` = Total,
`Reached here (% Total)`,
`Reached here (% Previous)`,
`Finished here` = N,
`Finished here (% Total)`,
`Finished here (% Previous)`
) |>
kable("html") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
Phase | Stage Reached | Reached here (n) | Reached here (% Total) | Reached here (% Previous) | Finished here | Finished here (% Total) | Finished here (% Previous) |
---|---|---|---|---|---|---|---|
Phase 1 | 2-Started intervention | 29727 | 100.00 | NA | 9029 | 30.37 | NA |
Phase 1 | 3-Completed intervention | 20698 | 69.63 | 69.63 | 2 | 0.01 | 0.01 |
Phase 1 | 4-Reached quiz section | 20696 | 69.62 | 99.99 | 306 | 1.03 | 1.48 |
Phase 1 | 5-Consented to quiz | 20390 | 68.59 | 98.52 | 3569 | 12.01 | 17.24 |
Phase 1 | 6-Completed quiz | 16821 | 56.58 | 82.50 | 289 | 0.97 | 1.42 |
Phase 1 | 7-Completed all | 16532 | 55.61 | 98.28 | 16532 | 55.61 | 98.28 |
Phase 2 | 1-Started chatbot | 40628 | 100.00 | NA | 365 | 0.90 | NA |
Phase 2 | 2-Started intervention | 40263 | 99.10 | 99.10 | 8880 | 21.86 | 21.86 |
Phase 2 | 4-Reached quiz section | 31383 | 77.24 | 77.95 | 390 | 0.96 | 0.97 |
Phase 2 | 5-Consented to quiz | 30993 | 76.28 | 98.76 | 3837 | 9.44 | 12.23 |
Phase 2 | 6-Completed quiz | 27156 | 66.84 | 87.62 | 317 | 0.78 | 1.02 |
Phase 2 | 7-Completed all | 26839 | 66.06 | 98.83 | 26839 | 66.06 | 98.83 |
abnormal_quiz_completers <- data %>%
filter(quiz_completed_coded_num == 1) %>%
select(arm_coded, analytic_id, phase_coded, quiz_completed_coded_num, intervention_duration) %>%
filter(is.na(arm_coded))
is_abnormal_quiz_completers <- as.character(nrow(abnormal_quiz_completers))
data <- data %>%
filter(!analytic_id %in% abnormal_quiz_completers$analytic_id)
d <- as.character(length(unique(data$`analytic_id`)))
Noticed 0 participants who completed the quiz but did not have an recorded entry for treatment assignment. They did not receive any intervention, so we dropped them from the our dataset.
Before dropping, we had 70355 number of observations. After droppping, we have 70355 number of observations
# Reshape post ids
data_long <- data %>%
select(starts_with("post_id"), analytic_id, phase_coded, arm_coded, attention_check_passed, quiz_completed_coded_num) %>%
select(-ends_with("_followup")) %>%
pivot_longer(
cols=starts_with("post_id"),
names_to="post_number",
values_to="post_id",
names_prefix="post_id_"
)
# Reshape manipulative scores
manipulative_score <- data %>%
select(starts_with("post_id"), contains(c("_manipulative")), analytic_id) %>%
select(-ends_with(c("_followup", "_reliable_order"))) %>%
pivot_longer(
cols = contains("_manipulative"),
names_to = "post_number",
values_to = "manipulative_score",
names_pattern = "post_(.*)_manipulative"
) %>%
select(post_number, analytic_id, manipulative_score)
# Reshape reliable scores
reliable_score <- data %>%
select(starts_with("post_id"), contains(c("_reliable")), analytic_id) %>%
select(-ends_with(c("_followup", "_reliable_order"))) %>%
pivot_longer(
cols = contains("_reliable"),
names_to = "post_number",
values_to = "reliable_score",
names_pattern = "post_(.*)_reliable"
) %>%
select(post_number, analytic_id, reliable_score)
# Reshape share scores
share_score <- data %>%
select(starts_with("post_id"), ends_with("_share"), analytic_id) %>%
pivot_longer(
cols = contains("_share"),
names_to = "post_number",
values_to = "share_score",
names_pattern = "post_(.*)_share"
) %>%
select(post_number, analytic_id, share_score)
# Merging reshaped variables
data_long = data_long |>
left_join(manipulative_score,
by = c("analytic_id", "post_number"),
relationship = "one-to-one",
unmatched = "error"
) |>
left_join(reliable_score,
by = c("analytic_id", "post_number"),
relationship = "one-to-one",
unmatched = "error"
) |>
left_join(share_score,
by = c("analytic_id", "post_number"),
relationship = "one-to-one",
unmatched = "error"
)
# converting to numeric responses
# for manipulative and reliable scores: 1 = not at all, 5 = very
# for share score: 1 = yes, 0 = no
data_long = data_long |>
mutate(
manipulative_score = parse_number(manipulative_score),
reliable_score = parse_number(reliable_score),
share_score = case_when(
tolower(share_score) == "yes" ~ 1,
tolower(share_score) == "no" ~ 0,
TRUE ~ NA_real_
)
)
abnormal_manipulativescore
or
abnormal_reliablescore
responses for manipulative and
reliable scores are not in {1,2,3,4,5}.data_long <- data_long |>
mutate(
abnormal_manipulativescore = ifelse(!(manipulative_score %in% 1:5) & !is.na(manipulative_score), 1, 0),
abnormal_reliablescore = ifelse(!(reliable_score %in% 1:5) & !is.na(reliable_score), 1, 0),
manipulative_score = ifelse(abnormal_manipulativescore == 1, NA_real_, manipulative_score) |> as.integer(),
reliable_score = ifelse(abnormal_reliablescore == 1, NA_real_, reliable_score) |> as.integer()
)
abnormal_scores = data_long |>
filter(abnormal_manipulativescore == 1 | abnormal_reliablescore == 1) |>
distinct(analytic_id)
data_long = data_long |> filter(!analytic_id %in% abnormal_scores$analytic_id)
Here we are checking attention check pass across two phases. - In
Phase 1, the attention check was confusingly (arguably, incorrectly)
worded and required users to respond in a specific way to three
questions in a row based on a single prompt. This resulted in a very low
pass rate. - In Phase 2 we used a simpler attention check that simply
required users to select one option (either 2
or
4
, randomized) from a set of {1
,
2
, 3
, 4
, 5
}.
# attention check pass variable
# Attention check in Phase 2 was built in the Chatbot and was a single question with 5 options.
data_attention_2 = data |> filter(
phase_coded == "Phase 2") |>
mutate(
attention_check_passed = case_when(
attention_check_passed == "TRUE" ~ 1,
attention_check_passed == "FALSE" ~ 0,
TRUE ~ NA_real_
)
) |>
select(analytic_id, attention_check_passed)
# Attention check in Phase 1 required users to respond in a specific way to three questions in a row based on a single prompt. Passing was not automatically recorded in the data, so we manually coded it.
data_attantion_1 = data_long |>
filter(phase_coded == "Phase 1") |>
filter(post_id %in% c("attention_check_1", "attention_check_2")) |>
mutate(
attention_check_passed = case_when(
post_id == "attention_check_1" & manipulative_score == 4 & reliable_score == 4 & share_score == 1 ~ 1,
post_id == "attention_check_2" & manipulative_score == 2 & reliable_score == 2 & share_score == 0 ~ 1,
post_id == "attention_check_1" & !(manipulative_score == 4 & reliable_score == 4 & share_score == 1) ~ 0,
post_id == "attention_check_2" & !(manipulative_score == 2 & reliable_score == 2 & share_score == 0) ~ 0,
TRUE ~ NA_real_
)
) |>
select(analytic_id, attention_check_passed)
data_attention <- bind_rows(data_attention_2, data_attantion_1)
data_attention |> left_join(
data |> select(analytic_id, phase_coded),
by = "analytic_id",
relationship = "one-to-one"
) |>
group_by(phase_coded) |>
summarize(
`Attention Check Passed` = sum(attention_check_passed, na.rm = TRUE),
`Attention Check Non-missing` = sum(!is.na(attention_check_passed)),
`Pass Rate (%)` = round(`Attention Check Passed` / `Attention Check Non-missing` * 100, 2)
) |>
rename(`Phase` = phase_coded) |>
kable("html") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
Phase | Attention Check Passed | Attention Check Non-missing | Pass Rate (%) |
---|---|---|---|
Phase 1 | 493 | 18363 | 2.68 |
Phase 2 | 23742 | 29073 | 81.66 |
We are cautious of missing values in the quiz among
quiz completers
. We will now evaluate the issue of
missingness
We will start by dropping the attention check posts from the long dataset as they were coded differently in Phase 2 (and we analyzed them above); we will also drop those that did not complete the quiz (158040 out of 421692 user-posts`).
# Creating post_type for each post
data_long <- data_long |>
mutate(post_type = case_when(
str_detect(post_id, "non_misinfo") ~ "non_misinfo",
str_detect(post_id, "discrediting") ~ "discrediting",
str_detect(post_id, "false") ~ "false_dichotomy",
str_detect(post_id, "attention_check") ~ "attention_check",
str_detect(post_id, "emotional") ~ "emotional",
TRUE ~ NA
))
data_long = data_long |> filter(!post_type == "attention_check") |> filter(quiz_completed_coded_num == 1)
data_long |> group_by(post_id) |> summarize(n = n()) |> arrange(desc(n)) |> kable("html") |> kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |> scroll_box(width = "100%")
post_id | n |
---|---|
emotional_28 | 22051 |
non_misinfo_10 | 22039 |
false_dichotomy_21 | 21964 |
non_misinfo_6 | 21944 |
discrediting_17 | 21938 |
discrediting_14 | 21902 |
false_dichotomy_23 | 21876 |
non_misinfo_8 | 21876 |
non_misinfo_2 | 21821 |
emotional_27 | 21789 |
valid_posts = c(
"emotional_28", "non_misinfo_10", "false_dichotomy_21",
"non_misinfo_6", "discrediting_17", "discrediting_14",
"non_misinfo_8", "false_dichotomy_23", "non_misinfo_2", "emotional_27"
)
abnormal_posts = data_long |>
filter(is.na(post_id) | !(post_id %in% valid_posts)) |>
distinct(analytic_id) |>
mutate(abnormal_posts = 1)
data_long = data_long |> filter(!analytic_id %in% abnormal_posts$analytic_id)
data = data |> left_join(
abnormal_posts,
by = "analytic_id"
) |>
mutate(abnormal_posts = ifelse(is.na(abnormal_posts), 0, abnormal_posts)) |>
filter(abnormal_posts == 0)
The valid post ids, which correspond to stimuli used in the survey are: emotional_28, non_misinfo_10, false_dichotomy_21, non_misinfo_6, discrediting_17, discrediting_14, non_misinfo_8, false_dichotomy_23, non_misinfo_2, emotional_27. We shall drop all users who have at least one missing value in the post_id variable. We shall also drop all users who have at least one invalid post_id value. A total of 0 users are dropped.
missing_treatment_arm <- data_long |>
filter(arm_coded == "None") |>
select(analytic_id) |>
distinct() |>
mutate(missing_treatment_arm = 1)
data_long = data_long |> filter(!analytic_id %in% missing_treatment_arm$analytic_id)
data = data |> left_join(
missing_treatment_arm,
by = "analytic_id"
) |>
mutate(missing_treatment_arm = ifelse(is.na(missing_treatment_arm), 0, missing_treatment_arm)) |>
filter(missing_treatment_arm == 0)
Finally, we shall drop all users who have missing treatment arm values, a total of 0 users.
Note, in Phase 1 we initially did not enforce that the users select from one of the given options. This results in the relatively higher number of missing values in Phase 1.
data_long |>
group_by(phase_coded, analytic_id) |>
summarize(
`User has any missing values` = any(is.na(share_score)) | any(is.na(reliable_score)) | any(is.na(manipulative_score)),
`User has all missing values` = all(is.na(share_score)) & all(is.na(reliable_score)) & all(is.na(manipulative_score)),
`User has no missing values` = !any(is.na(share_score)) & !any(is.na(reliable_score)) & !any(is.na(manipulative_score))
) |>
group_by(phase_coded) |>
summarize(
`Total Users` = n(),
`Users with any missing values` = sum(`User has any missing values`),
`Users with all missing values` = sum(`User has all missing values`),
`Users with no missing values` = sum(`User has no missing values`)
) |>
kable("html", caption = "Users with missing values, per phase") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
phase_coded | Total Users | Users with any missing values | Users with all missing values | Users with no missing values |
---|---|---|---|---|
Phase 1 | 16680 | 987 | 0 | 15693 |
Phase 2 | 27160 | 46 | 0 | 27114 |
data_long |>
group_by(phase_coded) |>
summarize(
`Total` = n(),
`Missing Share Score` = sum(is.na(share_score)),
`Missing Reliable Score` = sum(is.na(reliable_score)),
`Missing Manipulative Score` = sum(is.na(manipulative_score)),
`Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
`All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
) |>
kable("html", caption = "Missing values per phase") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
phase_coded | Total | Missing Share Score | Missing Reliable Score | Missing Manipulative Score | Missing All Scores | All Scores Non-missing |
---|---|---|---|---|---|---|
Phase 1 | 83400 | 463 | 558 | 510 | 34 | 82107 |
Phase 2 | 135800 | 35 | 32 | 37 | 22 | 135745 |
We now look into whether the missing values are concentrated in a particular post order.
data_long |>
group_by(post_number) |>
summarize(
`Total` = n(),
`Missing Share Score` = sum(is.na(share_score)),
`Missing Reliable Score` = sum(is.na(reliable_score)),
`Missing Manipulative Score` = sum(is.na(manipulative_score)),
`Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
`All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
) |>
kable("html", caption = "Missing values per post order") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
post_number | Total | Missing Share Score | Missing Reliable Score | Missing Manipulative Score | Missing All Scores | All Scores Non-missing |
---|---|---|---|---|---|---|
1 | 36466 | 88 | 108 | 93 | 5 | 36215 |
2 | 36568 | 69 | 84 | 83 | 8 | 36374 |
3 | 36555 | 76 | 87 | 83 | 8 | 36349 |
4 | 36461 | 63 | 86 | 86 | 9 | 36269 |
5 | 36495 | 87 | 99 | 100 | 14 | 36270 |
6 | 36655 | 115 | 126 | 102 | 12 | 36375 |
data_long |>
group_by(post_id) |>
summarize(
`Total` = n(),
`Missing Share Score` = sum(is.na(share_score)),
`Missing Reliable Score` = sum(is.na(reliable_score)),
`Missing Manipulative Score` = sum(is.na(manipulative_score)),
`Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
`All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
) |>
kable("html", caption = "Missing values per post id") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
post_id | Total | Missing Share Score | Missing Reliable Score | Missing Manipulative Score | Missing All Scores | All Scores Non-missing |
---|---|---|---|---|---|---|
discrediting_14 | 21902 | 46 | 62 | 56 | 6 | 21773 |
discrediting_17 | 21938 | 57 | 53 | 53 | 9 | 21806 |
emotional_27 | 21789 | 52 | 65 | 54 | 7 | 21653 |
emotional_28 | 22051 | 49 | 65 | 67 | 6 | 21905 |
false_dichotomy_21 | 21964 | 44 | 62 | 58 | 6 | 21828 |
false_dichotomy_23 | 21876 | 51 | 61 | 67 | 7 | 21729 |
non_misinfo_10 | 22039 | 36 | 56 | 53 | 5 | 21922 |
non_misinfo_2 | 21821 | 42 | 45 | 44 | 0 | 21698 |
non_misinfo_6 | 21944 | 69 | 56 | 45 | 3 | 21799 |
non_misinfo_8 | 21876 | 52 | 65 | 50 | 7 | 21739 |
data_long |>
group_by(phase_coded, arm_coded) |>
summarize(
`Total` = n(),
`Missing Share Score` = sum(is.na(share_score)),
`Missing Reliable Score` = sum(is.na(reliable_score)),
`Missing Manipulative Score` = sum(is.na(manipulative_score)),
`Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
`All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
) |>
kable("html", caption = "Missing values per phase and treatment arm") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
phase_coded | arm_coded | Total | Missing Share Score | Missing Reliable Score | Missing Manipulative Score | Missing All Scores | All Scores Non-missing |
---|---|---|---|---|---|---|---|
Phase 1 | Game | 17265 | 99 | 132 | 125 | 5 | 16972 |
Phase 1 | Original Baseline | 24225 | 128 | 153 | 159 | 11 | 23864 |
Phase 1 | SMS | 21415 | 115 | 150 | 121 | 8 | 21078 |
Phase 1 | Video | 20495 | 121 | 123 | 105 | 10 | 20193 |
Phase 2 | Long Baseline | 45865 | 8 | 7 | 10 | 6 | 45853 |
Phase 2 | Original Baseline | 21355 | 5 | 5 | 6 | 4 | 21347 |
Phase 2 | SMS | 68580 | 22 | 20 | 21 | 12 | 68545 |
data_long |>
group_by(attention_check_passed) |>
summarize(
`Total` = n(),
`Missing Share Score` = sum(is.na(share_score)),
`Missing Reliable Score` = sum(is.na(reliable_score)),
`Missing Manipulative Score` = sum(is.na(manipulative_score)),
`Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
`All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
) |>
kable("html", caption = "Missing values per attention check") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
attention_check_passed | Total | Missing Share Score | Missing Reliable Score | Missing Manipulative Score | Missing All Scores | All Scores Non-missing |
---|---|---|---|---|---|---|
FALSE | 23080 | 4 | 5 | 5 | 2 | 23072 |
TRUE | 112705 | 28 | 25 | 30 | 18 | 112661 |
NA | 83415 | 466 | 560 | 512 | 36 | 82119 |
Note that we will use a restrictive sample definition, where we only consider the users with no missing values. As a result we lose 1033 users out of 43840 users who completed the quiz. Vast majority of these users are from Phase 1. Concretely, for any user with one or more missing values in the quiz questions, we drop all their answers. These users will continue to be tracked in the dataset, but we will not use their answers in the survey analysis. In other words, for users who have at least one missing value in the main survey quiz questions, we will treat them as not completing the main survey quiz, but not dropping them.
mean_share_misinfo
: mean (across posts) intention to
share misinformation posts for a given user; range [0, 1]mean_reliable_misinfo
: mean (across posts) reliable
score for misinformation posts for a given user; range [1, 5]mean_manipulative_misinfo
: mean (across posts)
manipulative score for misinformation posts for a given user; range [1,
5]mean_share_nonmisinfo
: mean (across posts) intention to
share non-misinformation post for a given user; range [0, 1]mean_reliable_nonmisinfo
: mean (across posts) reliable
score for non-misinformation posts for a given user; range [1, 5]mean_manipulative_nonmisinfo
: mean (across posts)
manipulative score for non-misinformation posts for a given user; range
[1, 5]sharing_discernment
: difference in mean share score
between non-misinformation and misinformation posts for a given user;
range [-1, 1]reliability_discernment
: difference in mean reliable
score between non-misinformation and misinformation posts for a given
user; range [-4, 4]manipulation_discernment
: difference in mean
manipulative score between misinformation and non-misinformation posts
for a given user; range [-4, 4]Note that the discernment scores are constructed in such a way that a positive value indicates better discernment. For example, positive value for reliability discernment indicates that user scored non-misinformation posts as more reliable than misinformation posts. On the other hand, positive monipulative discernment indicates that user scored non-misinformation posts as less manipulative than misinformation posts.
# Find users with missing answers
missing_scores = data_long |>
group_by(analytic_id) |>
summarize(
any_missing_share = any(is.na(share_score)),
any_missing_reliable = any(is.na(reliable_score)),
any_missing_manipulative = any(is.na(manipulative_score)),
any_missing = any(is.na(share_score) | is.na(reliable_score) | is.na(manipulative_score))
) |>
filter(any_missing)
# Drop answers of users with missing scores
data_long = data_long |> filter(!(analytic_id %in% missing_scores$analytic_id))
# Compute outcomes
outcomes_misinfo = data_long |>
filter(post_type %in% c("discrediting", "false_dichotomy", "emotional")) |>
group_by(analytic_id) |>
summarize(
mean_share_misinfo = mean(share_score),
mean_reliable_misinfo = mean(reliable_score),
mean_manipulative_misinfo = mean(manipulative_score)
)
outcomes_nonmisinfo = data_long |>
filter(post_type == "non_misinfo") |>
group_by(analytic_id) |>
summarize(
mean_share_nonmisinfo = mean(share_score),
mean_reliable_nonmisinfo = mean(reliable_score),
mean_manipulative_nonmisinfo = mean(manipulative_score),
)
outcomes = outcomes_misinfo |>
left_join(outcomes_nonmisinfo,
by = "analytic_id",
relationship = "one-to-one",
unmatched = "error"
)
outcomes = outcomes |>
mutate(
sharing_discernment = mean_share_nonmisinfo - mean_share_misinfo,
reliability_discernment = mean_reliable_nonmisinfo - mean_reliable_misinfo,
manipulation_discernment = mean_manipulative_misinfo - mean_manipulative_nonmisinfo
)
# Check that there are no missing values in the outcomes
if (any(is.na(outcomes))) {
stop("There are missing values in the outcomes")
}
For the purpose of our analysis, we dropped participants with at
least one missing value in share_score
,
manipulative_score
, and
reliability_scores
.
We dropped 1033 participants who had at least one missing value in the scores. After dropping, we have 214035 user-post observations.
data_long |>
group_by(attention_check_passed) |>
summarize(
`Total` = n(),
`Missing Share Score` = sum(is.na(share_score)),
`Missing Reliable Score` = sum(is.na(reliable_score)),
`Missing Manipulative Score` = sum(is.na(manipulative_score)),
`Missing All Scores` = sum(is.na(share_score) & is.na(reliable_score) & is.na(manipulative_score)),
`All Scores Non-missing` = sum(!is.na(share_score) & !is.na(reliable_score) & !is.na(manipulative_score)),
) |>
kable("html", caption = "Missing values per attention check") |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) |>
scroll_box(width = "100%")
attention_check_passed | Total | Missing Share Score | Missing Reliable Score | Missing Manipulative Score | Missing All Scores | All Scores Non-missing |
---|---|---|---|---|---|---|
FALSE | 23045 | 0 | 0 | 0 | 0 | 23045 |
TRUE | 112515 | 0 | 0 | 0 | 0 | 112515 |
NA | 78475 | 0 | 0 | 0 | 0 | 78475 |
data_long_final <- data_long %>%
select(analytic_id, post_id, post_number, post_type, share_score, reliable_score, manipulative_score, phase_coded, quiz_completed_coded_num, arm_coded)
# labeling
library(Hmisc)
label(data_long_final$analytic_id) <- "Analytic ID"
label(data_long_final$post_id) <- "Post ID"
label(data_long_final$post_number) <- "Post Order Number"
label(data_long_final$post_type) <- "Post Type"
label(data_long_final$share_score) <- "Share Score"
label(data_long_final$reliable_score) <- "Reliable Score"
label(data_long_final$manipulative_score) <- "Manipulative Score"
label(data_long_final$phase_coded) <- "Phase"
label(data_long_final$quiz_completed_coded_num) <- "Quiz Completion"
label(data_long_final$arm_coded) <- "Treatment arm"
var_names <- names(data_long_final)
# Extracting labels
var_labels <- sapply(data_long_final, label)
# Combine into a data frame
label_info_long <- data.frame(Variable = var_names, Label = var_labels, stringsAsFactors = FALSE)
write.csv(label_info_long, "./data/chatfuel/processed/long_data_variable.csv", row.names = FALSE)
fwrite(data_long_final, "./data/chatfuel/processed/misinfo_clean_long.csv.gz")
saveRDS(data_long_final, "./data/chatfuel/processed/misinfo_clean_long.rds")
Long dataset is exported as
misinfo_clean_long.csv
The dataset contains the following variables:
analytic_id
: unique identifier for each
participantpost_id
: unique identifier for each postpost_number
: order of the postpost_type
: type of the postshare_score
: score for sharing discernmentreliable_score
: score for reliability discernmentmanipulative_score
: score for manipulation
discernmentphase_coded
: phase of the studyquiz_completed_coded_num
: 1 if the participant
completed the quiz, 0 otherwisearm_coded
: treatment armThe dataset is also saved as misinfo_clean_long.rds
for future use.
The dataset contains 214035 user-post observations corresponding to 42807 participants.
data_wide = data |>
select(
analytic_id, phase_coded, country_coded, payment_amount_coded, payment_condition_dollar, arm_coded,
MisinfoChat_start_time, total_duration, intervention_duration,
misinfo_quiz_duration, misinfoQuiz_end_time, MisinfoQuiz_start_time,
missing_treatment_arm, abnormal_posts, stage_reached_num,
stage_reached_char, consent_coded_num, manipulative_reliable_order,
subphase, strata
) |> left_join(
outcomes,
by = "analytic_id",
relationship = "one-to-one"
) |> left_join(
data_attention,
by = "analytic_id",
relationship = "one-to-one"
)
data_wide = data_wide |>
mutate(
quiz_completed = !any(is.na(manipulative_reliable_order), is.na(mean_share_misinfo), is.na(mean_reliable_misinfo), is.na(mean_manipulative_misinfo), is.na(mean_share_nonmisinfo), is.na(mean_reliable_nonmisinfo), is.na(mean_manipulative_nonmisinfo))
)
# Recode attention check to NA for those who did not complete the quiz
data_wide = data_wide |>
mutate(
attention_check_passed = ifelse(quiz_completed, attention_check_passed, NA_real_)
)
# Assert number of those who completed the quiz corresponds to the number of users in long dataset
if (sum(data_wide$quiz_completed) != nrow(data_long_final |> distinct(analytic_id))) {
stop("Number of users in wide dataset does not correspond to the number of users in long dataset")
}
# Rename to avoid confusion
data_wide$quiz_completed_coded <- data_wide$quiz_completed
#labeling
label(data_wide$analytic_id) <- "Analytic ID"
label(data_wide$phase_coded) <- "Phase"
label(data_wide$country_coded) <- "Country"
label(data_wide$stage_reached_num) <- "Stage Reached (Numeric, 1-7; 7 = Completed initial chatbot)"
label(data_wide$stage_reached_char) <- "Stage Reached"
label(data_wide$payment_amount_coded) <- "Payment Amount"
label(data_wide$arm_coded) <- "Treatment Arm"
label(data_wide$intervention_duration) <- "time duration between the start and the end of the intervention in minutes"
label(data_wide$total_duration) <- "time duration between the start of the chatbot and the end of the misinformation quiz in minutes"
label(data_wide$misinfo_quiz_duration) <- "time duration between the start and the end of the misinformation quiz in minutes"
label(data_wide$misinfoQuiz_end_time) <- "Misinfo Quiz End Time"
label(data_wide$MisinfoQuiz_start_time) <- "Misinfo Quiz Start Time"
label(data_wide$consent_coded_num) <- "Consent: 1 for I consent, start now, 0 for No let's stop here"
label(data_wide$attention_check_passed) <- "Attention Check Passed; 1 = PASS; 0 = FAIL"
label(data_wide$manipulative_reliable_order) <- "Order of Manipulative and Reliable Questions"
label(data_wide$mean_share_misinfo) <- "Mean Share Score for Misinformation Posts"
label(data_wide$mean_reliable_misinfo) <- "Mean Reliable Score for Misinformation Posts"
label(data_wide$mean_manipulative_misinfo) <- "Mean Manipulative Score for Misinformation Posts"
label(data_wide$mean_share_nonmisinfo) <- "Mean Share Score for Non-Misinformation Posts"
label(data_wide$mean_reliable_nonmisinfo) <- "Mean Reliable Score for Non-Misinformation Posts"
label(data_wide$mean_manipulative_nonmisinfo) <- "Mean Manipulative Score for Non-Misinformation Posts"
label(data_wide$sharing_discernment) <- "Discernment in Sharing Misinformation and Non-Misinformation Posts"
label(data_wide$reliability_discernment) <- "Discernment in Reliability of Misinformation and Non-Misinformation Posts"
label(data_wide$manipulation_discernment) <- "Discernment in Manipulation of Misinformation and Non-Misinformation Posts"
label(data_wide$MisinfoChat_start_time) <- "Misinformation Chatbot Start Time"
label(data_wide$subphase) <- "Sub-Phase (for Stratification)"
label(data_wide$strata) <- "Subphase-Country-Payment Strata"
label(data_wide$quiz_completed_coded) <- "Quiz Completion: 1 = Completed, 0 = Did not complete"
var_names <- names(data_wide)
# Extracting labels
var_labels <- sapply(data_wide, label)
# Combine into a data frame
label_info_wide <- data.frame(Variable = var_names, Label = var_labels, stringsAsFactors = FALSE)
write.csv(label_info_wide, "./data/chatfuel/processed/wide_data_variable.csv", row.names = FALSE)
# Save data to csv.gz
fwrite(data_wide, "./data/chatfuel/processed/misinfo_clean_wide.csv.gz")
# Save data to .rds
saveRDS(data_wide, "./data/chatfuel/processed/misinfo_clean_wide.rds")
#### Save to stata format
library(haven)
# shorten variable names to first 30 characters to fit stata limit
data_wide_stata = data_wide |>
rename_all(~str_replace_all(., "followup", "fllwp")) |>
rename_all(~str_replace_all(., "reflection", "refl")) |>
rename_all(~str_replace_all(., "manipulative", "mani")) |>
rename_all(~str_replace_all(., "reliable", "reli")) |>
rename_all(~str_replace_all(., "reflection", "refl")) |>
rename_all(~str_replace_all(., "discernment", "disc")) |>
rename_all(~str_replace_all(., "misinfo", "mis")) |>
rename_all(~str_replace_all(., "nonmisinfo", "nonmis")) |>
rename_all(~str_replace_all(., "share", "shre")) |>
rename_all(~str_replace_all(., "abnormal", "abn")) |>
rename_all(~str_replace_all(., "completed", "compl")) |>
rename_all(~str_replace_all(., "inter", "int"))
write_dta(data_wide_stata, "./data/chatfuel/processed/misinfo_clean_wide.dta")
Wide dataset is exported as
misinfo_clean_wide.csv
The dataset contains the following variables:
analytic_id
: unique identifier for each
participantphase_coded
: phase of the studycountry_coded
: country of the participantpayment_amount_coded
: payment amountarm_coded
: treatment armMisinfoChat_start_time
: time stamp for starting the
chatbottotal_duration
: time taken to complete the chatbotintervention_duration
: time taken to complete the
interventionmisinfo_quiz_duration
: time taken to complete the
quizmisinfoQuiz_end_time
: time stamp for ending the Misinfo
quizMisinfoQuiz_start_time
: time stamp for starting the
Misinfo quizmissing_treatment_arm
: 1 if the participant did not
have a treatment arm, 0 otherwiseabnormal_posts
: 1 if the participant had abnormal
posts, 0 otherwisestage_reached_num
: stage reached by the participant
(numeric)stage_reached_char
: stage reached by the
participantconsent_coded_num
: 1 if the participant consented to
the quiz, 0 otherwisemanipulative_reliable_order
: order of manipulative and
reliable questionsmean_share_misinfo
: mean share score for misinformation
postsmean_reliable_misinfo
: mean reliable score for
misinformation postsmean_manipulative_misinfo
: mean manipulative score for
misinformation postsmean_share_nonmisinfo
: mean share score for
non-misinformation postsmean_reliable_nonmisinfo
: mean reliable score for
non-misinformation postsmean_manipulative_nonmisinfo
: mean manipulative score
for non-misinformation postssharing_discernment
: discernment in sharing
misinformation and non-misinformation postsreliability_discernment
: discernment in reliability of
misinformation and non-misinformation postsmanipulation_discernment
: discernment in manipulation
of misinformation and non-misinformation postsattention_check_passed
: 1 if the participant passed the
attention check, 0 otherwise, NA if did not complete the quizThe dataset is also saved as misinfo_clean_wide.rds
for future use.
The dataset contains 70355 observations corresponding to 70355 participants who consented.
Among the participants, 42807 (60.84%) completed the quiz.
Among the participants who completed the quiz, 22946 (53.6%) passed the attention check.
quiz_completers
to not have any missing values for the
intervention_duration
, quiz_duration
, and
attention_check_passed
variable.df_wide_check <- data_wide %>%
filter(quiz_completed == 1) %>%
select(analytic_id, intervention_duration, misinfo_quiz_duration, total_duration, attention_check_passed) %>%
filter(is.na(intervention_duration) | is.na(misinfo_quiz_duration) | is.na(total_duration) | is.na(attention_check_passed))
n_manual_check <- as.character(nrow(df_wide_check))
# Save to csv
write.csv(df_wide_check , "./data/chatfuel/processed/need_manual_check.csv", row.names = FALSE)
intervention_duration
, misinfo_quiz_duration
,
total_duration
, and attention_check_passed
variables. We will manually check these participants on Chatfuel. The
list of participants is saved as need_manual_check.csv
in
the chatfuel
folder.