We’ll begin by defining a few helper functions which will be used throughout the script to create tables and summaries of variables.
This first function creates a table of the five most common values of a variable and includes a row for the sum of all other responses (other).
most_common_table <- function(variable, title){
table <- table(variable, useNA = 'ifany')
table <- sort(table, decreasing = TRUE)
k = 5
top_k <- head(table, 5)
other <- sum(table[-(1:k)])
freq_df <- data.frame(Response = names(top_k), Frequency = as.vector(top_k))
other_df <- data.frame(Response = "Other", Frequency = other)
result <- rbind(freq_df, other_df)
kable(result, caption = title, align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
}This function creates a clean table to look at the frequencies of different values of a variable. Unlike the function above, it will list all values.
We will create binary flags for passing the greeting, pre-consent, and consent variables.
First we will create a binary flag indicating if someone has passed the greeting. We’ll look at the most common values of the greeting variable. Participants were told “Hi! I’m Sam, a friendly chatbot, and I want to help you find your charity soulmate. With just 6 questions, I’ll find a highly-rated charity that inspires you!”. They were presented the option “Let’s do this!”.
| Response | Frequency |
|---|---|
| Let’s do this! | 30772 |
| Yes! | 8482 |
| NA | 4901 |
| No | 200 |
| Stop | 166 |
| Other | 3144 |
Currently, Chatfuel lists “Let’s do this!” as the only given response option to greeting, but we see that “Yes!” is also a common response. It could be that previously, chatfuel also (or only) listed “Yes!” as an option. To test this hypothesis, we will check the time periods in which “Let’s do this!” and “Yes!” reponses were recorded.
data$charitable_intro_start <- as.Date(data$charitable_intro_start)
yes_responses <- subset(data, greeting == "Yes!" & !is.na(charitable_intro_start))
lets_do_responses <- subset(data, greeting == "Let's do this!" & !is.na(charitable_intro_start))
yes_responses$charitable_intro_start <- as.POSIXct(yes_responses$charitable_intro_start)
time_range <- difftime(max(yes_responses$charitable_intro_start), min(yes_responses$charitable_intro_start))
binwidth <- 60 * 60 * 24
ggplot(yes_responses, aes(x = charitable_intro_start)) +
geom_histogram(binwidth = binwidth, fill = "skyblue", color = "black") +
labs(title = "Frequency of \'Yes!\'' Responses Over Time",
x = "Time",
y = "Frequency")lets_do_responses$charitable_intro_start <- as.POSIXct(lets_do_responses$charitable_intro_start)
time_range <- difftime(max(lets_do_responses$charitable_intro_start), min(lets_do_responses$charitable_intro_start))
ggplot(lets_do_responses, aes(x = charitable_intro_start)) +
geom_histogram(binwidth = binwidth, fill = "skyblue", color = "black") +
labs(title = "Frequency of \'Let's do this!\'' Responses Over Time",
x = "Time",
y = "Frequency")There is not a clear cutoff between when the two types of responses were recorded. Upon review, we found that chatfuel automatically added “Yes!” as a response option to the greeting.
We want the flag to be true if the response to the greeting is exactly “Let’s do this!” and false otherwise.
data$flag_greeting <- ifelse(grepl("Let's do this!", data$greeting, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_greeting, "Summary of flag_greeting", "flag_greeting", "Frequency")| flag_greeting | Frequency |
|---|---|
| FALSE | 16893 |
| TRUE | 30772 |
We want to check that those who answered “Yes!” to the greeting did not pass on to the other stages. We will check that the consent response is NA in these cases.
data_greeting_yes <- subset(data, greeting == "Yes!")
data_greeting_yes$NA_check <- ifelse(!is.na(data_greeting_yes$consent), TRUE, FALSE)
summary_table(data_greeting_yes$NA_check, "Summary of Greeting NA Check", "greeting_NA_check", "Frequency")| greeting_NA_check | Frequency |
|---|---|
| FALSE | 8203 |
| TRUE | 279 |
We see there are 279 people who responded “Yes!” to the greeting but still answered the consent question. We will add a flag for these people.
data$flag_greeting_yes_check <- ifelse(data$greeting == "Yes!" & !is.na(data$consent), TRUE, FALSE)
summary_table(data$flag_greeting_yes_check, "Summary of flag_greeting_NA_check", "flag_greeting_NA_check", "Frequency")| flag_greeting_NA_check | Frequency |
|---|---|
| FALSE | 47385 |
| TRUE | 279 |
| NA | 1 |
We will now create a binary flag for the pre-consent variable. Below is the most common values for the pre-consent variable. Participants were told “But first! Because this chat is part of an academic study, we need your consent to participate.” They were presented with the option “What should I know?”.
| Response | Frequency |
|---|---|
| What should I know? | 26390 |
| NA | 17801 |
| Ok | 222 |
| No | 144 |
| Yes | 136 |
| Other | 2972 |
We will only mark the binary flag as true if they responded exactly “What should I know?”.
data$flag_pre_consent <- ifelse(grepl("What should I know?", data$pre_consent, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_pre_consent, "Summary of flag_pre_consent", "flag_pre_consent", "Frequency")| flag_pre_consent | Frequency |
|---|---|
| FALSE | 21267 |
| TRUE | 26398 |
We will perform the same process with the consent variable. Participants were asked “Okay! Are you over the age of 18 and ready to get started?! (Tap button below to continue)”. They were presented the options “I Consent, Start Now” and “No, let’s stop here”.
| Response | Frequency |
|---|---|
| NA | 23659 |
| I Consent, Start Now | 19325 |
| No, let’s stop here | 4098 |
| Yes | 38 |
| No | 27 |
| Other | 518 |
We will only mark the consent flag as true if they responded exactly “I consent, start now”.
data$flag_consent <- ifelse(grepl("I Consent, Start Now", data$consent, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_consent, "Summary of flag_consent", "flag_consent", "Frequency")| flag_consent | Frequency |
|---|---|
| FALSE | 28337 |
| TRUE | 19328 |
We now want to check that if a participant did not pass the consent stage that future variables are NA. Specifically, we will check that the following variables are NA if the consent flag is false:
We create a new flag, consent_NA_check, to keep track of cases where not all listed variables are NA.
data$consent_NA_check <- ifelse(data$flag_consent == FALSE & (!is.na(data$treatment_group) | !is.na(data$donate_today) | !is.na(data$stay_connected) | !is.na(data$charity_name) | !is.na(data$charity_match_index)), TRUE, FALSE)
summary_table(data$consent_NA_check, "Summary of Consent NA Check", "consent_NA_check", "Frequency")| consent_NA_check | Frequency |
|---|---|
| FALSE | 47582 |
| TRUE | 83 |
There are 83 instances of participants not passing the consent stage but still having values for at least one of the listed variables. This could be because participants restarted the quiz then didn’t pass the consent stage the second time.
To test this hypothesis, at the end of the quiz, there is a question asking participants if they would like to take it again. This question was asked if the participants indicated they were not happy with their final charity match. They were asked “Do you want to try the charity match quiz again to get a different charity?” and were presented with the options “Sure!” and “No thanks.” We check the most common values for this variable below.
| Response | Frequency |
|---|---|
| NA | 47260 |
| No thanks | 370 |
| Sure! | 17 |
| No | 5 |
| no | 2 |
| Other | 11 |
We will create a flag, flag_repeat_quiz, that is true if the participant responded “Sure!” and false for all other values.
data$flag_repeat_quiz <- ifelse(grepl("Sure!", data$repeat_quiz, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_repeat_quiz, "Summary of flag_repeat_quiz", "flag_repeat_quiz", "Frequency")| flag_repeat_quiz | Frequency |
|---|---|
| FALSE | 47648 |
| TRUE | 17 |
We now check how many participants have both the repeat quiz flag and the consent NA flag.
data$repeat_consent_NA_check <- ifelse(data$flag_repeat_quiz & data$consent_NA_check, TRUE, FALSE)
summary_table(data$repeat_consent_NA_check, "Summary of Repeat Consent NA Check", "repeat_consent_NA_check", "Frequency")| repeat_consent_NA_check | Frequency |
|---|---|
| FALSE | 47665 |
None of the 83 observations are explained by the 17 people who indicated they would like to take the quiz again. Even so, there could be people who started the quiz again without reaching the repeat_quiz question stage.
We will now check that all time variables make logical sense, meaning they are in the correct order.
We’ll begin by examining charitable_affirm_start and charitable_affirm_end as an example, then run through all other time variables.
Charitable_affirm_start should always come before charitable_affirm_end. Below, we check that this is true, excluding NA cases for now. We create a flag, charitable_affirm_correct_order, for observations where the end time is before the start time
data$charitable_affirm_correct_order <- ifelse(data$charitable_affirm_start < data$charitable_affirm_end | is.na(data$charitable_affirm_start) | is.na(data$charitable_affirm_end), FALSE, TRUE)
summary_table(data$charitable_affirm_correct_order, "Summary of Charitable Affirm Correct Order", "charitable_affirm_correct_order", "Frequency")| charitable_affirm_correct_order | Frequency |
|---|---|
| FALSE | 47633 |
| TRUE | 32 |
We see that there are 32 observations where the end time was before the start time. We can look at the first five of these observations to see if there is a pattern.
incorrect_order <- data[data$charitable_affirm_correct_order,]
first_5 <- head(incorrect_order, 5)
first_5 <- first_5[, c("charitable_affirm_start", "charitable_affirm_end")]
kable(first_5, caption = "First 5 Observations with Charitable Affirm End Before Start", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) | charitable_affirm_start | charitable_affirm_end |
|---|---|
| 2023-12-03 12:35:11 | 2023-12-02 14:21:47 |
| 2023-12-02 21:50:44 | 2023-12-01 20:18:10 |
| 2023-12-03 20:35:44 | 2023-12-02 23:20:52 |
| 2023-12-07 02:07:12 | 2023-12-03 06:01:08 |
| 2023-12-09 02:12:32 | 2023-12-08 05:25:03 |
It appears that in these obeservations the end time is a day before the start time. These cases likely occur when someone takes the quiz twice and does not finish the second time. As in the introduction section, we will check if any of these inconsistencies are explained by participants indicating they would like to take the quiz again.
data$repeat_charitable_affirm_order_check <- ifelse(data$flag_repeat_quiz & data$charitable_affirm_correct_order, TRUE, FALSE)
summary_table(data$repeat_charitable_affirm_order_check, "Summary of Repeat Charitable Affirm Order Check", "repeat_charitable_affirm_order_check", "Frequency")| repeat_charitable_affirm_order_check | Frequency |
|---|---|
| FALSE | 47665 |
None of these inconsistencies are explained by the 17 people who indicated they would like to take the quiz again.
We will now check that for instances of there being an end time but no start time. We create a flag, charitable_affirm_no_start, for these cases.
data$charitable_affirm_no_start <- ifelse(is.na(data$charitable_affirm_start) & !is.na(data$charitable_affirm_end), TRUE, FALSE)
summary_table(data$charitable_affirm_no_start, "Summary of Charitable Affirm No Start", "charitable_affirm_no_start", "Frequency")| charitable_affirm_no_start | Frequency |
|---|---|
| FALSE | 47662 |
| TRUE | 3 |
We see there are three observations of participants having an end time but no start time. We can look at these three observations.
no_start <- data[is.na(data$charitable_affirm_start) & !is.na(data$charitable_affirm_end),]
no_start <- no_start[, c("charitable_affirm_start", "charitable_affirm_end")]
kable(no_start, caption = "Observations with Charitable Affirm End Time but no Start Time", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) | charitable_affirm_start | charitable_affirm_end |
|---|---|
| NA | 2023-12-22 19:57:45 |
| NA | 2023-12-27 15:00:07 |
| NA | 2023-12-23 10:12:53 |
We will now run through the same process for the following start time variables (and associated end variables):
We will create a flag that is triggered if there is a logical issue with any of these time variables, then we produce a table showing how many logical inconsistencies are associated with each variable.
data$start_end_logical_issue <- FALSE
starts = c("charitable_affirm_start_followup1", "charitable_feedback_start", "charitable_fllwup_start", "charitable_intro_start", "charitable_match_start", "charitable_match_start_followup1", "charitable_reveal_start", "charitable_reveal_start_followup1", "charitable_treatment_start")
prev_total = 0
issues <- data.frame('Time Variable' = character(), 'Total Issues' = numeric())
for (start in starts){
end = sub("start", "end", start)
data$start_end_logical_issue <- ifelse(data[[start]] < data[[end]] | is.na(data[[start]]) | is.na(data[[end]]), data$start_end_logical_issue, TRUE)
data$start_end_logical_issue <- ifelse(is.na(data[[start]]) & !is.na(data[[end]]), TRUE, data$start_end_logical_issue)
issue <- data[data$start_end_logical_issue,]
total = nrow(issue) - prev_total
new_row <- data.frame('Time Variable' = start, 'Total Issues' = total)
issues <- rbind(issues, new_row)
prev_total = nrow(issue)
}
kable(issues, caption = "Total Logical Issues for Each Time Variable", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) | Time.Variable | Total.Issues |
|---|---|
| charitable_affirm_start_followup1 | 0 |
| charitable_feedback_start | 45 |
| charitable_fllwup_start | 2 |
| charitable_intro_start | 2434 |
| charitable_match_start | 73 |
| charitable_match_start_followup1 | 0 |
| charitable_reveal_start | 82 |
| charitable_reveal_start_followup1 | 0 |
| charitable_treatment_start | 47 |
Charitable_intro_start has an abnormal amount of logical inconsistencies, so we’ll dive deeper.
data$charitable_intro_correct_order <- ifelse(data$charitable_intro_start < data$charitable_intro_end | is.na(data$charitable_intro_start) | is.na(data$charitable_intro_end), FALSE, TRUE)
summary_table(data$charitable_intro_correct_order, "Summary of Charitable Intro Correct Order", "charitable_intro_correct_order", "Frequency")| charitable_intro_correct_order | Frequency |
|---|---|
| FALSE | 47185 |
| TRUE | 480 |
There are 529 observations where the end time was before the start time. We will now look for observations where the start time is NA but there exists an end time.
data$charitable_intro_no_start <- ifelse(is.na(data$charitable_intro_start) & !is.na(data$charitable_intro_end), TRUE, FALSE)
summary_table(data$charitable_intro_no_start, "Summary of Charitable Intro No Start", "charitable_intro_no_start", "Frequency")| charitable_intro_no_start | Frequency |
|---|---|
| FALSE | 45709 |
| TRUE | 1956 |
There are 1956 observations where the start time is NA but there exists an end time.
We want to create a flag that is true if there is a logical issue with any of the time variables.
data$time_logical_issue <- ifelse(data$charitable_match_start < data$charitable_match_end &
data$charitable_match_end < data$charitable_affirm_start &
data$charitable_affirm_start < data$charitable_affirm_end &
data$charitable_affirm_end < data$charitable_reveal_start &
data$charitable_reveal_start < data$charitable_reveal_end &
data$charitable_reveal_end < data$charitable_treatment_start &
data$charitable_treatment_start < data$charitable_treatment_end, TRUE, FALSE)
summary_table(data$time_logical_issue, "Summary of Time Logical Issue", "time_logical_issue", "Frequency")| time_logical_issue | Frequency |
|---|---|
| FALSE | 17285 |
| NA | 30380 |
We’ll start with the US_or_Global variable, using the ‘labelled’ package to create a labelled version of the variable.
data$match_quiz_q1_impact_location <- labelled(data$US_or_Global, labels = c("US" = 1, "International" = 2))
summary_table(data$match_quiz_q1_impact_location, "Match Quiz Impact Location", "match_quiz_q1_impact_location", "Frequency")| match_quiz_q1_impact_location | Frequency |
|---|---|
| 1 | 15125 |
| 2 | 3191 |
| NA | 29349 |
We’ll now create a labelled version of all variables in the match quiz section. We have to convert all raw variables to numeric as they are currently stored as strings. Note that the warning “NAs introduced by coercion” indicates that some non-numeric values are being converted to NA. Is this how we should handle it?
data$SuperheroUS <- ifelse(data$SuperheroUS %in% c("1", "2", "3", "4", "5", "6"), as.integer(data$SuperheroUS), NA)## Warning in ifelse(data$SuperheroUS %in% c("1", "2", "3", "4", "5", "6"), : NAs
## introduced by coercion
data$match_quiz_q2_superhero_US <- labelled(data$SuperheroUS, labels = c("Transform education" = 1, "Eradicate hunger and homelessness" = 2,
"Defend the oppressed and marginalized" = 3, "Rescue the environment" = 4, "Heal the sick" = 5, "Protect the animals" = 6))
summary_table(data$match_quiz_q2_superhero_US, "Match Quiz Superhero US", "match_quiz_q2_superhero_US", "Frequency")| match_quiz_q2_superhero_US | Frequency |
|---|---|
| 1 | 1577 |
| 2 | 5570 |
| 3 | 1680 |
| 4 | 1153 |
| 5 | 2275 |
| 6 | 2478 |
| NA | 32932 |
data$Global_Option <- ifelse(data$Global_Option %in% c("1", "2", "3", "4", "5"), as.integer(data$Global_Option), NA)
data$match_quiz_q2_superhero_global <- labelled(data$Global_Option, labels = c("Eradicate poverty worldwide" = 1, "Defend the oppressed and marginalized" = 2,
"Rescue the environment" = 3, "Heal the sick" = 4, "Shield the animals" = 5))
summary_table(data$match_quiz_q2_superhero_global, "Match Quiz Superhero Global", "match_quiz_q2_superhero_global", "Frequency")| match_quiz_q2_superhero_global | Frequency |
|---|---|
| 1 | 1070 |
| 2 | 788 |
| 3 | 552 |
| 4 | 399 |
| 5 | 365 |
| NA | 44491 |
data$US_Help_eduction <- ifelse(data$US_Help_eduction %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_eduction), NA)
data$match_quiz_q3_education_US <- labelled(data$US_Help_eduction, labels = c("help all children learn to read" = 1, "send a low income student to college" = 2,
"support girls in STEM (Science, Tech, Engineering, & Medicide)" = 3, "help teachers fund classroom projects" = 4, "provide tutoring and support to underserved children" = 5))
summary_table(data$match_quiz_q3_education_US, "Match Quiz Education US", "match_quiz_q3_education_US", "Frequency")| match_quiz_q3_education_US | Frequency |
|---|---|
| 1 | 667 |
| 2 | 149 |
| 3 | 142 |
| 4 | 156 |
| 5 | 422 |
| NA | 46129 |
data$US_Help_HungerHomeless <- ifelse(data$US_Help_HungerHomeless %in% c("1", "2", "3"), as.integer(data$US_Help_HungerHomeless), NA)
data$match_quiz_q3_hunger_homelessness_US <- labelled(data$US_Help_HungerHomeless, labels = c("a roof over their head" = 1, "hot food in their belly" = 2,
"employment training and opportunities" = 3))
summary_table(data$match_quiz_q3_hunger_homelessness_US, "Match Quiz Hunger and Homelessness US", "match_quiz_q3_hunger_homelessness_US", "Frequency")| match_quiz_q3_hunger_homelessness_US | Frequency |
|---|---|
| 1 | 1941 |
| 2 | 1872 |
| 3 | 1777 |
| NA | 42075 |
data$US_Help_oppressed <- ifelse(data$US_Help_oppressed %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_oppressed), NA)## Warning in ifelse(data$US_Help_oppressed %in% c("1", "2", "3", "4", "5"), : NAs
## introduced by coercion
data$match_quiz_q3_marginalized_US <- labelled(data$US_Help_oppressed, labels = c("immigrants and refugees fleeing violence" = 1, "LGBTQ+ communities" = 2,
"racial or ethnic minorities" = 3, "women" = 4, "people with disabilities" = 5))
summary_table(data$match_quiz_q3_marginalized_US, "Match Quiz Marginalized US", "match_quiz_q3_marginalized_US", "Frequency")| match_quiz_q3_marginalized_US | Frequency |
|---|---|
| 1 | 353 |
| 2 | 300 |
| 3 | 334 |
| 4 | 278 |
| 5 | 374 |
| NA | 46026 |
data$US_Help_environment <- ifelse(data$US_Help_environment %in% c("1", "2", "3", "4"), as.integer(data$US_Help_environment), NA)
data$match_quiz_q3_environment_US <- labelled(data$US_Help_environment, labels = c("innovations that further clean energy technology" = 1, "massive ocean clean up" = 2,
"conserving natural spaces and habitats like old rainforests and national parks" = 3, "new polivies to regulate carbon emissions" = 4))
summary_table(data$match_quiz_q3_environment_US, "Match Quiz Environment US", "match_quiz_q3_environment_US", "Frequency")| match_quiz_q3_environment_US | Frequency |
|---|---|
| 1 | 266 |
| 2 | 194 |
| 3 | 451 |
| 4 | 228 |
| NA | 46526 |
data$US_help_sick <- ifelse(data$US_help_sick %in% c("1", "2", "3", "4"), as.integer(data$US_help_sick), NA)
data$match_quiz_q3_sick_US <- labelled(data$US_help_sick, labels = c("people suffering from preventable diseases because of inadequate healthcare" = 1,
"rare diseases that need more research" = 2, "cancer: the Big C" = 3, "people with heart disease" = 4))
summary_table(data$match_quiz_q3_sick_US, "Match Quiz Sick US", "match_quiz_q3_sick_US", "Frequency")| match_quiz_q3_sick_US | Frequency |
|---|---|
| 1 | 623 |
| 2 | 347 |
| 3 | 1064 |
| 4 | 249 |
| NA | 45382 |
data$US_Help_animals <- ifelse(data$US_Help_animals %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_animals), NA)
data$match_quiz_q3_animals_US <- labelled(data$US_Help_animals, labels = c("dogs & cats humans' best friends" = 1,
"all the WILD furry animals like lions, tigers, and bears (oh and koalas!)" = 2, "any species on the brink of extinction" = 3, "animals in factory farms" = 4,
"whales and ocean animals (Baby Beluga! Nemo!!)" = 5))
summary_table(data$match_quiz_q3_animals_US, "Match Quiz Animals US", "match_quiz_q3_animals_US", "Frequency")| match_quiz_q3_animals_US | Frequency |
|---|---|
| 1 | 1736 |
| 2 | 212 |
| 3 | 206 |
| 4 | 226 |
| 5 | 73 |
| NA | 45212 |
data$Global_help_poverty <- ifelse(data$Global_help_poverty %in% c("1", "2", "3"), as.integer(data$Global_help_poverty), NA)
data$match_quiz_q3_poverty_global <- labelled(data$Global_help_poverty, labels = c("the opportunity to go to school" = 1, "food in their belly" = 2,
"the skills to grow food sustainably" = 3))
summary_table(data$match_quiz_q3_poverty_global, "Match Quiz Poverty Global", "match_quiz_q3_poverty_global", "Frequency")| match_quiz_q3_poverty_global | Frequency |
|---|---|
| 1 | 259 |
| 2 | 259 |
| 3 | 312 |
| NA | 46835 |
data$Global_help_oppressed <- ifelse(data$Global_help_oppressed %in% c("1", "2", "3", "4"), as.integer(data$Global_help_oppressed), NA)
data$match_quiz_q3_oppressed_global <- labelled(data$Global_help_oppressed, labels = c("refugees fleeing violence or other disasters" = 1,
"people that are victims of human trafficking" = 2, "women" = 3, "groups targeted by their government" = 4))
summary_table(data$match_quiz_q3_oppressed_global, "Match Quiz Oppressed Global", "match_quiz_q3_oppressed_global", "Frequency")| match_quiz_q3_oppressed_global | Frequency |
|---|---|
| 1 | 387 |
| 2 | 113 |
| 3 | 100 |
| 4 | 162 |
| NA | 46903 |
data$Global_help_enivronment <- ifelse(data$Global_help_enivronment %in% c("1", "2", "3", "4"), as.integer(data$Global_help_enivronment), NA)
data$match_quiz_q3_environment_global <- labelled(data$Global_help_enivronment, labels = c("innovatiuos that further clean energy technology" = 1, "massive pollution cleanup" = 2,
"conserving natural spaces and habitats like old rainforests and national parks" = 3, "new policies to regulate carbon emissions" = 4))
summary_table(data$match_quiz_q3_environment_global, "Match Quiz Environment Global", "match_quiz_q3_environment_global", "Frequency")| match_quiz_q3_environment_global | Frequency |
|---|---|
| 1 | 123 |
| 2 | 112 |
| 3 | 202 |
| 4 | 113 |
| NA | 47115 |
data$Global_help_sick <- ifelse(data$Global_help_sick %in% c("1", "2", "3", "4"), as.integer(data$Global_help_sick), NA)
data$match_quiz_q3_sick_global <- labelled(data$Global_help_sick, labels = c("protecting against preventable childhood diseases" = 1,
"treating the big three: HIV, AIDS, Tuburculosis and Malaria" = 2, "eradicating polio" = 3, "preventing and curing blindness" = 4))
summary_table(data$match_quiz_q3_sick_global, "Match Quiz Sick Global", "match_quiz_q3_sick_global", "Frequency")| match_quiz_q3_sick_global | Frequency |
|---|---|
| 1 | 204 |
| 2 | 107 |
| 3 | 12 |
| 4 | 62 |
| NA | 47280 |
data$Global_help_animals <- ifelse(data$Global_help_animals %in% c("1", "2", "3"), as.integer(data$Global_help_animals), NA)
data$match_quiz_q3_animals_global <- labelled(data$Global_help_animals, labels = c("all the WILD furry animals like lions, tigers, and bears (oh and koalas!)" = 1,
"any species on the brink of extinction" = 2, "whales and ocean animals (Baby Beluga! Nemo!!)" = 3))
summary_table(data$match_quiz_q3_animals_global, "Match Quiz Animals Global", "match_quiz_q3_animals_global", "Frequency")| match_quiz_q3_animals_global | Frequency |
|---|---|
| 1 | 147 |
| 2 | 184 |
| 3 | 25 |
| NA | 47309 |
We want to check that each combination of answers to the matchmaking quiz leads to the same charity match. The code below displays the combinations of answers that do not always lead to the same charity_match_index. The column unique_charity_match_index is the number of unique charity_match_index values for each combination of answers, and the column observation_count is the number of observations for each combination of answers.
filtered_data <- subset(data, !is.na(charity_match_index))
grouped_data <- filtered_data |> group_by(match_quiz_q1_impact_location, match_quiz_q2_superhero_US, match_quiz_q2_superhero_global,
match_quiz_q3_education_US, match_quiz_q3_hunger_homelessness_US, match_quiz_q3_marginalized_US, match_quiz_q3_environment_US,
match_quiz_q3_sick_US, match_quiz_q3_animals_US, match_quiz_q3_poverty_global, match_quiz_q3_oppressed_global, match_quiz_q3_sick_global,
match_quiz_q3_environment_global, match_quiz_q3_animals_global) |>
summarize(unique_charity_match_index = n_distinct(charity_match_index), observation_count = n())## `summarise()` has grouped output by 'match_quiz_q1_impact_location',
## 'match_quiz_q2_superhero_US', 'match_quiz_q2_superhero_global',
## 'match_quiz_q3_education_US', 'match_quiz_q3_hunger_homelessness_US',
## 'match_quiz_q3_marginalized_US', 'match_quiz_q3_environment_US',
## 'match_quiz_q3_sick_US', 'match_quiz_q3_animals_US',
## 'match_quiz_q3_poverty_global', 'match_quiz_q3_oppressed_global',
## 'match_quiz_q3_sick_global', 'match_quiz_q3_environment_global'. You can
## override using the `.groups` argument.
inconsistent_groups <- grouped_data |>
filter(unique_charity_match_index > 1)
kable(inconsistent_groups, caption = "Inconsistent Groups", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) | match_quiz_q1_impact_location | match_quiz_q2_superhero_US | match_quiz_q2_superhero_global | match_quiz_q3_education_US | match_quiz_q3_hunger_homelessness_US | match_quiz_q3_marginalized_US | match_quiz_q3_environment_US | match_quiz_q3_sick_US | match_quiz_q3_animals_US | match_quiz_q3_poverty_global | match_quiz_q3_oppressed_global | match_quiz_q3_sick_global | match_quiz_q3_environment_global | match_quiz_q3_animals_global | unique_charity_match_index | observation_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 2 |
We now write a CSV file for each unique combination of answers to the matchmaking quiz that only contains the answers with exactly one unique match. We also add a column for the unique charity_match_index.
filtered_data <- subset(data, !is.na(charity_match_index))
grouped_data <- filtered_data %>%
group_by(match_quiz_q1_impact_location, match_quiz_q2_superhero_US, match_quiz_q2_superhero_global,
match_quiz_q3_education_US, match_quiz_q3_hunger_homelessness_US, match_quiz_q3_marginalized_US, match_quiz_q3_environment_US,
match_quiz_q3_sick_US, match_quiz_q3_animals_US, match_quiz_q3_poverty_global, match_quiz_q3_oppressed_global, match_quiz_q3_sick_global,
match_quiz_q3_environment_global, match_quiz_q3_animals_global) %>%
summarize(unique_charity_match_index = n_distinct(charity_match_index),
observation_count = n(),
charity_match_index = first(charity_match_index))## `summarise()` has grouped output by 'match_quiz_q1_impact_location',
## 'match_quiz_q2_superhero_US', 'match_quiz_q2_superhero_global',
## 'match_quiz_q3_education_US', 'match_quiz_q3_hunger_homelessness_US',
## 'match_quiz_q3_marginalized_US', 'match_quiz_q3_environment_US',
## 'match_quiz_q3_sick_US', 'match_quiz_q3_animals_US',
## 'match_quiz_q3_poverty_global', 'match_quiz_q3_oppressed_global',
## 'match_quiz_q3_sick_global', 'match_quiz_q3_environment_global'. You can
## override using the `.groups` argument.
consistent_groups <- grouped_data %>%
filter(unique_charity_match_index == 1)
# write.csv(consistent_groups, "Data/Processed/Consistent_Groups.csv")We see that in all but one inconsistent combination, participants have not finished the quiz.
We will go through a similar process as with the Match Quiz Variables, creating labelled versions of each.
data$important_smart <- ifelse(data$important_smart %in% c("1", "2", "3"), as.integer(data$important_smart), NA)
data$values_quiz_q1_smart <- labelled(data$important_smart, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q1_smart, "Summary of Values Quiz Q1: Smart", "values_quiz_q1_smart", "Frequency")| values_quiz_q1_smart | Frequency |
|---|---|
| 1 | 908 |
| 2 | 2439 |
| 3 | 13656 |
| NA | 30662 |
data$important_responsive <- ifelse(data$important_responsive %in% c("1", "2", "3"), as.integer(data$important_responsive), NA)
data$values_quiz_q2_responsive <- labelled(data$important_responsive, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q2_responsive, "Summary of Values Quiz Q2: Responsive", "values_quiz_q2_responsive", "Frequency")| values_quiz_q2_responsive | Frequency |
|---|---|
| 1 | 982 |
| 2 | 4672 |
| 3 | 11162 |
| NA | 30849 |
data$important_forwardlooking <- ifelse(data$important_forwardlooking %in% c("1", "2", "3"), as.integer(data$important_forwardlooking), NA)
data$values_quiz_q3_forwardlooking <- labelled(data$important_forwardlooking, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q3_forwardlooking, "Summary of Values Quiz Q3: Forward Looking", "values_quiz_q3_forwardlooking", "Frequency")| values_quiz_q3_forwardlooking | Frequency |
|---|---|
| 1 | 546 |
| 2 | 4260 |
| 3 | 11830 |
| NA | 31029 |
We will also check the values of the donor type.
| donor_type | Frequency |
|---|---|
| forward_looking | 5349 |
| responsive | 4797 |
| smart | 6387 |
| unsure | 99 |
| NA | 31033 |
There do not appear to be any issues with this variable.
First we’ll look at the treatment_group variable, which appears to have no issues.
| treatment_group | Frequency |
|---|---|
| control | 2265 |
| obligation | 4543 |
| opportunity | 4420 |
| NA | 36437 |
We now want to check that those who have a treatment value of NA also have NA values for the following variables. `
data$treatment_NA_check <- ifelse(is.na(data$treatment_group) & (!is.na(data$treatment_RF) | !is.na(data$treatment_FR) | !is.na(data$donate_today) | !is.na(data$stay_connected)), TRUE, FALSE)
summary_table(data$treatment_NA_check, "Summary of Treatment NA Check", "treatment_NA_check", "Frequency")| treatment_NA_check | Frequency |
|---|---|
| FALSE | 47662 |
| TRUE | 3 |
There are 3 cases in which the treatment group is NA but at least one of the other variables are not. This could be because people are restarting the quiz, then not passing the consent stage the second time.
We’ll now look at the free responses within the treatment variable group. Namely, treatment_RF and treatment_FR. For each, we’ll look at the number of NAs along with the average length of response when it is not NA. The treatment_RF variable is the free response when asked, “imagine talking to (text1_opportunity), image what they might say to you. How would this make you feel?” The question leading to the treatment_FR variable is posed the same way, exept the text1_opportunity is replaced with text1_obligation.
data$treatment_RF <- as.character(data$treatment_RF)
data$treatment_FR <- as.character(data$treatment_FR)
treatment_RF_na <- sum(is.na(data$treatment_RF))
treatment_FR_na <- sum(is.na(data$treatment_FR))
treatment_RF_avg_length <- mean(nchar(data$treatment_RF[!is.na(data$treatment_RF)]))
treatment_FR_avg_length <- mean(nchar(data$treatment_FR[!is.na(data$treatment_FR)]))
treatment_na <- data.frame('Variable' = c("treatment_RF", "treatment_FR"), 'NAs' = c(treatment_RF_na, treatment_FR_na), 'Average Length' = c(treatment_RF_avg_length, treatment_FR_avg_length))
kable(treatment_na, caption = "Summary of NAs and Average Length of Treatment Variables", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) | Variable | NAs | Average.Length |
|---|---|---|
| treatment_RF | 44451 | 53.87243 |
| treatment_FR | 44498 | 50.88254 |
There appear to be no issues with the manipulation order variable.
| manipulation_order | Frequency |
|---|---|
| 1 | 4262 |
| 2 | 4383 |
| NA | 39020 |
We will create labelled versions of the manipulation value question as done with the mathmaking and values quizzes.
data$manipulation_q1_value <- ifelse(data$manipulation_question_value %in% c("1", "2", "3", "4", "5"), as.integer(data$manipulation_question_value), NA)## Warning in ifelse(data$manipulation_question_value %in% c("1", "2", "3", : NAs
## introduced by coercion
data$manipulation_q1_value <- labelled(data$manipulation_q1_value, labels = c("Not at all important" = 1, "Not too important" = 2, "Somewhat important" = 3,
"Quite important" = 4, "Very important" = 5))
summary_table(data$manipulation_q1_value, "Manipulation Question 1: Value", "manipulation_q1_value", "Frequency")| manipulation_q1_value | Frequency |
|---|---|
| 1 | 126 |
| 2 | 28 |
| 3 | 221 |
| 4 | 1020 |
| 5 | 5337 |
| NA | 40933 |
We’ll look at the frequencies of the most common values of the donate_today variable.
| Response | Frequency |
|---|---|
| NA | 39770 |
| No, not ready | 6668 |
| Yes, let’s do it! | 1046 |
| No | 10 |
| Stop | 4 |
| Other | 167 |
We will only accept “Yes, let’s do it!” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$boolean_donate_today_NA <- ifelse(is.na(data$donate_today), NA, ifelse(grepl("Yes, let's do it!", data$donate_today, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$boolean_donate_today_NA, "Summary of Boolean Donate Today Variable (With NA)", "boolean_donate_today", "Frequency")| boolean_donate_today | Frequency |
|---|---|
| FALSE | 6849 |
| TRUE | 1046 |
| NA | 39770 |
data$boolean_donate_today_noNA <- ifelse(grepl("Yes, let's do it!", data$donate_today, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$boolean_donate_today_noNA, "Summary of Boolean Donate Today Variable (With NA = FALSE)", "boolean_donate_today_noNA", "Frequency")| boolean_donate_today_noNA | Frequency |
|---|---|
| FALSE | 46619 |
| TRUE | 1046 |
We look at the five most common values of the donate_later variable.
| Response | Frequency |
|---|---|
| NA | 41824 |
| Explore others | 2238 |
| Learn more | 1939 |
| Do it later | 1205 |
| Stop | 57 |
| Other | 402 |
We’ll look at the frequencies of the most common values of the stay_connected variable.
| Response | Frequency |
|---|---|
| NA | 41499 |
| No thanks! | 3813 |
| Yes please! | 2318 |
| Yes | 4 |
| No | 3 |
| Other | 28 |
We will only accept “Yes please!” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$boolean_stay_connected_NA <- ifelse(is.na(data$stay_connected), NA, ifelse(grepl("Yes please!", data$stay_connected, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$boolean_stay_connected_NA, "Summary of Boolean Stay Connected Variable (With NA)", "boolean_stay_connected_NA", "Frequency")| boolean_stay_connected_NA | Frequency |
|---|---|
| FALSE | 3848 |
| TRUE | 2318 |
| NA | 41499 |
data$boolean_stay_connected_noNA <- ifelse(grepl("Yes please!", data$stay_connected, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$boolean_stay_connected_noNA, "Summary of Boolean Stay Connected Variable (With NA = FALSE)", "boolean_stay_connected_noNA", "Frequency")| boolean_stay_connected_noNA | Frequency |
|---|---|
| FALSE | 45347 |
| TRUE | 2318 |
We will begin by looking at all values and frequencies of the charity_match_index variable.
| charity_match_index | Frequency |
|---|---|
| global_animals_1 | 144 |
| global_animals_2 | 179 |
| global_animals_3 | 22 |
| global_educ_1 | 251 |
| global_educ_2 | 250 |
| global_educ_3 | 298 |
| global_educ_4 | 229 |
| global_envir_1 | 122 |
| global_envir_2 | 108 |
| global_envir_3 | 194 |
| global_envir_4 | 111 |
| global_oppressed_1 | 377 |
| global_oppressed_2 | 111 |
| global_oppressed_3 | 99 |
| global_oppressed_4 | 157 |
| global_sick_1 | 197 |
| global_sick_2 | 98 |
| global_sick_3 | 12 |
| global_sick_4 | 54 |
| us_animals_#1 | 1 |
| us_animals_1 | 1708 |
| us_animals_2 | 205 |
| us_animals_3 | 201 |
| us_animals_4 | 224 |
| us_animals_5 | 68 |
| us_educ_1 | 638 |
| us_educ_2 | 139 |
| us_educ_3 | 139 |
| us_educ_4 | 155 |
| us_educ_5 | 406 |
| us_envir_1 | 254 |
| us_envir_2 | 190 |
| us_envir_3 | 442 |
| us_envir_4 | 220 |
| us_hunger_,,3 | 1 |
| us_hunger_#1 | 1 |
| us_hunger_1 | 1911 |
| us_hunger_2 | 1830 |
| us_hunger_3 | 1734 |
| us_oppressed_1 | 342 |
| us_oppressed_2 | 295 |
| us_oppressed_3 | 328 |
| us_oppressed_4 | 273 |
| us_oppressed_5 | 357 |
| us_sick_1 | 601 |
| us_sick_2 | 340 |
| us_sick_3 | 1030 |
| us_sick_4 | 238 |
| NA | 30381 |
There are several values that seem to be incorrect. For example, there is a value with “us_hunger_,,3,” “us_hunger_#1,” and “us_animals_#1,” all of which only have one observation and don’t fit the name patterns of the other values.
We will begin by looking at all values and frequencies of the charity_name variable.
| charity_name | Frequency |
|---|---|
| American Heart Association | 237 |
| American Society for the Prevention of Cruelty to Animals | 1701 |
| Amnesty International | 156 |
| BRAC | 296 |
| Clean Air Task Force | 328 |
| Concern Worldwide | 98 |
| Dana-Farber Cancer Institute | 1029 |
| DonorsChoose | 155 |
| Evidence Action Inc. | 195 |
| Feeding America | 1821 |
| Freedom Network | 111 |
| Girls Inc. | 139 |
| Global Fund for Women | 98 |
| HealthWell Foundation | 598 |
| HIAS | 717 |
| Housing Matters | 1905 |
| Mercy For Animals | 224 |
| National Women’s Law Center | 270 |
| Natural Resources Defense Council | 374 |
| Rainforest Trust | 634 |
| Reading is Fundamental | 635 |
| Rocketship Education | 403 |
| Room to Read | 248 |
| Sightsavers Inc. | 54 |
| Southern Poverty Law Center | 326 |
| Special Olympics | 354 |
| Surfrider Foundation | 295 |
| The Leukemia & Lymphoma Society | 339 |
| The Ocean Foundation | 90 |
| The Rotary Foundation | 12 |
| The Trevor Project | 294 |
| Thurgood Marshall College Fund | 139 |
| WaterAid | 228 |
| Wild Earth Allies | 349 |
| Wildlife SOS | 379 |
| World Food Program | 247 |
| Year Up | 1728 |
| NA | 30459 |
Here there are no clear instances of incorrect values. There are 78 more NA values for charity_name than for charity_match_index. This is because there was a periods of time where the corresponding app wasn’t serving data over to chatfuel. We will create a flag for when charity_name is NA and charity_match_index is not NA.
data$charity_name_NA <- ifelse(is.na(data$charity_name) & !is.na(data$charity_match_index), TRUE, FALSE)
summary_table(data$charity_name_NA, "Summary of Charity Name NA", "charity_name_NA", "Frequency")| charity_name_NA | Frequency |
|---|---|
| FALSE | 47587 |
| TRUE | 78 |
We see that there are exactly 78 of such instances.
We want to check that each charity_match_index corresponds to exactly one charity_name. When doing so, we will ignore the 78 cases where charity_name is NA and charity_match_index is not NA.
data_filtered <- subset(data, charity_name_NA == FALSE)
result <- data_filtered |>
group_by(charity_match_index) |>
summarise(distinct_charity_names = n_distinct(charity_name))
kable(result, caption = "Number of Charity Names Associated With Each Charity Match Index", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) | charity_match_index | distinct_charity_names |
|---|---|
| global_animals_1 | 1 |
| global_animals_2 | 1 |
| global_animals_3 | 1 |
| global_educ_1 | 1 |
| global_educ_2 | 1 |
| global_educ_3 | 2 |
| global_educ_4 | 1 |
| global_envir_1 | 1 |
| global_envir_2 | 1 |
| global_envir_3 | 1 |
| global_envir_4 | 1 |
| global_oppressed_1 | 1 |
| global_oppressed_2 | 1 |
| global_oppressed_3 | 1 |
| global_oppressed_4 | 1 |
| global_sick_1 | 1 |
| global_sick_2 | 1 |
| global_sick_3 | 1 |
| global_sick_4 | 1 |
| us_animals_1 | 1 |
| us_animals_2 | 1 |
| us_animals_3 | 1 |
| us_animals_4 | 1 |
| us_animals_5 | 1 |
| us_educ_1 | 1 |
| us_educ_2 | 1 |
| us_educ_3 | 1 |
| us_educ_4 | 1 |
| us_educ_5 | 1 |
| us_envir_1 | 1 |
| us_envir_2 | 1 |
| us_envir_3 | 1 |
| us_envir_4 | 1 |
| us_hunger_1 | 1 |
| us_hunger_2 | 2 |
| us_hunger_3 | 1 |
| us_oppressed_1 | 1 |
| us_oppressed_2 | 1 |
| us_oppressed_3 | 2 |
| us_oppressed_4 | 1 |
| us_oppressed_5 | 1 |
| us_sick_1 | 1 |
| us_sick_2 | 1 |
| us_sick_3 | 2 |
| us_sick_4 | 1 |
| NA | 1 |
We see that there are four charity_match_indexes that have more than one distinct charity_name, namely global_educ_3, us_hunger_2, us_oppressed_3, and us_sick_3.
data_global_educ_3 <- subset(data_filtered, charity_match_index == "global_educ_3")
data_us_hunger_2 <- subset(data_filtered, charity_match_index == "us_hunger_2")
data_us_oppressed_3 <- subset(data_filtered, charity_match_index == "us_oppressed_3")
data_us_sick_3 <- subset(data_filtered, charity_match_index == "us_sick_3")
summary_table(data_global_educ_3$charity_name, "Charity Name for Global Education 3", "charity_name_global_educ_3", "Frequency")| charity_name_global_educ_3 | Frequency |
|---|---|
| BRAC | 296 |
| HIAS | 1 |
summary_table(data_us_hunger_2$charity_name, "Charity Name for US Hunger 2", "charity_name_us_hunger_2", "Frequency")| charity_name_us_hunger_2 | Frequency |
|---|---|
| Dana-Farber Cancer Institute | 1 |
| Feeding America | 1821 |
summary_table(data_us_oppressed_3$charity_name, "Charity Name for US Oppressed 3", "charity_name_us_oppressed_3", "Frequency")| charity_name_us_oppressed_3 | Frequency |
|---|---|
| Southern Poverty Law Center | 326 |
| The Trevor Project | 1 |
summary_table(data_us_sick_3$charity_name, "Charity Name for US Sick 3", "charity_name_us_sick_3", "Frequency")| charity_name_us_sick_3 | Frequency |
|---|---|
| Dana-Farber Cancer Institute | 1028 |
| Housing Matters | 1 |
For each one, there is only one case of an incorrect value. ## Follow Up Variables We want to create boolean versions of the follow up variables. ### follow_up_donated We’ll look at the frequencies of the most common values of the follow_up_donated variable.
| Response | Frequency |
|---|---|
| NA | 47260 |
| Not yet | 195 |
| Yes, donated! | 166 |
| No | 8 |
| Stop | 2 |
| Other | 34 |
We will only accept “Yes, donated!” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$same_day_donated_NA <- ifelse(is.na(data$follow_up_donated), NA, ifelse(grepl("Yes, donated!", data$follow_up_donated, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_donated_NA, "Summary of Boolean Follow Up Donated Variable (With NA)", "boolean_follow_up_donated_NA", "Frequency")| boolean_follow_up_donated_NA | Frequency |
|---|---|
| FALSE | 239 |
| TRUE | 166 |
| NA | 47260 |
data$same_day_donated_noNA <- ifelse(grepl("Yes, donated!", data$follow_up_donated, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_donated_noNA, "Summary of Boolean Follow Up Donated Variable (With NA = FALSE)", "boolean_follow_up_donated_noNA", "Frequency")| boolean_follow_up_donated_noNA | Frequency |
|---|---|
| FALSE | 47499 |
| TRUE | 166 |
We’ll look at the frequencies of the most common values of the good_match variable.
| Response | Frequency |
|---|---|
| NA | 45611 |
| Not really… | 934 |
| Yes! | 736 |
| Stop | 77 |
| Yes | 41 |
| Other | 266 |
We will only accept “Yes!” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$same_day_good_match_NA <- ifelse(is.na(data$good_match), NA, ifelse(grepl("Yes!", data$good_match, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_good_match_NA, "Summary of Boolean Good Match Variable (With NA)", "boolean_good_match_NA", "Frequency")| boolean_good_match_NA | Frequency |
|---|---|
| FALSE | 1318 |
| TRUE | 736 |
| NA | 45611 |
data$same_day_good_match_noNA <- ifelse(grepl("Yes!", data$good_match, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_good_match_noNA, "Summary of Boolean Good Match Variable (With NA = FALSE)", "boolean_good_match_noNA", "Frequency")| boolean_good_match_noNA | Frequency |
|---|---|
| FALSE | 46929 |
| TRUE | 736 |
We’ll look at the frequencies of the most common values of the follow_up_donated_1 variable.
| Response | Frequency |
|---|---|
| NA | 46998 |
| No, not yet. | 555 |
| Yes, donated. | 88 |
| Stop | 4 |
| no, not yet. | 2 |
| Other | 18 |
We will only accept “Yes, doanted.” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$same_day_donated1_NA <- ifelse(is.na(data$followup_donated_1), NA, ifelse(grepl("Yes, donated.", data$followup_donated_1, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_donated1_NA, "Summary of Boolean Follow Up Donated 1 Variable (With NA)", "boolean_follow_up_donated_1_NA", "Frequency")| boolean_follow_up_donated_1_NA | Frequency |
|---|---|
| FALSE | 579 |
| TRUE | 88 |
| NA | 46998 |
data$same_day_donated1_noNA <- ifelse(grepl("Yes, donated.", data$followup_donated_1, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_donated1_noNA, "Summary of Boolean Follow Up Donated 1 Variable (With NA = FALSE)", "boolean_follow_up_donated_1_noNA", "Frequency")| boolean_follow_up_donated_1_noNA | Frequency |
|---|---|
| FALSE | 47577 |
| TRUE | 88 |
We’ll look at the frequencies of the most common values of the not_good_match_learn_more variable.
most_common_table(data$not_good_match_learn_more, "Top 5 Most Common Responses for Not Good Match Learn More")| Response | Frequency |
|---|---|
| NA | 46860 |
| Nope | 685 |
| Sure | 96 |
| Stop | 5 |
| STOP | 4 |
| Other | 15 |
We will only accept “Sure” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$same_day_not_good_match_learn_more_NA <- ifelse(is.na(data$not_good_match_learn_more), NA, ifelse(grepl("Sure", data$not_good_match_learn_more, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_not_good_match_learn_more_NA, "Summary of Boolean Not Good Match Learn More Variable (With NA)", "boolean_not_good_match_learn_more_NA", "Frequency")| boolean_not_good_match_learn_more_NA | Frequency |
|---|---|
| FALSE | 709 |
| TRUE | 96 |
| NA | 46860 |
data$same_day_not_good_match_learn_more_noNA <- ifelse(grepl("Sure", data$not_good_match_learn_more, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_not_good_match_learn_more_noNA, "Summary of Boolean Not Good Match Learn More Variable (With NA = FALSE)", "boolean_not_good_match_learn_more_noNA", "Frequency")| boolean_not_good_match_learn_more_noNA | Frequency |
|---|---|
| FALSE | 47569 |
| TRUE | 96 |
We’ll look at the frequencies of the most common values of the donate_amount variable.
| Response | Frequency |
|---|---|
| NA | 47442 |
| Prefer not to share | 141 |
| 25 | 13 |
| $25 | 9 |
| $100 | 5 |
| Other | 55 |
We want to remove any dollar signs then convert to numeric.
data$donate_amount_numeric <- gsub("\\$", "", data$donate_amount)
data$donate_amount_numeric <- as.numeric(data$donate_amount)## Warning: NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.0 25.0 25.0 55.2 50.0 260.0 47635
We’ll look at the frequencies of the most common values of the recurring variable.
| Response | Frequency |
|---|---|
| NA | 47457 |
| Nope | 158 |
| Yes! | 45 |
| If this is Charity Navigator, you should have that information | 1 |
| No | 1 |
| Other | 3 |
We will only accept “Yes!” as a positive response. We will create two versions of this variable, one with NA values and one without.
data$same_day_recurring_NA <- ifelse(is.na(data$recurring), NA, ifelse(grepl("Yes!", data$recurring, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_recurring_NA, "Summary of Boolean Recurring Variable (With NA)", "boolean_recurring_NA", "Frequency")| boolean_recurring_NA | Frequency |
|---|---|
| FALSE | 163 |
| TRUE | 45 |
| NA | 47457 |
data$same_day_recurring_noNA <- ifelse(grepl("Yes!", data$recurring, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_recurring_noNA, "Summary of Boolean Recurring Variable (With NA = FALSE)", "boolean_recurring_noNA", "Frequency")| boolean_recurring_noNA | Frequency |
|---|---|
| FALSE | 47620 |
| TRUE | 45 |