Helper Functions

We’ll begin by defining a few helper functions which will be used throughout the script to create tables and summaries of variables.

Most Common Table

This first function creates a table of the five most common values of a variable and includes a row for the sum of all other responses (other).

most_common_table <- function(variable, title){
    table <- table(variable, useNA = 'ifany')
    table <- sort(table, decreasing = TRUE)
    k = 5
    top_k <- head(table, 5)
    other <- sum(table[-(1:k)])
    freq_df <- data.frame(Response = names(top_k), Frequency = as.vector(top_k))
    other_df <- data.frame(Response = "Other", Frequency = other)
    result <- rbind(freq_df, other_df)
    kable(result,  caption = title, align = "c") |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
}

Summary Table

This function creates a clean table to look at the frequencies of different values of a variable. Unlike the function above, it will list all values.

summary_table <- function(variable, title, var1, var2){
    bool_df <- as.data.frame(table(variable, useNA = 'ifany'))
    colnames(bool_df) <- c(var1, var2)
    kable(bool_df, caption = title, align = "c") |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
}

Intro Variables

We will create binary flags for passing the greeting, pre-consent, and consent variables.

Greeting

First we will create a binary flag indicating if someone has passed the greeting. We’ll look at the most common values of the greeting variable. Participants were told “Hi! I’m Sam, a friendly chatbot, and I want to help you find your charity soulmate. With just 6 questions, I’ll find a highly-rated charity that inspires you!”. They were presented the option “Let’s do this!”.

most_common_table(data$greeting, "Top 5 Most Common Responses for Greeting")
Top 5 Most Common Responses for Greeting
Response Frequency
Let’s do this! 30772
Yes! 8482
NA 4901
No 200
Stop 166
Other 3144

Currently, Chatfuel lists “Let’s do this!” as the only given response option to greeting, but we see that “Yes!” is also a common response. It could be that previously, chatfuel also (or only) listed “Yes!” as an option. To test this hypothesis, we will check the time periods in which “Let’s do this!” and “Yes!” reponses were recorded.

data$charitable_intro_start <- as.Date(data$charitable_intro_start)
yes_responses <- subset(data, greeting == "Yes!" & !is.na(charitable_intro_start))
lets_do_responses <- subset(data, greeting == "Let's do this!" & !is.na(charitable_intro_start))

yes_responses$charitable_intro_start <- as.POSIXct(yes_responses$charitable_intro_start)
time_range <- difftime(max(yes_responses$charitable_intro_start), min(yes_responses$charitable_intro_start))
binwidth <- 60 * 60 * 24
ggplot(yes_responses, aes(x = charitable_intro_start)) +
  geom_histogram(binwidth = binwidth, fill = "skyblue", color = "black") +
  labs(title = "Frequency of \'Yes!\'' Responses Over Time",
       x = "Time",
       y = "Frequency")

lets_do_responses$charitable_intro_start <- as.POSIXct(lets_do_responses$charitable_intro_start)
time_range <- difftime(max(lets_do_responses$charitable_intro_start), min(lets_do_responses$charitable_intro_start)) 
ggplot(lets_do_responses, aes(x = charitable_intro_start)) +
  geom_histogram(binwidth = binwidth, fill = "skyblue", color = "black") +
  labs(title = "Frequency of \'Let's do this!\'' Responses Over Time",
       x = "Time",
       y = "Frequency")

There is not a clear cutoff between when the two types of responses were recorded. Upon review, we found that chatfuel automatically added “Yes!” as a response option to the greeting.

We want the flag to be true if the response to the greeting is exactly “Let’s do this!” and false otherwise.

data$flag_greeting <- ifelse(grepl("Let's do this!", data$greeting, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_greeting, "Summary of flag_greeting", "flag_greeting", "Frequency")
Summary of flag_greeting
flag_greeting Frequency
FALSE 16893
TRUE 30772

We want to check that those who answered “Yes!” to the greeting did not pass on to the other stages. We will check that the consent response is NA in these cases.

data_greeting_yes <- subset(data, greeting == "Yes!")
data_greeting_yes$NA_check <- ifelse(!is.na(data_greeting_yes$consent), TRUE, FALSE)
summary_table(data_greeting_yes$NA_check, "Summary of Greeting NA Check", "greeting_NA_check", "Frequency")
Summary of Greeting NA Check
greeting_NA_check Frequency
FALSE 8203
TRUE 279

We see there are 279 people who responded “Yes!” to the greeting but still answered the consent question. We will add a flag for these people.

data$flag_greeting_yes_check <- ifelse(data$greeting == "Yes!" & !is.na(data$consent), TRUE, FALSE)
summary_table(data$flag_greeting_yes_check, "Summary of flag_greeting_NA_check", "flag_greeting_NA_check", "Frequency")
Summary of flag_greeting_NA_check
flag_greeting_NA_check Frequency
FALSE 47385
TRUE 279
NA 1

Time variables

We will now check that all time variables make logical sense, meaning they are in the correct order.

Charitable Affirm Start

We’ll begin by examining charitable_affirm_start and charitable_affirm_end as an example, then run through all other time variables.

Charitable_affirm_start should always come before charitable_affirm_end. Below, we check that this is true, excluding NA cases for now. We create a flag, charitable_affirm_correct_order, for observations where the end time is before the start time

data$charitable_affirm_correct_order <- ifelse(data$charitable_affirm_start < data$charitable_affirm_end | is.na(data$charitable_affirm_start) | is.na(data$charitable_affirm_end), FALSE, TRUE)
summary_table(data$charitable_affirm_correct_order, "Summary of Charitable Affirm Correct Order", "charitable_affirm_correct_order", "Frequency")
Summary of Charitable Affirm Correct Order
charitable_affirm_correct_order Frequency
FALSE 47633
TRUE 32

We see that there are 32 observations where the end time was before the start time. We can look at the first five of these observations to see if there is a pattern.

incorrect_order <- data[data$charitable_affirm_correct_order,]
first_5 <- head(incorrect_order, 5)
first_5 <- first_5[, c("charitable_affirm_start", "charitable_affirm_end")]
kable(first_5, caption = "First 5 Observations with Charitable Affirm End Before Start", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
First 5 Observations with Charitable Affirm End Before Start
charitable_affirm_start charitable_affirm_end
2023-12-03 12:35:11 2023-12-02 14:21:47
2023-12-02 21:50:44 2023-12-01 20:18:10
2023-12-03 20:35:44 2023-12-02 23:20:52
2023-12-07 02:07:12 2023-12-03 06:01:08
2023-12-09 02:12:32 2023-12-08 05:25:03

It appears that in these obeservations the end time is a day before the start time. These cases likely occur when someone takes the quiz twice and does not finish the second time. As in the introduction section, we will check if any of these inconsistencies are explained by participants indicating they would like to take the quiz again.

data$repeat_charitable_affirm_order_check <- ifelse(data$flag_repeat_quiz & data$charitable_affirm_correct_order, TRUE, FALSE)
summary_table(data$repeat_charitable_affirm_order_check, "Summary of Repeat Charitable Affirm Order Check", "repeat_charitable_affirm_order_check", "Frequency")
Summary of Repeat Charitable Affirm Order Check
repeat_charitable_affirm_order_check Frequency
FALSE 47665

None of these inconsistencies are explained by the 17 people who indicated they would like to take the quiz again.

We will now check that for instances of there being an end time but no start time. We create a flag, charitable_affirm_no_start, for these cases.

data$charitable_affirm_no_start <- ifelse(is.na(data$charitable_affirm_start) & !is.na(data$charitable_affirm_end), TRUE, FALSE)
summary_table(data$charitable_affirm_no_start, "Summary of Charitable Affirm No Start", "charitable_affirm_no_start", "Frequency")
Summary of Charitable Affirm No Start
charitable_affirm_no_start Frequency
FALSE 47662
TRUE 3

We see there are three observations of participants having an end time but no start time. We can look at these three observations.

no_start <- data[is.na(data$charitable_affirm_start) & !is.na(data$charitable_affirm_end),]
no_start <- no_start[, c("charitable_affirm_start", "charitable_affirm_end")]
kable(no_start, caption = "Observations with Charitable Affirm End Time but no Start Time", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
Observations with Charitable Affirm End Time but no Start Time
charitable_affirm_start charitable_affirm_end
NA 2023-12-22 19:57:45
NA 2023-12-27 15:00:07
NA 2023-12-23 10:12:53

Other Time Variables

We will now run through the same process for the following start time variables (and associated end variables):

  • charitable_affirm_start_follwup1
  • charitable_feedback_start
  • charitable_fllwup_start
  • charitable_intro_start
  • charitable_match_start
  • charitable_match_start_followup1
  • charitable_reveal_start
  • chartiable_reveal_start_followup1
  • chartiable_treatment_start

We will create a flag that is triggered if there is a logical issue with any of these time variables, then we produce a table showing how many logical inconsistencies are associated with each variable.

data$start_end_logical_issue <- FALSE
starts = c("charitable_affirm_start_followup1", "charitable_feedback_start", "charitable_fllwup_start", "charitable_intro_start", "charitable_match_start", "charitable_match_start_followup1", "charitable_reveal_start", "charitable_reveal_start_followup1", "charitable_treatment_start")
prev_total = 0
issues <- data.frame('Time Variable' = character(), 'Total Issues' = numeric())
for (start in starts){
    end = sub("start", "end", start)
    data$start_end_logical_issue <- ifelse(data[[start]] < data[[end]] | is.na(data[[start]]) | is.na(data[[end]]), data$start_end_logical_issue, TRUE)
    data$start_end_logical_issue <- ifelse(is.na(data[[start]]) & !is.na(data[[end]]), TRUE, data$start_end_logical_issue)
    issue <- data[data$start_end_logical_issue,]
    total = nrow(issue) - prev_total
    new_row <- data.frame('Time Variable' = start, 'Total Issues' = total)
    issues <- rbind(issues, new_row)
    prev_total = nrow(issue)
}
kable(issues, caption = "Total Logical Issues for Each Time Variable", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
Total Logical Issues for Each Time Variable
Time.Variable Total.Issues
charitable_affirm_start_followup1 0
charitable_feedback_start 45
charitable_fllwup_start 2
charitable_intro_start 2434
charitable_match_start 73
charitable_match_start_followup1 0
charitable_reveal_start 82
charitable_reveal_start_followup1 0
charitable_treatment_start 47

Charitable_intro_start has an abnormal amount of logical inconsistencies, so we’ll dive deeper.

data$charitable_intro_correct_order <- ifelse(data$charitable_intro_start < data$charitable_intro_end | is.na(data$charitable_intro_start) | is.na(data$charitable_intro_end), FALSE, TRUE)
summary_table(data$charitable_intro_correct_order, "Summary of Charitable Intro Correct Order", "charitable_intro_correct_order", "Frequency")
Summary of Charitable Intro Correct Order
charitable_intro_correct_order Frequency
FALSE 47185
TRUE 480

There are 529 observations where the end time was before the start time. We will now look for observations where the start time is NA but there exists an end time.

data$charitable_intro_no_start <- ifelse(is.na(data$charitable_intro_start) & !is.na(data$charitable_intro_end), TRUE, FALSE)
summary_table(data$charitable_intro_no_start, "Summary of Charitable Intro No Start", "charitable_intro_no_start", "Frequency")
Summary of Charitable Intro No Start
charitable_intro_no_start Frequency
FALSE 45709
TRUE 1956

There are 1956 observations where the start time is NA but there exists an end time.

General Time Variable flag

We want to create a flag that is true if there is a logical issue with any of the time variables.

data$time_logical_issue <- ifelse(data$charitable_match_start < data$charitable_match_end &
                                  data$charitable_match_end < data$charitable_affirm_start &
                                  data$charitable_affirm_start < data$charitable_affirm_end &
                                  data$charitable_affirm_end < data$charitable_reveal_start &
                                  data$charitable_reveal_start < data$charitable_reveal_end &
                                  data$charitable_reveal_end < data$charitable_treatment_start &
                                  data$charitable_treatment_start < data$charitable_treatment_end, TRUE, FALSE)
summary_table(data$time_logical_issue, "Summary of Time Logical Issue", "time_logical_issue", "Frequency")
Summary of Time Logical Issue
time_logical_issue Frequency
FALSE 17285
NA 30380

Matchmaking Quiz Variables

Creating labelled versions of each variable

We’ll start with the US_or_Global variable, using the ‘labelled’ package to create a labelled version of the variable.

data$match_quiz_q1_impact_location <- labelled(data$US_or_Global, labels = c("US" = 1, "International" = 2))
summary_table(data$match_quiz_q1_impact_location, "Match Quiz Impact Location", "match_quiz_q1_impact_location", "Frequency")
Match Quiz Impact Location
match_quiz_q1_impact_location Frequency
1 15125
2 3191
NA 29349

We’ll now create a labelled version of all variables in the match quiz section. We have to convert all raw variables to numeric as they are currently stored as strings. Note that the warning “NAs introduced by coercion” indicates that some non-numeric values are being converted to NA. Is this how we should handle it?

data$SuperheroUS <- ifelse(data$SuperheroUS %in% c("1", "2", "3", "4", "5", "6"), as.integer(data$SuperheroUS), NA)
## Warning in ifelse(data$SuperheroUS %in% c("1", "2", "3", "4", "5", "6"), : NAs
## introduced by coercion
data$match_quiz_q2_superhero_US <- labelled(data$SuperheroUS, labels = c("Transform education" = 1, "Eradicate hunger and homelessness" = 2, 
"Defend the oppressed and marginalized" = 3, "Rescue the environment" = 4, "Heal the sick" = 5, "Protect the animals" = 6))
summary_table(data$match_quiz_q2_superhero_US, "Match Quiz Superhero US", "match_quiz_q2_superhero_US", "Frequency")
Match Quiz Superhero US
match_quiz_q2_superhero_US Frequency
1 1577
2 5570
3 1680
4 1153
5 2275
6 2478
NA 32932
data$Global_Option <- ifelse(data$Global_Option %in% c("1", "2", "3", "4", "5"), as.integer(data$Global_Option), NA)
data$match_quiz_q2_superhero_global <- labelled(data$Global_Option, labels = c("Eradicate poverty worldwide" = 1,  "Defend the oppressed and marginalized" = 2, 
"Rescue the environment" = 3, "Heal the sick" = 4, "Shield the animals" = 5))
summary_table(data$match_quiz_q2_superhero_global, "Match Quiz Superhero Global", "match_quiz_q2_superhero_global", "Frequency")
Match Quiz Superhero Global
match_quiz_q2_superhero_global Frequency
1 1070
2 788
3 552
4 399
5 365
NA 44491
data$US_Help_eduction <- ifelse(data$US_Help_eduction %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_eduction), NA)
data$match_quiz_q3_education_US <- labelled(data$US_Help_eduction, labels = c("help all children learn to read" = 1, "send a low income student to college" = 2,
"support girls in STEM (Science, Tech, Engineering, & Medicide)" = 3, "help teachers fund classroom projects" = 4, "provide tutoring and support to underserved children" = 5))
summary_table(data$match_quiz_q3_education_US, "Match Quiz Education US", "match_quiz_q3_education_US", "Frequency")
Match Quiz Education US
match_quiz_q3_education_US Frequency
1 667
2 149
3 142
4 156
5 422
NA 46129
data$US_Help_HungerHomeless <- ifelse(data$US_Help_HungerHomeless %in% c("1", "2", "3"), as.integer(data$US_Help_HungerHomeless), NA)
data$match_quiz_q3_hunger_homelessness_US <- labelled(data$US_Help_HungerHomeless, labels = c("a roof over their head" = 1, "hot food in their belly" = 2, 
"employment training and opportunities" = 3))
summary_table(data$match_quiz_q3_hunger_homelessness_US, "Match Quiz Hunger and Homelessness US", "match_quiz_q3_hunger_homelessness_US", "Frequency")
Match Quiz Hunger and Homelessness US
match_quiz_q3_hunger_homelessness_US Frequency
1 1941
2 1872
3 1777
NA 42075
data$US_Help_oppressed <- ifelse(data$US_Help_oppressed %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_oppressed), NA)
## Warning in ifelse(data$US_Help_oppressed %in% c("1", "2", "3", "4", "5"), : NAs
## introduced by coercion
data$match_quiz_q3_marginalized_US <- labelled(data$US_Help_oppressed, labels = c("immigrants and refugees fleeing violence" = 1, "LGBTQ+ communities" = 2,
"racial or ethnic minorities" = 3, "women" = 4, "people with disabilities" = 5))
summary_table(data$match_quiz_q3_marginalized_US, "Match Quiz Marginalized US", "match_quiz_q3_marginalized_US", "Frequency")
Match Quiz Marginalized US
match_quiz_q3_marginalized_US Frequency
1 353
2 300
3 334
4 278
5 374
NA 46026
data$US_Help_environment <- ifelse(data$US_Help_environment %in% c("1", "2", "3", "4"), as.integer(data$US_Help_environment), NA)
data$match_quiz_q3_environment_US <- labelled(data$US_Help_environment, labels = c("innovations that further clean energy technology" = 1, "massive ocean clean up" = 2,
"conserving natural spaces and habitats like old rainforests and national parks" = 3, "new polivies to regulate carbon emissions" = 4))
summary_table(data$match_quiz_q3_environment_US, "Match Quiz Environment US", "match_quiz_q3_environment_US", "Frequency")
Match Quiz Environment US
match_quiz_q3_environment_US Frequency
1 266
2 194
3 451
4 228
NA 46526
data$US_help_sick <- ifelse(data$US_help_sick %in% c("1", "2", "3", "4"), as.integer(data$US_help_sick), NA)
data$match_quiz_q3_sick_US <- labelled(data$US_help_sick, labels = c("people suffering from preventable diseases because of inadequate healthcare" = 1,
"rare diseases that need more research" = 2, "cancer: the Big C" = 3, "people with heart disease" = 4))
summary_table(data$match_quiz_q3_sick_US, "Match Quiz Sick US", "match_quiz_q3_sick_US", "Frequency")
Match Quiz Sick US
match_quiz_q3_sick_US Frequency
1 623
2 347
3 1064
4 249
NA 45382
data$US_Help_animals <- ifelse(data$US_Help_animals %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_animals), NA)
data$match_quiz_q3_animals_US <- labelled(data$US_Help_animals, labels = c("dogs & cats humans' best friends" = 1, 
"all the WILD furry animals like lions, tigers, and bears (oh and koalas!)" = 2, "any species on the brink of extinction" = 3, "animals in factory farms" = 4, 
"whales and ocean animals (Baby Beluga! Nemo!!)" = 5))
summary_table(data$match_quiz_q3_animals_US, "Match Quiz Animals US", "match_quiz_q3_animals_US", "Frequency")
Match Quiz Animals US
match_quiz_q3_animals_US Frequency
1 1736
2 212
3 206
4 226
5 73
NA 45212
data$Global_help_poverty <- ifelse(data$Global_help_poverty %in% c("1", "2", "3"), as.integer(data$Global_help_poverty), NA)
data$match_quiz_q3_poverty_global <- labelled(data$Global_help_poverty, labels = c("the opportunity to go to school" = 1, "food in their belly" = 2,
"the skills to grow food sustainably" = 3))
summary_table(data$match_quiz_q3_poverty_global, "Match Quiz Poverty Global", "match_quiz_q3_poverty_global", "Frequency")
Match Quiz Poverty Global
match_quiz_q3_poverty_global Frequency
1 259
2 259
3 312
NA 46835
data$Global_help_oppressed <- ifelse(data$Global_help_oppressed %in% c("1", "2", "3", "4"), as.integer(data$Global_help_oppressed), NA)
data$match_quiz_q3_oppressed_global <- labelled(data$Global_help_oppressed, labels = c("refugees fleeing violence or other disasters" = 1,
"people that are victims of human trafficking" = 2, "women" = 3, "groups targeted by their government" = 4))
summary_table(data$match_quiz_q3_oppressed_global, "Match Quiz Oppressed Global", "match_quiz_q3_oppressed_global", "Frequency")
Match Quiz Oppressed Global
match_quiz_q3_oppressed_global Frequency
1 387
2 113
3 100
4 162
NA 46903
data$Global_help_enivronment <- ifelse(data$Global_help_enivronment %in% c("1", "2", "3", "4"), as.integer(data$Global_help_enivronment), NA)
data$match_quiz_q3_environment_global <- labelled(data$Global_help_enivronment, labels = c("innovatiuos that further clean energy technology" = 1, "massive pollution cleanup" = 2,
"conserving natural spaces and habitats like old rainforests and national parks" = 3, "new policies to regulate carbon emissions" = 4))
summary_table(data$match_quiz_q3_environment_global, "Match Quiz Environment Global", "match_quiz_q3_environment_global", "Frequency")
Match Quiz Environment Global
match_quiz_q3_environment_global Frequency
1 123
2 112
3 202
4 113
NA 47115
data$Global_help_sick <- ifelse(data$Global_help_sick %in% c("1", "2", "3", "4"), as.integer(data$Global_help_sick), NA)
data$match_quiz_q3_sick_global <- labelled(data$Global_help_sick, labels = c("protecting against preventable childhood diseases" = 1, 
"treating the big three: HIV, AIDS, Tuburculosis and Malaria" = 2, "eradicating polio" = 3, "preventing and curing blindness" = 4))
summary_table(data$match_quiz_q3_sick_global, "Match Quiz Sick Global", "match_quiz_q3_sick_global", "Frequency")
Match Quiz Sick Global
match_quiz_q3_sick_global Frequency
1 204
2 107
3 12
4 62
NA 47280
data$Global_help_animals <- ifelse(data$Global_help_animals %in% c("1", "2", "3"), as.integer(data$Global_help_animals), NA)
data$match_quiz_q3_animals_global <- labelled(data$Global_help_animals, labels = c("all the WILD furry animals like lions, tigers, and bears (oh and koalas!)" = 1,
"any species on the brink of extinction" = 2, "whales and ocean animals (Baby Beluga! Nemo!!)" = 3))
summary_table(data$match_quiz_q3_animals_global, "Match Quiz Animals Global", "match_quiz_q3_animals_global", "Frequency")
Match Quiz Animals Global
match_quiz_q3_animals_global Frequency
1 147
2 184
3 25
NA 47309

Matchmaking Sanity Checks

We want to check that each combination of answers to the matchmaking quiz leads to the same charity match. The code below displays the combinations of answers that do not always lead to the same charity_match_index. The column unique_charity_match_index is the number of unique charity_match_index values for each combination of answers, and the column observation_count is the number of observations for each combination of answers.

filtered_data <- subset(data, !is.na(charity_match_index))
grouped_data <- filtered_data |> group_by(match_quiz_q1_impact_location, match_quiz_q2_superhero_US, match_quiz_q2_superhero_global, 
match_quiz_q3_education_US, match_quiz_q3_hunger_homelessness_US, match_quiz_q3_marginalized_US, match_quiz_q3_environment_US, 
match_quiz_q3_sick_US, match_quiz_q3_animals_US, match_quiz_q3_poverty_global, match_quiz_q3_oppressed_global, match_quiz_q3_sick_global, 
match_quiz_q3_environment_global, match_quiz_q3_animals_global) |>
summarize(unique_charity_match_index = n_distinct(charity_match_index), observation_count = n())
## `summarise()` has grouped output by 'match_quiz_q1_impact_location',
## 'match_quiz_q2_superhero_US', 'match_quiz_q2_superhero_global',
## 'match_quiz_q3_education_US', 'match_quiz_q3_hunger_homelessness_US',
## 'match_quiz_q3_marginalized_US', 'match_quiz_q3_environment_US',
## 'match_quiz_q3_sick_US', 'match_quiz_q3_animals_US',
## 'match_quiz_q3_poverty_global', 'match_quiz_q3_oppressed_global',
## 'match_quiz_q3_sick_global', 'match_quiz_q3_environment_global'. You can
## override using the `.groups` argument.
inconsistent_groups <- grouped_data |>
  filter(unique_charity_match_index > 1)
kable(inconsistent_groups, caption = "Inconsistent Groups", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
Inconsistent Groups
match_quiz_q1_impact_location match_quiz_q2_superhero_US match_quiz_q2_superhero_global match_quiz_q3_education_US match_quiz_q3_hunger_homelessness_US match_quiz_q3_marginalized_US match_quiz_q3_environment_US match_quiz_q3_sick_US match_quiz_q3_animals_US match_quiz_q3_poverty_global match_quiz_q3_oppressed_global match_quiz_q3_sick_global match_quiz_q3_environment_global match_quiz_q3_animals_global unique_charity_match_index observation_count
NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 2

We now write a CSV file for each unique combination of answers to the matchmaking quiz that only contains the answers with exactly one unique match. We also add a column for the unique charity_match_index.

filtered_data <- subset(data, !is.na(charity_match_index))

grouped_data <- filtered_data %>%
  group_by(match_quiz_q1_impact_location, match_quiz_q2_superhero_US, match_quiz_q2_superhero_global, 
           match_quiz_q3_education_US, match_quiz_q3_hunger_homelessness_US, match_quiz_q3_marginalized_US, match_quiz_q3_environment_US, 
           match_quiz_q3_sick_US, match_quiz_q3_animals_US, match_quiz_q3_poverty_global, match_quiz_q3_oppressed_global, match_quiz_q3_sick_global, 
           match_quiz_q3_environment_global, match_quiz_q3_animals_global) %>%
  summarize(unique_charity_match_index = n_distinct(charity_match_index), 
            observation_count = n(),
            charity_match_index = first(charity_match_index))
## `summarise()` has grouped output by 'match_quiz_q1_impact_location',
## 'match_quiz_q2_superhero_US', 'match_quiz_q2_superhero_global',
## 'match_quiz_q3_education_US', 'match_quiz_q3_hunger_homelessness_US',
## 'match_quiz_q3_marginalized_US', 'match_quiz_q3_environment_US',
## 'match_quiz_q3_sick_US', 'match_quiz_q3_animals_US',
## 'match_quiz_q3_poverty_global', 'match_quiz_q3_oppressed_global',
## 'match_quiz_q3_sick_global', 'match_quiz_q3_environment_global'. You can
## override using the `.groups` argument.
consistent_groups <- grouped_data %>%
  filter(unique_charity_match_index == 1)

# write.csv(consistent_groups, "Data/Processed/Consistent_Groups.csv")

We see that in all but one inconsistent combination, participants have not finished the quiz.

Values Quiz Variables

We will go through a similar process as with the Match Quiz Variables, creating labelled versions of each.

data$important_smart <- ifelse(data$important_smart %in% c("1", "2", "3"), as.integer(data$important_smart), NA)
data$values_quiz_q1_smart <- labelled(data$important_smart, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q1_smart, "Summary of Values Quiz Q1: Smart", "values_quiz_q1_smart", "Frequency")
Summary of Values Quiz Q1: Smart
values_quiz_q1_smart Frequency
1 908
2 2439
3 13656
NA 30662
data$important_responsive <- ifelse(data$important_responsive %in% c("1", "2", "3"), as.integer(data$important_responsive), NA)
data$values_quiz_q2_responsive <- labelled(data$important_responsive, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q2_responsive, "Summary of Values Quiz Q2: Responsive", "values_quiz_q2_responsive", "Frequency")
Summary of Values Quiz Q2: Responsive
values_quiz_q2_responsive Frequency
1 982
2 4672
3 11162
NA 30849
data$important_forwardlooking <- ifelse(data$important_forwardlooking %in% c("1", "2", "3"), as.integer(data$important_forwardlooking), NA)
data$values_quiz_q3_forwardlooking <- labelled(data$important_forwardlooking, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q3_forwardlooking, "Summary of Values Quiz Q3: Forward Looking", "values_quiz_q3_forwardlooking", "Frequency")
Summary of Values Quiz Q3: Forward Looking
values_quiz_q3_forwardlooking Frequency
1 546
2 4260
3 11830
NA 31029

We will also check the values of the donor type.

summary_table(data$donor_type, "Donor Type", "donor_type", "Frequency")
Donor Type
donor_type Frequency
forward_looking 5349
responsive 4797
smart 6387
unsure 99
NA 31033

There do not appear to be any issues with this variable.

Treatment Variables

Treatment Group

First we’ll look at the treatment_group variable, which appears to have no issues.

summary_table(data$treatment_group, "Treatment Group", "treatment_group", "Frequency")
Treatment Group
treatment_group Frequency
control 2265
obligation 4543
opportunity 4420
NA 36437

We now want to check that those who have a treatment value of NA also have NA values for the following variables. `

  • treatment_RF
  • treatment_FR
  • donate_today
  • stay_connected
data$treatment_NA_check <- ifelse(is.na(data$treatment_group) & (!is.na(data$treatment_RF) | !is.na(data$treatment_FR) | !is.na(data$donate_today) | !is.na(data$stay_connected)), TRUE, FALSE)
summary_table(data$treatment_NA_check, "Summary of Treatment NA Check", "treatment_NA_check", "Frequency")
Summary of Treatment NA Check
treatment_NA_check Frequency
FALSE 47662
TRUE 3

There are 3 cases in which the treatment group is NA but at least one of the other variables are not. This could be because people are restarting the quiz, then not passing the consent stage the second time.

Free Response Variables

We’ll now look at the free responses within the treatment variable group. Namely, treatment_RF and treatment_FR. For each, we’ll look at the number of NAs along with the average length of response when it is not NA. The treatment_RF variable is the free response when asked, “imagine talking to (text1_opportunity), image what they might say to you. How would this make you feel?” The question leading to the treatment_FR variable is posed the same way, exept the text1_opportunity is replaced with text1_obligation.

data$treatment_RF <- as.character(data$treatment_RF)
data$treatment_FR <- as.character(data$treatment_FR)
treatment_RF_na <- sum(is.na(data$treatment_RF))
treatment_FR_na <- sum(is.na(data$treatment_FR))
treatment_RF_avg_length <- mean(nchar(data$treatment_RF[!is.na(data$treatment_RF)]))
treatment_FR_avg_length <- mean(nchar(data$treatment_FR[!is.na(data$treatment_FR)]))
treatment_na <- data.frame('Variable' = c("treatment_RF", "treatment_FR"), 'NAs' = c(treatment_RF_na, treatment_FR_na), 'Average Length' = c(treatment_RF_avg_length, treatment_FR_avg_length))
kable(treatment_na, caption = "Summary of NAs and Average Length of Treatment Variables", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
Summary of NAs and Average Length of Treatment Variables
Variable NAs Average.Length
treatment_RF 44451 53.87243
treatment_FR 44498 50.88254

Manipulation Order

There appear to be no issues with the manipulation order variable.

summary_table(data$manipulation_order, "Manipulation Order", "manipulation_order", "Frequency")
Manipulation Order
manipulation_order Frequency
1 4262
2 4383
NA 39020

Manipulation Value Question

We will create labelled versions of the manipulation value question as done with the mathmaking and values quizzes.

data$manipulation_q1_value <- ifelse(data$manipulation_question_value %in% c("1", "2", "3", "4", "5"), as.integer(data$manipulation_question_value), NA)
## Warning in ifelse(data$manipulation_question_value %in% c("1", "2", "3", : NAs
## introduced by coercion
data$manipulation_q1_value <- labelled(data$manipulation_q1_value, labels = c("Not at all important" = 1, "Not too important" = 2, "Somewhat important" = 3, 
"Quite important" = 4, "Very important" = 5))
summary_table(data$manipulation_q1_value, "Manipulation Question 1: Value", "manipulation_q1_value", "Frequency")
Manipulation Question 1: Value
manipulation_q1_value Frequency
1 126
2 28
3 221
4 1020
5 5337
NA 40933

Charity Match Variables

charity_match_index

We will begin by looking at all values and frequencies of the charity_match_index variable.

summary_table(data$charity_match_index, "Charity Match Index", "charity_match_index", "Frequency")
Charity Match Index
charity_match_index Frequency
global_animals_1 144
global_animals_2 179
global_animals_3 22
global_educ_1 251
global_educ_2 250
global_educ_3 298
global_educ_4 229
global_envir_1 122
global_envir_2 108
global_envir_3 194
global_envir_4 111
global_oppressed_1 377
global_oppressed_2 111
global_oppressed_3 99
global_oppressed_4 157
global_sick_1 197
global_sick_2 98
global_sick_3 12
global_sick_4 54
us_animals_#1 1
us_animals_1 1708
us_animals_2 205
us_animals_3 201
us_animals_4 224
us_animals_5 68
us_educ_1 638
us_educ_2 139
us_educ_3 139
us_educ_4 155
us_educ_5 406
us_envir_1 254
us_envir_2 190
us_envir_3 442
us_envir_4 220
us_hunger_,,3 1
us_hunger_#1 1
us_hunger_1 1911
us_hunger_2 1830
us_hunger_3 1734
us_oppressed_1 342
us_oppressed_2 295
us_oppressed_3 328
us_oppressed_4 273
us_oppressed_5 357
us_sick_1 601
us_sick_2 340
us_sick_3 1030
us_sick_4 238
NA 30381

There are several values that seem to be incorrect. For example, there is a value with “us_hunger_,,3,” “us_hunger_#1,” and “us_animals_#1,” all of which only have one observation and don’t fit the name patterns of the other values.

charity_name

We will begin by looking at all values and frequencies of the charity_name variable.

summary_table(data$charity_name, "Charity Name", "charity_name", "Frequency")
Charity Name
charity_name Frequency
American Heart Association 237
American Society for the Prevention of Cruelty to Animals 1701
Amnesty International 156
BRAC 296
Clean Air Task Force 328
Concern Worldwide 98
Dana-Farber Cancer Institute 1029
DonorsChoose 155
Evidence Action Inc.  195
Feeding America 1821
Freedom Network 111
Girls Inc.  139
Global Fund for Women 98
HealthWell Foundation 598
HIAS 717
Housing Matters 1905
Mercy For Animals 224
National Women’s Law Center 270
Natural Resources Defense Council 374
Rainforest Trust 634
Reading is Fundamental 635
Rocketship Education 403
Room to Read 248
Sightsavers Inc.  54
Southern Poverty Law Center 326
Special Olympics 354
Surfrider Foundation 295
The Leukemia & Lymphoma Society 339
The Ocean Foundation 90
The Rotary Foundation 12
The Trevor Project 294
Thurgood Marshall College Fund 139
WaterAid 228
Wild Earth Allies 349
Wildlife SOS 379
World Food Program 247
Year Up 1728
NA 30459

Here there are no clear instances of incorrect values. There are 78 more NA values for charity_name than for charity_match_index. This is because there was a periods of time where the corresponding app wasn’t serving data over to chatfuel. We will create a flag for when charity_name is NA and charity_match_index is not NA.

data$charity_name_NA <- ifelse(is.na(data$charity_name) & !is.na(data$charity_match_index), TRUE, FALSE)
summary_table(data$charity_name_NA, "Summary of Charity Name NA", "charity_name_NA", "Frequency")
Summary of Charity Name NA
charity_name_NA Frequency
FALSE 47587
TRUE 78

We see that there are exactly 78 of such instances.

We want to check that each charity_match_index corresponds to exactly one charity_name. When doing so, we will ignore the 78 cases where charity_name is NA and charity_match_index is not NA.

data_filtered <- subset(data, charity_name_NA == FALSE)
result <- data_filtered |>
  group_by(charity_match_index) |>
  summarise(distinct_charity_names = n_distinct(charity_name))

kable(result, caption = "Number of Charity Names Associated With Each Charity Match Index", align = "c") |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
Number of Charity Names Associated With Each Charity Match Index
charity_match_index distinct_charity_names
global_animals_1 1
global_animals_2 1
global_animals_3 1
global_educ_1 1
global_educ_2 1
global_educ_3 2
global_educ_4 1
global_envir_1 1
global_envir_2 1
global_envir_3 1
global_envir_4 1
global_oppressed_1 1
global_oppressed_2 1
global_oppressed_3 1
global_oppressed_4 1
global_sick_1 1
global_sick_2 1
global_sick_3 1
global_sick_4 1
us_animals_1 1
us_animals_2 1
us_animals_3 1
us_animals_4 1
us_animals_5 1
us_educ_1 1
us_educ_2 1
us_educ_3 1
us_educ_4 1
us_educ_5 1
us_envir_1 1
us_envir_2 1
us_envir_3 1
us_envir_4 1
us_hunger_1 1
us_hunger_2 2
us_hunger_3 1
us_oppressed_1 1
us_oppressed_2 1
us_oppressed_3 2
us_oppressed_4 1
us_oppressed_5 1
us_sick_1 1
us_sick_2 1
us_sick_3 2
us_sick_4 1
NA 1

We see that there are four charity_match_indexes that have more than one distinct charity_name, namely global_educ_3, us_hunger_2, us_oppressed_3, and us_sick_3.

data_global_educ_3 <- subset(data_filtered, charity_match_index == "global_educ_3")
data_us_hunger_2 <- subset(data_filtered, charity_match_index == "us_hunger_2")
data_us_oppressed_3 <- subset(data_filtered, charity_match_index == "us_oppressed_3")
data_us_sick_3 <- subset(data_filtered, charity_match_index == "us_sick_3")
summary_table(data_global_educ_3$charity_name, "Charity Name for Global Education 3", "charity_name_global_educ_3", "Frequency")
Charity Name for Global Education 3
charity_name_global_educ_3 Frequency
BRAC 296
HIAS 1
summary_table(data_us_hunger_2$charity_name, "Charity Name for US Hunger 2", "charity_name_us_hunger_2", "Frequency")
Charity Name for US Hunger 2
charity_name_us_hunger_2 Frequency
Dana-Farber Cancer Institute 1
Feeding America 1821
summary_table(data_us_oppressed_3$charity_name, "Charity Name for US Oppressed 3", "charity_name_us_oppressed_3", "Frequency")
Charity Name for US Oppressed 3
charity_name_us_oppressed_3 Frequency
Southern Poverty Law Center 326
The Trevor Project 1
summary_table(data_us_sick_3$charity_name, "Charity Name for US Sick 3", "charity_name_us_sick_3", "Frequency")
Charity Name for US Sick 3
charity_name_us_sick_3 Frequency
Dana-Farber Cancer Institute 1028
Housing Matters 1

For each one, there is only one case of an incorrect value. ## Follow Up Variables We want to create boolean versions of the follow up variables. ### follow_up_donated We’ll look at the frequencies of the most common values of the follow_up_donated variable.

most_common_table(data$follow_up_donated, "Top 5 Most Common Responses for Follow Up Donated")
Top 5 Most Common Responses for Follow Up Donated
Response Frequency
NA 47260
Not yet 195
Yes, donated! 166
No 8
Stop 2
Other 34

We will only accept “Yes, donated!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_donated_NA <- ifelse(is.na(data$follow_up_donated), NA, ifelse(grepl("Yes, donated!", data$follow_up_donated, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_donated_NA, "Summary of Boolean Follow Up Donated Variable (With NA)", "boolean_follow_up_donated_NA", "Frequency")
Summary of Boolean Follow Up Donated Variable (With NA)
boolean_follow_up_donated_NA Frequency
FALSE 239
TRUE 166
NA 47260
data$same_day_donated_noNA <- ifelse(grepl("Yes, donated!", data$follow_up_donated, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_donated_noNA, "Summary of Boolean Follow Up Donated Variable (With NA = FALSE)", "boolean_follow_up_donated_noNA", "Frequency")
Summary of Boolean Follow Up Donated Variable (With NA = FALSE)
boolean_follow_up_donated_noNA Frequency
FALSE 47499
TRUE 166

good_match

We’ll look at the frequencies of the most common values of the good_match variable.

most_common_table(data$good_match, "Top 5 Most Common Responses for Good Match")
Top 5 Most Common Responses for Good Match
Response Frequency
NA 45611
Not really… 934
Yes! 736
Stop 77
Yes 41
Other 266

We will only accept “Yes!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_good_match_NA <- ifelse(is.na(data$good_match), NA, ifelse(grepl("Yes!", data$good_match, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_good_match_NA, "Summary of Boolean Good Match Variable (With NA)", "boolean_good_match_NA", "Frequency")
Summary of Boolean Good Match Variable (With NA)
boolean_good_match_NA Frequency
FALSE 1318
TRUE 736
NA 45611
data$same_day_good_match_noNA <- ifelse(grepl("Yes!", data$good_match, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_good_match_noNA, "Summary of Boolean Good Match Variable (With NA = FALSE)", "boolean_good_match_noNA", "Frequency")
Summary of Boolean Good Match Variable (With NA = FALSE)
boolean_good_match_noNA Frequency
FALSE 46929
TRUE 736

follow_up_donated_1

We’ll look at the frequencies of the most common values of the follow_up_donated_1 variable.

most_common_table(data$followup_donated_1, "Top 5 Most Common Responses for Follow Up Donated 1")
Top 5 Most Common Responses for Follow Up Donated 1
Response Frequency
NA 46998
No, not yet. 555
Yes, donated. 88
Stop 4
no, not yet. 2
Other 18

We will only accept “Yes, doanted.” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_donated1_NA <- ifelse(is.na(data$followup_donated_1), NA, ifelse(grepl("Yes, donated.", data$followup_donated_1, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_donated1_NA, "Summary of Boolean Follow Up Donated 1 Variable (With NA)", "boolean_follow_up_donated_1_NA", "Frequency")
Summary of Boolean Follow Up Donated 1 Variable (With NA)
boolean_follow_up_donated_1_NA Frequency
FALSE 579
TRUE 88
NA 46998
data$same_day_donated1_noNA <- ifelse(grepl("Yes, donated.", data$followup_donated_1, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_donated1_noNA, "Summary of Boolean Follow Up Donated 1 Variable (With NA = FALSE)", "boolean_follow_up_donated_1_noNA", "Frequency")
Summary of Boolean Follow Up Donated 1 Variable (With NA = FALSE)
boolean_follow_up_donated_1_noNA Frequency
FALSE 47577
TRUE 88

not_good_match_learn_more

We’ll look at the frequencies of the most common values of the not_good_match_learn_more variable.

most_common_table(data$not_good_match_learn_more, "Top 5 Most Common Responses for Not Good Match Learn More")
Top 5 Most Common Responses for Not Good Match Learn More
Response Frequency
NA 46860
Nope 685
Sure 96
Stop 5
STOP 4
Other 15

We will only accept “Sure” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_not_good_match_learn_more_NA <- ifelse(is.na(data$not_good_match_learn_more), NA, ifelse(grepl("Sure", data$not_good_match_learn_more, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_not_good_match_learn_more_NA, "Summary of Boolean Not Good Match Learn More Variable (With NA)", "boolean_not_good_match_learn_more_NA", "Frequency")
Summary of Boolean Not Good Match Learn More Variable (With NA)
boolean_not_good_match_learn_more_NA Frequency
FALSE 709
TRUE 96
NA 46860
data$same_day_not_good_match_learn_more_noNA <- ifelse(grepl("Sure", data$not_good_match_learn_more, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_not_good_match_learn_more_noNA, "Summary of Boolean Not Good Match Learn More Variable (With NA = FALSE)", "boolean_not_good_match_learn_more_noNA", "Frequency")
Summary of Boolean Not Good Match Learn More Variable (With NA = FALSE)
boolean_not_good_match_learn_more_noNA Frequency
FALSE 47569
TRUE 96

recurring

We’ll look at the frequencies of the most common values of the recurring variable.

most_common_table(data$recurring, "Top 5 Most Common Responses for Recurring")
Top 5 Most Common Responses for Recurring
Response Frequency
NA 47457
Nope 158
Yes! 45
If this is Charity Navigator, you should have that information 1
No 1
Other 3

We will only accept “Yes!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_recurring_NA <- ifelse(is.na(data$recurring), NA, ifelse(grepl("Yes!", data$recurring, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_recurring_NA, "Summary of Boolean Recurring Variable (With NA)", "boolean_recurring_NA", "Frequency")
Summary of Boolean Recurring Variable (With NA)
boolean_recurring_NA Frequency
FALSE 163
TRUE 45
NA 47457
data$same_day_recurring_noNA <- ifelse(grepl("Yes!", data$recurring, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_recurring_noNA, "Summary of Boolean Recurring Variable (With NA = FALSE)", "boolean_recurring_noNA", "Frequency")
Summary of Boolean Recurring Variable (With NA = FALSE)
boolean_recurring_noNA Frequency
FALSE 47620
TRUE 45