FB:Charitable: Cleaning of Survey Data

Helper Functions

We’ll begin by defining a few helper functions which will be used throughout the script to create tables and summaries of variables.

Most Common Table

This first function creates a table of the five most common values of a variable and includes a row for the sum of all other responses (other).

most_common_table <- function(variable, title){
    table <- table(variable, useNA = 'ifany')
    table <- sort(table, decreasing = TRUE)
    k = 5
    top_k <- head(table, 5)
    other <- sum(table[-(1:k)])
    freq_df <- data.frame(Response = names(top_k), Frequency = as.vector(top_k))
    other_df <- data.frame(Response = "Other", Frequency = other)
    result <- rbind(freq_df, other_df)
    kable(result,  caption = title, align = "c") |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
}

Summary Table

This function creates a clean table to look at the frequencies of different values of a variable. Unlike the function above, it will list all values.

summary_table <- function(variable, title, var1, var2){
    bool_df <- as.data.frame(table(variable, useNA = 'ifany'))
    colnames(bool_df) <- c(var1, var2)
    kable(bool_df, caption = title, align = "c") |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
}

Intro Variables

We will create binary flags for passing the greeting, pre-consent, and consent variables.

Greeting

First we will create a binary flag indicating if someone has passed the greeting. We’ll look at the most common values of the greeting variable. Participants were told “Hi! I’m Sam, a friendly chatbot, and I want to help you find your charity soulmate. With just 6 questions, I’ll find a highly-rated charity that inspires you!”. They were presented the option “Let’s do this!”.

most_common_table(data$greeting, "Top 5 Most Common Responses for Greeting")

Top 5 Most Common Responses for Greeting
Response	Frequency
Let’s do this!	30772
Yes!	8482
NA	4901
No	200
Stop	166
Other	3144

Currently, Chatfuel lists “Let’s do this!” as the only given response option to greeting, but we see that “Yes!” is also a common response. It could be that previously, chatfuel also (or only) listed “Yes!” as an option. To test this hypothesis, we will check the time periods in which “Let’s do this!” and “Yes!” reponses were recorded.

data$charitable_intro_start <- as.Date(data$charitable_intro_start)
yes_responses <- subset(data, greeting == "Yes!" & !is.na(charitable_intro_start))
lets_do_responses <- subset(data, greeting == "Let's do this!" & !is.na(charitable_intro_start))

yes_responses$charitable_intro_start <- as.POSIXct(yes_responses$charitable_intro_start)
time_range <- difftime(max(yes_responses$charitable_intro_start), min(yes_responses$charitable_intro_start))
binwidth <- 60 * 60 * 24
ggplot(yes_responses, aes(x = charitable_intro_start)) +
  geom_histogram(binwidth = binwidth, fill = "skyblue", color = "black") +
  labs(title = "Frequency of \'Yes!\'' Responses Over Time",
       x = "Time",
       y = "Frequency")

lets_do_responses$charitable_intro_start <- as.POSIXct(lets_do_responses$charitable_intro_start)
time_range <- difftime(max(lets_do_responses$charitable_intro_start), min(lets_do_responses$charitable_intro_start)) 
ggplot(lets_do_responses, aes(x = charitable_intro_start)) +
  geom_histogram(binwidth = binwidth, fill = "skyblue", color = "black") +
  labs(title = "Frequency of \'Let's do this!\'' Responses Over Time",
       x = "Time",
       y = "Frequency")

There is not a clear cutoff between when the two types of responses were recorded. Upon review, we found that chatfuel automatically added “Yes!” as a response option to the greeting.

We want the flag to be true if the response to the greeting is exactly “Let’s do this!” and false otherwise.

data$flag_greeting <- ifelse(grepl("Let's do this!", data$greeting, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_greeting, "Summary of flag_greeting", "flag_greeting", "Frequency")

Summary of flag_greeting
flag_greeting	Frequency
FALSE	16893
TRUE	30772

We want to check that those who answered “Yes!” to the greeting did not pass on to the other stages. We will check that the consent response is NA in these cases.

data_greeting_yes <- subset(data, greeting == "Yes!")
data_greeting_yes$NA_check <- ifelse(!is.na(data_greeting_yes$consent), TRUE, FALSE)
summary_table(data_greeting_yes$NA_check, "Summary of Greeting NA Check", "greeting_NA_check", "Frequency")

Summary of Greeting NA Check
greeting_NA_check	Frequency
FALSE	8203
TRUE	279

We see there are 279 people who responded “Yes!” to the greeting but still answered the consent question. We will add a flag for these people.

data$flag_greeting_yes_check <- ifelse(data$greeting == "Yes!" & !is.na(data$consent), TRUE, FALSE)
summary_table(data$flag_greeting_yes_check, "Summary of flag_greeting_NA_check", "flag_greeting_NA_check", "Frequency")

Summary of flag_greeting_NA_check
flag_greeting_NA_check	Frequency
FALSE	47385
TRUE	279
NA	1

Pre-Consent

We will now create a binary flag for the pre-consent variable. Below is the most common values for the pre-consent variable. Participants were told “But first! Because this chat is part of an academic study, we need your consent to participate.” They were presented with the option “What should I know?”.

most_common_table(data$pre_consent, "Top 5 Most Common Responses for Pre-Consent")

Top 5 Most Common Responses for Pre-Consent
Response	Frequency
What should I know?	26390
NA	17801
Ok	222
No	144
Yes	136
Other	2972

We will only mark the binary flag as true if they responded exactly “What should I know?”.

data$flag_pre_consent <- ifelse(grepl("What should I know?", data$pre_consent, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_pre_consent, "Summary of flag_pre_consent", "flag_pre_consent", "Frequency")

Summary of flag_pre_consent
flag_pre_consent	Frequency
FALSE	21267
TRUE	26398

Consent

We will perform the same process with the consent variable. Participants were asked “Okay! Are you over the age of 18 and ready to get started?! (Tap button below to continue)”. They were presented the options “I Consent, Start Now” and “No, let’s stop here”.

most_common_table(data$consent, "Top 5 Most Common Responses for Consent")

Top 5 Most Common Responses for Consent
Response	Frequency
NA	23659
I Consent, Start Now	19325
No, let’s stop here	4098
Yes	38
No	27
Other	518

We will only mark the consent flag as true if they responded exactly “I consent, start now”.

data$flag_consent <- ifelse(grepl("I Consent, Start Now", data$consent, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_consent, "Summary of flag_consent", "flag_consent", "Frequency")

Summary of flag_consent
flag_consent	Frequency
FALSE	28337
TRUE	19328

Consent Variable Sanity Check

We now want to check that if a participant did not pass the consent stage that future variables are NA. Specifically, we will check that the following variables are NA if the consent flag is false:

treatment_group
donate_today
stay_connected
charity_name
charity_match_index

We create a new flag, consent_NA_check, to keep track of cases where not all listed variables are NA.

data$consent_NA_check <- ifelse(data$flag_consent == FALSE & (!is.na(data$treatment_group) | !is.na(data$donate_today) | !is.na(data$stay_connected) | !is.na(data$charity_name) | !is.na(data$charity_match_index)), TRUE, FALSE)
summary_table(data$consent_NA_check, "Summary of Consent NA Check", "consent_NA_check", "Frequency")

Summary of Consent NA Check
consent_NA_check	Frequency
FALSE	47582
TRUE	83

There are 83 instances of participants not passing the consent stage but still having values for at least one of the listed variables. This could be because participants restarted the quiz then didn’t pass the consent stage the second time.

To test this hypothesis, at the end of the quiz, there is a question asking participants if they would like to take it again. This question was asked if the participants indicated they were not happy with their final charity match. They were asked “Do you want to try the charity match quiz again to get a different charity?” and were presented with the options “Sure!” and “No thanks.” We check the most common values for this variable below.

most_common_table(data$repeat_quiz, "Top 5 Most Common Responses for Repeat Quiz")

Top 5 Most Common Responses for Repeat Quiz
Response	Frequency
NA	47260
No thanks	370
Sure!	17
No	5
no	2
Other	11

We will create a flag, flag_repeat_quiz, that is true if the participant responded “Sure!” and false for all other values.

data$flag_repeat_quiz <- ifelse(grepl("Sure!", data$repeat_quiz, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$flag_repeat_quiz, "Summary of flag_repeat_quiz", "flag_repeat_quiz", "Frequency")

Summary of flag_repeat_quiz
flag_repeat_quiz	Frequency
FALSE	47648
TRUE	17

We now check how many participants have both the repeat quiz flag and the consent NA flag.

data$repeat_consent_NA_check <- ifelse(data$flag_repeat_quiz & data$consent_NA_check, TRUE, FALSE)
summary_table(data$repeat_consent_NA_check, "Summary of Repeat Consent NA Check", "repeat_consent_NA_check", "Frequency")

Summary of Repeat Consent NA Check
repeat_consent_NA_check	Frequency
FALSE	47665

None of the 83 observations are explained by the 17 people who indicated they would like to take the quiz again. Even so, there could be people who started the quiz again without reaching the repeat_quiz question stage.

Time variables

We will now check that all time variables make logical sense, meaning they are in the correct order.

Charitable Affirm Start

We’ll begin by examining charitable_affirm_start and charitable_affirm_end as an example, then run through all other time variables.

Charitable_affirm_start should always come before charitable_affirm_end. Below, we check that this is true, excluding NA cases for now. We create a flag, charitable_affirm_correct_order, for observations where the end time is before the start time

data$charitable_affirm_correct_order <- ifelse(data$charitable_affirm_start < data$charitable_affirm_end | is.na(data$charitable_affirm_start) | is.na(data$charitable_affirm_end), FALSE, TRUE)
summary_table(data$charitable_affirm_correct_order, "Summary of Charitable Affirm Correct Order", "charitable_affirm_correct_order", "Frequency")

Summary of Charitable Affirm Correct Order
charitable_affirm_correct_order	Frequency
FALSE	47633
TRUE	32

We see that there are 32 observations where the end time was before the start time. We can look at the first five of these observations to see if there is a pattern.

incorrect_order <- data[data$charitable_affirm_correct_order,]
first_5 <- head(incorrect_order, 5)
first_5 <- first_5[, c("charitable_affirm_start", "charitable_affirm_end")]
kable(first_5, caption = "First 5 Observations with Charitable Affirm End Before Start", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

First 5 Observations with Charitable Affirm End Before Start
charitable_affirm_start	charitable_affirm_end
2023-12-03 12:35:11	2023-12-02 14:21:47
2023-12-02 21:50:44	2023-12-01 20:18:10
2023-12-03 20:35:44	2023-12-02 23:20:52
2023-12-07 02:07:12	2023-12-03 06:01:08
2023-12-09 02:12:32	2023-12-08 05:25:03

It appears that in these obeservations the end time is a day before the start time. These cases likely occur when someone takes the quiz twice and does not finish the second time. As in the introduction section, we will check if any of these inconsistencies are explained by participants indicating they would like to take the quiz again.

data$repeat_charitable_affirm_order_check <- ifelse(data$flag_repeat_quiz & data$charitable_affirm_correct_order, TRUE, FALSE)
summary_table(data$repeat_charitable_affirm_order_check, "Summary of Repeat Charitable Affirm Order Check", "repeat_charitable_affirm_order_check", "Frequency")

Summary of Repeat Charitable Affirm Order Check
repeat_charitable_affirm_order_check	Frequency
FALSE	47665

None of these inconsistencies are explained by the 17 people who indicated they would like to take the quiz again.

We will now check that for instances of there being an end time but no start time. We create a flag, charitable_affirm_no_start, for these cases.

data$charitable_affirm_no_start <- ifelse(is.na(data$charitable_affirm_start) & !is.na(data$charitable_affirm_end), TRUE, FALSE)
summary_table(data$charitable_affirm_no_start, "Summary of Charitable Affirm No Start", "charitable_affirm_no_start", "Frequency")

Summary of Charitable Affirm No Start
charitable_affirm_no_start	Frequency
FALSE	47662
TRUE	3

We see there are three observations of participants having an end time but no start time. We can look at these three observations.

no_start <- data[is.na(data$charitable_affirm_start) & !is.na(data$charitable_affirm_end),]
no_start <- no_start[, c("charitable_affirm_start", "charitable_affirm_end")]
kable(no_start, caption = "Observations with Charitable Affirm End Time but no Start Time", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Observations with Charitable Affirm End Time but no Start Time
charitable_affirm_start	charitable_affirm_end
NA	2023-12-22 19:57:45
NA	2023-12-27 15:00:07
NA	2023-12-23 10:12:53

Other Time Variables

We will now run through the same process for the following start time variables (and associated end variables):

charitable_affirm_start_follwup1
charitable_feedback_start
charitable_fllwup_start
charitable_intro_start
charitable_match_start
charitable_match_start_followup1
charitable_reveal_start
chartiable_reveal_start_followup1
chartiable_treatment_start

We will create a flag that is triggered if there is a logical issue with any of these time variables, then we produce a table showing how many logical inconsistencies are associated with each variable.

data$start_end_logical_issue <- FALSE
starts = c("charitable_affirm_start_followup1", "charitable_feedback_start", "charitable_fllwup_start", "charitable_intro_start", "charitable_match_start", "charitable_match_start_followup1", "charitable_reveal_start", "charitable_reveal_start_followup1", "charitable_treatment_start")
prev_total = 0
issues <- data.frame('Time Variable' = character(), 'Total Issues' = numeric())
for (start in starts){
    end = sub("start", "end", start)
    data$start_end_logical_issue <- ifelse(data[[start]] < data[[end]] | is.na(data[[start]]) | is.na(data[[end]]), data$start_end_logical_issue, TRUE)
    data$start_end_logical_issue <- ifelse(is.na(data[[start]]) & !is.na(data[[end]]), TRUE, data$start_end_logical_issue)
    issue <- data[data$start_end_logical_issue,]
    total = nrow(issue) - prev_total
    new_row <- data.frame('Time Variable' = start, 'Total Issues' = total)
    issues <- rbind(issues, new_row)
    prev_total = nrow(issue)
}
kable(issues, caption = "Total Logical Issues for Each Time Variable", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Total Logical Issues for Each Time Variable
Time.Variable	Total.Issues
charitable_affirm_start_followup1	0
charitable_feedback_start	45
charitable_fllwup_start	2
charitable_intro_start	2434
charitable_match_start	73
charitable_match_start_followup1	0
charitable_reveal_start	82
charitable_reveal_start_followup1	0
charitable_treatment_start	47

Charitable_intro_start has an abnormal amount of logical inconsistencies, so we’ll dive deeper.

data$charitable_intro_correct_order <- ifelse(data$charitable_intro_start < data$charitable_intro_end | is.na(data$charitable_intro_start) | is.na(data$charitable_intro_end), FALSE, TRUE)
summary_table(data$charitable_intro_correct_order, "Summary of Charitable Intro Correct Order", "charitable_intro_correct_order", "Frequency")

Summary of Charitable Intro Correct Order
charitable_intro_correct_order	Frequency
FALSE	47185
TRUE	480

There are 529 observations where the end time was before the start time. We will now look for observations where the start time is NA but there exists an end time.

data$charitable_intro_no_start <- ifelse(is.na(data$charitable_intro_start) & !is.na(data$charitable_intro_end), TRUE, FALSE)
summary_table(data$charitable_intro_no_start, "Summary of Charitable Intro No Start", "charitable_intro_no_start", "Frequency")

Summary of Charitable Intro No Start
charitable_intro_no_start	Frequency
FALSE	45709
TRUE	1956

There are 1956 observations where the start time is NA but there exists an end time.

General Time Variable flag

We want to create a flag that is true if there is a logical issue with any of the time variables.

data$time_logical_issue <- ifelse(data$charitable_match_start < data$charitable_match_end &
                                  data$charitable_match_end < data$charitable_affirm_start &
                                  data$charitable_affirm_start < data$charitable_affirm_end &
                                  data$charitable_affirm_end < data$charitable_reveal_start &
                                  data$charitable_reveal_start < data$charitable_reveal_end &
                                  data$charitable_reveal_end < data$charitable_treatment_start &
                                  data$charitable_treatment_start < data$charitable_treatment_end, TRUE, FALSE)
summary_table(data$time_logical_issue, "Summary of Time Logical Issue", "time_logical_issue", "Frequency")

Summary of Time Logical Issue
time_logical_issue	Frequency
FALSE	17285
NA	30380

Matchmaking Quiz Variables

Creating labelled versions of each variable

We’ll start with the US_or_Global variable, using the ‘labelled’ package to create a labelled version of the variable.

data$match_quiz_q1_impact_location <- labelled(data$US_or_Global, labels = c("US" = 1, "International" = 2))
summary_table(data$match_quiz_q1_impact_location, "Match Quiz Impact Location", "match_quiz_q1_impact_location", "Frequency")

Match Quiz Impact Location
match_quiz_q1_impact_location	Frequency
1	15125
2	3191
NA	29349

We’ll now create a labelled version of all variables in the match quiz section. We have to convert all raw variables to numeric as they are currently stored as strings. Note that the warning “NAs introduced by coercion” indicates that some non-numeric values are being converted to NA. Is this how we should handle it?

data$SuperheroUS <- ifelse(data$SuperheroUS %in% c("1", "2", "3", "4", "5", "6"), as.integer(data$SuperheroUS), NA)

## Warning in ifelse(data$SuperheroUS %in% c("1", "2", "3", "4", "5", "6"), : NAs
## introduced by coercion

data$match_quiz_q2_superhero_US <- labelled(data$SuperheroUS, labels = c("Transform education" = 1, "Eradicate hunger and homelessness" = 2, 
"Defend the oppressed and marginalized" = 3, "Rescue the environment" = 4, "Heal the sick" = 5, "Protect the animals" = 6))
summary_table(data$match_quiz_q2_superhero_US, "Match Quiz Superhero US", "match_quiz_q2_superhero_US", "Frequency")

Match Quiz Superhero US
match_quiz_q2_superhero_US	Frequency
1	1577
2	5570
3	1680
4	1153
5	2275
6	2478
NA	32932

data$Global_Option <- ifelse(data$Global_Option %in% c("1", "2", "3", "4", "5"), as.integer(data$Global_Option), NA)
data$match_quiz_q2_superhero_global <- labelled(data$Global_Option, labels = c("Eradicate poverty worldwide" = 1,  "Defend the oppressed and marginalized" = 2, 
"Rescue the environment" = 3, "Heal the sick" = 4, "Shield the animals" = 5))
summary_table(data$match_quiz_q2_superhero_global, "Match Quiz Superhero Global", "match_quiz_q2_superhero_global", "Frequency")

Match Quiz Superhero Global
match_quiz_q2_superhero_global	Frequency
1	1070
2	788
3	552
4	399
5	365
NA	44491

data$US_Help_eduction <- ifelse(data$US_Help_eduction %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_eduction), NA)
data$match_quiz_q3_education_US <- labelled(data$US_Help_eduction, labels = c("help all children learn to read" = 1, "send a low income student to college" = 2,
"support girls in STEM (Science, Tech, Engineering, & Medicide)" = 3, "help teachers fund classroom projects" = 4, "provide tutoring and support to underserved children" = 5))
summary_table(data$match_quiz_q3_education_US, "Match Quiz Education US", "match_quiz_q3_education_US", "Frequency")

Match Quiz Education US
match_quiz_q3_education_US	Frequency
1	667
2	149
3	142
4	156
5	422
NA	46129

data$US_Help_HungerHomeless <- ifelse(data$US_Help_HungerHomeless %in% c("1", "2", "3"), as.integer(data$US_Help_HungerHomeless), NA)
data$match_quiz_q3_hunger_homelessness_US <- labelled(data$US_Help_HungerHomeless, labels = c("a roof over their head" = 1, "hot food in their belly" = 2, 
"employment training and opportunities" = 3))
summary_table(data$match_quiz_q3_hunger_homelessness_US, "Match Quiz Hunger and Homelessness US", "match_quiz_q3_hunger_homelessness_US", "Frequency")

Match Quiz Hunger and Homelessness US
match_quiz_q3_hunger_homelessness_US	Frequency
1	1941
2	1872
3	1777
NA	42075

data$US_Help_oppressed <- ifelse(data$US_Help_oppressed %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_oppressed), NA)

## Warning in ifelse(data$US_Help_oppressed %in% c("1", "2", "3", "4", "5"), : NAs
## introduced by coercion

data$match_quiz_q3_marginalized_US <- labelled(data$US_Help_oppressed, labels = c("immigrants and refugees fleeing violence" = 1, "LGBTQ+ communities" = 2,
"racial or ethnic minorities" = 3, "women" = 4, "people with disabilities" = 5))
summary_table(data$match_quiz_q3_marginalized_US, "Match Quiz Marginalized US", "match_quiz_q3_marginalized_US", "Frequency")

Match Quiz Marginalized US
match_quiz_q3_marginalized_US	Frequency
1	353
2	300
3	334
4	278
5	374
NA	46026

data$US_Help_environment <- ifelse(data$US_Help_environment %in% c("1", "2", "3", "4"), as.integer(data$US_Help_environment), NA)
data$match_quiz_q3_environment_US <- labelled(data$US_Help_environment, labels = c("innovations that further clean energy technology" = 1, "massive ocean clean up" = 2,
"conserving natural spaces and habitats like old rainforests and national parks" = 3, "new polivies to regulate carbon emissions" = 4))
summary_table(data$match_quiz_q3_environment_US, "Match Quiz Environment US", "match_quiz_q3_environment_US", "Frequency")

Match Quiz Environment US
match_quiz_q3_environment_US	Frequency
1	266
2	194
3	451
4	228
NA	46526

data$US_help_sick <- ifelse(data$US_help_sick %in% c("1", "2", "3", "4"), as.integer(data$US_help_sick), NA)
data$match_quiz_q3_sick_US <- labelled(data$US_help_sick, labels = c("people suffering from preventable diseases because of inadequate healthcare" = 1,
"rare diseases that need more research" = 2, "cancer: the Big C" = 3, "people with heart disease" = 4))
summary_table(data$match_quiz_q3_sick_US, "Match Quiz Sick US", "match_quiz_q3_sick_US", "Frequency")

Match Quiz Sick US
match_quiz_q3_sick_US	Frequency
1	623
2	347
3	1064
4	249
NA	45382

data$US_Help_animals <- ifelse(data$US_Help_animals %in% c("1", "2", "3", "4", "5"), as.integer(data$US_Help_animals), NA)
data$match_quiz_q3_animals_US <- labelled(data$US_Help_animals, labels = c("dogs & cats humans' best friends" = 1, 
"all the WILD furry animals like lions, tigers, and bears (oh and koalas!)" = 2, "any species on the brink of extinction" = 3, "animals in factory farms" = 4, 
"whales and ocean animals (Baby Beluga! Nemo!!)" = 5))
summary_table(data$match_quiz_q3_animals_US, "Match Quiz Animals US", "match_quiz_q3_animals_US", "Frequency")

Match Quiz Animals US
match_quiz_q3_animals_US	Frequency
1	1736
2	212
3	206
4	226
5	73
NA	45212

data$Global_help_poverty <- ifelse(data$Global_help_poverty %in% c("1", "2", "3"), as.integer(data$Global_help_poverty), NA)
data$match_quiz_q3_poverty_global <- labelled(data$Global_help_poverty, labels = c("the opportunity to go to school" = 1, "food in their belly" = 2,
"the skills to grow food sustainably" = 3))
summary_table(data$match_quiz_q3_poverty_global, "Match Quiz Poverty Global", "match_quiz_q3_poverty_global", "Frequency")

Match Quiz Poverty Global
match_quiz_q3_poverty_global	Frequency
1	259
2	259
3	312
NA	46835

data$Global_help_oppressed <- ifelse(data$Global_help_oppressed %in% c("1", "2", "3", "4"), as.integer(data$Global_help_oppressed), NA)
data$match_quiz_q3_oppressed_global <- labelled(data$Global_help_oppressed, labels = c("refugees fleeing violence or other disasters" = 1,
"people that are victims of human trafficking" = 2, "women" = 3, "groups targeted by their government" = 4))
summary_table(data$match_quiz_q3_oppressed_global, "Match Quiz Oppressed Global", "match_quiz_q3_oppressed_global", "Frequency")

Match Quiz Oppressed Global
match_quiz_q3_oppressed_global	Frequency
1	387
2	113
3	100
4	162
NA	46903

data$Global_help_enivronment <- ifelse(data$Global_help_enivronment %in% c("1", "2", "3", "4"), as.integer(data$Global_help_enivronment), NA)
data$match_quiz_q3_environment_global <- labelled(data$Global_help_enivronment, labels = c("innovatiuos that further clean energy technology" = 1, "massive pollution cleanup" = 2,
"conserving natural spaces and habitats like old rainforests and national parks" = 3, "new policies to regulate carbon emissions" = 4))
summary_table(data$match_quiz_q3_environment_global, "Match Quiz Environment Global", "match_quiz_q3_environment_global", "Frequency")

Match Quiz Environment Global
match_quiz_q3_environment_global	Frequency
1	123
2	112
3	202
4	113
NA	47115

data$Global_help_sick <- ifelse(data$Global_help_sick %in% c("1", "2", "3", "4"), as.integer(data$Global_help_sick), NA)
data$match_quiz_q3_sick_global <- labelled(data$Global_help_sick, labels = c("protecting against preventable childhood diseases" = 1, 
"treating the big three: HIV, AIDS, Tuburculosis and Malaria" = 2, "eradicating polio" = 3, "preventing and curing blindness" = 4))
summary_table(data$match_quiz_q3_sick_global, "Match Quiz Sick Global", "match_quiz_q3_sick_global", "Frequency")

Match Quiz Sick Global
match_quiz_q3_sick_global	Frequency
1	204
2	107
3	12
4	62
NA	47280

data$Global_help_animals <- ifelse(data$Global_help_animals %in% c("1", "2", "3"), as.integer(data$Global_help_animals), NA)
data$match_quiz_q3_animals_global <- labelled(data$Global_help_animals, labels = c("all the WILD furry animals like lions, tigers, and bears (oh and koalas!)" = 1,
"any species on the brink of extinction" = 2, "whales and ocean animals (Baby Beluga! Nemo!!)" = 3))
summary_table(data$match_quiz_q3_animals_global, "Match Quiz Animals Global", "match_quiz_q3_animals_global", "Frequency")

Match Quiz Animals Global
match_quiz_q3_animals_global	Frequency
1	147
2	184
3	25
NA	47309

Matchmaking Sanity Checks

We want to check that each combination of answers to the matchmaking quiz leads to the same charity match. The code below displays the combinations of answers that do not always lead to the same charity_match_index. The column unique_charity_match_index is the number of unique charity_match_index values for each combination of answers, and the column observation_count is the number of observations for each combination of answers.

filtered_data <- subset(data, !is.na(charity_match_index))
grouped_data <- filtered_data |> group_by(match_quiz_q1_impact_location, match_quiz_q2_superhero_US, match_quiz_q2_superhero_global, 
match_quiz_q3_education_US, match_quiz_q3_hunger_homelessness_US, match_quiz_q3_marginalized_US, match_quiz_q3_environment_US, 
match_quiz_q3_sick_US, match_quiz_q3_animals_US, match_quiz_q3_poverty_global, match_quiz_q3_oppressed_global, match_quiz_q3_sick_global, 
match_quiz_q3_environment_global, match_quiz_q3_animals_global) |>
summarize(unique_charity_match_index = n_distinct(charity_match_index), observation_count = n())

## `summarise()` has grouped output by 'match_quiz_q1_impact_location',
## 'match_quiz_q2_superhero_US', 'match_quiz_q2_superhero_global',
## 'match_quiz_q3_education_US', 'match_quiz_q3_hunger_homelessness_US',
## 'match_quiz_q3_marginalized_US', 'match_quiz_q3_environment_US',
## 'match_quiz_q3_sick_US', 'match_quiz_q3_animals_US',
## 'match_quiz_q3_poverty_global', 'match_quiz_q3_oppressed_global',
## 'match_quiz_q3_sick_global', 'match_quiz_q3_environment_global'. You can
## override using the `.groups` argument.

inconsistent_groups <- grouped_data |>
  filter(unique_charity_match_index > 1)
kable(inconsistent_groups, caption = "Inconsistent Groups", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Inconsistent Groups
match_quiz_q1_impact_location	match_quiz_q2_superhero_US	match_quiz_q2_superhero_global	match_quiz_q3_education_US	match_quiz_q3_hunger_homelessness_US	match_quiz_q3_marginalized_US	match_quiz_q3_environment_US	match_quiz_q3_sick_US	match_quiz_q3_animals_US	match_quiz_q3_poverty_global	match_quiz_q3_oppressed_global	match_quiz_q3_sick_global	match_quiz_q3_environment_global	match_quiz_q3_animals_global	unique_charity_match_index	observation_count
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	2

We now write a CSV file for each unique combination of answers to the matchmaking quiz that only contains the answers with exactly one unique match. We also add a column for the unique charity_match_index.

filtered_data <- subset(data, !is.na(charity_match_index))

grouped_data <- filtered_data %>%
  group_by(match_quiz_q1_impact_location, match_quiz_q2_superhero_US, match_quiz_q2_superhero_global, 
           match_quiz_q3_education_US, match_quiz_q3_hunger_homelessness_US, match_quiz_q3_marginalized_US, match_quiz_q3_environment_US, 
           match_quiz_q3_sick_US, match_quiz_q3_animals_US, match_quiz_q3_poverty_global, match_quiz_q3_oppressed_global, match_quiz_q3_sick_global, 
           match_quiz_q3_environment_global, match_quiz_q3_animals_global) %>%
  summarize(unique_charity_match_index = n_distinct(charity_match_index), 
            observation_count = n(),
            charity_match_index = first(charity_match_index))

## `summarise()` has grouped output by 'match_quiz_q1_impact_location',
## 'match_quiz_q2_superhero_US', 'match_quiz_q2_superhero_global',
## 'match_quiz_q3_education_US', 'match_quiz_q3_hunger_homelessness_US',
## 'match_quiz_q3_marginalized_US', 'match_quiz_q3_environment_US',
## 'match_quiz_q3_sick_US', 'match_quiz_q3_animals_US',
## 'match_quiz_q3_poverty_global', 'match_quiz_q3_oppressed_global',
## 'match_quiz_q3_sick_global', 'match_quiz_q3_environment_global'. You can
## override using the `.groups` argument.

consistent_groups <- grouped_data %>%
  filter(unique_charity_match_index == 1)

# write.csv(consistent_groups, "Data/Processed/Consistent_Groups.csv")

We see that in all but one inconsistent combination, participants have not finished the quiz.

Values Quiz Variables

We will go through a similar process as with the Match Quiz Variables, creating labelled versions of each.

data$important_smart <- ifelse(data$important_smart %in% c("1", "2", "3"), as.integer(data$important_smart), NA)
data$values_quiz_q1_smart <- labelled(data$important_smart, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q1_smart, "Summary of Values Quiz Q1: Smart", "values_quiz_q1_smart", "Frequency")

Summary of Values Quiz Q1: Smart
values_quiz_q1_smart	Frequency
1	908
2	2439
3	13656
NA	30662

data$important_responsive <- ifelse(data$important_responsive %in% c("1", "2", "3"), as.integer(data$important_responsive), NA)
data$values_quiz_q2_responsive <- labelled(data$important_responsive, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q2_responsive, "Summary of Values Quiz Q2: Responsive", "values_quiz_q2_responsive", "Frequency")

Summary of Values Quiz Q2: Responsive
values_quiz_q2_responsive	Frequency
1	982
2	4672
3	11162
NA	30849

data$important_forwardlooking <- ifelse(data$important_forwardlooking %in% c("1", "2", "3"), as.integer(data$important_forwardlooking), NA)
data$values_quiz_q3_forwardlooking <- labelled(data$important_forwardlooking, labels = c("Not at all" = 1, "Somewhat" = 2, "Very" = 3))
summary_table(data$values_quiz_q3_forwardlooking, "Summary of Values Quiz Q3: Forward Looking", "values_quiz_q3_forwardlooking", "Frequency")

Summary of Values Quiz Q3: Forward Looking
values_quiz_q3_forwardlooking	Frequency
1	546
2	4260
3	11830
NA	31029

We will also check the values of the donor type.

summary_table(data$donor_type, "Donor Type", "donor_type", "Frequency")

Donor Type
donor_type	Frequency
forward_looking	5349
responsive	4797
smart	6387
unsure	99
NA	31033

There do not appear to be any issues with this variable.

Treatment Variables

Treatment Group

First we’ll look at the treatment_group variable, which appears to have no issues.

summary_table(data$treatment_group, "Treatment Group", "treatment_group", "Frequency")

Treatment Group
treatment_group	Frequency
control	2265
obligation	4543
opportunity	4420
NA	36437

We now want to check that those who have a treatment value of NA also have NA values for the following variables. `

treatment_RF
treatment_FR
donate_today
stay_connected

data$treatment_NA_check <- ifelse(is.na(data$treatment_group) & (!is.na(data$treatment_RF) | !is.na(data$treatment_FR) | !is.na(data$donate_today) | !is.na(data$stay_connected)), TRUE, FALSE)
summary_table(data$treatment_NA_check, "Summary of Treatment NA Check", "treatment_NA_check", "Frequency")

Summary of Treatment NA Check
treatment_NA_check	Frequency
FALSE	47662
TRUE	3

There are 3 cases in which the treatment group is NA but at least one of the other variables are not. This could be because people are restarting the quiz, then not passing the consent stage the second time.

Free Response Variables

We’ll now look at the free responses within the treatment variable group. Namely, treatment_RF and treatment_FR. For each, we’ll look at the number of NAs along with the average length of response when it is not NA. The treatment_RF variable is the free response when asked, “imagine talking to (text1_opportunity), image what they might say to you. How would this make you feel?” The question leading to the treatment_FR variable is posed the same way, exept the text1_opportunity is replaced with text1_obligation.

data$treatment_RF <- as.character(data$treatment_RF)
data$treatment_FR <- as.character(data$treatment_FR)
treatment_RF_na <- sum(is.na(data$treatment_RF))
treatment_FR_na <- sum(is.na(data$treatment_FR))
treatment_RF_avg_length <- mean(nchar(data$treatment_RF[!is.na(data$treatment_RF)]))
treatment_FR_avg_length <- mean(nchar(data$treatment_FR[!is.na(data$treatment_FR)]))
treatment_na <- data.frame('Variable' = c("treatment_RF", "treatment_FR"), 'NAs' = c(treatment_RF_na, treatment_FR_na), 'Average Length' = c(treatment_RF_avg_length, treatment_FR_avg_length))
kable(treatment_na, caption = "Summary of NAs and Average Length of Treatment Variables", align = "c") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Summary of NAs and Average Length of Treatment Variables
Variable	NAs	Average.Length
treatment_RF	44451	53.87243
treatment_FR	44498	50.88254

Manipulation Order

There appear to be no issues with the manipulation order variable.

summary_table(data$manipulation_order, "Manipulation Order", "manipulation_order", "Frequency")

Manipulation Order
manipulation_order	Frequency
1	4262
2	4383
NA	39020

Manipulation Value Question

We will create labelled versions of the manipulation value question as done with the mathmaking and values quizzes.

data$manipulation_q1_value <- ifelse(data$manipulation_question_value %in% c("1", "2", "3", "4", "5"), as.integer(data$manipulation_question_value), NA)

## Warning in ifelse(data$manipulation_question_value %in% c("1", "2", "3", : NAs
## introduced by coercion

data$manipulation_q1_value <- labelled(data$manipulation_q1_value, labels = c("Not at all important" = 1, "Not too important" = 2, "Somewhat important" = 3, 
"Quite important" = 4, "Very important" = 5))
summary_table(data$manipulation_q1_value, "Manipulation Question 1: Value", "manipulation_q1_value", "Frequency")

Manipulation Question 1: Value
manipulation_q1_value	Frequency
1	126
2	28
3	221
4	1020
5	5337
NA	40933

Donate Variables

Donate Today

We’ll look at the frequencies of the most common values of the donate_today variable.

most_common_table(data$donate_today, "Top 5 Most Common Responses for Donate Today")

Top 5 Most Common Responses for Donate Today
Response	Frequency
NA	39770
No, not ready	6668
Yes, let’s do it!	1046
No	10
Stop	4
Other	167

We will only accept “Yes, let’s do it!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$boolean_donate_today_NA <- ifelse(is.na(data$donate_today), NA, ifelse(grepl("Yes, let's do it!", data$donate_today, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$boolean_donate_today_NA, "Summary of Boolean Donate Today Variable (With NA)", "boolean_donate_today", "Frequency")

Summary of Boolean Donate Today Variable (With NA)
boolean_donate_today	Frequency
FALSE	6849
TRUE	1046
NA	39770

data$boolean_donate_today_noNA <- ifelse(grepl("Yes, let's do it!", data$donate_today, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$boolean_donate_today_noNA, "Summary of Boolean Donate Today Variable (With NA = FALSE)", "boolean_donate_today_noNA", "Frequency")

Summary of Boolean Donate Today Variable (With NA = FALSE)
boolean_donate_today_noNA	Frequency
FALSE	46619
TRUE	1046

Donate Later

We look at the five most common values of the donate_later variable.

most_common_table(data$donate_later, "Top 5 Most Common Responses for Donate Later")

Top 5 Most Common Responses for Donate Later
Response	Frequency
NA	41824
Explore others	2238
Learn more	1939
Do it later	1205
Stop	57
Other	402

Stay Connected

We’ll look at the frequencies of the most common values of the stay_connected variable.

most_common_table(data$stay_connected, "Top 5 Most Common Responses for Stay Connected")

Top 5 Most Common Responses for Stay Connected
Response	Frequency
NA	41499
No thanks!	3813
Yes please!	2318
Yes	4
No	3
Other	28

We will only accept “Yes please!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$boolean_stay_connected_NA <- ifelse(is.na(data$stay_connected), NA, ifelse(grepl("Yes please!", data$stay_connected, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$boolean_stay_connected_NA, "Summary of Boolean Stay Connected Variable (With NA)", "boolean_stay_connected_NA", "Frequency")

Summary of Boolean Stay Connected Variable (With NA)
boolean_stay_connected_NA	Frequency
FALSE	3848
TRUE	2318
NA	41499

data$boolean_stay_connected_noNA <- ifelse(grepl("Yes please!", data$stay_connected, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$boolean_stay_connected_noNA, "Summary of Boolean Stay Connected Variable (With NA = FALSE)", "boolean_stay_connected_noNA", "Frequency")

Summary of Boolean Stay Connected Variable (With NA = FALSE)
boolean_stay_connected_noNA	Frequency
FALSE	45347
TRUE	2318

Charity Match Variables

charity_match_index

We will begin by looking at all values and frequencies of the charity_match_index variable.

summary_table(data$charity_match_index, "Charity Match Index", "charity_match_index", "Frequency")

Charity Match Index
charity_match_index	Frequency
global_animals_1	144
global_animals_2	179
global_animals_3	22
global_educ_1	251
global_educ_2	250
global_educ_3	298
global_educ_4	229
global_envir_1	122
global_envir_2	108
global_envir_3	194
global_envir_4	111
global_oppressed_1	377
global_oppressed_2	111
global_oppressed_3	99
global_oppressed_4	157
global_sick_1	197
global_sick_2	98
global_sick_3	12
global_sick_4	54
us_animals_#1	1
us_animals_1	1708
us_animals_2	205
us_animals_3	201
us_animals_4	224
us_animals_5	68
us_educ_1	638
us_educ_2	139
us_educ_3	139
us_educ_4	155
us_educ_5	406
us_envir_1	254
us_envir_2	190
us_envir_3	442
us_envir_4	220
us_hunger_,,3	1
us_hunger_#1	1
us_hunger_1	1911
us_hunger_2	1830
us_hunger_3	1734
us_oppressed_1	342
us_oppressed_2	295
us_oppressed_3	328
us_oppressed_4	273
us_oppressed_5	357
us_sick_1	601
us_sick_2	340
us_sick_3	1030
us_sick_4	238
NA	30381

There are several values that seem to be incorrect. For example, there is a value with “us_hunger_,,3,” “us_hunger_#1,” and “us_animals_#1,” all of which only have one observation and don’t fit the name patterns of the other values.

charity_name

We will begin by looking at all values and frequencies of the charity_name variable.

summary_table(data$charity_name, "Charity Name", "charity_name", "Frequency")

Charity Name
charity_name	Frequency
American Heart Association	237
American Society for the Prevention of Cruelty to Animals	1701
Amnesty International	156
BRAC	296
Clean Air Task Force	328
Concern Worldwide	98
Dana-Farber Cancer Institute	1029
DonorsChoose	155
Evidence Action Inc.	195
Feeding America	1821
Freedom Network	111
Girls Inc.	139
Global Fund for Women	98
HealthWell Foundation	598
HIAS	717
Housing Matters	1905
Mercy For Animals	224
National Women’s Law Center	270
Natural Resources Defense Council	374
Rainforest Trust	634
Reading is Fundamental	635
Rocketship Education	403
Room to Read	248
Sightsavers Inc.	54
Southern Poverty Law Center	326
Special Olympics	354
Surfrider Foundation	295
The Leukemia & Lymphoma Society	339
The Ocean Foundation	90
The Rotary Foundation	12
The Trevor Project	294
Thurgood Marshall College Fund	139
WaterAid	228
Wild Earth Allies	349
Wildlife SOS	379
World Food Program	247
Year Up	1728
NA	30459

Here there are no clear instances of incorrect values. There are 78 more NA values for charity_name than for charity_match_index. This is because there was a periods of time where the corresponding app wasn’t serving data over to chatfuel. We will create a flag for when charity_name is NA and charity_match_index is not NA.

data$charity_name_NA <- ifelse(is.na(data$charity_name) & !is.na(data$charity_match_index), TRUE, FALSE)
summary_table(data$charity_name_NA, "Summary of Charity Name NA", "charity_name_NA", "Frequency")

Summary of Charity Name NA
charity_name_NA	Frequency
FALSE	47587
TRUE	78

We see that there are exactly 78 of such instances.

We want to check that each charity_match_index corresponds to exactly one charity_name. When doing so, we will ignore the 78 cases where charity_name is NA and charity_match_index is not NA.

data_filtered <- subset(data, charity_name_NA == FALSE)
result <- data_filtered |>
  group_by(charity_match_index) |>
  summarise(distinct_charity_names = n_distinct(charity_name))

kable(result, caption = "Number of Charity Names Associated With Each Charity Match Index", align = "c") |>
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Number of Charity Names Associated With Each Charity Match Index
charity_match_index	distinct_charity_names
global_animals_1	1
global_animals_2	1
global_animals_3	1
global_educ_1	1
global_educ_2	1
global_educ_3	2
global_educ_4	1
global_envir_1	1
global_envir_2	1
global_envir_3	1
global_envir_4	1
global_oppressed_1	1
global_oppressed_2	1
global_oppressed_3	1
global_oppressed_4	1
global_sick_1	1
global_sick_2	1
global_sick_3	1
global_sick_4	1
us_animals_1	1
us_animals_2	1
us_animals_3	1
us_animals_4	1
us_animals_5	1
us_educ_1	1
us_educ_2	1
us_educ_3	1
us_educ_4	1
us_educ_5	1
us_envir_1	1
us_envir_2	1
us_envir_3	1
us_envir_4	1
us_hunger_1	1
us_hunger_2	2
us_hunger_3	1
us_oppressed_1	1
us_oppressed_2	1
us_oppressed_3	2
us_oppressed_4	1
us_oppressed_5	1
us_sick_1	1
us_sick_2	1
us_sick_3	2
us_sick_4	1
NA	1

We see that there are four charity_match_indexes that have more than one distinct charity_name, namely global_educ_3, us_hunger_2, us_oppressed_3, and us_sick_3.

data_global_educ_3 <- subset(data_filtered, charity_match_index == "global_educ_3")
data_us_hunger_2 <- subset(data_filtered, charity_match_index == "us_hunger_2")
data_us_oppressed_3 <- subset(data_filtered, charity_match_index == "us_oppressed_3")
data_us_sick_3 <- subset(data_filtered, charity_match_index == "us_sick_3")
summary_table(data_global_educ_3$charity_name, "Charity Name for Global Education 3", "charity_name_global_educ_3", "Frequency")

Charity Name for Global Education 3
charity_name_global_educ_3	Frequency
BRAC	296
HIAS	1

summary_table(data_us_hunger_2$charity_name, "Charity Name for US Hunger 2", "charity_name_us_hunger_2", "Frequency")

Charity Name for US Hunger 2
charity_name_us_hunger_2	Frequency
Dana-Farber Cancer Institute	1
Feeding America	1821

summary_table(data_us_oppressed_3$charity_name, "Charity Name for US Oppressed 3", "charity_name_us_oppressed_3", "Frequency")

Charity Name for US Oppressed 3
charity_name_us_oppressed_3	Frequency
Southern Poverty Law Center	326
The Trevor Project	1

summary_table(data_us_sick_3$charity_name, "Charity Name for US Sick 3", "charity_name_us_sick_3", "Frequency")

Charity Name for US Sick 3
charity_name_us_sick_3	Frequency
Dana-Farber Cancer Institute	1028
Housing Matters	1

For each one, there is only one case of an incorrect value. ## Follow Up Variables We want to create boolean versions of the follow up variables. ### follow_up_donated We’ll look at the frequencies of the most common values of the follow_up_donated variable.

most_common_table(data$follow_up_donated, "Top 5 Most Common Responses for Follow Up Donated")

Top 5 Most Common Responses for Follow Up Donated
Response	Frequency
NA	47260
Not yet	195
Yes, donated!	166
No	8
Stop	2
Other	34

We will only accept “Yes, donated!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_donated_NA <- ifelse(is.na(data$follow_up_donated), NA, ifelse(grepl("Yes, donated!", data$follow_up_donated, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_donated_NA, "Summary of Boolean Follow Up Donated Variable (With NA)", "boolean_follow_up_donated_NA", "Frequency")

Summary of Boolean Follow Up Donated Variable (With NA)
boolean_follow_up_donated_NA	Frequency
FALSE	239
TRUE	166
NA	47260

data$same_day_donated_noNA <- ifelse(grepl("Yes, donated!", data$follow_up_donated, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_donated_noNA, "Summary of Boolean Follow Up Donated Variable (With NA = FALSE)", "boolean_follow_up_donated_noNA", "Frequency")

Summary of Boolean Follow Up Donated Variable (With NA = FALSE)
boolean_follow_up_donated_noNA	Frequency
FALSE	47499
TRUE	166

good_match

We’ll look at the frequencies of the most common values of the good_match variable.

most_common_table(data$good_match, "Top 5 Most Common Responses for Good Match")

Top 5 Most Common Responses for Good Match
Response	Frequency
NA	45611
Not really…	934
Yes!	736
Stop	77
Yes	41
Other	266

We will only accept “Yes!” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_good_match_NA <- ifelse(is.na(data$good_match), NA, ifelse(grepl("Yes!", data$good_match, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_good_match_NA, "Summary of Boolean Good Match Variable (With NA)", "boolean_good_match_NA", "Frequency")

Summary of Boolean Good Match Variable (With NA)
boolean_good_match_NA	Frequency
FALSE	1318
TRUE	736
NA	45611

data$same_day_good_match_noNA <- ifelse(grepl("Yes!", data$good_match, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_good_match_noNA, "Summary of Boolean Good Match Variable (With NA = FALSE)", "boolean_good_match_noNA", "Frequency")

Summary of Boolean Good Match Variable (With NA = FALSE)
boolean_good_match_noNA	Frequency
FALSE	46929
TRUE	736

follow_up_donated_1

We’ll look at the frequencies of the most common values of the follow_up_donated_1 variable.

most_common_table(data$followup_donated_1, "Top 5 Most Common Responses for Follow Up Donated 1")

Top 5 Most Common Responses for Follow Up Donated 1
Response	Frequency
NA	46998
No, not yet.	555
Yes, donated.	88
Stop	4
no, not yet.	2
Other	18

We will only accept “Yes, doanted.” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_donated1_NA <- ifelse(is.na(data$followup_donated_1), NA, ifelse(grepl("Yes, donated.", data$followup_donated_1, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_donated1_NA, "Summary of Boolean Follow Up Donated 1 Variable (With NA)", "boolean_follow_up_donated_1_NA", "Frequency")

Summary of Boolean Follow Up Donated 1 Variable (With NA)
boolean_follow_up_donated_1_NA	Frequency
FALSE	579
TRUE	88
NA	46998

data$same_day_donated1_noNA <- ifelse(grepl("Yes, donated.", data$followup_donated_1, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_donated1_noNA, "Summary of Boolean Follow Up Donated 1 Variable (With NA = FALSE)", "boolean_follow_up_donated_1_noNA", "Frequency")

Summary of Boolean Follow Up Donated 1 Variable (With NA = FALSE)
boolean_follow_up_donated_1_noNA	Frequency
FALSE	47577
TRUE	88

not_good_match_learn_more

We’ll look at the frequencies of the most common values of the not_good_match_learn_more variable.

most_common_table(data$not_good_match_learn_more, "Top 5 Most Common Responses for Not Good Match Learn More")

Top 5 Most Common Responses for Not Good Match Learn More
Response	Frequency
NA	46860
Nope	685
Sure	96
Stop	5
STOP	4
Other	15

We will only accept “Sure” as a positive response. We will create two versions of this variable, one with NA values and one without.

data$same_day_not_good_match_learn_more_NA <- ifelse(is.na(data$not_good_match_learn_more), NA, ifelse(grepl("Sure", data$not_good_match_learn_more, ignore.case = TRUE), TRUE, FALSE))
summary_table(data$same_day_not_good_match_learn_more_NA, "Summary of Boolean Not Good Match Learn More Variable (With NA)", "boolean_not_good_match_learn_more_NA", "Frequency")

Summary of Boolean Not Good Match Learn More Variable (With NA)
boolean_not_good_match_learn_more_NA	Frequency
FALSE	709
TRUE	96
NA	46860

data$same_day_not_good_match_learn_more_noNA <- ifelse(grepl("Sure", data$not_good_match_learn_more, ignore.case = TRUE), TRUE, FALSE)
summary_table(data$same_day_not_good_match_learn_more_noNA, "Summary of Boolean Not Good Match Learn More Variable (With NA = FALSE)", "boolean_not_good_match_learn_more_noNA", "Frequency")

Summary of Boolean Not Good Match Learn More Variable (With NA = FALSE)
boolean_not_good_match_learn_more_noNA	Frequency
FALSE	47569
TRUE	96

donate_amount

We’ll look at the frequencies of the most common values of the donate_amount variable.

most_common_table(data$donate_amount, "Top 5 Most Common Responses for Donate Amount")

Top 5 Most Common Responses for Donate Amount
Response	Frequency
NA	47442
Prefer not to share	141
25	13
$25	9
$100	5
Other	55

We want to remove any dollar signs then convert to numeric.

data$donate_amount_numeric <- gsub("\\$", "", data$donate_amount)
data$donate_amount_numeric <- as.numeric(data$donate_amount)

## Warning: NAs introduced by coercion

summary(data$donate_amount_numeric)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     5.0    25.0    25.0    55.2    50.0   260.0   47635

recurring

We’ll look at the frequencies of the most common values of the recurring variable.

most_common_table(data$recurring, "Top 5 Most Common Responses for Recurring")

Top 5 Most Common Responses for Recurring
Response	Frequency
NA	47457
Nope	158
Yes!	45
If this is Charity Navigator, you should have that information	1
No	1
Other	3