I was working on a project for the American Heart Association. They asked us to help predict donations by participants in fundraising events.
To do this, I created a Random Forest model that was able to successfully predict individuals whose total contributions would fall into the top 10% of lifetime donations given to the AHA. This model will help the organization develop targeted marketing efforts to increase the number of participants who will fall into this category- thereby increasing total donations received by the AHA from various fundraising events.
The Random Forest model was useful in predicting “top_walker”’s, however, we will now take these predictions a step further to develop a cohesive profile of these individuals to help with the targeted campaigns. For this, we will create a Bayesian Belief Network. The previous Random Forest model was good in helping us identify key variables that are useful in predicting “top_walker” status, but the Belief Network will take this analysis a step further by illustrating the relationships within the data between variables.
In order to implement the Bayesian model, we will first have to do some manipulation with the data.
The “bnlearn” package which we will be using to conduct this analysis requires discrete input variables. In other words, the variables have to be categorical. However, the data that we have been given contains many numeric columns with values such as personal donations, etc. We will have to convert these numbers to categories by putting the values into bins.
Lets get started-
# Load the data
dt <- setDT(read.csv("Data/LuminateDataExport_UTDP2_011818.csv"))
# Remove the unwanted characters
dt1 <- dt[, lapply(.SD, function(x) {
gsub(c("\\$| |,"), "", x)
})]
# Lower-casing
setnames(dt1, names(dt1), tolower(names(dt1)))[]
## city company_goal company_name event_date event_year
## 1: 5/14/2016 FY2016
## 2: 9/26/2015 FY2016
## 3: 9/12/2015 FY2016
## 4: 6/3/2017 FY2017
## 5: 5/6/2017 FY2017
## ---
## 196875: SanAntonio 12/12/2015 FY2016
## 196876: Surrey 9/12/2015 FY2016
## 196877: BoyntonBeach 9/26/2015 FY2016
## 196878: Whitesboro 3/5/2016 FY2016
## 196879: Lakewood 3/7/2015 FY2015
## fundraising_goal name
## 1: 250 HeartWalk2015-2016St.LouisMO
## 2: 200 HeartWalk2015-2016CantonOH
## 3: 350 HeartWalk2015-2016DallasTX
## 4: 0 FDAHW2016-2017AlbanyNYWO-65355
## 5: 0 FDAHW2016-2017NewHavenCTWO-66497
## ---
## 196875: 350 HeartWalk2015-2016SanAntonioTX
## 196876: 350 HeartWalk2015-2016DallasTX
## 196877: 300 HeartWalk2015-2016PalmBeachCountyFL
## 196878: 50 HeartWalk2015-2016UticaNY
## 196879: 300 HeartWalk2014-2015OrangeCountyCA
## participant_id state street team_average
## 1: 2323216 176
## 2: 2273391 OH 123
## 3: 2419569 TX 306
## 4: 4088558 47
## 5: 4527010 0
## ---
## 196875: 2813767 TX 4500LockhillSelma 96
## 196876: 2937704 BC 240013450102Ave 318
## 196877: 3035497 FL 1666PalmBeachLakesBlvd 180
## 196878: 2073802 NY 8RooseveltDrive 86
## 196879: 3063505 CA 3700E.SouthhStreet 67
## team_captain team_count team_id team_member_goal
## 1: No 7 152788 0
## 2: No 6 127127 0
## 3: Yes 4 45273 0
## 4: No 46 179207 0
## 5: No 0 -1
## ---
## 196875: No 9 64567 0
## 196876: No 36 44816 0
## 196877: Yes 10 50585 0
## 196878: No 27 78604 0
## 196879: No 9 37613 0
## team_name team_total_gifts
## 1: TeamClayton 1230
## 2: CardiacCrusaders 738
## 3: BIS-TeamMyers 1225
## 4: IndependentWalkers 2145
## 5: 0
## ---
## 196875: RefundRockers 867
## 196876: International-HeartPumpingGlobeTrotters 11451
## 196877: David'sTeam 1800
## 196878: TeamDebbie 2313
## 196879: LakewoodRegionalMedicalCenter 600
## zip gifts_count registration_gift participant_gifts
## 1: 2 No 236
## 2: 1 No 218
## 3: 2 No 225
## 4: 1 No 0
## 5: 1 No 0
## ---
## 196875: Texas7824 3 Yes 57
## 196876: V3T0C3 17 No 1495
## 196877: WestPalm 1 No 100
## 196878: Whitesboro 1 No 0
## 196879: 09712 1 Yes 0
## personal_gift total_gifts match_code tap_level tap_desc tap_lifed
## 1: 0 236 UX000 X
## 2: 0 218 UX000 X
## 3: 0 225 UX000 X
## 4: 250 250 UX000 X
## 5: 250 250 UX000 X
## ---
## 196875: 50 107 UX000 X
## 196876: 0 1495 UX000 X
## 196877: 0 100 UX000 X
## 196878: 130 130 UX000 X
## 196879: 100 100 UX000 X
## medage_cy divindx_cy medhinc_cy meddi_cy mednw_cy
## 1: 0 0 0 0 0
## 2: 0 0 0 0 0
## 3: 0 0 0 0 0
## 4: 0 0 0 0 0
## 5: 0 0 0 0 0
## ---
## 196875: 0 0 0 0 0
## 196876: 0 0 0 0 0
## 196877: 0 0 0 0 0
## 196878: 0 0 0 0 0
## 196879: 0 0 0 0 0
# Remove unnecessary columns
important <- dt1[, `:=`(c("event_date", "city", "company_goal", "company_name",
"fundraising_goal", "state", "street", "team_count", "team_member_goal",
"team_name", "zip", "gifts_count", "registration_gift"), NULL)]
# Convert everything to character
imp1 <- important[, lapply(.SD, as.character, stringsAsFactors = FALSE)]
# Function to convert numeric columns
conversion <- function(x) {
stopifnot(is.list(x))
x[] <- rapply(x, utils::type.convert, classes = "character", how = "replace",
as.is = TRUE)
return(x)
}
# Apply function
imp2 <- conversion(imp1)
# Select factor columns
factors <- c("event_year", "name", "participant_id", "team_captain", "team_id",
"match_code", "tap_level", "tap_desc")
imp3 <- imp2[, `:=`((factors), lapply(.SD, as.factor)), .SDcols = factors]
# count by category
imp3[, `:=`(team_count = .N), by = team_id]
imp3[, `:=`(event_count = .N), by = name]
imp3[, `:=`(segment_count = .N), by = tap_desc]
# sum $ by category
imp3[, `:=`(personal_sum = sum(total_gifts)), by = participant_id]
imp3[, `:=`(segment_sum = sum(total_gifts)), by = tap_desc]
imp3[, `:=`(event_sum = sum(total_gifts)), by = name]
imp3[, `:=`(year_sum = sum(total_gifts)), by = event_year]
# top 10%
n <- 10
imp3[, `:=`(top_walker, as.factor(ifelse(participant_id %in% imp3[personal_sum >
quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0)))]
imp3[, `:=`(top_team, as.factor(ifelse(team_id %in% imp3[team_total_gifts >
quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0)))]
imp3[, `:=`(top_event, as.factor(ifelse(name %in% imp3[event_sum > quantile(event_sum,
prob = 1 - n/100)]$name, 1, 0)))]
# By year
imp3[, `:=`(y1_top_walker, as.factor(ifelse(event_year == "FY2015" & participant_id %in%
imp3[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id,
1, 0)))]
imp3[, `:=`(y2_top_walker, as.factor(ifelse(event_year == "FY2016" & participant_id %in%
imp3[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id,
1, 0)))]
imp3[, `:=`(y3_top_walker, as.factor(ifelse(event_year == "FY2017" & participant_id %in%
imp3[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id,
1, 0)))]
imp3[, `:=`(y1_top_team, as.factor(ifelse(event_year == "FY2015" & team_id %in%
imp3[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id,
1, 0)))]
imp3[, `:=`(y2_top_team, as.factor(ifelse(event_year == "FY2016" & team_id %in%
imp3[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id,
1, 0)))]
imp3[, `:=`(y3_top_team, as.factor(ifelse(event_year == "FY2017" & team_id %in%
imp3[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id,
1, 0)))]
# Remove demographic info
imp4 <- imp3[, c("medage_cy", "divindx_cy", "medhinc_cy", "meddi_cy", "mednw_cy") := NULL]
# Select numeric columns
nums <- which(sapply(imp4, is.numeric))
bn_dat <- imp4[, nums, with = FALSE]
# Create categorical columns
out_cols <- paste(names(bn_dat), "cat", sep = "_")
# Transform
bn_dat <- bn_dat[, c(out_cols) := lapply(.SD, function(x) { .bincode(x, unique(quantile(x, probs = seq(0, 1, .25))))})]
# Remove NA values
bn_dat[is.na(bn_dat)] = 0
factors <- seq(13, 24, 1)
# Convert numeric to factor
bn_dat2 <- bn_dat[, c(factors) := lapply(.SD, as.factor), .SDcols = factors]
# Merge data with original factor columns
merge_cols <- which(sapply(imp4, is.factor))
combo <- cbind(bn_dat2[, factors, with = FALSE], imp4[, merge_cols, with = FALSE])
Now that we have our data in a suitable format, we are ready to feed it into the “bnlearn” algorithm:
# Give the data to the algorithm
bn <- naive.bayes(combo, "top_walker")
# Plot
plot(bn)
Unfortunately, in this instance- it would seem like the network plot of the naive bayes model does not give us much information on the nature of the data. The object was to try and find out if there were specific variables that we conditionally dependent upon one another. However, the plot does not show any relationships between the explanatory variables. In a sense, this is OK because it would further impact the concept underlying the Naive Bayes model which is that the variables should be independent of one another.