Introduction:

I was working on a project for the American Heart Association. They asked us to help predict donations by participants in fundraising events.

To do this, I created a Random Forest model that was able to successfully predict individuals whose total contributions would fall into the top 10% of lifetime donations given to the AHA. This model will help the organization develop targeted marketing efforts to increase the number of participants who will fall into this category- thereby increasing total donations received by the AHA from various fundraising events.

The Random Forest model was useful in predicting “top_walker”’s, however, we will now take these predictions a step further to develop a cohesive profile of these individuals to help with the targeted campaigns. For this, we will create a Bayesian Belief Network. The previous Random Forest model was good in helping us identify key variables that are useful in predicting “top_walker” status, but the Belief Network will take this analysis a step further by illustrating the relationships within the data between variables.


The model:

In order to implement the Bayesian model, we will first have to do some manipulation with the data.

The “bnlearn” package which we will be using to conduct this analysis requires discrete input variables. In other words, the variables have to be categorical. However, the data that we have been given contains many numeric columns with values such as personal donations, etc. We will have to convert these numbers to categories by putting the values into bins.

Lets get started-

# Load the data
dt <- setDT(read.csv("Data/LuminateDataExport_UTDP2_011818.csv"))

# Remove the unwanted characters
dt1 <- dt[, lapply(.SD, function(x) {
    gsub(c("\\$| |,"), "", x)
})]

# Lower-casing
setnames(dt1, names(dt1), tolower(names(dt1)))[]
##                 city company_goal company_name event_date event_year
##      1:                                         5/14/2016     FY2016
##      2:                                         9/26/2015     FY2016
##      3:                                         9/12/2015     FY2016
##      4:                                          6/3/2017     FY2017
##      5:                                          5/6/2017     FY2017
##     ---                                                             
## 196875:   SanAntonio                           12/12/2015     FY2016
## 196876:       Surrey                            9/12/2015     FY2016
## 196877: BoyntonBeach                            9/26/2015     FY2016
## 196878:   Whitesboro                             3/5/2016     FY2016
## 196879:     Lakewood                             3/7/2015     FY2015
##         fundraising_goal                                name
##      1:              250        HeartWalk2015-2016St.LouisMO
##      2:              200          HeartWalk2015-2016CantonOH
##      3:              350          HeartWalk2015-2016DallasTX
##      4:                0      FDAHW2016-2017AlbanyNYWO-65355
##      5:                0    FDAHW2016-2017NewHavenCTWO-66497
##     ---                                                     
## 196875:              350      HeartWalk2015-2016SanAntonioTX
## 196876:              350          HeartWalk2015-2016DallasTX
## 196877:              300 HeartWalk2015-2016PalmBeachCountyFL
## 196878:               50           HeartWalk2015-2016UticaNY
## 196879:              300    HeartWalk2014-2015OrangeCountyCA
##         participant_id state                 street team_average
##      1:        2323216                                       176
##      2:        2273391    OH                                 123
##      3:        2419569    TX                                 306
##      4:        4088558                                        47
##      5:        4527010                                         0
##     ---                                                         
## 196875:        2813767    TX      4500LockhillSelma           96
## 196876:        2937704    BC        240013450102Ave          318
## 196877:        3035497    FL 1666PalmBeachLakesBlvd          180
## 196878:        2073802    NY        8RooseveltDrive           86
## 196879:        3063505    CA     3700E.SouthhStreet           67
##         team_captain team_count team_id team_member_goal
##      1:           No          7  152788                0
##      2:           No          6  127127                0
##      3:          Yes          4   45273                0
##      4:           No         46  179207                0
##      5:           No          0      -1                 
##     ---                                                 
## 196875:           No          9   64567                0
## 196876:           No         36   44816                0
## 196877:          Yes         10   50585                0
## 196878:           No         27   78604                0
## 196879:           No          9   37613                0
##                                       team_name team_total_gifts
##      1:                             TeamClayton             1230
##      2:                        CardiacCrusaders              738
##      3:                           BIS-TeamMyers             1225
##      4:                      IndependentWalkers             2145
##      5:                                                        0
##     ---                                                         
## 196875:                           RefundRockers              867
## 196876: International-HeartPumpingGlobeTrotters            11451
## 196877:                             David'sTeam             1800
## 196878:                              TeamDebbie             2313
## 196879:           LakewoodRegionalMedicalCenter              600
##                zip gifts_count registration_gift participant_gifts
##      1:                      2                No               236
##      2:                      1                No               218
##      3:                      2                No               225
##      4:                      1                No                 0
##      5:                      1                No                 0
##     ---                                                           
## 196875:  Texas7824           3               Yes                57
## 196876:     V3T0C3          17                No              1495
## 196877:   WestPalm           1                No               100
## 196878: Whitesboro           1                No                 0
## 196879:      09712           1               Yes                 0
##         personal_gift total_gifts match_code tap_level tap_desc tap_lifed
##      1:             0         236      UX000         X                   
##      2:             0         218      UX000         X                   
##      3:             0         225      UX000         X                   
##      4:           250         250      UX000         X                   
##      5:           250         250      UX000         X                   
##     ---                                                                  
## 196875:            50         107      UX000         X                   
## 196876:             0        1495      UX000         X                   
## 196877:             0         100      UX000         X                   
## 196878:           130         130      UX000         X                   
## 196879:           100         100      UX000         X                   
##         medage_cy divindx_cy medhinc_cy meddi_cy mednw_cy
##      1:         0          0          0        0        0
##      2:         0          0          0        0        0
##      3:         0          0          0        0        0
##      4:         0          0          0        0        0
##      5:         0          0          0        0        0
##     ---                                                  
## 196875:         0          0          0        0        0
## 196876:         0          0          0        0        0
## 196877:         0          0          0        0        0
## 196878:         0          0          0        0        0
## 196879:         0          0          0        0        0
# Remove unnecessary columns
important <- dt1[, `:=`(c("event_date", "city", "company_goal", "company_name", 
    "fundraising_goal", "state", "street", "team_count", "team_member_goal", 
    "team_name", "zip", "gifts_count", "registration_gift"), NULL)]

# Convert everything to character
imp1 <- important[, lapply(.SD, as.character, stringsAsFactors = FALSE)]

# Function to convert numeric columns
conversion <- function(x) {
    
    stopifnot(is.list(x))
    
    x[] <- rapply(x, utils::type.convert, classes = "character", how = "replace", 
        as.is = TRUE)
    
    return(x)
}

# Apply function
imp2 <- conversion(imp1)

# Select factor columns
factors <- c("event_year", "name", "participant_id", "team_captain", "team_id", 
    "match_code", "tap_level", "tap_desc")

imp3 <- imp2[, `:=`((factors), lapply(.SD, as.factor)), .SDcols = factors]

# count by category
imp3[, `:=`(team_count = .N), by = team_id]
imp3[, `:=`(event_count = .N), by = name]
imp3[, `:=`(segment_count = .N), by = tap_desc]

# sum $ by category
imp3[, `:=`(personal_sum = sum(total_gifts)), by = participant_id]

imp3[, `:=`(segment_sum = sum(total_gifts)), by = tap_desc]

imp3[, `:=`(event_sum = sum(total_gifts)), by = name]

imp3[, `:=`(year_sum = sum(total_gifts)), by = event_year]

# top 10%
n <- 10

imp3[, `:=`(top_walker, as.factor(ifelse(participant_id %in% imp3[personal_sum > 
    quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0)))]

imp3[, `:=`(top_team, as.factor(ifelse(team_id %in% imp3[team_total_gifts > 
    quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0)))]

imp3[, `:=`(top_event, as.factor(ifelse(name %in% imp3[event_sum > quantile(event_sum, 
    prob = 1 - n/100)]$name, 1, 0)))]

# By year
imp3[, `:=`(y1_top_walker, as.factor(ifelse(event_year == "FY2015" & participant_id %in% 
    imp3[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 
    1, 0)))]
imp3[, `:=`(y2_top_walker, as.factor(ifelse(event_year == "FY2016" & participant_id %in% 
    imp3[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 
    1, 0)))]
imp3[, `:=`(y3_top_walker, as.factor(ifelse(event_year == "FY2017" & participant_id %in% 
    imp3[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 
    1, 0)))]

imp3[, `:=`(y1_top_team, as.factor(ifelse(event_year == "FY2015" & team_id %in% 
    imp3[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 
    1, 0)))]
imp3[, `:=`(y2_top_team, as.factor(ifelse(event_year == "FY2016" & team_id %in% 
    imp3[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 
    1, 0)))]
imp3[, `:=`(y3_top_team, as.factor(ifelse(event_year == "FY2017" & team_id %in% 
    imp3[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 
    1, 0)))]
# Remove demographic info
imp4 <- imp3[,  c("medage_cy", "divindx_cy", "medhinc_cy", "meddi_cy", "mednw_cy") := NULL]

# Select numeric columns
nums <- which(sapply(imp4, is.numeric))

bn_dat <- imp4[, nums, with = FALSE]

# Create categorical columns
out_cols <- paste(names(bn_dat), "cat", sep = "_")

# Transform
bn_dat <- bn_dat[, c(out_cols) := lapply(.SD, function(x) { .bincode(x, unique(quantile(x, probs = seq(0, 1, .25))))})]

# Remove NA values
bn_dat[is.na(bn_dat)] = 0

factors <- seq(13, 24, 1)

# Convert numeric to factor
bn_dat2 <- bn_dat[, c(factors) := lapply(.SD, as.factor), .SDcols = factors]

# Merge data with original factor columns
merge_cols <- which(sapply(imp4, is.factor))

combo <- cbind(bn_dat2[, factors, with = FALSE], imp4[, merge_cols, with = FALSE])

Now that we have our data in a suitable format, we are ready to feed it into the “bnlearn” algorithm:

# Give the data to the algorithm
bn <- naive.bayes(combo, "top_walker")

# Plot
plot(bn)


Conclusion:

Unfortunately, in this instance- it would seem like the network plot of the naive bayes model does not give us much information on the nature of the data. The object was to try and find out if there were specific variables that we conditionally dependent upon one another. However, the plot does not show any relationships between the explanatory variables. In a sense, this is OK because it would further impact the concept underlying the Naive Bayes model which is that the variables should be independent of one another.