Introduction:

A national veterans’ organization aims to improve the cost-effectiveness of their direct marketing campaign. With an in-house database of over 13 million donors, the organization is one of the largest direct-mail fundraisers in the United States. This report presents the background information, business objectives, and data used to develop a classification model for capturing donors more effectively and maximizing the expected net profit.

Background Information:

According to the organization’s recent mailing records:

The overall response rate is 5.1%. ** Out of those who responded (donated), the average donation is $13.00.
Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send.

To achieve the desired goal, we will:

Use a sample of this dataset to develop a classification model that can effectively capture donors. ** Optimize the model so that the expected net profit is maximized.
Employ weighted sampling, under-representing the non-responders. ** Ensure the sample has equal numbers of donors and non-donors.

Business Objectives and Goals.

Successful businesses are based on both goals and objectives, as they clarify the purpose of the business and help identify necessary actions. Goals are general statements of desired achievement, while objectives are the specific steps or actions you take to reach your goal.

Data

The organization has made available a sample data file named “fundraising.rds” from their latest fundraising campaign. This data set consists of 21 variables and 3,000 records. It is worth noting that the data is weighted and represents a deliberate under-representation of non-donors. This approach allows us to have a balanced sample of donors and non-donors for our analysis. * The target variable in this data set, referred to as “target,” is a categorical variable indicating whether an individual is a donor or not. * The remaining variables in the data set are either categorical or numerical. * The categorical variables include zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, and female. * The numerical variables include num_child, income, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, months_since_donate, time_lag, and avg_gift.

train = readRDS("C:\\Users\\lazar\\Documents\\Spring2023\\DataMining\\FinalProj\\fundraising.rds")
test = readRDS("C:\\Users\\lazar\\Documents\\Spring2023\\DataMining\\FinalProj\\future_fundraising.rds")

sum(is.na(train))

## [1] 0

summary(train)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

Calculating Correlations

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.2.3

## corrplot 0.92 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.2.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.2.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.2.3

res = cor(temp)
round(res, 2)

##                     zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner
## zipconvert2                1.00       -0.25       -0.27       -0.42      0.02
## zipconvert3               -0.25        1.00       -0.25       -0.38     -0.08
## zipconvert4               -0.27       -0.25        1.00       -0.41     -0.04
## zipconvert5               -0.42       -0.38       -0.41        1.00      0.08
## homeowner                  0.02       -0.08       -0.04        0.08      1.00
## num_child                  0.03       -0.01        0.03       -0.04      0.05
## income                     0.00       -0.09       -0.02        0.10      0.32
## female                    -0.01       -0.02        0.02        0.01      0.00
## wealth                     0.04       -0.08       -0.01        0.04      0.07
## home_value                -0.13       -0.24       -0.18        0.45      0.12
## med_fam_inc               -0.09       -0.10       -0.04        0.18      0.14
## avg_fam_inc               -0.08       -0.11       -0.05        0.20      0.13
## pct_lt15k                  0.05        0.03        0.06       -0.12     -0.14
## num_prom                  -0.03        0.04        0.03       -0.03      0.00
## lifetime_gifts            -0.03       -0.01        0.03        0.01     -0.03
## largest_gift              -0.02        0.00       -0.02        0.03     -0.03
## last_gift                 -0.01       -0.05       -0.03        0.07      0.00
## months_since_donate        0.02       -0.02        0.03       -0.02      0.02
## time_lag                  -0.02        0.04       -0.01       -0.01      0.03
## avg_gift                  -0.01       -0.06       -0.03        0.08     -0.01
## target                    -0.01       -0.01       -0.01        0.02      0.03
##                     num_child income female wealth home_value med_fam_inc
## zipconvert2              0.03   0.00  -0.01   0.04      -0.13       -0.09
## zipconvert3             -0.01  -0.09  -0.02  -0.08      -0.24       -0.10
## zipconvert4              0.03  -0.02   0.02  -0.01      -0.18       -0.04
## zipconvert5             -0.04   0.10   0.01   0.04       0.45        0.18
## homeowner                0.05   0.32   0.00   0.07       0.12        0.14
## num_child                1.00   0.09  -0.03   0.06      -0.01        0.05
## income                   0.09   1.00  -0.04   0.21       0.29        0.37
## female                  -0.03  -0.04   1.00  -0.03      -0.02       -0.02
## wealth                   0.06   0.21  -0.03   1.00       0.26        0.38
## home_value              -0.01   0.29  -0.02   0.26       1.00        0.74
## med_fam_inc              0.05   0.37  -0.02   0.38       0.74        1.00
## avg_fam_inc              0.05   0.38  -0.03   0.39       0.75        0.97
## pct_lt15k               -0.03  -0.28   0.06  -0.38      -0.40       -0.67
## num_prom                -0.09  -0.07   0.04  -0.41      -0.06       -0.05
## lifetime_gifts          -0.05  -0.02   0.04  -0.23      -0.02       -0.04
## largest_gift            -0.02   0.03   0.00  -0.03       0.06        0.05
## last_gift               -0.01   0.11  -0.05   0.05       0.16        0.14
## months_since_donate     -0.01   0.08  -0.05   0.03       0.02        0.03
## time_lag                -0.01   0.00  -0.01  -0.07       0.00        0.02
## avg_gift                -0.02   0.12  -0.07   0.09       0.17        0.14
## target                  -0.04   0.04   0.02   0.00       0.02        0.01
##                     avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## zipconvert2               -0.08      0.05    -0.03          -0.03        -0.02
## zipconvert3               -0.11      0.03     0.04          -0.01         0.00
## zipconvert4               -0.05      0.06     0.03           0.03        -0.02
## zipconvert5                0.20     -0.12    -0.03           0.01         0.03
## homeowner                  0.13     -0.14     0.00          -0.03        -0.03
## num_child                  0.05     -0.03    -0.09          -0.05        -0.02
## income                     0.38     -0.28    -0.07          -0.02         0.03
## female                    -0.03      0.06     0.04           0.04         0.00
## wealth                     0.39     -0.38    -0.41          -0.23        -0.03
## home_value                 0.75     -0.40    -0.06          -0.02         0.06
## med_fam_inc                0.97     -0.67    -0.05          -0.04         0.05
## avg_fam_inc                1.00     -0.68    -0.06          -0.04         0.04
## pct_lt15k                 -0.68      1.00     0.04           0.06        -0.01
## num_prom                  -0.06      0.04     1.00           0.54         0.11
## lifetime_gifts            -0.04      0.06     0.54           1.00         0.51
## largest_gift               0.04     -0.01     0.11           0.51         1.00
## last_gift                  0.13     -0.06    -0.06           0.20         0.45
## months_since_donate        0.03     -0.01    -0.28          -0.14         0.02
## time_lag                   0.02     -0.02     0.12           0.04         0.04
## avg_gift                   0.13     -0.06    -0.15           0.18         0.47
## target                     0.00      0.00     0.07           0.02        -0.02
##                     last_gift months_since_donate time_lag avg_gift target
## zipconvert2             -0.01                0.02    -0.02    -0.01  -0.01
## zipconvert3             -0.05               -0.02     0.04    -0.06  -0.01
## zipconvert4             -0.03                0.03    -0.01    -0.03  -0.01
## zipconvert5              0.07               -0.02    -0.01     0.08   0.02
## homeowner                0.00                0.02     0.03    -0.01   0.03
## num_child               -0.01               -0.01    -0.01    -0.02  -0.04
## income                   0.11                0.08     0.00     0.12   0.04
## female                  -0.05               -0.05    -0.01    -0.07   0.02
## wealth                   0.05                0.03    -0.07     0.09   0.00
## home_value               0.16                0.02     0.00     0.17   0.02
## med_fam_inc              0.14                0.03     0.02     0.14   0.01
## avg_fam_inc              0.13                0.03     0.02     0.13   0.00
## pct_lt15k               -0.06               -0.01    -0.02    -0.06   0.00
## num_prom                -0.06               -0.28     0.12    -0.15   0.07
## lifetime_gifts           0.20               -0.14     0.04     0.18   0.02
## largest_gift             0.45                0.02     0.04     0.47  -0.02
## last_gift                1.00                0.19     0.08     0.87  -0.08
## months_since_donate      0.19                1.00     0.02     0.19  -0.13
## time_lag                 0.08                0.02     1.00     0.07   0.01
## avg_gift                 0.87                0.19     0.07     1.00  -0.08
## target                  -0.08               -0.13     0.01    -0.08   1.00

corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45)

There appears to be a strong negative correlation between median family and avg_fam_inc with pct_it15k

highly_correlated <- findCorrelation(res, cutoff = 0.8, verbose = TRUE)

## Compare row 12  and column  11 with corr  0.972 
##   Means:  0.214 vs 0.106 so flagging column 12 
## Compare row 20  and column  17 with corr  0.866 
##   Means:  0.151 vs 0.098 so flagging column 20 
## All correlations <= 0.8

highly_correlated

## [1] 12 20

Row 12 and Column 11 (med_fam_income) have a correlation coefficient of 0.972, and the mean values for the two variables are 0.214 and 0.106, respectively. Based on this, you have flagged Column 12 (med_fam_income) as potentially redundant or highly correlated with Row 12.

Row 20 and Column 17 (last_gift) have a correlation coefficient of 0.866, and the mean values for the two variables are 0.151 and 0.098, respectively. Based on this, you have flagged Column 20 (last_gift) as potentially redundant or highly correlated with Row 20.

library(ggplot2)

# Set the figure size and resolution
options(repr.plot.width=16, repr.plot.height=10, repr.plot.res=300)

# Define the list of excluded variables
excluded_vars <- c("zipconvert2","zipconvert3","zipconvert4", "zipconvert5", "female")

# Filter only numeric columns and remove excluded variables
numeric_data <- temp[sapply(temp, is.numeric)]
filtered_data <- numeric_data[!(names(numeric_data) %in% excluded_vars)]

# Remove rows with missing values from the filtered_data
filtered_data <- na.omit(filtered_data)

# Melt the filtered data into a long format
library(reshape2)
melted_data <- melt(filtered_data)

## No id variables; using all as measure variables

# Create the histograms
histograms <- ggplot(melted_data, aes(x = value)) +
  geom_histogram(bins = 30, color = "black", fill = "skyblue", alpha = 0.7) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  theme_minimal() +
  theme(strip.text.x = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

# Display the histograms
print(histograms)

Looking at our variables, it appears that our data is not normally distributed

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Select the desired variables from your dataset
selected_data <- temp %>%
  select(zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, num_child, income, 
         female, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, 
         lifetime_gifts, largest_gift, last_gift, months_since_donate, time_lag, avg_gift, target)

# Create the pairwise plot using ggpairs()
pairwise_plot <- selected_data %>%
  ggpairs(aes(fill = target))

# Display the pairwise plot
print(pairwise_plot)

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Here is an interesting way to visualize these variables. Strong correlations appear as presented in our correlation matrix shown, however, other non-linear distributions also appear in this graph. The numeric variables within the dataset exhibit some degree of skewness, as both the skewness and skew.2SE (2 standard errors) values are above 1, which indicates non-normality.

Now, I am ready to begin analyzing the data. I will begin with the XGBoost model

XGBoost

I understand that this model was not taught in class, however, I would like to use it. The professor approved of this model.

XGBoost (Extreme Gradient Boosting) is a popular machine learning algorithm that uses a boosted tree model to solve regression, classification, and ranking problems. The model is based on an ensemble of decision trees that are trained sequentially to correct errors made by the previous trees.

The main advantages of XGBoost include its ability to handle large datasets, its fast training speed, and its high predictive accuracy, which has made it a popular choice in data science competitions and industry applications. XGBoost also includes various regularization techniques to prevent overfitting and can handle missing values and outliers.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt

library(dplyr)
library(xgboost)

## Warning: package 'xgboost' was built under R version 4.2.3

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ data.table::first()   masks dplyr::first()
## ✖ lubridate::hour()     masks data.table::hour()
## ✖ lubridate::isoweek()  masks data.table::isoweek()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ data.table::last()    masks dplyr::last()
## ✖ purrr::lift()         masks caret::lift()
## ✖ lubridate::mday()     masks data.table::mday()
## ✖ lubridate::minute()   masks data.table::minute()
## ✖ lubridate::month()    masks data.table::month()
## ✖ lubridate::quarter()  masks data.table::quarter()
## ✖ lubridate::second()   masks data.table::second()
## ✖ xgboost::slice()      masks dplyr::slice()
## ✖ purrr::transpose()    masks data.table::transpose()
## ✖ lubridate::wday()     masks data.table::wday()
## ✖ lubridate::week()     masks data.table::week()
## ✖ lubridate::yday()     masks data.table::yday()
## ✖ lubridate::year()     masks data.table::year()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(DiagrammeR)

## Warning: package 'DiagrammeR' was built under R version 4.2.3

cross validation:

Here, I will create variable k to cross validate my data

k will be set to 5 folds. This will allow for 80% training, and 20% testing

# set k folds:
k = 5 #allows for 2400 train (80% training) and 600 test (20% testing)
fold_nonNumeric = vector(mode = "list", length = k)
ID = vector(mode = "list", length = k)
pred = vector(mode = "list", length = k)
predID = vector(mode = "list", length = k)
status = vector(mode = "list", length = k)

Here we set a random seed to ensure reproducibility of the results. It then assigns a unique ID to each row in the train dataset. Then we shuffle the dataset and create k folds for cross-validation. It does so by randomly shuffling the rows and assigning each row to one of the k folds. The dataset is then sorted by the fold assignments, resulting in a shuffled dataset with an additional fold column that can be used for cross-validation purposes.

set.seed(12345)
# unique ID
train$index = 1:nrow(train)

# shuffle it up
train <- train %>% 
  sample_frac(size = 1) %>% # these 2 lines randomly shuffle the rows
  mutate(fold = rep(1:k, length = n())) %>%  # returns with 1:k (the num of rows there are)
  arrange(fold)

Once the folds are created, here we do our 5 fold CV using XGBoost

set.seed(12345)
for(i in 1:k){
  train_cv <- train %>% 
    filter(fold != i)
  test_cv <- train %>% 
    filter(fold == i)
  
  train_cv$index = NULL
  test_cv$index = NULL
  
  train_cv$fold = NULL
  test_cv$fold = NULL
  x_train <- train_cv
  x_test <- test_cv
  x_train$target = NULL
  x_test$target = NULL
  
  #loopID = test_cv[,22] #gets the index number
  loopStatus = test_cv[,21]
  
  #loopID = as.matrix(loopID)
  loopStatus = as.matrix(loopStatus)
  
  x_train = as.matrix(x_train)
  x_test = as.matrix(x_test)
  

  
  test_cv = as.data.frame(test_cv)
  train_cv = as.data.frame(train_cv)
  
  #dtrain = xgb.DMatrix(x_train, label = train_cv$target)
  #dtest = xgb.DMatrix(x_test, label = test_cv$target)
  
  model = xgboost(x_train, train_cv$target,
                  nround = 10,
                  objective = "binary:logistic",
                  eval_metric = "error")
  
  pred = predict(model, x_test)
  
  fold_nonNumeric[[i]] = pred
  
  #ID[[i]] = loopID
  status[[i]] = loopStatus
}

## [1]  train-error:0.362500 
## [2]  train-error:0.314167 
## [3]  train-error:0.296667 
## [4]  train-error:0.263333 
## [5]  train-error:0.256250 
## [6]  train-error:0.252917 
## [7]  train-error:0.230417 
## [8]  train-error:0.220417 
## [9]  train-error:0.199583 
## [10] train-error:0.199583 
## [1]  train-error:0.367083 
## [2]  train-error:0.311667 
## [3]  train-error:0.307917 
## [4]  train-error:0.290417 
## [5]  train-error:0.277917 
## [6]  train-error:0.247083 
## [7]  train-error:0.246667 
## [8]  train-error:0.228750 
## [9]  train-error:0.215833 
## [10] train-error:0.206250 
## [1]  train-error:0.357500 
## [2]  train-error:0.324167 
## [3]  train-error:0.297500 
## [4]  train-error:0.279583 
## [5]  train-error:0.269583 
## [6]  train-error:0.236250 
## [7]  train-error:0.225833 
## [8]  train-error:0.216667 
## [9]  train-error:0.201250 
## [10] train-error:0.196667 
## [1]  train-error:0.364583 
## [2]  train-error:0.321667 
## [3]  train-error:0.310417 
## [4]  train-error:0.272500 
## [5]  train-error:0.262500 
## [6]  train-error:0.259167 
## [7]  train-error:0.251667 
## [8]  train-error:0.236250 
## [9]  train-error:0.237083 
## [10] train-error:0.230417 
## [1]  train-error:0.357500 
## [2]  train-error:0.331250 
## [3]  train-error:0.285833 
## [4]  train-error:0.265833 
## [5]  train-error:0.251667 
## [6]  train-error:0.251667 
## [7]  train-error:0.250000 
## [8]  train-error:0.225833 
## [9]  train-error:0.218333 
## [10] train-error:0.208750

ID = unlist(ID)
fold_nonNumeric = unlist(fold_nonNumeric)
status = unlist(status)

Predictions = cbind(ID, fold_nonNumeric, status)

importance_matrix = xgb.importance(names(x_train), model = model)
xgb.plot.importance(importance_matrix)

xgb.plot.tree(model = model, trees = 1)

The variables that our model found to be the most important are: home_value, lifetime_gifts, and avg_gift

We are also able to visualize the xgboost trees created by our model.

Predictions = as.data.frame(Predictions)
Predictions$binary_pred <- ifelse(Predictions$fold_nonNumeric > 0.5, 1, 0)

library(caret)

conf_mat <- confusionMatrix(factor(Predictions$binary_pred), factor(Predictions$status))
print(conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 809 728
##          1 692 771
##                                           
##                Accuracy : 0.5267          
##                  95% CI : (0.5086, 0.5447)
##     No Information Rate : 0.5003          
##     P-Value [Acc > NIR] : 0.002072        
##                                           
##                   Kappa : 0.0533          
##                                           
##  Mcnemar's Test P-Value : 0.352991        
##                                           
##             Sensitivity : 0.5390          
##             Specificity : 0.5143          
##          Pos Pred Value : 0.5264          
##          Neg Pred Value : 0.5270          
##              Prevalence : 0.5003          
##          Detection Rate : 0.2697          
##    Detection Prevalence : 0.5123          
##       Balanced Accuracy : 0.5267          
##                                           
##        'Positive' Class : 0               
##

This model, despite it being an efficient and scalable gradient boosting algorithm that is particularly effective in handling large datasets and solving complex machine learning problems, such as classification and regression tasks, did not do such a great job on the training set.

We had a 53% accuracy in our model with a kappa of .0533.

Now that my data is trained, it is time to run my training model on the test model. However, variables have to be reformated to fit our train model

My second model of choice is Random Forest

Random Forest

A random forest is an ensemble learning method that constructs multiple decision trees during training and combines their predictions to produce a more accurate and robust result. It reduces overfitting by averaging the predictions of multiple trees and can handle both classification and regression tasks.

Note, this is different as XGBoost, on the other hand, is an optimized implementation of gradient boosting, which builds trees sequentially, with each new tree correcting the errors made by the previous ones. While both methods use decision trees, XGBoost employs a boosting technique to minimize the loss function, whereas random forests use bagging to reduce the model variance.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.2.3

library(caret)

80% train, and 20% test

library(caret)
library(vcd)

## Warning: package 'vcd' was built under R version 4.2.3

## Loading required package: grid

# Set the seed for reproducibility
set.seed(12345)

# Create an index for the 80/20 split
index <- createDataPartition(train$target, p = 0.8, list = FALSE)

# Split the data into training and test sets
train <- trainn[index, ]
test <- train[-index, ]


x = as.data.frame(train[,1:20])
y = as.data.frame(train[,21])

Creating our model

train$target <- as.factor(train$target) # convert to factor
control = trainControl(method = "repeatedcv", number = 10, repeats = 3)
seed = 12345
metric = "Accuracy"
mtry = sqrt(ncol(x))
tunegrid = expand.grid(.mtry = mtry)
rf_default <- train(target~., data=train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)

## Random Forest 
## 
## 2400 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2160, 2160, 2160, 2160, 2160, 2160, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5174972  0.03461029
## 
## Tuning parameter 'mtry' was held constant at a value of 4.472136

# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(12345)
mtry <- sqrt(ncol(x))
rf_random <- train(target~., data=train, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 2400 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2160, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##    1    0.5493051  0.09762545
##    2    0.5233364  0.04617784
##    6    0.5161089  0.03182405
##    7    0.5190255  0.03776252
##    8    0.5237426  0.04721077
##   10    0.5216650  0.04304579
##   11    0.5166632  0.03310103
##   12    0.5162478  0.03223786
##   16    0.5188902  0.03755671
##   17    0.5183213  0.03639208
##   20    0.5187472  0.03719054
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

plot(rf_random)

I ended up using a random search to identify the best parameter. The model was trained using cross-validation with 10 folds and 3 repetitions, and the best value for the mtry parameter was found to be 1, based on the highest accuracy score. The final model had an accuracy of 0.549 and a kappa value of 0.098.

# Get variable importance scores using varImp() function
var_importance <- varImp(rf_random)

# Print the variable importance scores
print(var_importance)

## rf variable importance
## 
##                      Overall
## avg_gift            100.0000
## home_value           91.6774
## med_fam_inc          90.1522
## lifetime_gifts       88.7047
## avg_fam_inc          86.6479
## num_prom             83.8573
## largest_gift         81.3663
## months_since_donate  80.3790
## pct_lt15k            77.8428
## last_gift            75.9321
## time_lag             70.6874
## income               44.3764
## wealth               41.5966
## num_child             7.9709
## homeownerNo           1.8291
## homeownerYes          1.5980
## femaleNo              0.8964
## zipconvert5No         0.7741
## femaleYes             0.4881
## zipconvert5Yes        0.0000

# Plot the variable importance scores
plot(var_importance, main = "Variable Importance Plot")

Similarly to the XGBoost model, important variables include: * avg_gift * home_value * med_fam_inc

Extended Caret:

Here we create a custom random forest model for use with the caret package, specifying the model type, library, parameters, and functions for creating a grid, fitting the model, making predictions, calculating probabilities, sorting results, and extracting class levels.
* Takes too long to run *

# customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
# customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
# customRF$grid <- function(x, y, len = NULL, search = "grid") {}
# customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
#   randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
# }
# customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
#    predict(modelFit, newdata)
# customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
#    predict(modelFit, newdata, type = "prob")
# customRF$sort <- function(x) x[order(x[,1]),]
# customRF$levels <- function(x) x$classes

# # train model
# control <- trainControl(method="repeatedcv", number=10, repeats=3)
# tunegrid <- expand.grid(.mtry=c(1:15), .ntree=c(1000, 1500, 2000, 2500))
# set.seed(seed)
# custom <- train(target~., data=train, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control)
# summary(custom)
# plot(custom)

# Extract the best parameter value
best_mtry <- rf_random$bestTune$mtry

# Train a new model on the full training set using the best parameter value
model <- randomForest(target ~ ., data = train, mtry = best_mtry)

# Generate predictions on the test set using the trained model
predictions <- predict(model, newdata = test)

# Evaluate the performance of the model on the test set using the same metric as before
accuracy <- mean(predictions == test$target)

# Print the performance metrics
cat("Accuracy on test set: ", accuracy, "\n")

## Accuracy on test set:  0.778481

library(ROCR)
# Generate predictions on the test set using the trained model
predictions <- predict(model, newdata = test)

# Convert predictions and test$target to factors with the same levels
predictions <- factor(predictions, levels = c("0", "1"))
test$target <- factor(test$target, levels = c("0", "1"))

# Create a confusion matrix using caret::confusionMatrix()
conf_matrix <- confusionMatrix(predictions, test$target)

# Print the confusion matrix
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 169  38
##          1  68 199
##                                           
##                Accuracy : 0.7764          
##                  95% CI : (0.7361, 0.8131)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5527          
##                                           
##  Mcnemar's Test P-Value : 0.004852        
##                                           
##             Sensitivity : 0.7131          
##             Specificity : 0.8397          
##          Pos Pred Value : 0.8164          
##          Neg Pred Value : 0.7453          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3565          
##    Detection Prevalence : 0.4367          
##       Balanced Accuracy : 0.7764          
##                                           
##        'Positive' Class : 0               
##

The overall accuracy of the model is 0.7764, and the kappa coefficient is 0.5527, indicating moderate agreement between the predicted and actual classifications. However, this accuracy makes me feel that the data might have been overfitted.

Here I will proceed to use the test rmds and upload the files.

set.seed(12345)
pred <- predict(model, newdata = test)

pred <- as.data.frame(pred)

# Convert factor levels to "Donor" and "No Donor"
levels(pred$pred) <- c("No Donor", "Donor")

# Rename column to "value"
colnames(pred) <- "value"


#write.csv(pred, file ="C:\\Users\\lazar\\Downloads\\values_rf.csv" )

Model Result recap:

The reason for using weighted sampling to produce a training set with equal numbers of donors and non-donors is to address the issue of class imbalance. In this case, the original dataset contains a significantly higher number of non-donors than donors. If a simple random sample is taken from the original dataset, the training set may not have enough donors to accurately represent the donor class. This can lead to a biased model that performs poorly in predicting donor behavior.

The important variables for both of these models were:

avg_gift
home_value
med_fam_inc

Both models used ALL of the variables

The XGBoost model exhibited poor performance with an accuracy of only 53% and a kappa coefficient of 0.0533 on the test set. On the contrary, the random forest model with random search demonstrated moderate agreement between predicted and actual classifications, achieving an accuracy of 0.7764 and a kappa coefficient of 0.5527 on the test set. However, the high accuracy score of the random forest model raises concerns about overfitting. (These tables are listed in the models (scroll up)) In this case, random forest dominates.

It is important to note that for XGBoost, I had to use the default parameters (explained in recomendations). However, cut off values where used.

Interestingly, when the models were evaluated using the leaderboard website, the XGBoost model outperformed the random forest with an accuracy score of 61.7% compared to 58.33%.

Although the random forest model initially appeared promising, its underperformance in the modeling competition suggests that it might have been overfitted to the training data. On the other hand, despite the XGBoost model’s lower accuracy score on the training set, it proved to be more effective in predicting outcomes in the competition.

Overall, I belive that in terms of our training set, the Random forrest is the best, however, in the application, XGBoost takes the win.

Recommendations:

As far as the modeling goes in XGBoost, I attempted to define a hyperparameter space such as the following:

  # # Define the model's hyperparameter search space
  # param_grid <- expand.grid(
  #   nrounds = c(50),
  #   max_depth = c(3, 4, 5),
  #   eta = c(0.01, 0.02, 0.05, 0.1, 0.2),
  #   gamma = c( 0.1, 0.2, 0.3, 0.4),
  #   colsample_bytree = c(0.5, 0.6, 0.7),
  #   min_child_weight = c(1, 2, 3, 4, 5),
  #   subsample = c(0.5, 0.6, 0.7, 0.8, 0.9, 1)
  # )

However, this technique took more than whole night to run (and did not finish), and I decided that it was too computationally expensive to continue. The same applies to my extended caret model:

# customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
# customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
# customRF$grid <- function(x, y, len = NULL, search = "grid") {}
# customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
#   randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
# }
# customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
#    predict(modelFit, newdata)
# customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
#    predict(modelFit, newdata, type = "prob")
# customRF$sort <- function(x) x[order(x[,1]),]
# customRF$levels <- function(x) x$classes

I attempted to adjust hyperparameters by defining a hyperparameter search space for XGBoost and using extended caret model. However, due to the computational expense, I recommend using better hardware, such as a cluster, to speed up the process of training the data set. However, overfitting can still occur.

I recommend adding more variables to the data set, such as marketing, holiday seasons, and advertisement media, to help pinpoint the donor population. These variables can provide additional insights into the factors that influence donor behavior.

In addition to XGBoost and random forest models, I suggest using other classification models to provide more insights into the data. Each model has its own strengths and weaknesses, and using a variety of models can help to identify patterns and relationships in the data that may be missed by a single model.

Finally, I suggest exploring more advanced techniques such as ensemble learning or deep learning to improve the predictive accuracy of the model. These techniques can be more complex and require more computational resources, but they can also provide significant improvements in performance.

Improving Direct Marketing Campaigns for a National Veterans’ Organization

Joshua Lazaro

May 10, 2023