A national veterans’ organization aims to improve the cost-effectiveness of their direct marketing campaign. With an in-house database of over 13 million donors, the organization is one of the largest direct-mail fundraisers in the United States. This report presents the background information, business objectives, and data used to develop a classification model for capturing donors more effectively and maximizing the expected net profit.
According to the organization’s recent mailing records:
To achieve the desired goal, we will:
Successful businesses are based on both goals and objectives, as they clarify the purpose of the business and help identify necessary actions. Goals are general statements of desired achievement, while objectives are the specific steps or actions you take to reach your goal.
The organization has made available a sample data file named “fundraising.rds” from their latest fundraising campaign. This data set consists of 21 variables and 3,000 records. It is worth noting that the data is weighted and represents a deliberate under-representation of non-donors. This approach allows us to have a balanced sample of donors and non-donors for our analysis. * The target variable in this data set, referred to as “target,” is a categorical variable indicating whether an individual is a donor or not. * The remaining variables in the data set are either categorical or numerical. * The categorical variables include zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, and female. * The numerical variables include num_child, income, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, months_since_donate, time_lag, and avg_gift.
train = readRDS("C:\\Users\\lazar\\Documents\\Spring2023\\DataMining\\FinalProj\\fundraising.rds")
test = readRDS("C:\\Users\\lazar\\Documents\\Spring2023\\DataMining\\FinalProj\\future_fundraising.rds")
sum(is.na(train))
## [1] 0
summary(train)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
Calculating Correlations
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
res = cor(temp)
round(res, 2)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner
## zipconvert2 1.00 -0.25 -0.27 -0.42 0.02
## zipconvert3 -0.25 1.00 -0.25 -0.38 -0.08
## zipconvert4 -0.27 -0.25 1.00 -0.41 -0.04
## zipconvert5 -0.42 -0.38 -0.41 1.00 0.08
## homeowner 0.02 -0.08 -0.04 0.08 1.00
## num_child 0.03 -0.01 0.03 -0.04 0.05
## income 0.00 -0.09 -0.02 0.10 0.32
## female -0.01 -0.02 0.02 0.01 0.00
## wealth 0.04 -0.08 -0.01 0.04 0.07
## home_value -0.13 -0.24 -0.18 0.45 0.12
## med_fam_inc -0.09 -0.10 -0.04 0.18 0.14
## avg_fam_inc -0.08 -0.11 -0.05 0.20 0.13
## pct_lt15k 0.05 0.03 0.06 -0.12 -0.14
## num_prom -0.03 0.04 0.03 -0.03 0.00
## lifetime_gifts -0.03 -0.01 0.03 0.01 -0.03
## largest_gift -0.02 0.00 -0.02 0.03 -0.03
## last_gift -0.01 -0.05 -0.03 0.07 0.00
## months_since_donate 0.02 -0.02 0.03 -0.02 0.02
## time_lag -0.02 0.04 -0.01 -0.01 0.03
## avg_gift -0.01 -0.06 -0.03 0.08 -0.01
## target -0.01 -0.01 -0.01 0.02 0.03
## num_child income female wealth home_value med_fam_inc
## zipconvert2 0.03 0.00 -0.01 0.04 -0.13 -0.09
## zipconvert3 -0.01 -0.09 -0.02 -0.08 -0.24 -0.10
## zipconvert4 0.03 -0.02 0.02 -0.01 -0.18 -0.04
## zipconvert5 -0.04 0.10 0.01 0.04 0.45 0.18
## homeowner 0.05 0.32 0.00 0.07 0.12 0.14
## num_child 1.00 0.09 -0.03 0.06 -0.01 0.05
## income 0.09 1.00 -0.04 0.21 0.29 0.37
## female -0.03 -0.04 1.00 -0.03 -0.02 -0.02
## wealth 0.06 0.21 -0.03 1.00 0.26 0.38
## home_value -0.01 0.29 -0.02 0.26 1.00 0.74
## med_fam_inc 0.05 0.37 -0.02 0.38 0.74 1.00
## avg_fam_inc 0.05 0.38 -0.03 0.39 0.75 0.97
## pct_lt15k -0.03 -0.28 0.06 -0.38 -0.40 -0.67
## num_prom -0.09 -0.07 0.04 -0.41 -0.06 -0.05
## lifetime_gifts -0.05 -0.02 0.04 -0.23 -0.02 -0.04
## largest_gift -0.02 0.03 0.00 -0.03 0.06 0.05
## last_gift -0.01 0.11 -0.05 0.05 0.16 0.14
## months_since_donate -0.01 0.08 -0.05 0.03 0.02 0.03
## time_lag -0.01 0.00 -0.01 -0.07 0.00 0.02
## avg_gift -0.02 0.12 -0.07 0.09 0.17 0.14
## target -0.04 0.04 0.02 0.00 0.02 0.01
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## zipconvert2 -0.08 0.05 -0.03 -0.03 -0.02
## zipconvert3 -0.11 0.03 0.04 -0.01 0.00
## zipconvert4 -0.05 0.06 0.03 0.03 -0.02
## zipconvert5 0.20 -0.12 -0.03 0.01 0.03
## homeowner 0.13 -0.14 0.00 -0.03 -0.03
## num_child 0.05 -0.03 -0.09 -0.05 -0.02
## income 0.38 -0.28 -0.07 -0.02 0.03
## female -0.03 0.06 0.04 0.04 0.00
## wealth 0.39 -0.38 -0.41 -0.23 -0.03
## home_value 0.75 -0.40 -0.06 -0.02 0.06
## med_fam_inc 0.97 -0.67 -0.05 -0.04 0.05
## avg_fam_inc 1.00 -0.68 -0.06 -0.04 0.04
## pct_lt15k -0.68 1.00 0.04 0.06 -0.01
## num_prom -0.06 0.04 1.00 0.54 0.11
## lifetime_gifts -0.04 0.06 0.54 1.00 0.51
## largest_gift 0.04 -0.01 0.11 0.51 1.00
## last_gift 0.13 -0.06 -0.06 0.20 0.45
## months_since_donate 0.03 -0.01 -0.28 -0.14 0.02
## time_lag 0.02 -0.02 0.12 0.04 0.04
## avg_gift 0.13 -0.06 -0.15 0.18 0.47
## target 0.00 0.00 0.07 0.02 -0.02
## last_gift months_since_donate time_lag avg_gift target
## zipconvert2 -0.01 0.02 -0.02 -0.01 -0.01
## zipconvert3 -0.05 -0.02 0.04 -0.06 -0.01
## zipconvert4 -0.03 0.03 -0.01 -0.03 -0.01
## zipconvert5 0.07 -0.02 -0.01 0.08 0.02
## homeowner 0.00 0.02 0.03 -0.01 0.03
## num_child -0.01 -0.01 -0.01 -0.02 -0.04
## income 0.11 0.08 0.00 0.12 0.04
## female -0.05 -0.05 -0.01 -0.07 0.02
## wealth 0.05 0.03 -0.07 0.09 0.00
## home_value 0.16 0.02 0.00 0.17 0.02
## med_fam_inc 0.14 0.03 0.02 0.14 0.01
## avg_fam_inc 0.13 0.03 0.02 0.13 0.00
## pct_lt15k -0.06 -0.01 -0.02 -0.06 0.00
## num_prom -0.06 -0.28 0.12 -0.15 0.07
## lifetime_gifts 0.20 -0.14 0.04 0.18 0.02
## largest_gift 0.45 0.02 0.04 0.47 -0.02
## last_gift 1.00 0.19 0.08 0.87 -0.08
## months_since_donate 0.19 1.00 0.02 0.19 -0.13
## time_lag 0.08 0.02 1.00 0.07 0.01
## avg_gift 0.87 0.19 0.07 1.00 -0.08
## target -0.08 -0.13 0.01 -0.08 1.00
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
There appears to be a strong negative correlation between median family
and avg_fam_inc with pct_it15k
highly_correlated <- findCorrelation(res, cutoff = 0.8, verbose = TRUE)
## Compare row 12 and column 11 with corr 0.972
## Means: 0.214 vs 0.106 so flagging column 12
## Compare row 20 and column 17 with corr 0.866
## Means: 0.151 vs 0.098 so flagging column 20
## All correlations <= 0.8
highly_correlated
## [1] 12 20
Row 12 and Column 11 (med_fam_income) have a correlation coefficient of 0.972, and the mean values for the two variables are 0.214 and 0.106, respectively. Based on this, you have flagged Column 12 (med_fam_income) as potentially redundant or highly correlated with Row 12.
Row 20 and Column 17 (last_gift) have a correlation coefficient of 0.866, and the mean values for the two variables are 0.151 and 0.098, respectively. Based on this, you have flagged Column 20 (last_gift) as potentially redundant or highly correlated with Row 20.
library(ggplot2)
# Set the figure size and resolution
options(repr.plot.width=16, repr.plot.height=10, repr.plot.res=300)
# Define the list of excluded variables
excluded_vars <- c("zipconvert2","zipconvert3","zipconvert4", "zipconvert5", "female")
# Filter only numeric columns and remove excluded variables
numeric_data <- temp[sapply(temp, is.numeric)]
filtered_data <- numeric_data[!(names(numeric_data) %in% excluded_vars)]
# Remove rows with missing values from the filtered_data
filtered_data <- na.omit(filtered_data)
# Melt the filtered data into a long format
library(reshape2)
melted_data <- melt(filtered_data)
## No id variables; using all as measure variables
# Create the histograms
histograms <- ggplot(melted_data, aes(x = value)) +
geom_histogram(bins = 30, color = "black", fill = "skyblue", alpha = 0.7) +
facet_wrap(~ variable, scales = "free_x", ncol = 3) +
theme_minimal() +
theme(strip.text.x = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
# Display the histograms
print(histograms)
Looking at our variables, it appears that our data is not normally distributed
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Select the desired variables from your dataset
selected_data <- temp %>%
select(zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, num_child, income,
female, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom,
lifetime_gifts, largest_gift, last_gift, months_since_donate, time_lag, avg_gift, target)
# Create the pairwise plot using ggpairs()
pairwise_plot <- selected_data %>%
ggpairs(aes(fill = target))
# Display the pairwise plot
print(pairwise_plot)
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here is an interesting way to visualize these variables. Strong correlations appear as presented in our correlation matrix shown, however, other non-linear distributions also appear in this graph. The numeric variables within the dataset exhibit some degree of skewness, as both the skewness and skew.2SE (2 standard errors) values are above 1, which indicates non-normality.
Now, I am ready to begin analyzing the data. I will begin with the XGBoost model
I understand that this model was not taught in class, however, I would like to use it. The professor approved of this model.
XGBoost (Extreme Gradient Boosting) is a popular machine learning algorithm that uses a boosted tree model to solve regression, classification, and ranking problems. The model is based on an ensemble of decision trees that are trained sequentially to correct errors made by the previous trees.
The main advantages of XGBoost include its ability to handle large datasets, its fast training speed, and its high predictive accuracy, which has made it a popular choice in data science competitions and industry applications. XGBoost also includes various regularization techniques to prevent overfitting and can handle missing values and outliers.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
library(dplyr)
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.2.3
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ data.table::first() masks dplyr::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ data.table::last() masks dplyr::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second() masks data.table::second()
## ✖ xgboost::slice() masks dplyr::slice()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 4.2.3
Here, I will create variable k to cross validate my data
k will be set to 5 folds. This will allow for 80% training, and 20% testing
# set k folds:
k = 5 #allows for 2400 train (80% training) and 600 test (20% testing)
fold_nonNumeric = vector(mode = "list", length = k)
ID = vector(mode = "list", length = k)
pred = vector(mode = "list", length = k)
predID = vector(mode = "list", length = k)
status = vector(mode = "list", length = k)
Here we set a random seed to ensure reproducibility of the results. It then assigns a unique ID to each row in the train dataset. Then we shuffle the dataset and create k folds for cross-validation. It does so by randomly shuffling the rows and assigning each row to one of the k folds. The dataset is then sorted by the fold assignments, resulting in a shuffled dataset with an additional fold column that can be used for cross-validation purposes.
set.seed(12345)
# unique ID
train$index = 1:nrow(train)
# shuffle it up
train <- train %>%
sample_frac(size = 1) %>% # these 2 lines randomly shuffle the rows
mutate(fold = rep(1:k, length = n())) %>% # returns with 1:k (the num of rows there are)
arrange(fold)
Once the folds are created, here we do our 5 fold CV using XGBoost
set.seed(12345)
for(i in 1:k){
train_cv <- train %>%
filter(fold != i)
test_cv <- train %>%
filter(fold == i)
train_cv$index = NULL
test_cv$index = NULL
train_cv$fold = NULL
test_cv$fold = NULL
x_train <- train_cv
x_test <- test_cv
x_train$target = NULL
x_test$target = NULL
#loopID = test_cv[,22] #gets the index number
loopStatus = test_cv[,21]
#loopID = as.matrix(loopID)
loopStatus = as.matrix(loopStatus)
x_train = as.matrix(x_train)
x_test = as.matrix(x_test)
test_cv = as.data.frame(test_cv)
train_cv = as.data.frame(train_cv)
#dtrain = xgb.DMatrix(x_train, label = train_cv$target)
#dtest = xgb.DMatrix(x_test, label = test_cv$target)
model = xgboost(x_train, train_cv$target,
nround = 10,
objective = "binary:logistic",
eval_metric = "error")
pred = predict(model, x_test)
fold_nonNumeric[[i]] = pred
#ID[[i]] = loopID
status[[i]] = loopStatus
}
## [1] train-error:0.362500
## [2] train-error:0.314167
## [3] train-error:0.296667
## [4] train-error:0.263333
## [5] train-error:0.256250
## [6] train-error:0.252917
## [7] train-error:0.230417
## [8] train-error:0.220417
## [9] train-error:0.199583
## [10] train-error:0.199583
## [1] train-error:0.367083
## [2] train-error:0.311667
## [3] train-error:0.307917
## [4] train-error:0.290417
## [5] train-error:0.277917
## [6] train-error:0.247083
## [7] train-error:0.246667
## [8] train-error:0.228750
## [9] train-error:0.215833
## [10] train-error:0.206250
## [1] train-error:0.357500
## [2] train-error:0.324167
## [3] train-error:0.297500
## [4] train-error:0.279583
## [5] train-error:0.269583
## [6] train-error:0.236250
## [7] train-error:0.225833
## [8] train-error:0.216667
## [9] train-error:0.201250
## [10] train-error:0.196667
## [1] train-error:0.364583
## [2] train-error:0.321667
## [3] train-error:0.310417
## [4] train-error:0.272500
## [5] train-error:0.262500
## [6] train-error:0.259167
## [7] train-error:0.251667
## [8] train-error:0.236250
## [9] train-error:0.237083
## [10] train-error:0.230417
## [1] train-error:0.357500
## [2] train-error:0.331250
## [3] train-error:0.285833
## [4] train-error:0.265833
## [5] train-error:0.251667
## [6] train-error:0.251667
## [7] train-error:0.250000
## [8] train-error:0.225833
## [9] train-error:0.218333
## [10] train-error:0.208750
ID = unlist(ID)
fold_nonNumeric = unlist(fold_nonNumeric)
status = unlist(status)
Predictions = cbind(ID, fold_nonNumeric, status)
importance_matrix = xgb.importance(names(x_train), model = model)
xgb.plot.importance(importance_matrix)
xgb.plot.tree(model = model, trees = 1)
The variables that our model found to be the most important are: home_value, lifetime_gifts, and avg_gift
We are also able to visualize the xgboost trees created by our model.
Predictions = as.data.frame(Predictions)
Predictions$binary_pred <- ifelse(Predictions$fold_nonNumeric > 0.5, 1, 0)
library(caret)
conf_mat <- confusionMatrix(factor(Predictions$binary_pred), factor(Predictions$status))
print(conf_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 809 728
## 1 692 771
##
## Accuracy : 0.5267
## 95% CI : (0.5086, 0.5447)
## No Information Rate : 0.5003
## P-Value [Acc > NIR] : 0.002072
##
## Kappa : 0.0533
##
## Mcnemar's Test P-Value : 0.352991
##
## Sensitivity : 0.5390
## Specificity : 0.5143
## Pos Pred Value : 0.5264
## Neg Pred Value : 0.5270
## Prevalence : 0.5003
## Detection Rate : 0.2697
## Detection Prevalence : 0.5123
## Balanced Accuracy : 0.5267
##
## 'Positive' Class : 0
##
This model, despite it being an efficient and scalable gradient boosting algorithm that is particularly effective in handling large datasets and solving complex machine learning problems, such as classification and regression tasks, did not do such a great job on the training set.
We had a 53% accuracy in our model with a kappa of .0533.
Now that my data is trained, it is time to run my training model on the test model. However, variables have to be reformated to fit our train model
My second model of choice is Random Forest
A random forest is an ensemble learning method that constructs multiple decision trees during training and combines their predictions to produce a more accurate and robust result. It reduces overfitting by averaging the predictions of multiple trees and can handle both classification and regression tasks.
Note, this is different as XGBoost, on the other hand, is an optimized implementation of gradient boosting, which builds trees sequentially, with each new tree correcting the errors made by the previous ones. While both methods use decision trees, XGBoost employs a boosting technique to minimize the loss function, whereas random forests use bagging to reduce the model variance.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.2.3
library(caret)
80% train, and 20% test
library(caret)
library(vcd)
## Warning: package 'vcd' was built under R version 4.2.3
## Loading required package: grid
# Set the seed for reproducibility
set.seed(12345)
# Create an index for the 80/20 split
index <- createDataPartition(train$target, p = 0.8, list = FALSE)
# Split the data into training and test sets
train <- trainn[index, ]
test <- train[-index, ]
x = as.data.frame(train[,1:20])
y = as.data.frame(train[,21])
Creating our model
train$target <- as.factor(train$target) # convert to factor
control = trainControl(method = "repeatedcv", number = 10, repeats = 3)
seed = 12345
metric = "Accuracy"
mtry = sqrt(ncol(x))
tunegrid = expand.grid(.mtry = mtry)
rf_default <- train(target~., data=train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)
## Random Forest
##
## 2400 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2160, 2160, 2160, 2160, 2160, 2160, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5174972 0.03461029
##
## Tuning parameter 'mtry' was held constant at a value of 4.472136
# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(12345)
mtry <- sqrt(ncol(x))
rf_random <- train(target~., data=train, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
## Random Forest
##
## 2400 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2160, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.5493051 0.09762545
## 2 0.5233364 0.04617784
## 6 0.5161089 0.03182405
## 7 0.5190255 0.03776252
## 8 0.5237426 0.04721077
## 10 0.5216650 0.04304579
## 11 0.5166632 0.03310103
## 12 0.5162478 0.03223786
## 16 0.5188902 0.03755671
## 17 0.5183213 0.03639208
## 20 0.5187472 0.03719054
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
plot(rf_random)
I ended up using a random search to identify the best parameter. The model was trained using cross-validation with 10 folds and 3 repetitions, and the best value for the mtry parameter was found to be 1, based on the highest accuracy score. The final model had an accuracy of 0.549 and a kappa value of 0.098.
# Get variable importance scores using varImp() function
var_importance <- varImp(rf_random)
# Print the variable importance scores
print(var_importance)
## rf variable importance
##
## Overall
## avg_gift 100.0000
## home_value 91.6774
## med_fam_inc 90.1522
## lifetime_gifts 88.7047
## avg_fam_inc 86.6479
## num_prom 83.8573
## largest_gift 81.3663
## months_since_donate 80.3790
## pct_lt15k 77.8428
## last_gift 75.9321
## time_lag 70.6874
## income 44.3764
## wealth 41.5966
## num_child 7.9709
## homeownerNo 1.8291
## homeownerYes 1.5980
## femaleNo 0.8964
## zipconvert5No 0.7741
## femaleYes 0.4881
## zipconvert5Yes 0.0000
# Plot the variable importance scores
plot(var_importance, main = "Variable Importance Plot")
Similarly to the XGBoost model, important variables include: * avg_gift
* home_value * med_fam_inc
Extended Caret:
Here we create a custom random forest model for use with the caret
package, specifying the model type, library, parameters, and functions
for creating a grid, fitting the model, making predictions, calculating
probabilities, sorting results, and extracting class levels.
* Takes too long to run *
# customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
# customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
# customRF$grid <- function(x, y, len = NULL, search = "grid") {}
# customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
# randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
# }
# customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
# predict(modelFit, newdata)
# customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
# predict(modelFit, newdata, type = "prob")
# customRF$sort <- function(x) x[order(x[,1]),]
# customRF$levels <- function(x) x$classes
# # train model
# control <- trainControl(method="repeatedcv", number=10, repeats=3)
# tunegrid <- expand.grid(.mtry=c(1:15), .ntree=c(1000, 1500, 2000, 2500))
# set.seed(seed)
# custom <- train(target~., data=train, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control)
# summary(custom)
# plot(custom)
# Extract the best parameter value
best_mtry <- rf_random$bestTune$mtry
# Train a new model on the full training set using the best parameter value
model <- randomForest(target ~ ., data = train, mtry = best_mtry)
# Generate predictions on the test set using the trained model
predictions <- predict(model, newdata = test)
# Evaluate the performance of the model on the test set using the same metric as before
accuracy <- mean(predictions == test$target)
# Print the performance metrics
cat("Accuracy on test set: ", accuracy, "\n")
## Accuracy on test set: 0.778481
library(ROCR)
# Generate predictions on the test set using the trained model
predictions <- predict(model, newdata = test)
# Convert predictions and test$target to factors with the same levels
predictions <- factor(predictions, levels = c("0", "1"))
test$target <- factor(test$target, levels = c("0", "1"))
# Create a confusion matrix using caret::confusionMatrix()
conf_matrix <- confusionMatrix(predictions, test$target)
# Print the confusion matrix
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 169 38
## 1 68 199
##
## Accuracy : 0.7764
## 95% CI : (0.7361, 0.8131)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5527
##
## Mcnemar's Test P-Value : 0.004852
##
## Sensitivity : 0.7131
## Specificity : 0.8397
## Pos Pred Value : 0.8164
## Neg Pred Value : 0.7453
## Prevalence : 0.5000
## Detection Rate : 0.3565
## Detection Prevalence : 0.4367
## Balanced Accuracy : 0.7764
##
## 'Positive' Class : 0
##
The overall accuracy of the model is 0.7764, and the kappa coefficient is 0.5527, indicating moderate agreement between the predicted and actual classifications. However, this accuracy makes me feel that the data might have been overfitted.
Here I will proceed to use the test rmds and upload the files.
set.seed(12345)
pred <- predict(model, newdata = test)
pred <- as.data.frame(pred)
# Convert factor levels to "Donor" and "No Donor"
levels(pred$pred) <- c("No Donor", "Donor")
# Rename column to "value"
colnames(pred) <- "value"
#write.csv(pred, file ="C:\\Users\\lazar\\Downloads\\values_rf.csv" )
The reason for using weighted sampling to produce a training set with equal numbers of donors and non-donors is to address the issue of class imbalance. In this case, the original dataset contains a significantly higher number of non-donors than donors. If a simple random sample is taken from the original dataset, the training set may not have enough donors to accurately represent the donor class. This can lead to a biased model that performs poorly in predicting donor behavior.
The important variables for both of these models were:
Both models used ALL of the variables
The XGBoost model exhibited poor performance with an accuracy of only 53% and a kappa coefficient of 0.0533 on the test set. On the contrary, the random forest model with random search demonstrated moderate agreement between predicted and actual classifications, achieving an accuracy of 0.7764 and a kappa coefficient of 0.5527 on the test set. However, the high accuracy score of the random forest model raises concerns about overfitting. (These tables are listed in the models (scroll up)) In this case, random forest dominates.
It is important to note that for XGBoost, I had to use the default parameters (explained in recomendations). However, cut off values where used.
Interestingly, when the models were evaluated using the leaderboard website, the XGBoost model outperformed the random forest with an accuracy score of 61.7% compared to 58.33%.
Although the random forest model initially appeared promising, its underperformance in the modeling competition suggests that it might have been overfitted to the training data. On the other hand, despite the XGBoost model’s lower accuracy score on the training set, it proved to be more effective in predicting outcomes in the competition.
Overall, I belive that in terms of our training set, the Random forrest is the best, however, in the application, XGBoost takes the win.
As far as the modeling goes in XGBoost, I attempted to define a hyperparameter space such as the following:
# # Define the model's hyperparameter search space
# param_grid <- expand.grid(
# nrounds = c(50),
# max_depth = c(3, 4, 5),
# eta = c(0.01, 0.02, 0.05, 0.1, 0.2),
# gamma = c( 0.1, 0.2, 0.3, 0.4),
# colsample_bytree = c(0.5, 0.6, 0.7),
# min_child_weight = c(1, 2, 3, 4, 5),
# subsample = c(0.5, 0.6, 0.7, 0.8, 0.9, 1)
# )
However, this technique took more than whole night to run (and did not finish), and I decided that it was too computationally expensive to continue. The same applies to my extended caret model:
# customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
# customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
# customRF$grid <- function(x, y, len = NULL, search = "grid") {}
# customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
# randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
# }
# customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
# predict(modelFit, newdata)
# customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
# predict(modelFit, newdata, type = "prob")
# customRF$sort <- function(x) x[order(x[,1]),]
# customRF$levels <- function(x) x$classes
I attempted to adjust hyperparameters by defining a hyperparameter search space for XGBoost and using extended caret model. However, due to the computational expense, I recommend using better hardware, such as a cluster, to speed up the process of training the data set. However, overfitting can still occur.
I recommend adding more variables to the data set, such as marketing, holiday seasons, and advertisement media, to help pinpoint the donor population. These variables can provide additional insights into the factors that influence donor behavior.
In addition to XGBoost and random forest models, I suggest using other classification models to provide more insights into the data. Each model has its own strengths and weaknesses, and using a variety of models can help to identify patterns and relationships in the data that may be missed by a single model.
Finally, I suggest exploring more advanced techniques such as ensemble learning or deep learning to improve the predictive accuracy of the model. These techniques can be more complex and require more computational resources, but they can also provide significant improvements in performance.