Lauren Hall (DDV339), Christian Rivera (HVQ288), Erwin Gonzalez
(KJJ953)
Background: The Fundraising file comes from National Veterans
organization with the goal of minimizing direct marketing campaining
expenditures through predictive modeling. This will be achieved by
leveraging a balanced dataset to train a model with a target variable of
donor vs. not donor. Various classification methods will be attempted on
the dataset such as: Logistic Regression, Random Forest, and
XGBoost.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.2
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.3
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(yardstick)
## Warning: package 'yardstick' was built under R version 4.3.2
##
## Attaching package: 'yardstick'
##
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:readr':
##
## spec
library(readr)
library(dplyr)
Step 1: Partitioning 80/20 split training and testing dataset
The data is split 80/20 so that there is a testable portion to be
utilized in our classification models. There are 2400 observations for
the train data and there are 600 observations in the test data and they
both contain 24 variables.
# Load data
data <- read.csv("C:/Users/lclha/Documents/MSDA Program 2024-2026/Summer 2025/Predictive Modeling/Final Project/fundraising.csv")
set.seed(1)
# 80% training indices
train_indices <- sample(1:nrow(data), size = 0.8 * nrow(data))
# Split the data 80/20
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
Step 2: Model Building
1. Exploratory data analysis. Examine the predictors and evaluate
their association with the response variable. Which might be good
candidate predictors? Are any collinear with each other.
In the fundraising data there is 3000 observations and 21 variables
with a mixture of 7 categorical and 14 numerical variables.
str(data)
## 'data.frame': 3000 obs. of 21 variables:
## $ zipconvert2 : chr "Yes" "No" "No" "No" ...
## $ zipconvert3 : chr "No" "No" "No" "Yes" ...
## $ zipconvert4 : chr "No" "No" "No" "No" ...
## $ zipconvert5 : chr "No" "Yes" "Yes" "No" ...
## $ homeowner : chr "Yes" "No" "Yes" "Yes" ...
## $ num_child : int 1 2 1 1 1 1 1 1 1 1 ...
## $ income : int 1 5 3 4 4 4 4 4 4 1 ...
## $ female : chr "No" "Yes" "No" "No" ...
## $ wealth : int 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : int 698 828 1471 547 482 857 505 1438 1316 428 ...
## $ med_fam_inc : int 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : int 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : int 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : int 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: int 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : int 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num 9.4 4.29 7.08 7.67 7.3 ...
## $ target : chr "Donor" "Donor" "No Donor" "No Donor" ...
summary(data)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5
## Length:3000 Length:3000 Length:3000 Length:3000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## homeowner num_child income female
## Length:3000 Min. :1.000 Min. :1.000 Length:3000
## Class :character 1st Qu.:1.000 1st Qu.:3.000 Class :character
## Mode :character Median :1.000 Median :4.000 Mode :character
## Mean :1.069 Mean :3.899
## 3rd Qu.:1.000 3rd Qu.:5.000
## Max. :5.000 Max. :7.000
## wealth home_value med_fam_inc avg_fam_inc
## Min. :0.000 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0 1st Qu.: 318.0
## Median :8.000 Median : 816.5 Median : 355.0 Median : 396.0
## Mean :6.396 Mean :1143.3 Mean : 388.4 Mean : 432.3
## 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0 3rd Qu.: 516.0
## Max. :9.000 Max. :5945.0 Max. :1500.0 Max. :1331.0
## pct_lt15k num_prom lifetime_gifts largest_gift
## Min. : 0.00 Min. : 11.00 Min. : 15.0 Min. : 5.00
## 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0 1st Qu.: 10.00
## Median :12.00 Median : 48.00 Median : 81.0 Median : 15.00
## Mean :14.71 Mean : 49.14 Mean : 110.7 Mean : 16.65
## 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0 3rd Qu.: 20.00
## Max. :90.00 Max. :157.00 Max. :5674.9 Max. :1000.00
## last_gift months_since_donate time_lag avg_gift
## Min. : 0.00 Min. :17.00 Min. : 0.000 Min. : 2.139
## 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000 1st Qu.: 6.333
## Median : 10.00 Median :31.00 Median : 5.000 Median : 9.000
## Mean : 13.48 Mean :31.13 Mean : 6.876 Mean : 10.669
## 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000 3rd Qu.: 12.800
## Max. :219.00 Max. :37.00 Max. :77.000 Max. :122.167
## target
## Length:3000
## Class :character
## Mode :character
##
##
##
data <- data %>%
mutate(across(where(is.character),as.factor))
data$target <- as.factor(data$target)
# Separate numeric and categorical predictors
numeric_vars <- data %>% dplyr::select(where(is.numeric))
cat_vars <- data %>% select(where(~!is.numeric(.)))
# Histogram of numeric variables
numeric_vars %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value)) +
facet_wrap(~name, scales = "free") +
geom_histogram(bins = 30, fill = "steelblue") +
theme_minimal()

# Boxplots of numeric predictors by response (for classification)
if (is.factor(data$target)) {
numeric_vars %>%
mutate(response = data$target) %>%
pivot_longer(-response) %>%
ggplot(aes(x = response, y = value)) +
geom_boxplot() +
facet_wrap(~name, scales = "free") +
theme_minimal()
}

# Select Yes/No variables
yesno_vars <- data %>%
select(homeowner, female, zipconvert2, zipconvert3, zipconvert4, zipconvert5)
# Reshape and plot
yesno_vars %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, fill = value)) +
geom_bar() +
facet_wrap(~name, scales = "free_y") +
labs(x = "Category", y = "Count") +
theme_minimal()

# Correlation matrix among numeric predictors
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.8)

# Identify high correlations
high_corr <- findCorrelation(cor_matrix, cutoff = 0.75)
if (length(high_corr) > 0) {
cat("Highly collinear variables:/n")
print(names(numeric_vars)[high_corr])
}
## Highly collinear variables:/n[1] "avg_fam_inc" "avg_gift"
Step 2.1:
Potentially good predictors would be the following: income, wealth,
time lag, num_prom, months_since_donate, avg_fam_inc,
home_value,avg_gift,med_fam_inc, female, homeowner.Yes there are high
collinearility which are identified as any variables with a correlation
greater than 0.75 which are “avg_fam_inc” and “avg_gift”.
Step 2.2: Select classification tool and parameters. Run at least
two classification models of your choosing. Describe the two models that
you chose, with sufficient detail (method, parameters, variables,
etc.)
Logistic Regresssion: This method is a classification method used
for binary outcomes and estimates the log-odds of the target class. The
predictor variables used were income, wealth, time lag, num_prom,
months_since_donate, avg_fam_inc, home_value,avg_gift,med_fam_inc,
female, homeowner and target variable being target (donor vs not
donor).
collinear_vars <- colnames(numeric_vars)[high_corr]
data_reduced <- data %>%
select(income, wealth, time_lag, num_prom, months_since_donate,
home_value, med_fam_inc, female, homeowner, target)
# Ensure the same reduction applied to both train and test sets
train_data_reduced <- train_data %>% select(all_of(colnames(data_reduced)))
test_data_reduced <- test_data %>% select(all_of(colnames(data_reduced)))
# Ensure target is a factor in both sets
train_data_reduced$target <- as.factor(train_data_reduced$target)
test_data_reduced$target <- as.factor(test_data_reduced$target)
# Ensure target is a factor with correct levels
test_data_reduced$target <- as.factor(test_data_reduced$target)
log_model <- glm (target ~ ., data = data_reduced, family = "binomial")
# Check levels of the actual target
levels(test_data_reduced$target)
## [1] "Donor" "No Donor"
# e.g., should be something like: "Donor", "No Donor"
# Predict probabilities
log_probs <- predict(log_model, test_data_reduced, type = "response")
# Map probabilities to the correct labels
log_preds <- ifelse(log_probs > 0.5, "Donor", "No Donor") %>% as.factor()
# Match levels explicitly
log_preds <- factor(log_preds, levels = levels(test_data_reduced$target))
# Evaluate
confusionMatrix(log_preds, test_data_reduced$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 141 180
## No Donor 159 120
##
## Accuracy : 0.435
## 95% CI : (0.3949, 0.4757)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.9994
##
## Kappa : -0.13
##
## Mcnemar's Test P-Value : 0.2774
##
## Sensitivity : 0.4700
## Specificity : 0.4000
## Pos Pred Value : 0.4393
## Neg Pred Value : 0.4301
## Prevalence : 0.5000
## Detection Rate : 0.2350
## Detection Prevalence : 0.5350
## Balanced Accuracy : 0.4350
##
## 'Positive' Class : Donor
##
Random Forest Model: This method builds multiple decision trees and
combines their predictions randomly to improve accuracy and reduce
overfitting. Same predictor variables were used as the ones in logistic
regression and same target variable.
rf_model <- randomForest(target ~ ., data = train_data_reduced, ntree = 500, importance = TRUE)
# Predict class labels
rf_preds <- predict(rf_model, test_data_reduced)
# Evaluate performance
confusionMatrix(rf_preds, test_data_reduced$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 166 143
## No Donor 134 157
##
## Accuracy : 0.5383
## 95% CI : (0.4975, 0.5788)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.03305
##
## Kappa : 0.0767
##
## Mcnemar's Test P-Value : 0.63075
##
## Sensitivity : 0.5533
## Specificity : 0.5233
## Pos Pred Value : 0.5372
## Neg Pred Value : 0.5395
## Prevalence : 0.5000
## Detection Rate : 0.2767
## Detection Prevalence : 0.5150
## Balanced Accuracy : 0.5383
##
## 'Positive' Class : Donor
##
# View variable importance
varImpPlot(rf_model)

XGBoost Model: This model os a boosting-based classifier tha builds
trees sequentially to correct previous errors. Same predictor variables
were used and the same target variable. Target variables were converted
into dummy variables of 1 or 0.
data_reduced <- data %>%
select(income, wealth, time_lag, num_prom, months_since_donate,
home_value, med_fam_inc, female, homeowner, target) %>%
mutate(across(c(female, homeowner), as.factor),
target = as.factor(target))
# Step 2: Encode categorical vars as numeric (required for xgboost)
data_reduced$target <- ifelse(data_reduced$target == "Donor", 1, 0)
dummies <- dummyVars(target ~ ., data = data_reduced)
data_xgb <- predict(dummies, newdata = data_reduced) %>% as.data.frame()
data_xgb$target <- data_reduced$target
# Step 3: Split into train/test
set.seed(123)
train_indices <- sample(1:nrow(data_xgb), 0.8 * nrow(data_xgb))
train_data <- data_xgb[train_indices, ]
test_data <- data_xgb[-train_indices, ]
# Step 4: Prepare matrices for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, -ncol(train_data)]), label = train_data$target)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -ncol(test_data)]), label = test_data$target)
# Step 5: Train XGBoost model
xgb_model <- xgboost(data = train_matrix,
objective = "binary:logistic",
eval_metric = "logloss",
nrounds = 100,
verbose = 0)
# Step 6: Predict probabilities on test set
xgb_probs <- predict(xgb_model, newdata = test_matrix)
# Step 7: Adjust threshold for higher specificity (e.g., 0.7)
xgb_preds <- ifelse(xgb_probs > 0.7, 1, 0) %>% factor(levels = c(0, 1))
true_labels <- test_data$target %>% factor(levels = c(0, 1))
# Step 8: Evaluate with confusion matrix
confusionMatrix(xgb_preds, true_labels, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 226 238
## 1 62 74
##
## Accuracy : 0.5
## 95% CI : (0.4592, 0.5408)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 0.8465
##
## Kappa : 0.0214
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2372
## Specificity : 0.7847
## Pos Pred Value : 0.5441
## Neg Pred Value : 0.4871
## Prevalence : 0.5200
## Detection Rate : 0.1233
## Detection Prevalence : 0.2267
## Balanced Accuracy : 0.5110
##
## 'Positive' Class : 1
##
Step 2.3: Classification under asymmetric response and cost. Comment
on the reasoning behind using weighted sampling to produce a training
set with equal numbers of donors and non-donors? Why not use a simple
random sample from the original dataset?
Using an equal weighted sample between donor and non-donor creates a
balanced data set which reduces bias. This can improve the model
learning of classes that might be overlooked in a real world scenario.
As a result, the confusion matrix provides a more accurate reflection of
the model’s performance.
Step 2.4: Evaluate the fit. Examine the out of sample error for your
models. Use tables or graphs to display your results. Is there a model
that dominates?
# Logistic Regression
log_cm <- confusionMatrix(log_preds, test_data_reduced$target)
log_metrics <- tibble(
Model = "Logistic Regression",
Accuracy = log_cm$overall["Accuracy"],
Sensitivity = log_cm$byClass["Sensitivity"],
Specificity = log_cm$byClass["Specificity"],
F1 = log_cm$byClass["F1"]
)
# Random Forest
rf_cm <- confusionMatrix(rf_preds, test_data_reduced$target)
rf_metrics <- tibble(
Model = "Random Forest",
Accuracy = rf_cm$overall["Accuracy"],
Sensitivity = rf_cm$byClass["Sensitivity"],
Specificity = rf_cm$byClass["Specificity"],
F1 = rf_cm$byClass["F1"]
)
# XGBoost
xgb_cm <- confusionMatrix(xgb_preds, true_labels, positive = "1")
xgb_metrics <- tibble(
Model = "XGBoost",
Accuracy = xgb_cm$overall["Accuracy"],
Sensitivity = xgb_cm$byClass["Sensitivity"],
Specificity = xgb_cm$byClass["Specificity"],
F1 = xgb_cm$byClass["F1"]
)
# Combine into one table
all_metrics <- bind_rows(log_metrics, rf_metrics, xgb_metrics)
# Display table
print(all_metrics)
## # A tibble: 3 × 5
## Model Accuracy Sensitivity Specificity F1
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Logistic Regression 0.435 0.47 0.4 0.454
## 2 Random Forest 0.538 0.553 0.523 0.545
## 3 XGBoost 0.5 0.237 0.785 0.330
Random Forest is the most consistent of the three models and has the
highest Accuracy, Sensitivity, and F1 Score.
Step 2.5: Select best model. From your answer in (4), what do you
think is the “best” model?
We choose the XGBoost model because it has the highest specificity
score (0.785). The specificity score is the percentage of individuals
that are accurately predicted to not donate. This is useful information
when the goal is to minimize costs and maximize profits efficiently so
there are no wasted marketing expenditures to these individuals.
Step 3: Testing using Future Fundraising Test File:
Results: The XGBoost model succesfully predicted donor vs. not donor
53% are likely to be donors when submitted into the leaderboard.