A national veterans’ organization aims to develop a predictive model to improve the cost-effectiveness of its direct marketing campaigns. With an in-house database of over 13 million donors, the organization is one of the largest direct-mail fundraisers in the United States. According to recent mailing records, the overall response rate is 5.1%. Among those who donated, the average gift is $13.00. Each mailing—containing personalized address labels and assorted cards—costs $0.68 to produce and send. Given these figures, a sample dataset has been drawn to develop a classification model that captures likely donors and maximizes expected net profit. Weighted sampling was used to balance the training data, ensuring equal representation of donors and non-donors.
The organization spends $0.68 per mailer, but only 5.1% of recipients respond, resulting in low returns when mailing to everyone. To avoid unnecessary costs, a more efficient, data-driven targeting strategy is needed.
The objective is to build a predictive classification model that identifies likely donors from a pool of candidates, in order to maximize the expected net profit of the direct-mail fundraising campaign.
The goal of this project is to develop a classification model that enables the veterans’ organization to target the right individuals for mailing campaigns. With a low historical response rate and a per-unit cost of $0.68, the model should identify individuals who are most likely to respond positively.
Instead of mailing to the full list—which leads to wasted costs—the model will predict donor likelihood using historical data. Mailings will be restricted to those with a high predicted probability of donating.
The model must also account for the asymmetric cost structure of the campaign: - True Positive (correctly predicted donor) → Profit = $13.00 - $0.68 = $12.32 - False Positive (incorrectly predicted donor) → Loss = $0.68
Thus, the model should not only perform well in classification metrics like accuracy or recall, but also focus on maximizing expected net profit by reducing costly false positives while identifying valuable true positives.
This project uses two datasets from a national veterans’ organization: the training dataset fundraising.csv with 3,000 records and 21 variables, and the scoring dataset future_fundraising.csv with 120 records and 20 variables (excluding the target). The training data was created using weighted sampling to balance donors and non-donors evenly (50% each), allowing the model to better learn donor patterns despite the original population’s low 5.1% donor rate. This avoids bias from a simple random sample dominated by non-donors. The datasets had no missing values, but categorical variables such as zip code groups (zipconvert2 to zipconvert5), homeowner, female, and target were converted from character to factor types. Numeric variables including num_child, income, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, months_since_donate, and time_lag were converted to numeric types. The data was then split into an 80/20 train-validation partition (2,401 and 599 records respectively), preserving class balance to support robust model evaluation.
Load required libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
#library(glmnet)
library(corrplot)
## corrplot 0.95 loaded
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(e1071)
library(ROCR)
#library(VIM)
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following object is masked from 'package:dplyr':
##
## combine
# Data Loading
fundraising <- read.csv("Final Project/fundraising.csv")
future_fundraising <- read.csv("Final Project/future_fundraising.csv")
#Copy of the dataset
df = fundraising
#Check missing values
sum(is.na(df))
## [1] 0
sum(is.na(future_fundraising))
## [1] 0
#Check dimensions of the datasets
dim(df)
## [1] 3000 21
dim(future_fundraising)
## [1] 120 20
#See the target distibution
table(df$target)
##
## Donor No Donor
## 1499 1501
#Structure of the data
str(df)
## 'data.frame': 3000 obs. of 21 variables:
## $ zipconvert2 : chr "Yes" "No" "No" "No" ...
## $ zipconvert3 : chr "No" "No" "No" "Yes" ...
## $ zipconvert4 : chr "No" "No" "No" "No" ...
## $ zipconvert5 : chr "No" "Yes" "Yes" "No" ...
## $ homeowner : chr "Yes" "No" "Yes" "Yes" ...
## $ num_child : int 1 2 1 1 1 1 1 1 1 1 ...
## $ income : int 1 5 3 4 4 4 4 4 4 1 ...
## $ female : chr "No" "Yes" "No" "No" ...
## $ wealth : int 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : int 698 828 1471 547 482 857 505 1438 1316 428 ...
## $ med_fam_inc : int 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : int 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : int 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : int 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: int 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : int 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num 9.4 4.29 7.08 7.67 7.3 ...
## $ target : chr "Donor" "Donor" "No Donor" "No Donor" ...
# Convert incorrect classes to their correct classes
#Convert characters to factor variables
df$zipconvert2 <- as.factor(df$zipconvert2)
df$zipconvert3 <- as.factor(df$zipconvert3)
df$zipconvert4 <- as.factor(df$zipconvert4)
df$zipconvert5 <- as.factor(df$zipconvert5)
df$homeowner <- as.factor(df$homeowner)
df$female <- as.factor(df$female)
df$target <- as.factor(df$target)
#Convert integers to numeric variables
df$num_child <- as.numeric(df$num_child)
df$income <- as.numeric(df$income)
df$wealth <- as.numeric(df$wealth)
df$home_value <- as.numeric(df$home_value)
df$med_fam_inc <- as.numeric(df$med_fam_inc)
df$avg_fam_inc <- as.numeric(df$avg_fam_inc)
df$pct_lt15k <- as.numeric(df$pct_lt15k)
df$num_prom <- as.numeric(df$num_prom)
df$months_since_donate <- as.numeric(df$months_since_donate)
df$time_lag <- as.numeric(df$time_lag)
#Verify the change
str(df)
## 'data.frame': 3000 obs. of 21 variables:
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ num_child : num 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 2 2 ...
## $ wealth : num 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num 698 828 1471 547 482 ...
## $ med_fam_inc : num 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
To evaluate model performance reliably, the dataset was randomly split into training and validation sets using an 80/20 ratio. A fixed seed (set.seed(12345)) was set to ensure reproducibility. This partitioning resulted in 2,401 records for training and 599 for validation, with the class distribution of donors and non-donors balanced evenly in both sets.
## Step 1: Partitioning
# Set seed for reproducibility
set.seed(12345)
# Create training and validation sets using your 'df' dataset
train_index <- createDataPartition(df$target, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
#See the dimensions of the train and test sets
dim(train_data)
## [1] 2401 21
dim(test_data)
## [1] 599 21
# Verify target distribution in both sets
print(table(train_data$target))
##
## Donor No Donor
## 1200 1201
print(table(test_data$target))
##
## Donor No Donor
## 299 300
Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?
# Examine distribution of target variable
# Calculate percentages
target_counts <- train_data %>%
count(target) %>%
mutate(percentage = n / sum(n) * 100,
label = paste0(round(percentage, 1), "%"))
# Plot with percentage labels
ggplot(target_counts, aes(x = target, y = n, fill = target)) +
geom_col() +
geom_text(aes(label = label), vjust = -0.05, size = 5) +
labs(title = "Distribution of Target Variable",
x = "Target", y = "Count") +
theme_minimal()
Shows a perfectly balanced dataset with exactly 50% donors and 50%
non-donors. This balanced distribution is ideal for classification
modeling as it eliminates class imbalance concerns.
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.3
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
##
## area
# Key numeric variables
key_numeric_vars <- c("lifetime_gifts", "largest_gift", "last_gift", "avg_gift",
"months_since_donate", "num_prom", "wealth", "income")
# Generate a list of plots
plot_list <- lapply(key_numeric_vars, function(var) {
ggplot(train_data, aes_string(x = "target", y = var, fill = "target")) +
geom_boxplot() +
labs(title = paste("Distribution of\n", var, "\nby Target"),
x = "Target", y = var) +
theme_minimal() +
scale_fill_manual(values = c("#1f77b4", "#ff7f0e")) +
theme(legend.position = "none",
plot.title = element_text(size = 10))
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Combine and display plots in a 4x4 grid (will auto-adjust if fewer than 16)
wrap_plots(plot_list, ncol = 4)
Lifetime gifts, largest gift, last gift, avg gift: All show heavy right
skew with many outliers. Donors and non-donors appear to have similar
distributions for these variables, suggesting they may not be strong
predictors on their own. - Months since donate: Shows similar
distributions between groups with medians around 30-35 months,
indicating this temporal factor may not strongly differentiate donors
from non-donors. - Number of promotions: Donors appear to receive
slightly fewer promotions on average than non-donors, which is
counterintuitive and may suggest targeting inefficiencies. - Wealth and
Income: Both groups show similar distributions with many outliers,
suggesting these socioeconomic factors alone may not be strong
predictors of donation behavior.
cat_vars <- c("zipconvert2", "zipconvert3", "zipconvert4",
"zipconvert5", "homeowner", "female")
# Create list of categorical plots
cat_plot_list <- list()
for(i in 1:length(cat_vars)) {
var <- cat_vars[i]
# Create proportion table for plotting
prop_data <- train_data %>%
group_by(!!sym(var), target) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(!!sym(var)) %>%
mutate(prop = count / sum(count))
cat_plot_list[[i]] <- ggplot(prop_data, aes_string(x = var, y = "prop", fill = "target")) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = paste(var),
x = var, y = "Proportion") +
theme_minimal() +
theme(legend.position = if(i == length(cat_vars)) "right" else "none",
axis.text.x = element_text(angle = 45, hjust = 1))
}
# Arrange categorical plots in grid
grid.arrange(grobs = cat_plot_list, ncol = 3,
top = "Distribution of Categorical Variables by Target")
Zipconvert variables (2-5): All show nearly identical proportions
between donors and non-donors across categories, indicating these zip
code conversion variables have little discriminatory power.
Homeowner: There is a noticeable difference here: the proportion of non-donors is higher among non-homeowners (No), whereas donors are slightly more likely to be homeowners (Yes). This indicates that being a homeowner might be positively associated with donating.
Female: The proportions of donors and non-donors among males (No) and females (Yes) are almost equal, indicating gender is not a significant predictor in this dataset.
Overall, the individual variables show surprisingly weak associations with the target variable, suggesting that donation behavior may depend on complex interactions between variables rather than simple univariate relationships.
# Correlation
# Select only numeric columns
numeric_vars <- train_data %>% select_if(is.numeric)
# Create correlation matrix
cor_matrix <- cor(numeric_vars)
cor_matrix
## num_child income wealth home_value
## num_child 1.000000000 0.08339983 0.061221609 -0.015495031
## income 0.083399826 1.00000000 0.213800254 0.284339261
## wealth 0.061221609 0.21380025 1.000000000 0.254019359
## home_value -0.015495031 0.28433926 0.254019359 1.000000000
## med_fam_inc 0.042532258 0.35891866 0.368738444 0.732555415
## avg_fam_inc 0.045063106 0.37202479 0.377668631 0.747760685
## pct_lt15k -0.025578299 -0.27905337 -0.365032790 -0.398634647
## num_prom -0.092865608 -0.06431244 -0.414796330 -0.060627169
## lifetime_gifts -0.050327680 -0.01031642 -0.216662769 -0.024815397
## largest_gift -0.018110511 0.08118317 -0.005723304 0.104641235
## last_gift -0.006292668 0.10415222 0.050857635 0.155872334
## months_since_donate -0.006843916 0.06515296 0.031986163 0.025694375
## time_lag 0.002649744 -0.01333546 -0.067629922 -0.006180833
## avg_gift -0.012829580 0.12673611 0.099395301 0.172480423
## med_fam_inc avg_fam_inc pct_lt15k num_prom
## num_child 0.04253226 0.04506311 -0.02557830 -0.09286561
## income 0.35891866 0.37202479 -0.27905337 -0.06431244
## wealth 0.36873844 0.37766863 -0.36503279 -0.41479633
## home_value 0.73255541 0.74776069 -0.39863465 -0.06062717
## med_fam_inc 1.00000000 0.97101190 -0.66591144 -0.04501360
## avg_fam_inc 0.97101190 1.00000000 -0.68511328 -0.05072327
## pct_lt15k -0.66591144 -0.68511328 1.00000000 0.03128615
## num_prom -0.04501360 -0.05072327 0.03128615 1.00000000
## lifetime_gifts -0.03368242 -0.03900009 0.06266368 0.51747370
## largest_gift 0.08823892 0.08245552 -0.01211723 0.10157729
## last_gift 0.13557244 0.12746622 -0.05284895 -0.05000859
## months_since_donate 0.03548792 0.03492284 -0.01219579 -0.27191285
## time_lag 0.01281501 0.02304798 -0.02352929 0.13466978
## avg_gift 0.14162847 0.13242714 -0.05270529 -0.14979050
## lifetime_gifts largest_gift last_gift
## num_child -0.05032768 -0.018110511 -0.006292668
## income -0.01031642 0.081183171 0.104152224
## wealth -0.21666277 -0.005723304 0.050857635
## home_value -0.02481540 0.104641235 0.155872334
## med_fam_inc -0.03368242 0.088238920 0.135572437
## avg_fam_inc -0.03900009 0.082455521 0.127466219
## pct_lt15k 0.06266368 -0.012117231 -0.052848955
## num_prom 0.51747370 0.101577294 -0.050008591
## lifetime_gifts 1.00000000 0.636819069 0.203316447
## largest_gift 0.63681907 1.000000000 0.749200240
## last_gift 0.20331645 0.749200240 1.000000000
## months_since_donate -0.13033090 0.072066752 0.185651892
## time_lag 0.04814786 0.078718288 0.072982321
## avg_gift 0.17853780 0.730035131 0.859818678
## months_since_donate time_lag avg_gift
## num_child -0.006843916 0.002649744 -0.01282958
## income 0.065152965 -0.013335461 0.12673611
## wealth 0.031986163 -0.067629922 0.09939530
## home_value 0.025694375 -0.006180833 0.17248042
## med_fam_inc 0.035487919 0.012815010 0.14162847
## avg_fam_inc 0.034922841 0.023047982 0.13242714
## pct_lt15k -0.012195792 -0.023529286 -0.05270529
## num_prom -0.271912850 0.134669776 -0.14979050
## lifetime_gifts -0.130330897 0.048147865 0.17853780
## largest_gift 0.072066752 0.078718288 0.73003513
## last_gift 0.185651892 0.072982321 0.85981868
## months_since_donate 1.000000000 0.005744928 0.19128868
## time_lag 0.005744928 1.000000000 0.07432409
## avg_gift 0.191288678 0.074324093 1.00000000
# Plot the correlation matrix
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
number.cex = 0.6,
tl.cex = 0.8,
tl.col = "black",
tl.srt = 45,
mar = c(0, 0, 1, 0),
title = "Correlation Matrix of Numeric Features")
Highly Correlated Variable Pairs
Variable 1 | Variable 2 | Correlation |
---|---|---|
med_fam_inc |
avg_fam_inc |
0.971 |
avg_fam_inc |
home_value |
0.748 |
med_fam_inc |
home_value |
0.733 |
largest_gift |
avg_gift |
0.730 |
last_gift |
avg_gift |
0.860 |
largest_gift |
last_gift |
0.749 |
Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.
The classification models used in this analysis are Stepwise Logistic Regression, Random Forest model, and Support Vector Machine with Radial as the Kernel.
# Set seed for reproducibility across all models
set.seed(12345)
# Method Description:
# - Uses forward, backward, and bidirectional stepwise selection
# - Based on Akaike Information Criterion (AIC) for variable selection
# - Automatically selects the most predictive variables
# - Fits a logistic regression model with selected variables
# Fit full logistic regression model (starting point)
full_model <- glm(target ~ ., data = train_data, family = binomial)
# Fit null model (intercept only)
null_model <- glm(target ~ 1, data = train_data, family = binomial)
# Perform stepwise selection using AIC
# Direction "both" allows forward and backward steps
stepwise_model <- stepAIC(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both",
trace = TRUE)
## Start: AIC=3330.49
## target ~ 1
##
## Df Deviance AIC
## + months_since_donate 1 3274.3 3278.3
## + last_gift 1 3313.4 3317.4
## + avg_gift 1 3316.0 3320.0
## + num_prom 1 3317.0 3321.0
## + income 1 3321.8 3325.8
## + largest_gift 1 3322.3 3326.3
## + home_value 1 3323.4 3327.4
## + num_child 1 3324.0 3328.0
## + zipconvert5 1 3325.2 3329.2
## + homeowner 1 3325.6 3329.6
## <none> 3328.5 3330.5
## + zipconvert2 1 3327.3 3331.3
## + med_fam_inc 1 3327.4 3331.4
## + lifetime_gifts 1 3327.6 3331.6
## + female 1 3327.7 3331.7
## + avg_fam_inc 1 3327.8 3331.8
## + zipconvert3 1 3328.0 3332.0
## + zipconvert4 1 3328.4 3332.4
## + pct_lt15k 1 3328.4 3332.4
## + wealth 1 3328.5 3332.5
## + time_lag 1 3328.5 3332.5
##
## Step: AIC=3278.35
## target ~ months_since_donate
##
## Df Deviance AIC
## + income 1 3264.6 3270.6
## + home_value 1 3268.2 3274.2
## + last_gift 1 3268.3 3274.3
## + num_child 1 3269.5 3275.5
## + avg_gift 1 3269.9 3275.9
## + largest_gift 1 3270.7 3276.7
## + homeowner 1 3270.8 3276.8
## + zipconvert5 1 3271.1 3277.1
## + num_prom 1 3272.0 3278.0
## <none> 3274.3 3278.3
## + med_fam_inc 1 3272.6 3278.6
## + zipconvert2 1 3273.0 3279.0
## + avg_fam_inc 1 3273.2 3279.2
## + zipconvert3 1 3273.6 3279.6
## + female 1 3274.1 3280.1
## + wealth 1 3274.2 3280.2
## + pct_lt15k 1 3274.2 3280.2
## + time_lag 1 3274.3 3280.3
## + lifetime_gifts 1 3274.3 3280.3
## + zipconvert4 1 3274.3 3280.3
## - months_since_donate 1 3328.5 3330.5
##
## Step: AIC=3270.61
## target ~ months_since_donate + income
##
## Df Deviance AIC
## + last_gift 1 3256.9 3264.9
## + avg_gift 1 3258.3 3266.3
## + num_child 1 3258.5 3266.5
## + largest_gift 1 3259.8 3267.8
## + num_prom 1 3261.8 3269.8
## + home_value 1 3261.8 3269.8
## + zipconvert5 1 3262.3 3270.3
## <none> 3264.6 3270.6
## + zipconvert2 1 3263.3 3271.3
## + homeowner 1 3263.8 3271.8
## + female 1 3264.2 3272.2
## + pct_lt15k 1 3264.3 3272.3
## + zipconvert3 1 3264.3 3272.3
## + wealth 1 3264.6 3272.6
## + med_fam_inc 1 3264.6 3272.6
## + avg_fam_inc 1 3264.6 3272.6
## + time_lag 1 3264.6 3272.6
## + lifetime_gifts 1 3264.6 3272.6
## + zipconvert4 1 3264.6 3272.6
## - income 1 3274.3 3278.3
## - months_since_donate 1 3321.8 3325.8
##
## Step: AIC=3264.87
## target ~ months_since_donate + income + last_gift
##
## Df Deviance AIC
## + num_child 1 3250.6 3260.6
## + home_value 1 3252.6 3262.6
## + zipconvert5 1 3253.8 3263.8
## + num_prom 1 3254.0 3264.0
## <none> 3256.9 3264.9
## + zipconvert2 1 3255.6 3265.6
## + homeowner 1 3256.2 3266.2
## + lifetime_gifts 1 3256.4 3266.4
## + zipconvert3 1 3256.4 3266.4
## + med_fam_inc 1 3256.6 3266.6
## + pct_lt15k 1 3256.6 3266.6
## + female 1 3256.6 3266.6
## + avg_gift 1 3256.8 3266.8
## + avg_fam_inc 1 3256.8 3266.8
## + time_lag 1 3256.8 3266.8
## + wealth 1 3256.8 3266.8
## + zipconvert4 1 3256.9 3266.9
## + largest_gift 1 3256.9 3266.9
## - last_gift 1 3264.6 3270.6
## - income 1 3268.3 3274.3
## - months_since_donate 1 3304.1 3310.1
##
## Step: AIC=3260.59
## target ~ months_since_donate + income + last_gift + num_child
##
## Df Deviance AIC
## + home_value 1 3246.7 3258.7
## + zipconvert5 1 3247.9 3259.9
## + num_prom 1 3248.5 3260.5
## <none> 3250.6 3260.6
## + zipconvert2 1 3249.5 3261.5
## + homeowner 1 3249.8 3261.8
## + zipconvert3 1 3250.1 3262.1
## + lifetime_gifts 1 3250.3 3262.3
## + med_fam_inc 1 3250.3 3262.3
## + pct_lt15k 1 3250.3 3262.3
## + female 1 3250.4 3262.4
## + avg_gift 1 3250.5 3262.5
## + avg_fam_inc 1 3250.5 3262.5
## + time_lag 1 3250.6 3262.6
## + largest_gift 1 3250.6 3262.6
## + wealth 1 3250.6 3262.6
## + zipconvert4 1 3250.6 3262.6
## - num_child 1 3256.9 3264.9
## - last_gift 1 3258.5 3266.5
## - income 1 3263.4 3271.4
## - months_since_donate 1 3298.3 3306.3
##
## Step: AIC=3258.74
## target ~ months_since_donate + income + last_gift + num_child +
## home_value
##
## Df Deviance AIC
## + avg_fam_inc 1 3243.9 3257.9
## + num_prom 1 3244.3 3258.3
## <none> 3246.7 3258.7
## + pct_lt15k 1 3245.2 3259.2
## + med_fam_inc 1 3245.4 3259.4
## + zipconvert5 1 3246.0 3260.0
## + zipconvert2 1 3246.1 3260.1
## + homeowner 1 3246.1 3260.1
## + lifetime_gifts 1 3246.3 3260.3
## + wealth 1 3246.5 3260.5
## + female 1 3246.6 3260.6
## + avg_gift 1 3246.6 3260.6
## - home_value 1 3250.6 3260.6
## + zipconvert4 1 3246.6 3260.6
## + zipconvert3 1 3246.6 3260.6
## + time_lag 1 3246.7 3260.7
## + largest_gift 1 3246.7 3260.7
## - num_child 1 3252.6 3262.6
## - income 1 3255.2 3265.2
## - last_gift 1 3256.1 3266.1
## - months_since_donate 1 3293.9 3303.9
##
## Step: AIC=3257.89
## target ~ months_since_donate + income + last_gift + num_child +
## home_value + avg_fam_inc
##
## Df Deviance AIC
## + num_prom 1 3241.4 3257.4
## <none> 3243.9 3257.9
## + med_fam_inc 1 3242.6 3258.6
## - avg_fam_inc 1 3246.7 3258.7
## + homeowner 1 3243.3 3259.3
## + zipconvert2 1 3243.3 3259.3
## + lifetime_gifts 1 3243.5 3259.5
## + zipconvert4 1 3243.6 3259.6
## + avg_gift 1 3243.7 3259.7
## + zipconvert5 1 3243.7 3259.7
## + female 1 3243.7 3259.7
## + time_lag 1 3243.8 3259.8
## + pct_lt15k 1 3243.8 3259.8
## + largest_gift 1 3243.9 3259.9
## + zipconvert3 1 3243.9 3259.9
## + wealth 1 3243.9 3259.9
## - num_child 1 3249.2 3261.2
## - home_value 1 3250.5 3262.5
## - last_gift 1 3253.2 3265.2
## - income 1 3254.3 3266.3
## - months_since_donate 1 3291.0 3303.0
##
## Step: AIC=3257.38
## target ~ months_since_donate + income + last_gift + num_child +
## home_value + avg_fam_inc + num_prom
##
## Df Deviance AIC
## <none> 3241.4 3257.4
## - num_prom 1 3243.9 3257.9
## + med_fam_inc 1 3240.2 3258.2
## - avg_fam_inc 1 3244.3 3258.3
## + wealth 1 3240.8 3258.8
## + homeowner 1 3240.8 3258.8
## + zipconvert2 1 3240.9 3258.9
## + zipconvert4 1 3241.1 3259.1
## + zipconvert5 1 3241.2 3259.2
## + largest_gift 1 3241.2 3259.2
## + female 1 3241.3 3259.3
## + pct_lt15k 1 3241.3 3259.3
## + lifetime_gifts 1 3241.3 3259.3
## + avg_gift 1 3241.4 3259.4
## + zipconvert3 1 3241.4 3259.4
## + time_lag 1 3241.4 3259.4
## - num_child 1 3246.0 3260.0
## - home_value 1 3248.4 3262.4
## - last_gift 1 3250.8 3264.8
## - income 1 3252.2 3266.2
## - months_since_donate 1 3280.2 3294.2
summary(stepwise_model)
##
## Call:
## glm(formula = target ~ months_since_donate + income + last_gift +
## num_child + home_value + avg_fam_inc + num_prom, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1834174 0.4254808 -5.132 2.87e-07 ***
## months_since_donate 0.0688887 0.0112683 6.114 9.75e-10 ***
## income -0.0908349 0.0277470 -3.274 0.00106 **
## last_gift 0.0126138 0.0042894 2.941 0.00327 **
## num_child 0.2647543 0.1256392 2.107 0.03510 *
## home_value -0.0001755 0.0000667 -2.632 0.00849 **
## avg_fam_inc 0.0006578 0.0003852 1.708 0.08772 .
## num_prom -0.0030592 0.0019329 -1.583 0.11348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3328.5 on 2400 degrees of freedom
## Residual deviance: 3241.4 on 2393 degrees of freedom
## AIC: 3257.4
##
## Number of Fisher Scoring iterations: 4
# Check Multicollinerity
vif(stepwise_model)
## months_since_donate income last_gift num_child
## 1.105146 1.179779 1.075611 1.022359
## home_value avg_fam_inc num_prom
## 2.312161 2.439406 1.074224
Because all VIF are less than 5, multicollinearity is not an issue.
set.seed(12345)
# Make predictions on test set
stepwise_pred_prob <- predict(stepwise_model, test_data, type = "response")
stepwise_pred_class <- ifelse(stepwise_pred_prob > 0.5, "Donor", "No Donor")
stepwise_pred_class <- factor(stepwise_pred_class, levels = c("Donor", "No Donor"))
# Calculate performance metrics
stepwise_roc <- roc(test_data$target, stepwise_pred_prob)
## Setting levels: control = Donor, case = No Donor
## Setting direction: controls < cases
stepwise_auc <- auc(stepwise_roc)
print(paste("Stepwise Logistic Regression AUC:", round(stepwise_auc, 4)))
## [1] "Stepwise Logistic Regression AUC: 0.5504"
print("Stepwise Logistic Regression Confusion Matrix:")
## [1] "Stepwise Logistic Regression Confusion Matrix:"
print(confusionMatrix(stepwise_pred_class, test_data$target, positive = "Donor"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 125 154
## No Donor 174 146
##
## Accuracy : 0.4524
## 95% CI : (0.412, 0.4933)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.9921
##
## Kappa : -0.0953
##
## Mcnemar's Test P-Value : 0.2941
##
## Sensitivity : 0.4181
## Specificity : 0.4867
## Pos Pred Value : 0.4480
## Neg Pred Value : 0.4563
## Prevalence : 0.4992
## Detection Rate : 0.2087
## Detection Prevalence : 0.4658
## Balanced Accuracy : 0.4524
##
## 'Positive' Class : Donor
##
# Method Description:
# - Ensemble method using bootstrap aggregation of decision trees
# - Each tree trained on bootstrap sample of training data
# - At each split, random subset of variables considered
# - Final prediction: majority vote
# Set seed for reproducibility
set.seed(12345)
# Train Random Forest model
rf_model <- randomForest(
target ~ .,
data = train_data,
ntree = 500,
mtry = 4,
nodesize = 5,
importance = TRUE
)
print(rf_model)
##
## Call:
## randomForest(formula = target ~ ., data = train_data, ntree = 500, mtry = 4, nodesize = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 46.02%
## Confusion matrix:
## Donor No Donor class.error
## Donor 665 535 0.4458333
## No Donor 570 631 0.4746045
# Predictions
rf_pred_prob <- predict(rf_model, test_data, type = "prob")[,"Donor"]
rf_pred_class <- predict(rf_model, test_data, type = "response")
# Performance
rf_roc <- roc(test_data$target, rf_pred_prob)
## Setting levels: control = Donor, case = No Donor
## Setting direction: controls > cases
rf_auc <- auc(rf_roc)
print(paste("Random Forest AUC:", round(rf_auc, 4)))
## [1] "Random Forest AUC: 0.5267"
print(confusionMatrix(rf_pred_class, test_data$target, positive = "Donor"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 167 157
## No Donor 132 143
##
## Accuracy : 0.5175
## 95% CI : (0.4767, 0.5582)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.2188
##
## Kappa : 0.0352
##
## Mcnemar's Test P-Value : 0.1580
##
## Sensitivity : 0.5585
## Specificity : 0.4767
## Pos Pred Value : 0.5154
## Neg Pred Value : 0.5200
## Prevalence : 0.4992
## Detection Rate : 0.2788
## Detection Prevalence : 0.5409
## Balanced Accuracy : 0.5176
##
## 'Positive' Class : Donor
##
# Variable Importance
importance_scores <- importance(rf_model)
importance_df <- data.frame(
Variable = rownames(importance_scores),
Importance = importance_scores[,"MeanDecreaseGini"]
)
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE),]
print(head(importance_df, 10))
## Variable Importance
## home_value home_value 95.77529
## avg_gift avg_gift 92.23919
## avg_fam_inc avg_fam_inc 86.56504
## lifetime_gifts lifetime_gifts 86.11906
## med_fam_inc med_fam_inc 86.06912
## num_prom num_prom 77.77546
## pct_lt15k pct_lt15k 73.31086
## months_since_donate months_since_donate 64.17823
## time_lag time_lag 62.64301
## last_gift last_gift 52.34361
# Next create variance importance plot
top_vars <- head(importance_df, 10)
# Plot
ggplot(top_vars, aes(x = reorder(Variable, Importance), y = Importance, fill = Importance)) +
geom_col() +
coord_flip() + # flip to horizontal bars
labs(title = "Top 10 Variable",
x = "Variable",
y = "Importance") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
theme_minimal() +
theme(legend.position = "none")
set.seed(12345) # for reproducibility
# Scale the data for SVM
train_scaled <- train_data
test_scaled <- test_data
# Scale numeric variables
numeric_vars <- sapply(train_data, is.numeric)
train_scaled[numeric_vars] <- scale(train_data[numeric_vars])
# Use same scaling parameters for test data
scaling_params <- attributes(scale(train_data[numeric_vars]))
test_scaled[numeric_vars] <- scale(test_data[numeric_vars],
center = scaling_params$`scaled:center`,
scale = scaling_params$`scaled:scale`)
# Train SVM model
svm_model <- svm(
target ~ .,
data = train_scaled,
kernel = "radial",
cost = 10,
gamma = 0.1,
probability = TRUE
)
print(svm_model)
##
## Call:
## svm(formula = target ~ ., data = train_scaled, kernel = "radial",
## cost = 10, gamma = 0.1, probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 2086
# Predictions
svm_pred <- predict(svm_model, test_scaled, probability = TRUE)
svm_pred_prob <- attr(svm_pred, "probabilities")[,"Donor"]
svm_pred_class <- svm_pred
# Performance
svm_roc <- roc(test_data$target, svm_pred_prob)
## Setting levels: control = Donor, case = No Donor
## Setting direction: controls > cases
svm_auc <- auc(svm_roc)
print(paste("SVM AUC:", round(svm_auc, 4)))
## [1] "SVM AUC: 0.4999"
print(confusionMatrix(svm_pred_class, test_data$target, positive = "Donor"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 135 132
## No Donor 164 168
##
## Accuracy : 0.5058
## 95% CI : (0.465, 0.5466)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.41908
##
## Kappa : 0.0115
##
## Mcnemar's Test P-Value : 0.07157
##
## Sensitivity : 0.4515
## Specificity : 0.5600
## Pos Pred Value : 0.5056
## Neg Pred Value : 0.5060
## Prevalence : 0.4992
## Detection Rate : 0.2254
## Detection Prevalence : 0.4457
## Balanced Accuracy : 0.5058
##
## 'Positive' Class : Donor
##
Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
# Based on the confusion matrices and AUC values from the output
# Performance metrics for each model
# Based on the confusion matrices and AUC values from the output
# Performance metrics for each model
performance_data <- data.frame(
Model = c("Stepwise Logistic", "Random Forest", "SVM"),
Accuracy = c(0.4524, 0.5175, 0.5058),
Sensitivity = c(0.4181, 0.5585, 0.4515),
Specificity = c(0.4867, 0.4767, 0.5600),
AUC = c(0.5504, 0.5267, 0.4999),
stringsAsFactors = FALSE
)
precision <- c(0.4480, 0.5154, 0.5056)
recall <- performance_data$Sensitivity
performance_data$F1_Score <- 2 * (precision * recall) / (precision + recall)
# Add error rate (1 - Accuracy)
performance_data$Error_Rate <- 1 - performance_data$Accuracy
# Create formatted performance table
performance_table <- performance_data
performance_table[, 2:7] <- round(performance_table[, 2:7], 4)
print(performance_table)
## Model Accuracy Sensitivity Specificity AUC F1_Score Error_Rate
## 1 Stepwise Logistic 0.4524 0.4181 0.4867 0.5504 0.4325 0.5476
## 2 Random Forest 0.5175 0.5585 0.4767 0.5267 0.5361 0.4825
## 3 SVM 0.5058 0.4515 0.5600 0.4999 0.4770 0.4942
# Performance summary
model_comparison <- data.frame(
Model = c("Stepwise Logistic Regression", "Random Forest", "SVM (RBF)"),
AUC = round(c(stepwise_auc, rf_auc, svm_auc), 4)
)
# Sort by AUC
model_comparison <- model_comparison[order(model_comparison$AUC, decreasing = TRUE),]
print(model_comparison)
## Model AUC
## 1 Stepwise Logistic Regression 0.5504
## 2 Random Forest 0.5267
## 3 SVM (RBF) 0.4999
# ROC Curves Plot
plot(stepwise_roc, col = "blue", main = "ROC Curves Comparison", lwd = 2)
plot(rf_roc, col = "red", add = TRUE, lwd = 2)
plot(svm_roc, col = "green", add = TRUE, lwd = 2)
legend("bottomright",
legend = c(paste("Stepwise LR (AUC =", round(stepwise_auc, 3), ")"),
paste("Random Forest (AUC =", round(rf_auc, 3), ")"),
paste("SVM RBF (AUC =", round(svm_auc, 3), ")")),
col = c("blue", "red", "green"),
lty = 1, lwd = 2, cex = 0.8)
# Create accuracy bar plot ordered by descending accuracy
accuracy_plot <- ggplot(performance_data, aes(x = reorder(Model, -Accuracy), y = Accuracy, fill = Model)) +
geom_bar(stat = "identity", alpha = 0.7, width = 0.6) +
geom_text(aes(label = paste0(round(Accuracy * 100, 1), "%")),
vjust = -0.3, size = 4) +
labs(title = "Model Accuracy Comparison",
x = "Model",
y = "Accuracy") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, size = 14),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11)) +
scale_y_continuous(limits = c(0, 0.6),
labels = scales::percent_format()) +
scale_fill_brewer(type = "qual", palette = "Set2")
print(accuracy_plot)
From your answer in (4), what do you think is the “best”
model?
Random Forest dominates in the most critical areas: - Sensitivity
(55.85%) - Highest ability to identify actual donors - Captures 14+
percentage points more donors than Stepwise Logistic (41.81%) - Captures
10+ percentage points more than SVM (45.15%) - F1-Score (0.5361) - Best
balance between precision and recall for donor prediction. It
significantly outperforms both Logistic and support vector models and it
has the - Lowest Error Rate (48.25%) - Making it the most reliable
overall performance
Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.
# Load the future fundraising test data
future_fundraising <- read.csv("Final Project/future_fundraising.csv")
# Convert character variables to factors to match df data format
future_fundraising$zipconvert2 <- factor(future_fundraising$zipconvert2, levels = c("No", "Yes"))
future_fundraising$zipconvert3 <- factor(future_fundraising$zipconvert3, levels = c("No", "Yes"))
future_fundraising$zipconvert4 <- factor(future_fundraising$zipconvert4, levels = c("No", "Yes"))
future_fundraising$zipconvert5 <- factor(future_fundraising$zipconvert5, levels = c("No", "Yes"))
future_fundraising$homeowner <- factor(future_fundraising$homeowner, levels = c("No", "Yes"))
future_fundraising$female <- factor(future_fundraising$female, levels = c("No", "Yes"))
# Convert integer variables to numeric to match df data
future_fundraising$num_child <- as.numeric(future_fundraising$num_child)
future_fundraising$income <- as.numeric(future_fundraising$income)
future_fundraising$wealth <- as.numeric(future_fundraising$wealth)
future_fundraising$home_value <- as.numeric(future_fundraising$home_value)
future_fundraising$med_fam_inc <- as.numeric(future_fundraising$med_fam_inc)
future_fundraising$avg_fam_inc <- as.numeric(future_fundraising$avg_fam_inc)
future_fundraising$pct_lt15k <- as.numeric(future_fundraising$pct_lt15k)
future_fundraising$num_prom <- as.numeric(future_fundraising$num_prom)
future_fundraising$months_since_donate <- as.numeric(future_fundraising$months_since_donate)
future_fundraising$time_lag <- as.numeric(future_fundraising$time_lag)
# Check the updated structure
str(future_fundraising)
## 'data.frame': 120 obs. of 20 variables:
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ zipconvert3 : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 1 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
## $ homeowner : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 1 ...
## $ num_child : num 1 1 1 1 1 1 1 1 1 1 ...
## $ income : num 5 1 4 4 2 4 2 3 4 2 ...
## $ female : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 1 1 ...
## $ wealth : num 9 7 1 8 7 8 1 8 3 5 ...
## $ home_value : num 1399 1355 835 1019 992 ...
## $ med_fam_inc : num 637 411 310 389 524 371 209 253 302 335 ...
## $ avg_fam_inc : num 703 497 364 473 563 408 259 285 324 348 ...
## $ pct_lt15k : num 1 9 22 15 6 10 36 25 19 14 ...
## $ num_prom : num 74 77 70 21 63 35 72 68 55 59 ...
## $ lifetime_gifts : num 102 249 126 26 100 92 146 98 66 276 ...
## $ largest_gift : num 6 15 6 16 20 37 12 5 7 15 ...
## $ last_gift : num 5 7 6 16 3 37 11 3 5 13 ...
## $ months_since_donate: num 29 35 34 37 21 37 36 32 30 33 ...
## $ time_lag : num 3 3 8 5 6 5 5 9 9 10 ...
## $ avg_gift : num 4.86 9.58 4.34 13 7.69 ...
# Make predictions using the Random Forest model
future_predictions <- predict(rf_model, future_fundraising, type = "response")
# Create submission dataframe with required format
submission <- data.frame(value = future_predictions)
# Display first few predictions
print(head(submission, 10))
## value
## 1 Donor
## 2 Donor
## 3 Donor
## 4 No Donor
## 5 Donor
## 6 No Donor
## 7 No Donor
## 8 Donor
## 9 Donor
## 10 No Donor
# Summary of predictions
table(submission$value)
##
## Donor No Donor
## 60 60
Out of all the models, the only model that excluded predictors was the stepwise logistic regression model. The following predictors were excluded from the final model; zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, female, wealth, med_fam_inc, pct_lt15k, lifetime_gifts, largest_gift, time_lag, and avg_gift.
As mentioned earlier in Data description and Cleaning, all character variables were converted to factors and all integers to numerical variables.
After running the stepwise logistic regression model, random forest and support vector models. The model that perform best in terms of accurately predicting actual donors with the lowest error rate was the Random forest model.
months_since_donate (p = 9.75e-10): Most significant predictor: Number of months from last donation to July 2018 Positive coefficient (0.0689): longer time since last donation increases likelihood of being classified as a current donor. This suggests the model may be identifying donors who gave earlier but are still considered “active”.
income (p = 0.00106) Household income in hundreds of dollars Negative coefficient (-0.0908): Higher household income reduces donation probability. Suggests donors may come from middle/lower income households rather than wealthy ones.
last_gift (p = 0.00327) Dollar amount of most recent gift Positive coefficient (0.0126): Larger previous gifts predict future donations which makes sense, generous past donors are likely to donate again.
home_value (p = 0.00849) Average home value in donor’s neighborhood (hundreds of dollars) Negative coefficient (-0.0002): Higher neighborhood home values reduce donation likelihood and this is consistent with income finding, wealthier neighborhoods are less likely to donate
num_child (p = 0.0351) Number of children Positive coefficient (0.265): More children increases donation probability
# Write to CSV file for submission
write.csv(submission, "future_fundraising_predictions.csv", row.names = FALSE)
read = read.csv("future_fundraising_predictions.csv")
head(read, 10)