Background

A national veterans’ organization aims to develop a predictive model to improve the cost-effectiveness of its direct marketing campaigns. With an in-house database of over 13 million donors, the organization is one of the largest direct-mail fundraisers in the United States. According to recent mailing records, the overall response rate is 5.1%. Among those who donated, the average gift is $13.00. Each mailing—containing personalized address labels and assorted cards—costs $0.68 to produce and send. Given these figures, a sample dataset has been drawn to develop a classification model that captures likely donors and maximizes expected net profit. Weighted sampling was used to balance the training data, ensuring equal representation of donors and non-donors.

Business Problem

The organization spends $0.68 per mailer, but only 5.1% of recipients respond, resulting in low returns when mailing to everyone. To avoid unnecessary costs, a more efficient, data-driven targeting strategy is needed.

Objective

The objective is to build a predictive classification model that identifies likely donors from a pool of candidates, in order to maximize the expected net profit of the direct-mail fundraising campaign.

Project Goal

The goal of this project is to develop a classification model that enables the veterans’ organization to target the right individuals for mailing campaigns. With a low historical response rate and a per-unit cost of $0.68, the model should identify individuals who are most likely to respond positively.

Instead of mailing to the full list—which leads to wasted costs—the model will predict donor likelihood using historical data. Mailings will be restricted to those with a high predicted probability of donating.

The model must also account for the asymmetric cost structure of the campaign: - True Positive (correctly predicted donor) → Profit = $13.00 - $0.68 = $12.32 - False Positive (incorrectly predicted donor) → Loss = $0.68

Thus, the model should not only perform well in classification metrics like accuracy or recall, but also focus on maximizing expected net profit by reducing costly false positives while identifying valuable true positives.

Data Description and Cleaning

This project uses two datasets from a national veterans’ organization: the training dataset fundraising.csv with 3,000 records and 21 variables, and the scoring dataset future_fundraising.csv with 120 records and 20 variables (excluding the target). The training data was created using weighted sampling to balance donors and non-donors evenly (50% each), allowing the model to better learn donor patterns despite the original population’s low 5.1% donor rate. This avoids bias from a simple random sample dominated by non-donors. The datasets had no missing values, but categorical variables such as zip code groups (zipconvert2 to zipconvert5), homeowner, female, and target were converted from character to factor types. Numeric variables including num_child, income, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, months_since_donate, and time_lag were converted to numeric types. The data was then split into an 80/20 train-validation partition (2,401 and 599 records respectively), preserving class balance to support robust model evaluation.

Load required libraries

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

#library(glmnet)
library(corrplot)

## corrplot 0.95 loaded

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(e1071)
library(ROCR)
#library(VIM)
library(ggplot2)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.4.3

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Data Loading and Cleaning

# Data Loading
fundraising <- read.csv("Final Project/fundraising.csv")
future_fundraising <- read.csv("Final Project/future_fundraising.csv")

#Copy of the dataset
df = fundraising

#Check missing values
sum(is.na(df))

## [1] 0

sum(is.na(future_fundraising))

## [1] 0

#Check dimensions of the datasets
dim(df)

## [1] 3000   21

dim(future_fundraising)

## [1] 120  20

#See the target distibution
table(df$target)

## 
##    Donor No Donor 
##     1499     1501

#Structure of the data
str(df)

## 'data.frame':    3000 obs. of  21 variables:
##  $ zipconvert2        : chr  "Yes" "No" "No" "No" ...
##  $ zipconvert3        : chr  "No" "No" "No" "Yes" ...
##  $ zipconvert4        : chr  "No" "No" "No" "No" ...
##  $ zipconvert5        : chr  "No" "Yes" "Yes" "No" ...
##  $ homeowner          : chr  "Yes" "No" "Yes" "Yes" ...
##  $ num_child          : int  1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : int  1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : chr  "No" "Yes" "No" "No" ...
##  $ wealth             : int  7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : int  698 828 1471 547 482 857 505 1438 1316 428 ...
##  $ med_fam_inc        : int  422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : int  463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : int  4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : int  46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num  94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num  12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num  12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: int  34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : int  6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num  9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : chr  "Donor" "Donor" "No Donor" "No Donor" ...

# Convert incorrect classes to their correct classes

#Convert characters to factor variables
df$zipconvert2 <- as.factor(df$zipconvert2)
df$zipconvert3 <- as.factor(df$zipconvert3)
df$zipconvert4 <- as.factor(df$zipconvert4)
df$zipconvert5 <- as.factor(df$zipconvert5)
df$homeowner <- as.factor(df$homeowner)
df$female <- as.factor(df$female)
df$target <- as.factor(df$target)

#Convert integers to numeric variables
df$num_child <- as.numeric(df$num_child)
df$income <- as.numeric(df$income)
df$wealth <- as.numeric(df$wealth)
df$home_value <- as.numeric(df$home_value)
df$med_fam_inc <- as.numeric(df$med_fam_inc)
df$avg_fam_inc <- as.numeric(df$avg_fam_inc)
df$pct_lt15k <- as.numeric(df$pct_lt15k)
df$num_prom <- as.numeric(df$num_prom)
df$months_since_donate <- as.numeric(df$months_since_donate)
df$time_lag <- as.numeric(df$time_lag)

#Verify the change
str(df)

## 'data.frame':    3000 obs. of  21 variables:
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ num_child          : num  1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num  1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 2 2 ...
##  $ wealth             : num  7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num  698 828 1471 547 482 ...
##  $ med_fam_inc        : num  422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num  463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num  4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num  46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num  94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num  12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num  12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num  34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num  6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num  9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

To evaluate model performance reliably, the dataset was randomly split into training and validation sets using an 80/20 ratio. A fixed seed (set.seed(12345)) was set to ensure reproducibility. This partitioning resulted in 2,401 records for training and 599 for validation, with the class distribution of donors and non-donors balanced evenly in both sets.

## Step 1: Partitioning
# Set seed for reproducibility
set.seed(12345)
# Create training and validation sets using your 'df' dataset
train_index <- createDataPartition(df$target, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

#See the dimensions of the train and test sets
dim(train_data)

## [1] 2401   21

dim(test_data)

## [1] 599  21

# Verify target distribution in both sets
print(table(train_data$target))

## 
##    Donor No Donor 
##     1200     1201

print(table(test_data$target))

## 
##    Donor No Donor 
##      299      300

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

# Examine distribution of target variable

# Calculate percentages
target_counts <- train_data %>%
  count(target) %>%
  mutate(percentage = n / sum(n) * 100,
         label = paste0(round(percentage, 1), "%"))

# Plot with percentage labels
ggplot(target_counts, aes(x = target, y = n, fill = target)) +
  geom_col() +
  geom_text(aes(label = label), vjust = -0.05, size = 5) +
  labs(title = "Distribution of Target Variable",
       x = "Target", y = "Count") +
  theme_minimal()

Shows a perfectly balanced dataset with exactly 50% donors and 50% non-donors. This balanced distribution is ideal for classification modeling as it eliminates class imbalance concerns.

library(patchwork)

## Warning: package 'patchwork' was built under R version 4.4.3

## 
## Attaching package: 'patchwork'

## The following object is masked from 'package:MASS':
## 
##     area

# Key numeric variables
key_numeric_vars <- c("lifetime_gifts", "largest_gift", "last_gift", "avg_gift",
                      "months_since_donate", "num_prom", "wealth", "income")

# Generate a list of plots
plot_list <- lapply(key_numeric_vars, function(var) {
  ggplot(train_data, aes_string(x = "target", y = var, fill = "target")) +
    geom_boxplot() +
    labs(title = paste("Distribution of\n", var, "\nby Target"),
         x = "Target", y = var) +
    theme_minimal() +
    scale_fill_manual(values = c("#1f77b4", "#ff7f0e")) +
    theme(legend.position = "none",
          plot.title = element_text(size = 10))
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Combine and display plots in a 4x4 grid (will auto-adjust if fewer than 16)
wrap_plots(plot_list, ncol = 4)

Lifetime gifts, largest gift, last gift, avg gift: All show heavy right skew with many outliers. Donors and non-donors appear to have similar distributions for these variables, suggesting they may not be strong predictors on their own. - Months since donate: Shows similar distributions between groups with medians around 30-35 months, indicating this temporal factor may not strongly differentiate donors from non-donors. - Number of promotions: Donors appear to receive slightly fewer promotions on average than non-donors, which is counterintuitive and may suggest targeting inefficiencies. - Wealth and Income: Both groups show similar distributions with many outliers, suggesting these socioeconomic factors alone may not be strong predictors of donation behavior.

cat_vars <- c("zipconvert2", "zipconvert3", "zipconvert4", 
                      "zipconvert5", "homeowner", "female")
# Create list of categorical plots
cat_plot_list <- list()
for(i in 1:length(cat_vars)) {
  var <- cat_vars[i]
  # Create proportion table for plotting
  prop_data <- train_data %>%
    group_by(!!sym(var), target) %>%
    summarise(count = n(), .groups = "drop") %>%
    group_by(!!sym(var)) %>%
    mutate(prop = count / sum(count))
  
  cat_plot_list[[i]] <- ggplot(prop_data, aes_string(x = var, y = "prop", fill = "target")) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = paste(var),
         x = var, y = "Proportion") +
    theme_minimal() +
    theme(legend.position = if(i == length(cat_vars)) "right" else "none",
          axis.text.x = element_text(angle = 45, hjust = 1))
}

# Arrange categorical plots in grid
grid.arrange(grobs = cat_plot_list, ncol = 3,
             top = "Distribution of Categorical Variables by Target")

Zipconvert variables (2-5): All show nearly identical proportions between donors and non-donors across categories, indicating these zip code conversion variables have little discriminatory power.

Homeowner: There is a noticeable difference here: the proportion of non-donors is higher among non-homeowners (No), whereas donors are slightly more likely to be homeowners (Yes). This indicates that being a homeowner might be positively associated with donating.

Female: The proportions of donors and non-donors among males (No) and females (Yes) are almost equal, indicating gender is not a significant predictor in this dataset.

Overall, the individual variables show surprisingly weak associations with the target variable, suggesting that donation behavior may depend on complex interactions between variables rather than simple univariate relationships.

# Correlation
# Select only numeric columns
numeric_vars <- train_data %>% select_if(is.numeric)


# Create correlation matrix
cor_matrix <- cor(numeric_vars)
cor_matrix

##                        num_child      income       wealth   home_value
## num_child            1.000000000  0.08339983  0.061221609 -0.015495031
## income               0.083399826  1.00000000  0.213800254  0.284339261
## wealth               0.061221609  0.21380025  1.000000000  0.254019359
## home_value          -0.015495031  0.28433926  0.254019359  1.000000000
## med_fam_inc          0.042532258  0.35891866  0.368738444  0.732555415
## avg_fam_inc          0.045063106  0.37202479  0.377668631  0.747760685
## pct_lt15k           -0.025578299 -0.27905337 -0.365032790 -0.398634647
## num_prom            -0.092865608 -0.06431244 -0.414796330 -0.060627169
## lifetime_gifts      -0.050327680 -0.01031642 -0.216662769 -0.024815397
## largest_gift        -0.018110511  0.08118317 -0.005723304  0.104641235
## last_gift           -0.006292668  0.10415222  0.050857635  0.155872334
## months_since_donate -0.006843916  0.06515296  0.031986163  0.025694375
## time_lag             0.002649744 -0.01333546 -0.067629922 -0.006180833
## avg_gift            -0.012829580  0.12673611  0.099395301  0.172480423
##                     med_fam_inc avg_fam_inc   pct_lt15k    num_prom
## num_child            0.04253226  0.04506311 -0.02557830 -0.09286561
## income               0.35891866  0.37202479 -0.27905337 -0.06431244
## wealth               0.36873844  0.37766863 -0.36503279 -0.41479633
## home_value           0.73255541  0.74776069 -0.39863465 -0.06062717
## med_fam_inc          1.00000000  0.97101190 -0.66591144 -0.04501360
## avg_fam_inc          0.97101190  1.00000000 -0.68511328 -0.05072327
## pct_lt15k           -0.66591144 -0.68511328  1.00000000  0.03128615
## num_prom            -0.04501360 -0.05072327  0.03128615  1.00000000
## lifetime_gifts      -0.03368242 -0.03900009  0.06266368  0.51747370
## largest_gift         0.08823892  0.08245552 -0.01211723  0.10157729
## last_gift            0.13557244  0.12746622 -0.05284895 -0.05000859
## months_since_donate  0.03548792  0.03492284 -0.01219579 -0.27191285
## time_lag             0.01281501  0.02304798 -0.02352929  0.13466978
## avg_gift             0.14162847  0.13242714 -0.05270529 -0.14979050
##                     lifetime_gifts largest_gift    last_gift
## num_child              -0.05032768 -0.018110511 -0.006292668
## income                 -0.01031642  0.081183171  0.104152224
## wealth                 -0.21666277 -0.005723304  0.050857635
## home_value             -0.02481540  0.104641235  0.155872334
## med_fam_inc            -0.03368242  0.088238920  0.135572437
## avg_fam_inc            -0.03900009  0.082455521  0.127466219
## pct_lt15k               0.06266368 -0.012117231 -0.052848955
## num_prom                0.51747370  0.101577294 -0.050008591
## lifetime_gifts          1.00000000  0.636819069  0.203316447
## largest_gift            0.63681907  1.000000000  0.749200240
## last_gift               0.20331645  0.749200240  1.000000000
## months_since_donate    -0.13033090  0.072066752  0.185651892
## time_lag                0.04814786  0.078718288  0.072982321
## avg_gift                0.17853780  0.730035131  0.859818678
##                     months_since_donate     time_lag    avg_gift
## num_child                  -0.006843916  0.002649744 -0.01282958
## income                      0.065152965 -0.013335461  0.12673611
## wealth                      0.031986163 -0.067629922  0.09939530
## home_value                  0.025694375 -0.006180833  0.17248042
## med_fam_inc                 0.035487919  0.012815010  0.14162847
## avg_fam_inc                 0.034922841  0.023047982  0.13242714
## pct_lt15k                  -0.012195792 -0.023529286 -0.05270529
## num_prom                   -0.271912850  0.134669776 -0.14979050
## lifetime_gifts             -0.130330897  0.048147865  0.17853780
## largest_gift                0.072066752  0.078718288  0.73003513
## last_gift                   0.185651892  0.072982321  0.85981868
## months_since_donate         1.000000000  0.005744928  0.19128868
## time_lag                    0.005744928  1.000000000  0.07432409
## avg_gift                    0.191288678  0.074324093  1.00000000

# Plot the correlation matrix
corrplot(cor_matrix, 
         method = "color",       
         type = "upper",        
         order = "hclust",      
         addCoef.col = "black", 
         number.cex = 0.6,       
         tl.cex = 0.8,           
         tl.col = "black",       
         tl.srt = 45,            
         mar = c(0, 0, 1, 0),   
         title = "Correlation Matrix of Numeric Features")

Highly Correlated Variable Pairs

Variable 1	Variable 2	Correlation
`med_fam_inc`	`avg_fam_inc`	0.971
`avg_fam_inc`	`home_value`	0.748
`med_fam_inc`	`home_value`	0.733
`largest_gift`	`avg_gift`	0.730
`last_gift`	`avg_gift`	0.860
`largest_gift`	`last_gift`	0.749

Select classification tool and parameters

Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.

Type of Analysis performed

The classification models used in this analysis are Stepwise Logistic Regression, Random Forest model, and Support Vector Machine with Radial as the Kernel.

MODEL 1: STEPWISE LOGISTIC REGRESSION

# Set seed for reproducibility across all models
set.seed(12345)

# Method Description:
# - Uses forward, backward, and bidirectional stepwise selection
# - Based on Akaike Information Criterion (AIC) for variable selection
# - Automatically selects the most predictive variables
# - Fits a logistic regression model with selected variables

# Fit full logistic regression model (starting point)
full_model <- glm(target ~ ., data = train_data, family = binomial)

# Fit null model (intercept only)
null_model <- glm(target ~ 1, data = train_data, family = binomial)

# Perform stepwise selection using AIC
# Direction "both" allows forward and backward steps
stepwise_model <- stepAIC(null_model, 
                         scope = list(lower = null_model, upper = full_model),
                         direction = "both",
                         trace = TRUE)

## Start:  AIC=3330.49
## target ~ 1
## 
##                       Df Deviance    AIC
## + months_since_donate  1   3274.3 3278.3
## + last_gift            1   3313.4 3317.4
## + avg_gift             1   3316.0 3320.0
## + num_prom             1   3317.0 3321.0
## + income               1   3321.8 3325.8
## + largest_gift         1   3322.3 3326.3
## + home_value           1   3323.4 3327.4
## + num_child            1   3324.0 3328.0
## + zipconvert5          1   3325.2 3329.2
## + homeowner            1   3325.6 3329.6
## <none>                     3328.5 3330.5
## + zipconvert2          1   3327.3 3331.3
## + med_fam_inc          1   3327.4 3331.4
## + lifetime_gifts       1   3327.6 3331.6
## + female               1   3327.7 3331.7
## + avg_fam_inc          1   3327.8 3331.8
## + zipconvert3          1   3328.0 3332.0
## + zipconvert4          1   3328.4 3332.4
## + pct_lt15k            1   3328.4 3332.4
## + wealth               1   3328.5 3332.5
## + time_lag             1   3328.5 3332.5
## 
## Step:  AIC=3278.35
## target ~ months_since_donate
## 
##                       Df Deviance    AIC
## + income               1   3264.6 3270.6
## + home_value           1   3268.2 3274.2
## + last_gift            1   3268.3 3274.3
## + num_child            1   3269.5 3275.5
## + avg_gift             1   3269.9 3275.9
## + largest_gift         1   3270.7 3276.7
## + homeowner            1   3270.8 3276.8
## + zipconvert5          1   3271.1 3277.1
## + num_prom             1   3272.0 3278.0
## <none>                     3274.3 3278.3
## + med_fam_inc          1   3272.6 3278.6
## + zipconvert2          1   3273.0 3279.0
## + avg_fam_inc          1   3273.2 3279.2
## + zipconvert3          1   3273.6 3279.6
## + female               1   3274.1 3280.1
## + wealth               1   3274.2 3280.2
## + pct_lt15k            1   3274.2 3280.2
## + time_lag             1   3274.3 3280.3
## + lifetime_gifts       1   3274.3 3280.3
## + zipconvert4          1   3274.3 3280.3
## - months_since_donate  1   3328.5 3330.5
## 
## Step:  AIC=3270.61
## target ~ months_since_donate + income
## 
##                       Df Deviance    AIC
## + last_gift            1   3256.9 3264.9
## + avg_gift             1   3258.3 3266.3
## + num_child            1   3258.5 3266.5
## + largest_gift         1   3259.8 3267.8
## + num_prom             1   3261.8 3269.8
## + home_value           1   3261.8 3269.8
## + zipconvert5          1   3262.3 3270.3
## <none>                     3264.6 3270.6
## + zipconvert2          1   3263.3 3271.3
## + homeowner            1   3263.8 3271.8
## + female               1   3264.2 3272.2
## + pct_lt15k            1   3264.3 3272.3
## + zipconvert3          1   3264.3 3272.3
## + wealth               1   3264.6 3272.6
## + med_fam_inc          1   3264.6 3272.6
## + avg_fam_inc          1   3264.6 3272.6
## + time_lag             1   3264.6 3272.6
## + lifetime_gifts       1   3264.6 3272.6
## + zipconvert4          1   3264.6 3272.6
## - income               1   3274.3 3278.3
## - months_since_donate  1   3321.8 3325.8
## 
## Step:  AIC=3264.87
## target ~ months_since_donate + income + last_gift
## 
##                       Df Deviance    AIC
## + num_child            1   3250.6 3260.6
## + home_value           1   3252.6 3262.6
## + zipconvert5          1   3253.8 3263.8
## + num_prom             1   3254.0 3264.0
## <none>                     3256.9 3264.9
## + zipconvert2          1   3255.6 3265.6
## + homeowner            1   3256.2 3266.2
## + lifetime_gifts       1   3256.4 3266.4
## + zipconvert3          1   3256.4 3266.4
## + med_fam_inc          1   3256.6 3266.6
## + pct_lt15k            1   3256.6 3266.6
## + female               1   3256.6 3266.6
## + avg_gift             1   3256.8 3266.8
## + avg_fam_inc          1   3256.8 3266.8
## + time_lag             1   3256.8 3266.8
## + wealth               1   3256.8 3266.8
## + zipconvert4          1   3256.9 3266.9
## + largest_gift         1   3256.9 3266.9
## - last_gift            1   3264.6 3270.6
## - income               1   3268.3 3274.3
## - months_since_donate  1   3304.1 3310.1
## 
## Step:  AIC=3260.59
## target ~ months_since_donate + income + last_gift + num_child
## 
##                       Df Deviance    AIC
## + home_value           1   3246.7 3258.7
## + zipconvert5          1   3247.9 3259.9
## + num_prom             1   3248.5 3260.5
## <none>                     3250.6 3260.6
## + zipconvert2          1   3249.5 3261.5
## + homeowner            1   3249.8 3261.8
## + zipconvert3          1   3250.1 3262.1
## + lifetime_gifts       1   3250.3 3262.3
## + med_fam_inc          1   3250.3 3262.3
## + pct_lt15k            1   3250.3 3262.3
## + female               1   3250.4 3262.4
## + avg_gift             1   3250.5 3262.5
## + avg_fam_inc          1   3250.5 3262.5
## + time_lag             1   3250.6 3262.6
## + largest_gift         1   3250.6 3262.6
## + wealth               1   3250.6 3262.6
## + zipconvert4          1   3250.6 3262.6
## - num_child            1   3256.9 3264.9
## - last_gift            1   3258.5 3266.5
## - income               1   3263.4 3271.4
## - months_since_donate  1   3298.3 3306.3
## 
## Step:  AIC=3258.74
## target ~ months_since_donate + income + last_gift + num_child + 
##     home_value
## 
##                       Df Deviance    AIC
## + avg_fam_inc          1   3243.9 3257.9
## + num_prom             1   3244.3 3258.3
## <none>                     3246.7 3258.7
## + pct_lt15k            1   3245.2 3259.2
## + med_fam_inc          1   3245.4 3259.4
## + zipconvert5          1   3246.0 3260.0
## + zipconvert2          1   3246.1 3260.1
## + homeowner            1   3246.1 3260.1
## + lifetime_gifts       1   3246.3 3260.3
## + wealth               1   3246.5 3260.5
## + female               1   3246.6 3260.6
## + avg_gift             1   3246.6 3260.6
## - home_value           1   3250.6 3260.6
## + zipconvert4          1   3246.6 3260.6
## + zipconvert3          1   3246.6 3260.6
## + time_lag             1   3246.7 3260.7
## + largest_gift         1   3246.7 3260.7
## - num_child            1   3252.6 3262.6
## - income               1   3255.2 3265.2
## - last_gift            1   3256.1 3266.1
## - months_since_donate  1   3293.9 3303.9
## 
## Step:  AIC=3257.89
## target ~ months_since_donate + income + last_gift + num_child + 
##     home_value + avg_fam_inc
## 
##                       Df Deviance    AIC
## + num_prom             1   3241.4 3257.4
## <none>                     3243.9 3257.9
## + med_fam_inc          1   3242.6 3258.6
## - avg_fam_inc          1   3246.7 3258.7
## + homeowner            1   3243.3 3259.3
## + zipconvert2          1   3243.3 3259.3
## + lifetime_gifts       1   3243.5 3259.5
## + zipconvert4          1   3243.6 3259.6
## + avg_gift             1   3243.7 3259.7
## + zipconvert5          1   3243.7 3259.7
## + female               1   3243.7 3259.7
## + time_lag             1   3243.8 3259.8
## + pct_lt15k            1   3243.8 3259.8
## + largest_gift         1   3243.9 3259.9
## + zipconvert3          1   3243.9 3259.9
## + wealth               1   3243.9 3259.9
## - num_child            1   3249.2 3261.2
## - home_value           1   3250.5 3262.5
## - last_gift            1   3253.2 3265.2
## - income               1   3254.3 3266.3
## - months_since_donate  1   3291.0 3303.0
## 
## Step:  AIC=3257.38
## target ~ months_since_donate + income + last_gift + num_child + 
##     home_value + avg_fam_inc + num_prom
## 
##                       Df Deviance    AIC
## <none>                     3241.4 3257.4
## - num_prom             1   3243.9 3257.9
## + med_fam_inc          1   3240.2 3258.2
## - avg_fam_inc          1   3244.3 3258.3
## + wealth               1   3240.8 3258.8
## + homeowner            1   3240.8 3258.8
## + zipconvert2          1   3240.9 3258.9
## + zipconvert4          1   3241.1 3259.1
## + zipconvert5          1   3241.2 3259.2
## + largest_gift         1   3241.2 3259.2
## + female               1   3241.3 3259.3
## + pct_lt15k            1   3241.3 3259.3
## + lifetime_gifts       1   3241.3 3259.3
## + avg_gift             1   3241.4 3259.4
## + zipconvert3          1   3241.4 3259.4
## + time_lag             1   3241.4 3259.4
## - num_child            1   3246.0 3260.0
## - home_value           1   3248.4 3262.4
## - last_gift            1   3250.8 3264.8
## - income               1   3252.2 3266.2
## - months_since_donate  1   3280.2 3294.2

summary(stepwise_model)

## 
## Call:
## glm(formula = target ~ months_since_donate + income + last_gift + 
##     num_child + home_value + avg_fam_inc + num_prom, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.1834174  0.4254808  -5.132 2.87e-07 ***
## months_since_donate  0.0688887  0.0112683   6.114 9.75e-10 ***
## income              -0.0908349  0.0277470  -3.274  0.00106 ** 
## last_gift            0.0126138  0.0042894   2.941  0.00327 ** 
## num_child            0.2647543  0.1256392   2.107  0.03510 *  
## home_value          -0.0001755  0.0000667  -2.632  0.00849 ** 
## avg_fam_inc          0.0006578  0.0003852   1.708  0.08772 .  
## num_prom            -0.0030592  0.0019329  -1.583  0.11348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3328.5  on 2400  degrees of freedom
## Residual deviance: 3241.4  on 2393  degrees of freedom
## AIC: 3257.4
## 
## Number of Fisher Scoring iterations: 4

# Check Multicollinerity
vif(stepwise_model)

## months_since_donate              income           last_gift           num_child 
##            1.105146            1.179779            1.075611            1.022359 
##          home_value         avg_fam_inc            num_prom 
##            2.312161            2.439406            1.074224

Because all VIF are less than 5, multicollinearity is not an issue.

set.seed(12345)
# Make predictions on test set
stepwise_pred_prob <- predict(stepwise_model, test_data, type = "response")
stepwise_pred_class <- ifelse(stepwise_pred_prob > 0.5, "Donor", "No Donor")
stepwise_pred_class <- factor(stepwise_pred_class, levels = c("Donor", "No Donor"))

# Calculate performance metrics
stepwise_roc <- roc(test_data$target, stepwise_pred_prob)

## Setting levels: control = Donor, case = No Donor

## Setting direction: controls < cases

stepwise_auc <- auc(stepwise_roc)

print(paste("Stepwise Logistic Regression AUC:", round(stepwise_auc, 4)))

## [1] "Stepwise Logistic Regression AUC: 0.5504"

print("Stepwise Logistic Regression Confusion Matrix:")

## [1] "Stepwise Logistic Regression Confusion Matrix:"

print(confusionMatrix(stepwise_pred_class, test_data$target, positive = "Donor"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      125      154
##   No Donor   174      146
##                                          
##                Accuracy : 0.4524         
##                  95% CI : (0.412, 0.4933)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 0.9921         
##                                          
##                   Kappa : -0.0953        
##                                          
##  Mcnemar's Test P-Value : 0.2941         
##                                          
##             Sensitivity : 0.4181         
##             Specificity : 0.4867         
##          Pos Pred Value : 0.4480         
##          Neg Pred Value : 0.4563         
##              Prevalence : 0.4992         
##          Detection Rate : 0.2087         
##    Detection Prevalence : 0.4658         
##       Balanced Accuracy : 0.4524         
##                                          
##        'Positive' Class : Donor          
##

MODEL 2: RANDOM FOREST

# Method Description:
# - Ensemble method using bootstrap aggregation of decision trees
# - Each tree trained on bootstrap sample of training data
# - At each split, random subset of variables considered
# - Final prediction: majority vote 

# Set seed for reproducibility
set.seed(12345)
# Train Random Forest model
rf_model <- randomForest(
  target ~ .,
  data = train_data,
  ntree = 500,
  mtry = 4,  
  nodesize = 5,
  importance = TRUE
)

print(rf_model)

## 
## Call:
##  randomForest(formula = target ~ ., data = train_data, ntree = 500,      mtry = 4, nodesize = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 46.02%
## Confusion matrix:
##          Donor No Donor class.error
## Donor      665      535   0.4458333
## No Donor   570      631   0.4746045

# Predictions
rf_pred_prob <- predict(rf_model, test_data, type = "prob")[,"Donor"]
rf_pred_class <- predict(rf_model, test_data, type = "response")

# Performance
rf_roc <- roc(test_data$target, rf_pred_prob)

## Setting levels: control = Donor, case = No Donor

## Setting direction: controls > cases

rf_auc <- auc(rf_roc)

print(paste("Random Forest AUC:", round(rf_auc, 4)))

## [1] "Random Forest AUC: 0.5267"

print(confusionMatrix(rf_pred_class, test_data$target, positive = "Donor"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      167      157
##   No Donor   132      143
##                                           
##                Accuracy : 0.5175          
##                  95% CI : (0.4767, 0.5582)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.2188          
##                                           
##                   Kappa : 0.0352          
##                                           
##  Mcnemar's Test P-Value : 0.1580          
##                                           
##             Sensitivity : 0.5585          
##             Specificity : 0.4767          
##          Pos Pred Value : 0.5154          
##          Neg Pred Value : 0.5200          
##              Prevalence : 0.4992          
##          Detection Rate : 0.2788          
##    Detection Prevalence : 0.5409          
##       Balanced Accuracy : 0.5176          
##                                           
##        'Positive' Class : Donor           
##

# Variable Importance
importance_scores <- importance(rf_model)
importance_df <- data.frame(
  Variable = rownames(importance_scores),
  Importance = importance_scores[,"MeanDecreaseGini"]
)
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE),]
print(head(importance_df, 10))

##                                Variable Importance
## home_value                   home_value   95.77529
## avg_gift                       avg_gift   92.23919
## avg_fam_inc                 avg_fam_inc   86.56504
## lifetime_gifts           lifetime_gifts   86.11906
## med_fam_inc                 med_fam_inc   86.06912
## num_prom                       num_prom   77.77546
## pct_lt15k                     pct_lt15k   73.31086
## months_since_donate months_since_donate   64.17823
## time_lag                       time_lag   62.64301
## last_gift                     last_gift   52.34361

# Next create variance importance plot
top_vars <- head(importance_df, 10)

# Plot
ggplot(top_vars, aes(x = reorder(Variable, Importance), y = Importance, fill = Importance)) +
  geom_col() +
  coord_flip() +  # flip to horizontal bars
  labs(title = "Top 10 Variable",
       x = "Variable",
       y = "Importance") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_minimal() +
  theme(legend.position = "none")

MODEL 3: SUPPORT VECTOR MACHINE (SVM)

set.seed(12345) # for reproducibility
# Scale the data for SVM
train_scaled <- train_data
test_scaled <- test_data

# Scale numeric variables
numeric_vars <- sapply(train_data, is.numeric)
train_scaled[numeric_vars] <- scale(train_data[numeric_vars])

# Use same scaling parameters for test data
scaling_params <- attributes(scale(train_data[numeric_vars]))
test_scaled[numeric_vars] <- scale(test_data[numeric_vars], 
                                  center = scaling_params$`scaled:center`,
                                  scale = scaling_params$`scaled:scale`)

# Train SVM model
svm_model <- svm(
  target ~ .,
  data = train_scaled,
  kernel = "radial",
  cost = 10,
  gamma = 0.1,
  probability = TRUE
)

print(svm_model)

## 
## Call:
## svm(formula = target ~ ., data = train_scaled, kernel = "radial", 
##     cost = 10, gamma = 0.1, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  2086

# Predictions
svm_pred <- predict(svm_model, test_scaled, probability = TRUE)
svm_pred_prob <- attr(svm_pred, "probabilities")[,"Donor"]
svm_pred_class <- svm_pred

# Performance
svm_roc <- roc(test_data$target, svm_pred_prob)

## Setting levels: control = Donor, case = No Donor

## Setting direction: controls > cases

svm_auc <- auc(svm_roc)

print(paste("SVM AUC:", round(svm_auc, 4)))

## [1] "SVM AUC: 0.4999"

print(confusionMatrix(svm_pred_class, test_data$target, positive = "Donor"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      135      132
##   No Donor   164      168
##                                          
##                Accuracy : 0.5058         
##                  95% CI : (0.465, 0.5466)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 0.41908        
##                                          
##                   Kappa : 0.0115         
##                                          
##  Mcnemar's Test P-Value : 0.07157        
##                                          
##             Sensitivity : 0.4515         
##             Specificity : 0.5600         
##          Pos Pred Value : 0.5056         
##          Neg Pred Value : 0.5060         
##              Prevalence : 0.4992         
##          Detection Rate : 0.2254         
##    Detection Prevalence : 0.4457         
##       Balanced Accuracy : 0.5058         
##                                          
##        'Positive' Class : Donor          
##

Classification under asymmetric response and cost.

Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

Weighted sampling balances classes so the model sees equal donors and non-donors.
This prevents bias toward the majority class, which happens with simple random samples.
It helps the model learn minority class patterns better, improving detection (recall) of donors.
Also, it addresses asymmetric costs by reducing costly false negatives of missing donors.

Evaluate the fit

Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

# Based on the confusion matrices and AUC values from the output

# Performance metrics for each model
# Based on the confusion matrices and AUC values from the output
# Performance metrics for each model
performance_data <- data.frame(
  Model = c("Stepwise Logistic", "Random Forest", "SVM"),
  Accuracy = c(0.4524, 0.5175, 0.5058),
  Sensitivity = c(0.4181, 0.5585, 0.4515),
  Specificity = c(0.4867, 0.4767, 0.5600),
  AUC = c(0.5504, 0.5267, 0.4999),
  stringsAsFactors = FALSE
)
precision <- c(0.4480, 0.5154, 0.5056)
recall <- performance_data$Sensitivity
performance_data$F1_Score <- 2 * (precision * recall) / (precision + recall)

# Add error rate (1 - Accuracy)
performance_data$Error_Rate <- 1 - performance_data$Accuracy

# Create formatted performance table
performance_table <- performance_data
performance_table[, 2:7] <- round(performance_table[, 2:7], 4)
print(performance_table)

##               Model Accuracy Sensitivity Specificity    AUC F1_Score Error_Rate
## 1 Stepwise Logistic   0.4524      0.4181      0.4867 0.5504   0.4325     0.5476
## 2     Random Forest   0.5175      0.5585      0.4767 0.5267   0.5361     0.4825
## 3               SVM   0.5058      0.4515      0.5600 0.4999   0.4770     0.4942

# Performance summary
model_comparison <- data.frame(
  Model = c("Stepwise Logistic Regression", "Random Forest", "SVM (RBF)"),
  AUC = round(c(stepwise_auc, rf_auc, svm_auc), 4)
)

# Sort by AUC
model_comparison <- model_comparison[order(model_comparison$AUC, decreasing = TRUE),]
print(model_comparison)

##                          Model    AUC
## 1 Stepwise Logistic Regression 0.5504
## 2                Random Forest 0.5267
## 3                    SVM (RBF) 0.4999

# ROC Curves Plot
plot(stepwise_roc, col = "blue", main = "ROC Curves Comparison", lwd = 2)
plot(rf_roc, col = "red", add = TRUE, lwd = 2)
plot(svm_roc, col = "green", add = TRUE, lwd = 2)
legend("bottomright", 
       legend = c(paste("Stepwise LR (AUC =", round(stepwise_auc, 3), ")"),
                  paste("Random Forest (AUC =", round(rf_auc, 3), ")"),
                  paste("SVM RBF (AUC =", round(svm_auc, 3), ")")),
       col = c("blue", "red", "green"), 
       lty = 1, lwd = 2, cex = 0.8)

# Create accuracy bar plot ordered by descending accuracy
accuracy_plot <- ggplot(performance_data, aes(x = reorder(Model, -Accuracy), y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity", alpha = 0.7, width = 0.6) +
  geom_text(aes(label = paste0(round(Accuracy * 100, 1), "%")), 
            vjust = -0.3, size = 4) +
  labs(title = "Model Accuracy Comparison", 
       x = "Model", 
       y = "Accuracy") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, size = 14),
        axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11)) +
  scale_y_continuous(limits = c(0, 0.6), 
                     labels = scales::percent_format()) +
  scale_fill_brewer(type = "qual", palette = "Set2")
print(accuracy_plot)

Select best model

From your answer in (4), what do you think is the “best” model?
Random Forest dominates in the most critical areas: - Sensitivity (55.85%) - Highest ability to identify actual donors - Captures 14+ percentage points more donors than Stepwise Logistic (41.81%) - Captures 10+ percentage points more than SVM (45.15%) - F1-Score (0.5361) - Best balance between precision and recall for donor prediction. It significantly outperforms both Logistic and support vector models and it has the - Lowest Error Rate (48.25%) - Making it the most reliable overall performance

Step 3: Testing

Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.

# Load the future fundraising test data
future_fundraising <- read.csv("Final Project/future_fundraising.csv")

# Convert character variables to factors to match df data format
future_fundraising$zipconvert2 <- factor(future_fundraising$zipconvert2, levels = c("No", "Yes"))
future_fundraising$zipconvert3 <- factor(future_fundraising$zipconvert3, levels = c("No", "Yes"))
future_fundraising$zipconvert4 <- factor(future_fundraising$zipconvert4, levels = c("No", "Yes"))
future_fundraising$zipconvert5 <- factor(future_fundraising$zipconvert5, levels = c("No", "Yes"))
future_fundraising$homeowner <- factor(future_fundraising$homeowner, levels = c("No", "Yes"))
future_fundraising$female <- factor(future_fundraising$female, levels = c("No", "Yes"))

# Convert integer variables to numeric to match df data
future_fundraising$num_child <- as.numeric(future_fundraising$num_child)
future_fundraising$income <- as.numeric(future_fundraising$income)
future_fundraising$wealth <- as.numeric(future_fundraising$wealth)
future_fundraising$home_value <- as.numeric(future_fundraising$home_value)
future_fundraising$med_fam_inc <- as.numeric(future_fundraising$med_fam_inc)
future_fundraising$avg_fam_inc <- as.numeric(future_fundraising$avg_fam_inc)
future_fundraising$pct_lt15k <- as.numeric(future_fundraising$pct_lt15k)
future_fundraising$num_prom <- as.numeric(future_fundraising$num_prom)
future_fundraising$months_since_donate <- as.numeric(future_fundraising$months_since_donate)
future_fundraising$time_lag <- as.numeric(future_fundraising$time_lag)

# Check the updated structure
str(future_fundraising)

## 'data.frame':    120 obs. of  20 variables:
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ zipconvert3        : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 1 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
##  $ homeowner          : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ num_child          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ income             : num  5 1 4 4 2 4 2 3 4 2 ...
##  $ female             : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 1 1 ...
##  $ wealth             : num  9 7 1 8 7 8 1 8 3 5 ...
##  $ home_value         : num  1399 1355 835 1019 992 ...
##  $ med_fam_inc        : num  637 411 310 389 524 371 209 253 302 335 ...
##  $ avg_fam_inc        : num  703 497 364 473 563 408 259 285 324 348 ...
##  $ pct_lt15k          : num  1 9 22 15 6 10 36 25 19 14 ...
##  $ num_prom           : num  74 77 70 21 63 35 72 68 55 59 ...
##  $ lifetime_gifts     : num  102 249 126 26 100 92 146 98 66 276 ...
##  $ largest_gift       : num  6 15 6 16 20 37 12 5 7 15 ...
##  $ last_gift          : num  5 7 6 16 3 37 11 3 5 13 ...
##  $ months_since_donate: num  29 35 34 37 21 37 36 32 30 33 ...
##  $ time_lag           : num  3 3 8 5 6 5 5 9 9 10 ...
##  $ avg_gift           : num  4.86 9.58 4.34 13 7.69 ...

# Make predictions using the Random Forest model
future_predictions <- predict(rf_model, future_fundraising, type = "response")

# Create submission dataframe with required format
submission <- data.frame(value = future_predictions)

# Display first few predictions
print(head(submission, 10))

##       value
## 1     Donor
## 2     Donor
## 3     Donor
## 4  No Donor
## 5     Donor
## 6  No Donor
## 7  No Donor
## 8     Donor
## 9     Donor
## 10 No Donor

# Summary of predictions
table(submission$value)

## 
##    Donor No Donor 
##       60       60

Exclusions

Out of all the models, the only model that excluded predictors was the stepwise logistic regression model. The following predictors were excluded from the final model; zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, female, wealth, med_fam_inc, pct_lt15k, lifetime_gifts, largest_gift, time_lag, and avg_gift.

Variable transformations

As mentioned earlier in Data description and Cleaning, all character variables were converted to factors and all integers to numerical variables.

Best Model

After running the stepwise logistic regression model, random forest and support vector models. The model that perform best in terms of accurately predicting actual donors with the lowest error rate was the Random forest model.

Significant Predictors (p < 0.05):

months_since_donate (p = 9.75e-10): Most significant predictor: Number of months from last donation to July 2018 Positive coefficient (0.0689): longer time since last donation increases likelihood of being classified as a current donor. This suggests the model may be identifying donors who gave earlier but are still considered “active”.
income (p = 0.00106) Household income in hundreds of dollars Negative coefficient (-0.0908): Higher household income reduces donation probability. Suggests donors may come from middle/lower income households rather than wealthy ones.
last_gift (p = 0.00327) Dollar amount of most recent gift Positive coefficient (0.0126): Larger previous gifts predict future donations which makes sense, generous past donors are likely to donate again.
home_value (p = 0.00849) Average home value in donor’s neighborhood (hundreds of dollars) Negative coefficient (-0.0002): Higher neighborhood home values reduce donation likelihood and this is consistent with income finding, wealthier neighborhoods are less likely to donate
num_child (p = 0.0351) Number of children Positive coefficient (0.265): More children increases donation probability

# Write to CSV file for submission
write.csv(submission, "future_fundraising_predictions.csv", row.names = FALSE)

read = read.csv("future_fundraising_predictions.csv")
head(read, 10)

Final Project

Doris Mbitazi Asongafac

2025-08-02