Lauren Hall (DDV339), Christian Rivera (HVQ288), Erwin Gonzalez (KJJ953)

Background: The Fundraising file comes from National Veterans organization with the goal of minimizing direct marketing campaining expenditures through predictive modeling. This will be achieved by leveraging a balanced dataset to train a model with a target variable of donor vs. not donor. Various classification methods will be attempted on the dataset such as: Logistic Regression, Random Forest, and XGBoost.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.2
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071) 
## Warning: package 'e1071' was built under R version 4.3.3
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.3
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(yardstick)
## Warning: package 'yardstick' was built under R version 4.3.2
## 
## Attaching package: 'yardstick'
## 
## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:readr':
## 
##     spec
library(readr)
library(dplyr)

Step 1: Partitioning 80/20 split training and testing dataset

The data is split 80/20 so that there is a testable portion to be utilized in our classification models. There are 2400 observations for the train data and there are 600 observations in the test data and they both contain 24 variables.

# Load data

data <- read.csv("C:/Users/lclha/Documents/MSDA Program 2024-2026/Summer 2025/Predictive Modeling/Final Project/fundraising.csv")


set.seed(1)

# 80% training indices
train_indices <- sample(1:nrow(data), size = 0.8 * nrow(data))

# Split the data 80/20
train_data <- data[train_indices, ]
test_data   <- data[-train_indices, ]

Step 2: Model Building

1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other.

In the fundraising data there is 3000 observations and 21 variables with a mixture of 7 categorical and 14 numerical variables.

str(data)
## 'data.frame':    3000 obs. of  21 variables:
##  $ zipconvert2        : chr  "Yes" "No" "No" "No" ...
##  $ zipconvert3        : chr  "No" "No" "No" "Yes" ...
##  $ zipconvert4        : chr  "No" "No" "No" "No" ...
##  $ zipconvert5        : chr  "No" "Yes" "Yes" "No" ...
##  $ homeowner          : chr  "Yes" "No" "Yes" "Yes" ...
##  $ num_child          : int  1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : int  1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : chr  "No" "Yes" "No" "No" ...
##  $ wealth             : int  7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : int  698 828 1471 547 482 857 505 1438 1316 428 ...
##  $ med_fam_inc        : int  422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : int  463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : int  4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : int  46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num  94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num  12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num  12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: int  34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : int  6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num  9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : chr  "Donor" "Donor" "No Donor" "No Donor" ...
summary(data)
##  zipconvert2        zipconvert3        zipconvert4        zipconvert5       
##  Length:3000        Length:3000        Length:3000        Length:3000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   homeowner           num_child         income         female         
##  Length:3000        Min.   :1.000   Min.   :1.000   Length:3000       
##  Class :character   1st Qu.:1.000   1st Qu.:3.000   Class :character  
##  Mode  :character   Median :1.000   Median :4.000   Mode  :character  
##                     Mean   :1.069   Mean   :3.899                     
##                     3rd Qu.:1.000   3rd Qu.:5.000                     
##                     Max.   :5.000   Max.   :7.000                     
##      wealth        home_value      med_fam_inc      avg_fam_inc    
##  Min.   :0.000   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0   1st Qu.: 318.0  
##  Median :8.000   Median : 816.5   Median : 355.0   Median : 396.0  
##  Mean   :6.396   Mean   :1143.3   Mean   : 388.4   Mean   : 432.3  
##  3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0   3rd Qu.: 516.0  
##  Max.   :9.000   Max.   :5945.0   Max.   :1500.0   Max.   :1331.0  
##    pct_lt15k        num_prom      lifetime_gifts    largest_gift    
##  Min.   : 0.00   Min.   : 11.00   Min.   :  15.0   Min.   :   5.00  
##  1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0   1st Qu.:  10.00  
##  Median :12.00   Median : 48.00   Median :  81.0   Median :  15.00  
##  Mean   :14.71   Mean   : 49.14   Mean   : 110.7   Mean   :  16.65  
##  3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0   3rd Qu.:  20.00  
##  Max.   :90.00   Max.   :157.00   Max.   :5674.9   Max.   :1000.00  
##    last_gift      months_since_donate    time_lag         avg_gift      
##  Min.   :  0.00   Min.   :17.00       Min.   : 0.000   Min.   :  2.139  
##  1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000   1st Qu.:  6.333  
##  Median : 10.00   Median :31.00       Median : 5.000   Median :  9.000  
##  Mean   : 13.48   Mean   :31.13       Mean   : 6.876   Mean   : 10.669  
##  3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000   3rd Qu.: 12.800  
##  Max.   :219.00   Max.   :37.00       Max.   :77.000   Max.   :122.167  
##     target         
##  Length:3000       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
data <- data %>% 
  mutate(across(where(is.character),as.factor))

data$target <- as.factor(data$target)

# Separate numeric and categorical predictors
numeric_vars <- data %>% dplyr::select(where(is.numeric))
cat_vars <- data %>% select(where(~!is.numeric(.)))

# Histogram of numeric variables
numeric_vars %>% 
  pivot_longer(everything()) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free") +
  geom_histogram(bins = 30, fill = "steelblue") +
  theme_minimal()

# Boxplots of numeric predictors by response (for classification)
if (is.factor(data$target)) {
  numeric_vars %>%
    mutate(response = data$target) %>%
    pivot_longer(-response) %>%
    ggplot(aes(x = response, y = value)) +
    geom_boxplot() +
    facet_wrap(~name, scales = "free") +
    theme_minimal()
}

# Select Yes/No variables
yesno_vars <- data %>%
  select(homeowner, female, zipconvert2, zipconvert3, zipconvert4, zipconvert5) 

# Reshape and plot
yesno_vars %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value, fill = value)) +
  geom_bar() +
  facet_wrap(~name, scales = "free_y") +
  labs(x = "Category", y = "Count") +
  theme_minimal()

# Correlation matrix among numeric predictors
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.8)

# Identify high correlations
high_corr <- findCorrelation(cor_matrix, cutoff = 0.75)
if (length(high_corr) > 0) {
  cat("Highly collinear variables:/n")
  print(names(numeric_vars)[high_corr])
}
## Highly collinear variables:/n[1] "avg_fam_inc" "avg_gift"

Step 2.1:

Potentially good predictors would be the following: income, wealth, time lag, num_prom, months_since_donate, avg_fam_inc, home_value,avg_gift,med_fam_inc, female, homeowner.Yes there are high collinearility which are identified as any variables with a correlation greater than 0.75 which are “avg_fam_inc” and “avg_gift”.

Step 2.2: Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.)

Logistic Regresssion: This method is a classification method used for binary outcomes and estimates the log-odds of the target class. The predictor variables used were income, wealth, time lag, num_prom, months_since_donate, avg_fam_inc, home_value,avg_gift,med_fam_inc, female, homeowner and target variable being target (donor vs not donor).

collinear_vars <- colnames(numeric_vars)[high_corr]
data_reduced <- data %>%
  select(income, wealth, time_lag, num_prom, months_since_donate,
         home_value, med_fam_inc, female, homeowner, target)
# Ensure the same reduction applied to both train and test sets
train_data_reduced <- train_data %>% select(all_of(colnames(data_reduced)))
test_data_reduced  <- test_data %>% select(all_of(colnames(data_reduced)))

# Ensure target is a factor in both sets
train_data_reduced$target <- as.factor(train_data_reduced$target)
test_data_reduced$target <- as.factor(test_data_reduced$target)
# Ensure target is a factor with correct levels
test_data_reduced$target <- as.factor(test_data_reduced$target)

log_model <- glm (target ~ ., data = data_reduced, family = "binomial")
# Check levels of the actual target
levels(test_data_reduced$target)
## [1] "Donor"    "No Donor"
# e.g., should be something like: "Donor", "No Donor"

# Predict probabilities
log_probs <- predict(log_model, test_data_reduced, type = "response")

# Map probabilities to the correct labels
log_preds <- ifelse(log_probs > 0.5, "Donor", "No Donor") %>% as.factor()

# Match levels explicitly
log_preds <- factor(log_preds, levels = levels(test_data_reduced$target))

# Evaluate
confusionMatrix(log_preds, test_data_reduced$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      141      180
##   No Donor   159      120
##                                           
##                Accuracy : 0.435           
##                  95% CI : (0.3949, 0.4757)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.9994          
##                                           
##                   Kappa : -0.13           
##                                           
##  Mcnemar's Test P-Value : 0.2774          
##                                           
##             Sensitivity : 0.4700          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.4393          
##          Neg Pred Value : 0.4301          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2350          
##    Detection Prevalence : 0.5350          
##       Balanced Accuracy : 0.4350          
##                                           
##        'Positive' Class : Donor           
## 

Random Forest Model: This method builds multiple decision trees and combines their predictions randomly to improve accuracy and reduce overfitting. Same predictor variables were used as the ones in logistic regression and same target variable.

rf_model <- randomForest(target ~ ., data = train_data_reduced, ntree = 500, importance = TRUE)

# Predict class labels
rf_preds <- predict(rf_model, test_data_reduced)

# Evaluate performance
confusionMatrix(rf_preds, test_data_reduced$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      166      143
##   No Donor   134      157
##                                           
##                Accuracy : 0.5383          
##                  95% CI : (0.4975, 0.5788)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.03305         
##                                           
##                   Kappa : 0.0767          
##                                           
##  Mcnemar's Test P-Value : 0.63075         
##                                           
##             Sensitivity : 0.5533          
##             Specificity : 0.5233          
##          Pos Pred Value : 0.5372          
##          Neg Pred Value : 0.5395          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2767          
##    Detection Prevalence : 0.5150          
##       Balanced Accuracy : 0.5383          
##                                           
##        'Positive' Class : Donor           
## 
# View variable importance
varImpPlot(rf_model)

XGBoost Model: This model os a boosting-based classifier tha builds trees sequentially to correct previous errors. Same predictor variables were used and the same target variable. Target variables were converted into dummy variables of 1 or 0.

data_reduced <- data %>%
  select(income, wealth, time_lag, num_prom, months_since_donate,
         home_value, med_fam_inc, female, homeowner, target) %>%
  mutate(across(c(female, homeowner), as.factor),
         target = as.factor(target))

# Step 2: Encode categorical vars as numeric (required for xgboost)
data_reduced$target <- ifelse(data_reduced$target == "Donor", 1, 0)

dummies <- dummyVars(target ~ ., data = data_reduced)
data_xgb <- predict(dummies, newdata = data_reduced) %>% as.data.frame()
data_xgb$target <- data_reduced$target

# Step 3: Split into train/test
set.seed(123)
train_indices <- sample(1:nrow(data_xgb), 0.8 * nrow(data_xgb))
train_data <- data_xgb[train_indices, ]
test_data  <- data_xgb[-train_indices, ]

# Step 4: Prepare matrices for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, -ncol(train_data)]), label = train_data$target)
test_matrix  <- xgb.DMatrix(data = as.matrix(test_data[, -ncol(test_data)]), label = test_data$target)

# Step 5: Train XGBoost model
xgb_model <- xgboost(data = train_matrix,
                     objective = "binary:logistic",
                     eval_metric = "logloss",
                     nrounds = 100,
                     verbose = 0)

# Step 6: Predict probabilities on test set
xgb_probs <- predict(xgb_model, newdata = test_matrix)

# Step 7: Adjust threshold for higher specificity (e.g., 0.7)
xgb_preds <- ifelse(xgb_probs > 0.7, 1, 0) %>% factor(levels = c(0, 1))
true_labels <- test_data$target %>% factor(levels = c(0, 1))

# Step 8: Evaluate with confusion matrix
confusionMatrix(xgb_preds, true_labels, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 226 238
##          1  62  74
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4592, 0.5408)
##     No Information Rate : 0.52            
##     P-Value [Acc > NIR] : 0.8465          
##                                           
##                   Kappa : 0.0214          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2372          
##             Specificity : 0.7847          
##          Pos Pred Value : 0.5441          
##          Neg Pred Value : 0.4871          
##              Prevalence : 0.5200          
##          Detection Rate : 0.1233          
##    Detection Prevalence : 0.2267          
##       Balanced Accuracy : 0.5110          
##                                           
##        'Positive' Class : 1               
## 

Step 2.3: Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

Using an equal weighted sample between donor and non-donor creates a balanced data set which reduces bias. This can improve the model learning of classes that might be overlooked in a real world scenario. As a result, the confusion matrix provides a more accurate reflection of the model’s performance.

Step 2.4: Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

# Logistic Regression
log_cm <- confusionMatrix(log_preds, test_data_reduced$target)
log_metrics <- tibble(
  Model = "Logistic Regression",
  Accuracy = log_cm$overall["Accuracy"],
  Sensitivity = log_cm$byClass["Sensitivity"],
  Specificity = log_cm$byClass["Specificity"],
  F1 = log_cm$byClass["F1"]
)

# Random Forest
rf_cm <- confusionMatrix(rf_preds, test_data_reduced$target)
rf_metrics <- tibble(
  Model = "Random Forest",
  Accuracy = rf_cm$overall["Accuracy"],
  Sensitivity = rf_cm$byClass["Sensitivity"],
  Specificity = rf_cm$byClass["Specificity"],
  F1 = rf_cm$byClass["F1"]
)


# XGBoost
xgb_cm <- confusionMatrix(xgb_preds, true_labels, positive = "1")
xgb_metrics <- tibble(
  Model = "XGBoost",
  Accuracy = xgb_cm$overall["Accuracy"],
  Sensitivity = xgb_cm$byClass["Sensitivity"],
  Specificity = xgb_cm$byClass["Specificity"],
  F1 = xgb_cm$byClass["F1"]
)

# Combine into one table
all_metrics <- bind_rows(log_metrics, rf_metrics, xgb_metrics)

# Display table
print(all_metrics)
## # A tibble: 3 × 5
##   Model               Accuracy Sensitivity Specificity    F1
##   <chr>                  <dbl>       <dbl>       <dbl> <dbl>
## 1 Logistic Regression    0.435       0.47        0.4   0.454
## 2 Random Forest          0.538       0.553       0.523 0.545
## 3 XGBoost                0.5         0.237       0.785 0.330

Random Forest is the most consistent of the three models and has the highest Accuracy, Sensitivity, and F1 Score.

Step 2.5: Select best model. From your answer in (4), what do you think is the “best” model?

We choose the XGBoost model because it has the highest specificity score (0.785). The specificity score is the percentage of individuals that are accurately predicted to not donate. This is useful information when the goal is to minimize costs and maximize profits efficiently so there are no wasted marketing expenditures to these individuals.

Step 3: Testing using Future Fundraising Test File:

XGBoost model will be performed showing donor predictions below:

# Step 0: (Recreate dummies object properly if not already)
dummies <- dummyVars(~ ., data = data_reduced %>% select(-target))

# Step 1: Load future data
future_data <- read.csv("C:/Users/lclha/Documents/MSDA Program 2024-2026/Summer 2025/Predictive Modeling/Final Project/future_fundraising.csv")

# Step 2: Select relevant predictors
future_data_xgb <- future_data %>%
  select(income, wealth, time_lag, num_prom, months_since_donate,
         home_value, med_fam_inc, female, homeowner)

# Step 3: Ensure correct types
future_data_xgb <- future_data_xgb %>%
  mutate(
    female = as.factor(female),
    homeowner = as.factor(homeowner),
    across(c(income, wealth, time_lag, num_prom, months_since_donate,
             home_value, med_fam_inc), as.numeric)
  )

# Step 4: Apply dummy encoding
future_matrix_df <- predict(dummies, newdata = future_data_xgb) %>% as.data.frame()

# Step 5: Convert to DMatrix
future_matrix <- xgb.DMatrix(data = as.matrix(future_matrix_df))

# Step 6: Predict probabilities
xgb_probs_future <- predict(xgb_model, newdata = future_matrix)

# Step 7: Convert to class predictions (adjust threshold if needed)
xgb_preds_future <- ifelse(xgb_probs_future > 0.7, "Donor", "No Donor")

# Step 8: Save results
xgb_results <- future_data_xgb %>%
  mutate(XGBoost_Probability = xgb_probs_future,
         XGBoost_Prediction = xgb_preds_future)

write.csv(xgb_results, "xgb_future_predictions.csv", row.names = FALSE)

# Preview output
head(xgb_results)

Results: The XGBoost model succesfully predicted donor vs. not donor 53% are likely to be donors when submitted into the leaderboard.