Predictive Modeling Final Project

library(readr)
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(rpart) # DT
library(tidyverse)
library(tree)
library(Metrics)
library(caret)
library(car)
library(kernlab)
library(MASS)
library(ROCR)
library(performance)
library(randomForest)
library(corrplot)
library(ggcorrplot) # corr plot
library(reshape2) # heat map
library(GGally) # Pais ggplot2
library(kableExtra) # Kabble tables
library(e1071) # Tune SVM
library(broom) # Tidy function
library("viridis") # heat map
library(naivebayes)

Business Problem

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Business Objective & Goals

Explore different models and find the best classification model that will lead to maximize the net profit. The objective is to develop a predictive model for national veterans’ organization. The model is to improve the cost effectiveness of their direct marketing campaign. This will predict which of their donors should receive a mail campaign letter.

Data

The data was provided as 2 .rds files:

Fundraise_data: 3,000 observations and 20 predictor variables (6 factor and 14 numerical) and 1 response variable (Factor)
- zip convert 1: 00000–19999
- zip convert 2: 20000–39999
- zip convert 3: 40000–59999
- zip convert 4: 60000–79999
- zip convert 5: 80000–99999
- homeowner: Yes = homeowner, No = not a homeowner
- num_child: Number of children income Household income
- female: No = male, Yes = female
- wealth: Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state
- home_value: Average home value in potential donor’s neighborhood in hundreds of dollars
- med_fam_inc Median family income in potential donor’s neighborhood in hundreds of dollars
- avg_fam_inc: Average family income in potential donor’s neighborhood in hundreds
- pct_lt15k: Percent earning less than $15K in potential donor’s neighborhood
- num_prom: Lifetime number of promotions received to date
- lifetime_gifts Dollar amount of lifetime gifts to date
- largest_gift: Dollar amount of largest gift to date
- last_gift Dollar amount of most recent gift
- months_since_donate: Number of months from last donation to July 2018
- time_lag Number of months between first and second gift
- avg_gift: Average dollar amount of gifts to date
- target Outcome variable: binary indicator for response Yes = donor, No = non-donor
Future_fundraiser: 120 observations and 20 predictor variables, no response variable, since it is unknown.

Loading the data

fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")

fundraise_data = as.data.frame(fundraise_data)

str(fundraise_data)

## 'data.frame':    3000 obs. of  21 variables:
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num  1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num  1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num  7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num  698 828 1471 547 482 ...
##  $ med_fam_inc        : num  422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num  463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num  4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num  46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num  94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num  12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num  12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num  34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num  6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num  9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

Preview of the Data

summary(fundraise_data)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

Comments: From the summary above we can do some pre-analysis, firstly by seeing that our target variable is balanced. We can get a glance that there the variables such as: lifetime_gifts, largest_gift, last_gift and avg_gift are skewed and might be significant, it will be important to do some exploratory analysis on them, and later to see if they are significant by running the models.

tbl_donor = table(fundraise_data$target)
cbind(tbl_donor, round(prop.table(tbl_donor)*100,2))

##          tbl_donor      
## Donor         1499 49.97
## No Donor      1501 50.03

Comments: The data is balanced, with an almost 50/50 split as we can see from the last table, which will definitely be an advantage for training purposes of our models.

fundraise_data$zipconvert2 = as.factor(fundraise_data$zipconvert2)

Exploratory Analysis

Heatmap - correlation

We will start our analysis with a heatmap to help us find any possible correlation between our numeric variables.

cormat = round(cor(fundraise_data[,-c(1:5,8,21)]),2)
melted_cormat = melt(cormat)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_fill_viridis(option = "A")

Comments: The first thing that stands out is the pct_lt15k variable, which is the one with the lowest correlation with multiple variables. The variables that we can see are strongly correlated between each other are: avg_fam_inc and med_fam_inc (for obvious reasons, since one is the calculation of the other), however, they are also correlated with home_value, income and wealth. When it comes to the gifts section, the ones correlated are last_gift with avg_gift, and also life_time with num_prom and largest_gift.

Pairwise Plots

From the Heat map we got some interesting insights that deserve to be looked at with more detail. We will use the pairwise plots matrix, to analyze them. The variables that we will look at, are the ones that showed the highest correlation in the previous heat map.

ggpairs(fundraise_data[, c(10, 11, 12, 15, 17, 20)], ggplot2::aes(colour = fundraise_data$target))

Comments: We can confirm from the pairs plots what we saw in the pre-analysis, where the variables lifetime_gifts, last_gift and avg_gift are skewed. We can also confirm the Correlation between avg_fam_inc and med_fam_inc , but also with the target (response variable).

Box plots / Bar plots

Zip codes

Now, for us to check the categorical varaibles, we will use some box plots and bar plots. For that I need to revert the one-hot-encoding process done with the zipconvert column, also I found that we have 4 observations that are not part of any zipconvert category.

x = ifelse(fundraise_data$zipconvert2 =="No", 
           ifelse(fundraise_data$zipconvert3 =="No", 
                  ifelse(fundraise_data$zipconvert4 == "No", 
                         ifelse(fundraise_data$zipconvert5 == "No", 0, "zipconvert5"), "zipconvert4"), "zipconvert3"), "zipconvert2")

x = as.factor(x)
x = droplevels(x)

table(x) # to verify if my reverse one-hot-encoding matches the summary() results

## x
##           0 zipconvert2 zipconvert3 zipconvert4 zipconvert5 
##           4         648         551         643        1154

# mutate the column just created with my data frame

fundraise_data = fundraise_data %>% 
  mutate(zipconvertAll = ifelse(fundraise_data$zipconvert2 =="No", 
           ifelse(fundraise_data$zipconvert3 =="No", 
                  ifelse(fundraise_data$zipconvert4 == "No", 
                         ifelse(fundraise_data$zipconvert5 == "No", 0, "zipconvert5"), "zipconvert4"), 
                  "zipconvert3"), "zipconvert2"))

fundraise_data$zipconvertAll = as.factor(fundraise_data$zipconvertAll)

Bar Plot

Just to see how the Donors and No Donor are distribuited throughout the zip codes.

ggplot(fundraise_data, aes(x = zipconvertAll, fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))

Comments: From the bar plot we can see that the Donors and Non-donors are equally distributed. And the only zipcode that has more donors than non-donors is ZipConvert5.

Gender

The last variable that is worth taking a look is Gender (Female).

ggplot(fundraise_data, aes(x = ifelse(female=="Yes", "Female", "Male"), fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1)) +
  labs(x = "Donors by Gender")

Comments: From here we can see that women are more likely to donate than males. And also we see that our database of “members”, has more females than males.

Feature Selection

Colinearity and Significance

Logistic Regression - Significance and VIF

Usually for my projects I like to run the basic models (either Linear Regression for continuous responses or Logistic Regression for Binomial responses). These models give a very good first glance at which variables are significant, and can help us when we compare that information VS the Random Forest Variable Importance.

For reporting purposes, I won’t show my full model, since only 3 variables were significant: months_since_donate, num_child, and income, this might indicate that we have some colinearity that we should get rid of first.

After running the full model for the first time, I deleted both zipconvertAll and zipconvert3, because adding all the zipconvert columns would create a very high VIF coefficient.

glm.fit.full = glm(target ~ . -zipconvertAll -zipconvert3, data = fundraise_data, family = binomial)

vif.glm.full = vif(glm.fit.full)

vif.glm.full %>% 
  kable(col.names = c("VIF coefficient")) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

	VIF coefficient
zipconvert2	1.754935
zipconvert4	1.726092
zipconvert5	2.386175
homeowner	1.133303
num_child	1.026263
income	1.312074
female	1.016380
wealth	1.521926
home_value	3.283724
med_fam_inc	18.773974
avg_fam_inc	21.009871
pct_lt15k	2.100204
num_prom	1.962842
lifetime_gifts	2.137244
largest_gift	2.104440
last_gift	3.950453
months_since_donate	1.132120
time_lag	1.037998
avg_gift	4.235878

Comments: We can see that med_fam_inc and avg_fam_inc like we saw in the previous exploratory analysis are highly correlated and one of them should go. Before deciding which one of them should. I will run a random forest model (again, full model) and do the Variable Importance to see which one is more important.

Variable Importance

I will use the full data set for this process.

set.seed(12345)
forest1 = train(target ~ . -zipconvertAll -time_lag, 
               data = fundraise_data, 
               method = "rf", 
               importance = TRUE, 
               trControl = trainControl(method = "cv", number = 10))

forest1

## Random Forest 
## 
## 3000 samples
##   21 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##    2    0.5320012  0.06404249
##   10    0.5313334  0.06268251
##   19    0.5243356  0.04868397
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Comments: Here we can see the accuracy of the model going down from 2 predictors, to 13, and then with 24 predictors. When running the Random Forest with no tuning or without any cross-validation method, the highest accuracy rate we get is: 54.13%. Really low, but we will run a preliminary Variable Importance analysis.

plot(forest1, m.target = "NULL",
  plots.one.page = TRUE, sorted = TRUE, verbose = F)

Variable Importance Plot

From the variable importance plot, we can see that the variables that we should for sure keep are: months_since_donate and largest_gift. Also from the VIF test that we got colinearity between med_fam_inc and avg_fam_inc, according to this plot, we should keep med_fam_inc.

plot(varImp(forest1))

Methodology

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train.df = fundraise_data[train_index, ]
test.df = fundraise_data[-train_index, ]

Models

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

Logistic Regression

I will start with the full model like in the Feature selection, as well I am excluding the new variable I created (-zipconvertAll).

glm.fit.null = glm(target ~ 1, data = fundraise_data, family = binomial)
glm.fit.full = glm(target ~ . -zipconvertAll, data = fundraise_data, family = binomial)

step.AIC = step(glm.fit.null, scope = list(upper=glm.fit.full),
     direction ="both",test ="Chisq", trace = F)

summary(step.AIC)

## 
## Call:
## glm(formula = target ~ months_since_donate + last_gift + income + 
##     num_child + num_prom, family = binomial, data = fundraise_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9735  -1.1544   0.6177   1.1603   1.7317  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.908023   0.368624  -5.176 2.27e-07 ***
## months_since_donate  0.059997   0.009927   6.044 1.51e-09 ***
## last_gift            0.012379   0.003914   3.163  0.00156 ** 
## income              -0.073489   0.022964  -3.200  0.00137 ** 
## num_child            0.283879   0.112839   2.516  0.01188 *  
## num_prom            -0.002907   0.001711  -1.699  0.08937 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.9  on 2999  degrees of freedom
## Residual deviance: 4076.8  on 2994  degrees of freedom
## AIC: 4088.8
## 
## Number of Fisher Scoring iterations: 4

Train Logistic Regression Model

We will train and test our Logistic Regression based on the AIC selected model.

glm.fit = glm(target ~ months_since_donate + last_gift + income + 
    num_child + num_prom, family = binomial, data = train.df)

Predictions on the Test data set

glm.probs = predict.glm(step.AIC, newdata = test.df, type="response")
glm.preds = as.factor(ifelse(glm.probs >= 0.5, "Donor", "No Donor"))

confusionMatrix(test.df$target, glm.preds)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      118      172
##   No Donor   164      146
##                                           
##                Accuracy : 0.44            
##                  95% CI : (0.3998, 0.4808)
##     No Information Rate : 0.53            
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : -0.1222         
##                                           
##  Mcnemar's Test P-Value : 0.7025          
##                                           
##             Sensitivity : 0.4184          
##             Specificity : 0.4591          
##          Pos Pred Value : 0.4069          
##          Neg Pred Value : 0.4710          
##              Prevalence : 0.4700          
##          Detection Rate : 0.1967          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.4388          
##                                           
##        'Positive' Class : Donor           
##

Comments: Our test accuracy for the variables selected by the AIC method is 44.00%.

Random Forest

Repeated CV Random Forest

set.seed(12345)
train_control = trainControl(method="repeatedcv", number=10,repeats=3)

best.rf = train(target ~ months_since_donate + last_gift + income + num_child + num_prom,
                data = train.df,
                method = 'rf',
                trControl = train_control,
                importance = TRUE)

pred.best.rf = predict(best.rf, test.df)
confusionMatrix(pred.best.rf, test.df$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      167      145
##   No Donor   123      165
##                                           
##                Accuracy : 0.5533          
##                  95% CI : (0.5125, 0.5936)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.03938         
##                                           
##                   Kappa : 0.1079          
##                                           
##  Mcnemar's Test P-Value : 0.19957         
##                                           
##             Sensitivity : 0.5759          
##             Specificity : 0.5323          
##          Pos Pred Value : 0.5353          
##          Neg Pred Value : 0.5729          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2783          
##    Detection Prevalence : 0.5200          
##       Balanced Accuracy : 0.5541          
##                                           
##        'Positive' Class : Donor           
##

KNN (K-Nearest Neighbors)

For this model we will create a train_ctrl variable, for the cross-validation.

set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)

knn.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
                data=train.df,
                method='knn',
                trControl = train_ctrl,
                tuneLength=20)

pred.probs.knn = predict(knn.fit, test.df)

confusionMatrix(as.factor(pred.probs.knn), test.df$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      179      152
##   No Donor   111      158
##                                           
##                Accuracy : 0.5617          
##                  95% CI : (0.5209, 0.6018)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.01509         
##                                           
##                   Kappa : 0.1263          
##                                           
##  Mcnemar's Test P-Value : 0.01364         
##                                           
##             Sensitivity : 0.6172          
##             Specificity : 0.5097          
##          Pos Pred Value : 0.5408          
##          Neg Pred Value : 0.5874          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2983          
##    Detection Prevalence : 0.5517          
##       Balanced Accuracy : 0.5635          
##                                           
##        'Positive' Class : Donor           
##

Comments: Accuracy for K-Nearest Neighbor is the best one so far.

Naive Bayes

The last model I will perform is the Naive Bayes, since it has been proven that it has a great performance.

set.seed(12345)
nb.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom, 
                data = train.df, method = "naive_bayes", 
                trControl = train_ctrl)
nb.fit

## Naive Bayes 
## 
## 2400 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa     
##   FALSE      0.5491628  0.09476239
##    TRUE      0.5394503  0.07420510
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.

pred.probs.nb = predict(nb.fit, test.df)

confusionMatrix(as.factor(pred.probs.nb), test.df$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      236      227
##   No Donor    54       83
##                                           
##                Accuracy : 0.5317          
##                  95% CI : (0.4908, 0.5722)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.2438          
##                                           
##                   Kappa : 0.08            
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8138          
##             Specificity : 0.2677          
##          Pos Pred Value : 0.5097          
##          Neg Pred Value : 0.6058          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3933          
##    Detection Prevalence : 0.7717          
##       Balanced Accuracy : 0.5408          
##                                           
##        'Positive' Class : Donor           
##

Comments: Accuracy for Naive Bayes is lower than K-Nearest Neighbor, here is 53.17%.

Model Performance

accuracy.all = data.frame(0.44,0.5533,0.5617,0.5317)
colnames(accuracy.all) <- c("Logistic Regression","Random Forest","K-Nearest Neighbor","Naive Bayes")

accuracy.all %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped","hover"))

Logistic Regression	Random Forest	K-Nearest Neighbor	Naive Bayes
0.44	0.5533	0.5617	0.5317

Best Model and Conclusions

The best model we could find was the K-Nearest Neighbor with a k-fold of 10. The accuracy was 56.17% in the Test data.

Important factors to consider when targeting people who may be a donor are the number of months from the last donation to July 2018, the dollar amount of the largest gift to date, number of children, dollar amount of most recent gift, percent earnings that are less than $15k in potential donor’s neighborhood, household income, and wealth ratings.

Testing the future file

set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)

knn.fit.final = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
                data=fundraise_data,
                method='knn',
                trControl = train_ctrl,
                tuneLength=20)

pred.probs.knn = predict(knn.fit.final, future_fundraise_data)

write.table(pred.probs.knn, file = "predictions_knn_final.csv", col.names = c("value"), row.names = FALSE)