Modeling Competition

Background A national veterans’ organization wishes to develop a predictive model to improve the costeffectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Business goals and objectives objective Develop a categorization model to help maximize expected net profit by predicting who will be more likely to donate in direct mail fundraising campaigns. target Improve national veterans organizations through the use of data analytics to help them achieve cost effectiveness in their direct marketing campaigns. Data sources and data used The fundraising file used contains 3,000 records. About 50 percent of donors and 50 percent of non-donors were recorded.

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345)

library(ISLR)
library(tidyverse)
library(MASS)
library(ResourceSelection)
library(caret)
library(e1071)
library(kernlab)

fundraising <- read_rds("C:/Users/yuan1/Downloads/fundraising.rds")
future_fundraising <- read_rds("C:/Users/yuan1/Downloads/future_fundraising.rds")

set.seed(12345)
train_index = sample(1:nrow(fundraising), round(nrow(fundraising) * 0.80))
train = fundraising[train_index, ]
test = fundraising[-train_index, ]

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

summary(train)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child   
##  No :1897    Yes: 450    No :1887    No :1468    Yes:1848   Min.   :1.00  
##  Yes: 503    No :1950    Yes: 513    Yes: 932    No : 552   1st Qu.:1.00  
##                                                             Median :1.00  
##                                                             Mean   :1.07  
##                                                             3rd Qu.:1.00  
##                                                             Max.   :4.00  
##      income      female         wealth        home_value    med_fam_inc    
##  Min.   :1.000   Yes:1471   Min.   :0.000   Min.   :   0   Min.   :   0.0  
##  1st Qu.:3.000   No : 929   1st Qu.:5.000   1st Qu.: 554   1st Qu.: 277.8  
##  Median :4.000              Median :8.000   Median : 811   Median : 355.0  
##  Mean   :3.912              Mean   :6.349   Mean   :1146   Mean   : 387.6  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1358   3rd Qu.: 465.2  
##  Max.   :7.000              Max.   :9.000   Max.   :5945   Max.   :1500.0  
##   avg_fam_inc     pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318   1st Qu.: 6.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 397   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432   Mean   :14.71   Mean   : 49.13   Mean   : 110.9  
##  3rd Qu.: 519   3rd Qu.:21.00   3rd Qu.: 64.00   3rd Qu.: 134.6  
##  Max.   :1331   Max.   :90.00   Max.   :144.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.76   Mean   : 13.48   Mean   :31.16       Mean   : 6.915  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :125.00   Max.   :37.00       Max.   :62.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1209  
##  1st Qu.:  6.400   No Donor:1191  
##  Median :  9.160                  
##  Mean   : 10.700                  
##  3rd Qu.: 12.918                  
##  Max.   :100.000

str(train)

## tibble [2,400 × 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 2 2 2 1 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 1 2 2 1 1 1 1 1 2 ...
##  $ num_child          : num [1:2400] 2 1 1 1 1 1 1 1 2 1 ...
##  $ income             : num [1:2400] 4 4 2 1 3 3 4 7 2 3 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 1 2 ...
##  $ wealth             : num [1:2400] 3 3 4 4 8 8 7 8 8 4 ...
##  $ home_value         : num [1:2400] 541 1229 444 442 2702 ...
##  $ med_fam_inc        : num [1:2400] 335 359 196 315 637 273 437 463 374 295 ...
##  $ avg_fam_inc        : num [1:2400] 367 490 263 343 695 331 454 597 434 319 ...
##  $ pct_lt15k          : num [1:2400] 13 10 38 24 2 21 9 13 3 19 ...
##  $ num_prom           : num [1:2400] 63 39 36 52 16 54 71 30 22 82 ...
##  $ lifetime_gifts     : num [1:2400] 91 35 178 134 20 110 118 57 29 242 ...
##  $ largest_gift       : num [1:2400] 10 15 20 20 20 9 10 13 15 12 ...
##  $ last_gift          : num [1:2400] 10 15 20 20 20 5 10 11 15 9 ...
##  $ months_since_donate: num [1:2400] 37 34 37 30 37 33 30 35 30 32 ...
##  $ time_lag           : num [1:2400] 4 13 0 10 5 4 3 2 6 7 ...
##  $ avg_gift           : num [1:2400] 6.5 11.67 9.89 16.75 20 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 2 2 2 1 2 2 1 1 1 1 ...

cor_train = train[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(cor_train)
cor(correlation)

##                       num_child      income       wealth  home_value
## num_child            1.00000000  0.05324112  0.121260748 -0.08781965
## income               0.05324112  1.00000000  0.527787162  0.58269772
## wealth               0.12126075  0.52778716  1.000000000  0.61785407
## home_value          -0.08781965  0.58269772  0.617854066  1.00000000
## med_fam_inc         -0.01755322  0.64991520  0.690415307  0.94969773
## avg_fam_inc         -0.01806456  0.65298184  0.693733960  0.95187155
## pct_lt15k           -0.04152967 -0.66561065 -0.713586018 -0.84639070
## num_prom            -0.23341537 -0.31727614 -0.702114485 -0.30241433
## lifetime_gifts      -0.27064886 -0.35807130 -0.606524506 -0.35859538
## largest_gift        -0.24197375 -0.24710454 -0.289981618 -0.21609775
## last_gift           -0.20667139 -0.02352432  0.007317074  0.01300455
## months_since_donate -0.06846089  0.03186073  0.139739273 -0.05500163
## time_lag            -0.11305265 -0.20851710 -0.269820082 -0.20642827
## avg_gift            -0.19240949  0.02166657  0.087178847  0.05394789
##                      med_fam_inc  avg_fam_inc    pct_lt15k   num_prom
## num_child           -0.017553217 -0.018064564 -0.041529668 -0.2334154
## income               0.649915203  0.652981840 -0.665610653 -0.3172761
## wealth               0.690415307  0.693733960 -0.713586018 -0.7021145
## home_value           0.949697730  0.951871551 -0.846390704 -0.3024143
## med_fam_inc          1.000000000  0.999624897 -0.947660353 -0.3025280
## avg_fam_inc          0.999624897  1.000000000 -0.947893805 -0.3075575
## pct_lt15k           -0.947660353 -0.947893805  1.000000000  0.2582344
## num_prom            -0.302527954 -0.307557507  0.258234397  1.0000000
## lifetime_gifts      -0.364872497 -0.370028633  0.286199372  0.7658079
## largest_gift        -0.222372635 -0.225844481  0.130634961  0.2546896
## last_gift           -0.008132127 -0.009336843 -0.063620846 -0.2351458
## months_since_donate -0.038734221 -0.036885528  0.001862361 -0.5810286
## time_lag            -0.182913312 -0.176791543  0.101515793  0.1879551
## avg_gift             0.031410526  0.030528762 -0.096170282 -0.3275139
##                     lifetime_gifts largest_gift    last_gift
## num_child              -0.27064886  -0.24197375 -0.206671392
## income                 -0.35807130  -0.24710454 -0.023524325
## wealth                 -0.60652451  -0.28998162  0.007317074
## home_value             -0.35859538  -0.21609775  0.013004548
## med_fam_inc            -0.36487250  -0.22237264 -0.008132127
## avg_fam_inc            -0.37002863  -0.22584448 -0.009336843
## pct_lt15k               0.28619937   0.13063496 -0.063620846
## num_prom                0.76580786   0.25468960 -0.235145812
## lifetime_gifts          1.00000000   0.69397810  0.121824828
## largest_gift            0.69397810   1.00000000  0.514762502
## last_gift               0.12182483   0.51476250  1.000000000
## months_since_donate    -0.41825900  -0.10318214  0.318396902
## time_lag                0.03071744  -0.03687018  0.001689584
## avg_gift                0.06906320   0.52092773  0.976814005
##                     months_since_donate     time_lag    avg_gift
## num_child                  -0.068460886 -0.113052655 -0.19240949
## income                      0.031860730 -0.208517098  0.02166657
## wealth                      0.139739273 -0.269820082  0.08717885
## home_value                 -0.055001633 -0.206428268  0.05394789
## med_fam_inc                -0.038734221 -0.182913312  0.03141053
## avg_fam_inc                -0.036885528 -0.176791543  0.03052876
## pct_lt15k                   0.001862361  0.101515793 -0.09617028
## num_prom                   -0.581028646  0.187955103 -0.32751391
## lifetime_gifts             -0.418259004  0.030717436  0.06906320
## largest_gift               -0.103182141 -0.036870178  0.52092773
## last_gift                   0.318396902  0.001689584  0.97681401
## months_since_donate         1.000000000 -0.073204830  0.33461668
## time_lag                   -0.073204830  1.000000000 -0.02634570
## avg_gift                    0.334616679 -0.026345702  1.00000000

# Compute correlation matrix
cor_train <- train[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation <- cor(cor_train)
# Find highly correlated variables
highly_correlated <- findCorrelation(correlation, cutoff = 0.7)
# Get the names of the highly correlated variablesnames(cor_train)[highly_correlated]
names(cor_train)[highly_correlated]

## [1] "avg_fam_inc" "med_fam_inc" "avg_gift"

with a cutoff point of 0.7, the heavy correlated variables are:“avg_fam_inc” “med_fam_inc” “avg_gift”

Loop through each predictor variable and create a histogram with the target variable

# Create a list of predictor variables
predictors <- c("zipconvert2", "zipconvert3", "zipconvert4", "zipconvert5", "homeowner", "female",
                "num_child", "income", "wealth", "home_value", "med_fam_inc", "avg_fam_inc",
                "pct_lt15k", "num_prom", "lifetime_gifts", "largest_gift", "months_since_donate",
                "time_lag", "avg_gift")

# Loop through each predictor variable and create a histogram with the target variable
for (i in predictors) {
  plot_data <- train[, c(i, "target")]
  plot_data <- plot_data[complete.cases(plot_data), ]
  plot <- ggplot(data = plot_data, aes(x = .data[[i]], y = ..count..)) +
    stat_count(binwidth = 5, fill = "blue", alpha = 0.5) +
    labs(title = paste("Histogram of", i, "vs. Target Variable"), x = i, y = "Frequency")
  print(plot)
}

2. Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.

#Model1 :Logistic Regression#

temp = fundraising[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(temp)
round(correlation, 5)

##                     num_child   income   wealth home_value med_fam_inc
## num_child             1.00000  0.09189  0.06018   -0.01196     0.04696
## income                0.09189  1.00000  0.20899    0.29197     0.36751
## wealth                0.06018  0.20899  1.00000    0.26116     0.37776
## home_value           -0.01196  0.29197  0.26116    1.00000     0.73815
## med_fam_inc           0.04696  0.36751  0.37776    0.73815     1.00000
## avg_fam_inc           0.04726  0.37859  0.38589    0.75257     0.97227
## pct_lt15k            -0.03172 -0.28319 -0.37515   -0.39909    -0.66536
## num_prom             -0.08643 -0.06901 -0.41212   -0.06451    -0.05078
## lifetime_gifts       -0.05095 -0.01957 -0.22547   -0.02407    -0.03525
## largest_gift         -0.01755  0.03318 -0.02528    0.05649     0.04703
## last_gift            -0.01295  0.10959  0.05259    0.15886     0.13598
## months_since_donate  -0.00556  0.07724  0.03371    0.02343     0.03234
## time_lag             -0.00607 -0.00155 -0.06642    0.00068     0.01520
## avg_gift             -0.01969  0.12406  0.09108    0.16877     0.13716
##                     avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child               0.04726  -0.03172 -0.08643       -0.05095     -0.01755
## income                  0.37859  -0.28319 -0.06901       -0.01957      0.03318
## wealth                  0.38589  -0.37515 -0.41212       -0.22547     -0.02528
## home_value              0.75257  -0.39909 -0.06451       -0.02407      0.05649
## med_fam_inc             0.97227  -0.66536 -0.05078       -0.03525      0.04703
## avg_fam_inc             1.00000  -0.68028 -0.05731       -0.04033      0.04310
## pct_lt15k              -0.68028   1.00000  0.03778        0.05962     -0.00788
## num_prom               -0.05731   0.03778  1.00000        0.53862      0.11381
## lifetime_gifts         -0.04033   0.05962  0.53862        1.00000      0.50726
## largest_gift            0.04310  -0.00788  0.11381        0.50726      1.00000
## last_gift               0.13138  -0.06175 -0.05587        0.20206      0.44724
## months_since_donate     0.03127  -0.00901 -0.28232       -0.14462      0.01979
## time_lag                0.02434  -0.01991  0.11962        0.03855      0.03998
## avg_gift                0.13176  -0.06248 -0.14725        0.18232      0.47483
##                     last_gift months_since_donate time_lag avg_gift
## num_child            -0.01295            -0.00556 -0.00607 -0.01969
## income                0.10959             0.07724 -0.00155  0.12406
## wealth                0.05259             0.03371 -0.06642  0.09108
## home_value            0.15886             0.02343  0.00068  0.16877
## med_fam_inc           0.13598             0.03234  0.01520  0.13716
## avg_fam_inc           0.13138             0.03127  0.02434  0.13176
## pct_lt15k            -0.06175            -0.00901 -0.01991 -0.06248
## num_prom             -0.05587            -0.28232  0.11962 -0.14725
## lifetime_gifts        0.20206            -0.14462  0.03855  0.18232
## largest_gift          0.44724             0.01979  0.03998  0.47483
## last_gift             1.00000             0.18672  0.07511  0.86640
## months_since_donate   0.18672             1.00000  0.01553  0.18911
## time_lag              0.07511             0.01553  1.00000  0.07008
## avg_gift              0.86640             0.18911  0.07008  1.00000

glm.fund = glm(target ~., data = train, family = 'binomial')
glm.step = step(glm.fund, scope = list(upper = glm.fund),
                direction = "both", test = "Chisq", trace = F)
summary(glm.step)

## 
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 + 
##     zipconvert5 + homeowner + num_child + income + home_value + 
##     avg_fam_inc + last_gift + months_since_donate, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7887  -1.1411  -0.7736   1.1703   1.6751  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.335e+00  3.900e-01  -5.987 2.14e-09 ***
## zipconvert2Yes      -1.266e+01  2.271e+02  -0.056  0.95553    
## zipconvert3No        1.257e+01  2.271e+02   0.055  0.95586    
## zipconvert4Yes      -1.264e+01  2.271e+02  -0.056  0.95563    
## zipconvert5Yes      -1.258e+01  2.271e+02  -0.055  0.95582    
## homeownerNo          1.505e-01  1.052e-01   1.430  0.15272    
## num_child            3.424e-01  1.271e-01   2.694  0.00706 ** 
## income              -5.308e-02  2.858e-02  -1.857  0.06330 .  
## home_value          -1.091e-04  7.675e-05  -1.422  0.15508    
## avg_fam_inc          5.741e-04  4.042e-04   1.420  0.15550    
## last_gift            1.610e-02  4.798e-03   3.356  0.00079 ***
## months_since_donate  5.852e-02  1.076e-02   5.438 5.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3255.7  on 2388  degrees of freedom
## AIC: 3279.7
## 
## Number of Fisher Scoring iterations: 11

hoslem.test(glm.step$y, fitted(glm.step), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.step$y, fitted(glm.step)
## X-squared = 1.7672, df = 8, p-value = 0.9873

Based on the above analysis, we fit the final model with the following predictors: num_child, last_gift, and months_since_donate

glm.fund_final = glm(target ~ num_child + last_gift + months_since_donate, data = train, family = 'binomial')

pred.prob = predict.glm(glm.fund_final, newdata = test, type = 'response')
pred = ifelse(pred.prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       98      131
##   No Donor   192      179
##                                           
##                Accuracy : 0.4617          
##                  95% CI : (0.4212, 0.5025)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.9968933       
##                                           
##                   Kappa : -0.0852         
##                                           
##  Mcnemar's Test P-Value : 0.0008424       
##                                           
##             Sensitivity : 0.3379          
##             Specificity : 0.5774          
##          Pos Pred Value : 0.4279          
##          Neg Pred Value : 0.4825          
##              Prevalence : 0.4833          
##          Detection Rate : 0.1633          
##    Detection Prevalence : 0.3817          
##       Balanced Accuracy : 0.4577          
##                                           
##        'Positive' Class : Donor           
##

*Hosmer and Lemeshow goodness of fit (GOF) test yielded a p-value of 0.9873 which is above the significance level of 0.05. therefore, the model is adequate.Logistic Regression model come out with a accuracy rate of 46.17%**

#Model2: Random Forest#

train_control = trainControl(method="repeatedcv",number=2,repeats=1)
rf.fit = train(target~.,
               data = train,
               method ='rf',
               trControl = train_control,
               importance = TRUE)
rf.fit$besttune

## NULL

varImp(rf.fit)

## rf variable importance
## 
##                     Importance
## months_since_donate    100.000
## largest_gift            84.900
## last_gift               59.995
## num_child               57.812
## avg_gift                50.522
## pct_lt15k               46.783
## income                  42.604
## home_value              40.148
## avg_fam_inc             34.732
## med_fam_inc             32.086
## homeownerNo             26.448
## zipconvert3No           26.428
## wealth                  18.079
## num_prom                16.817
## zipconvert2Yes          13.093
## femaleNo                 9.895
## time_lag                 6.264
## lifetime_gifts           4.329
## zipconvert4Yes           2.327
## zipconvert5Yes           0.000

plot(varImp(rf.fit))

We remove avg_gift as it is collinear with last_gift, and we remove med_fam_inc as it is collinear with pct_lt15k.

train_control = trainControl(method="repeatedcv",number=2,repeats=1)


rf.fit_refitted = train(target~ months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
               data = train,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

pred.rf_refitted = predict(rf.fit_refitted,test)
confusionMatrix(pred.rf_refitted,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      160      144
##   No Donor   130      166
##                                           
##                Accuracy : 0.5433          
##                  95% CI : (0.5025, 0.5837)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.1026          
##                                           
##                   Kappa : 0.0871          
##                                           
##  Mcnemar's Test P-Value : 0.4322          
##                                           
##             Sensitivity : 0.5517          
##             Specificity : 0.5355          
##          Pos Pred Value : 0.5263          
##          Neg Pred Value : 0.5608          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.5067          
##       Balanced Accuracy : 0.5436          
##                                           
##        'Positive' Class : Donor           
##

Model3 Support Vector Machine

# Fit a support vector machine model
svm_model <- svm(target ~ ., data = train)

# Make predictions on the test set
svm_pred <- predict(svm_model, newdata = test)

# Evaluate the model
table(svm_pred, test$target)

##           
## svm_pred   Donor No Donor
##   Donor      186      161
##   No Donor   104      149

# Create a data partition for cross-validation
folds <- createFolds(train$target, k = 5)

# Define the SVM model
model <- svm(target ~ ., data = train)

# Train the model using 5-fold cross-validation
ctrl <- trainControl(method = "cv", index = folds)
fit <- train(target ~ ., data = train, method = "svmRadial", trControl = ctrl, tuneLength = 10)

# Predict the class labels of the test set
predictions <- predict(fit, newdata = test)

# Create a confusion matrix
cm <- confusionMatrix(predictions, test$target)
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      182      169
##   No Donor   108      141
##                                           
##                Accuracy : 0.5383          
##                  95% CI : (0.4975, 0.5788)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.1535766       
##                                           
##                   Kappa : 0.0819          
##                                           
##  Mcnemar's Test P-Value : 0.0003121       
##                                           
##             Sensitivity : 0.6276          
##             Specificity : 0.4548          
##          Pos Pred Value : 0.5185          
##          Neg Pred Value : 0.5663          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3033          
##    Detection Prevalence : 0.5850          
##       Balanced Accuracy : 0.5412          
##                                           
##        'Positive' Class : Donor           
##

3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset? A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.

4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

models = c('Logistic Regression',  'Random Forest','Support Vector Machine')
acc= c(46.17, 53.83, 52.67 )

acc.summary= as.data.frame(acc, row.names = models)
acc.summary

##                          acc
## Logistic Regression    46.17
## Random Forest          53.83
## Support Vector Machine 52.67

barplot(acc,names.arg = models, ylab="Accuracy Score", col="pink",
main="Model Results", border="orange")

The Random Forest model appears to be the model that dominates.

5. Select best model. From your answer in (4), what do you think is the “best” model? Model Selected: Random Forest is the selected model due to its slightly higher accuracy.

6. Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.

train_control = trainControl(method="repeatedcv",number=2,repeats=1)


rf.fit_final = train(target~months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
               data = fundraising,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

pred.rf_final = predict(rf.fit_final,future_fundraising)

pred.rf_final

##   [1] Donor    No Donor Donor    Donor    Donor    No Donor No Donor No Donor
##   [9] Donor    No Donor No Donor Donor    No Donor Donor    No Donor No Donor
##  [17] Donor    Donor    Donor    No Donor Donor    Donor    No Donor No Donor
##  [25] Donor    No Donor Donor    No Donor No Donor Donor    Donor    Donor   
##  [33] No Donor No Donor Donor    No Donor No Donor Donor    Donor    Donor   
##  [41] No Donor No Donor Donor    No Donor No Donor Donor    Donor    No Donor
##  [49] No Donor Donor    No Donor Donor    Donor    Donor    No Donor Donor   
##  [57] No Donor No Donor Donor    Donor    Donor    No Donor No Donor No Donor
##  [65] No Donor Donor    No Donor No Donor No Donor Donor    Donor    No Donor
##  [73] No Donor Donor    Donor    Donor    No Donor No Donor No Donor Donor   
##  [81] No Donor Donor    No Donor Donor    No Donor Donor    Donor    No Donor
##  [89] Donor    Donor    Donor    Donor    Donor    No Donor No Donor Donor   
##  [97] Donor    No Donor Donor    No Donor No Donor Donor    Donor    No Donor
## [105] No Donor No Donor Donor    Donor    Donor    Donor    Donor    Donor   
## [113] No Donor No Donor Donor    Donor    No Donor No Donor No Donor Donor   
## Levels: Donor No Donor

summary(pred.rf_final)

##    Donor No Donor 
##       62       58

7. Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header and have the following format:

write.table(pred.rf_final, file = "predictions_randomforest_final.csv", col.names = c("value"), row.names = FALSE)

#"pred.rf_final" is your predicted values for the test set.
# Create a data frame with the predicted values
submission <- data.frame(value = pred.rf_final)

# Write the data frame to a CSV file with header
write.csv(submission, file = "submission.csv", row.names = FALSE)

Modeling Competition

Yuan Torres

2023-05-08