Background A national veterans’ organization wishes to develop a predictive model to improve the costeffectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
Business goals and objectives objective Develop a categorization model to help maximize expected net profit by predicting who will be more likely to donate in direct mail fundraising campaigns. target Improve national veterans organizations through the use of data analytics to help them achieve cost effectiveness in their direct marketing campaigns. Data sources and data used The fundraising file used contains 3,000 records. About 50 percent of donors and 50 percent of non-donors were recorded.
Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345)
library(ISLR)
library(tidyverse)
library(MASS)
library(ResourceSelection)
library(caret)
library(e1071)
library(kernlab)
fundraising <- read_rds("C:/Users/yuan1/Downloads/fundraising.rds")
future_fundraising <- read_rds("C:/Users/yuan1/Downloads/future_fundraising.rds")
set.seed(12345)
train_index = sample(1:nrow(fundraising), round(nrow(fundraising) * 0.80))
train = fundraising[train_index, ]
test = fundraising[-train_index, ]
Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.
1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?
summary(train)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :1897 Yes: 450 No :1887 No :1468 Yes:1848 Min. :1.00
## Yes: 503 No :1950 Yes: 513 Yes: 932 No : 552 1st Qu.:1.00
## Median :1.00
## Mean :1.07
## 3rd Qu.:1.00
## Max. :4.00
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1471 Min. :0.000 Min. : 0 Min. : 0.0
## 1st Qu.:3.000 No : 929 1st Qu.:5.000 1st Qu.: 554 1st Qu.: 277.8
## Median :4.000 Median :8.000 Median : 811 Median : 355.0
## Mean :3.912 Mean :6.349 Mean :1146 Mean : 387.6
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1358 3rd Qu.: 465.2
## Max. :7.000 Max. :9.000 Max. :5945 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318 1st Qu.: 6.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 397 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432 Mean :14.71 Mean : 49.13 Mean : 110.9
## 3rd Qu.: 519 3rd Qu.:21.00 3rd Qu.: 64.00 3rd Qu.: 134.6
## Max. :1331 Max. :90.00 Max. :144.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.76 Mean : 13.48 Mean :31.16 Mean : 6.915
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :125.00 Max. :37.00 Max. :62.000
## avg_gift target
## Min. : 2.139 Donor :1209
## 1st Qu.: 6.400 No Donor:1191
## Median : 9.160
## Mean : 10.700
## 3rd Qu.: 12.918
## Max. :100.000
str(train)
## tibble [2,400 × 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 2 2 2 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 1 2 2 1 1 1 1 1 2 ...
## $ num_child : num [1:2400] 2 1 1 1 1 1 1 1 2 1 ...
## $ income : num [1:2400] 4 4 2 1 3 3 4 7 2 3 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 1 2 ...
## $ wealth : num [1:2400] 3 3 4 4 8 8 7 8 8 4 ...
## $ home_value : num [1:2400] 541 1229 444 442 2702 ...
## $ med_fam_inc : num [1:2400] 335 359 196 315 637 273 437 463 374 295 ...
## $ avg_fam_inc : num [1:2400] 367 490 263 343 695 331 454 597 434 319 ...
## $ pct_lt15k : num [1:2400] 13 10 38 24 2 21 9 13 3 19 ...
## $ num_prom : num [1:2400] 63 39 36 52 16 54 71 30 22 82 ...
## $ lifetime_gifts : num [1:2400] 91 35 178 134 20 110 118 57 29 242 ...
## $ largest_gift : num [1:2400] 10 15 20 20 20 9 10 13 15 12 ...
## $ last_gift : num [1:2400] 10 15 20 20 20 5 10 11 15 9 ...
## $ months_since_donate: num [1:2400] 37 34 37 30 37 33 30 35 30 32 ...
## $ time_lag : num [1:2400] 4 13 0 10 5 4 3 2 6 7 ...
## $ avg_gift : num [1:2400] 6.5 11.67 9.89 16.75 20 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 2 2 2 1 2 2 1 1 1 1 ...
cor_train = train[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(cor_train)
cor(correlation)
## num_child income wealth home_value
## num_child 1.00000000 0.05324112 0.121260748 -0.08781965
## income 0.05324112 1.00000000 0.527787162 0.58269772
## wealth 0.12126075 0.52778716 1.000000000 0.61785407
## home_value -0.08781965 0.58269772 0.617854066 1.00000000
## med_fam_inc -0.01755322 0.64991520 0.690415307 0.94969773
## avg_fam_inc -0.01806456 0.65298184 0.693733960 0.95187155
## pct_lt15k -0.04152967 -0.66561065 -0.713586018 -0.84639070
## num_prom -0.23341537 -0.31727614 -0.702114485 -0.30241433
## lifetime_gifts -0.27064886 -0.35807130 -0.606524506 -0.35859538
## largest_gift -0.24197375 -0.24710454 -0.289981618 -0.21609775
## last_gift -0.20667139 -0.02352432 0.007317074 0.01300455
## months_since_donate -0.06846089 0.03186073 0.139739273 -0.05500163
## time_lag -0.11305265 -0.20851710 -0.269820082 -0.20642827
## avg_gift -0.19240949 0.02166657 0.087178847 0.05394789
## med_fam_inc avg_fam_inc pct_lt15k num_prom
## num_child -0.017553217 -0.018064564 -0.041529668 -0.2334154
## income 0.649915203 0.652981840 -0.665610653 -0.3172761
## wealth 0.690415307 0.693733960 -0.713586018 -0.7021145
## home_value 0.949697730 0.951871551 -0.846390704 -0.3024143
## med_fam_inc 1.000000000 0.999624897 -0.947660353 -0.3025280
## avg_fam_inc 0.999624897 1.000000000 -0.947893805 -0.3075575
## pct_lt15k -0.947660353 -0.947893805 1.000000000 0.2582344
## num_prom -0.302527954 -0.307557507 0.258234397 1.0000000
## lifetime_gifts -0.364872497 -0.370028633 0.286199372 0.7658079
## largest_gift -0.222372635 -0.225844481 0.130634961 0.2546896
## last_gift -0.008132127 -0.009336843 -0.063620846 -0.2351458
## months_since_donate -0.038734221 -0.036885528 0.001862361 -0.5810286
## time_lag -0.182913312 -0.176791543 0.101515793 0.1879551
## avg_gift 0.031410526 0.030528762 -0.096170282 -0.3275139
## lifetime_gifts largest_gift last_gift
## num_child -0.27064886 -0.24197375 -0.206671392
## income -0.35807130 -0.24710454 -0.023524325
## wealth -0.60652451 -0.28998162 0.007317074
## home_value -0.35859538 -0.21609775 0.013004548
## med_fam_inc -0.36487250 -0.22237264 -0.008132127
## avg_fam_inc -0.37002863 -0.22584448 -0.009336843
## pct_lt15k 0.28619937 0.13063496 -0.063620846
## num_prom 0.76580786 0.25468960 -0.235145812
## lifetime_gifts 1.00000000 0.69397810 0.121824828
## largest_gift 0.69397810 1.00000000 0.514762502
## last_gift 0.12182483 0.51476250 1.000000000
## months_since_donate -0.41825900 -0.10318214 0.318396902
## time_lag 0.03071744 -0.03687018 0.001689584
## avg_gift 0.06906320 0.52092773 0.976814005
## months_since_donate time_lag avg_gift
## num_child -0.068460886 -0.113052655 -0.19240949
## income 0.031860730 -0.208517098 0.02166657
## wealth 0.139739273 -0.269820082 0.08717885
## home_value -0.055001633 -0.206428268 0.05394789
## med_fam_inc -0.038734221 -0.182913312 0.03141053
## avg_fam_inc -0.036885528 -0.176791543 0.03052876
## pct_lt15k 0.001862361 0.101515793 -0.09617028
## num_prom -0.581028646 0.187955103 -0.32751391
## lifetime_gifts -0.418259004 0.030717436 0.06906320
## largest_gift -0.103182141 -0.036870178 0.52092773
## last_gift 0.318396902 0.001689584 0.97681401
## months_since_donate 1.000000000 -0.073204830 0.33461668
## time_lag -0.073204830 1.000000000 -0.02634570
## avg_gift 0.334616679 -0.026345702 1.00000000
# Compute correlation matrix
cor_train <- train[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation <- cor(cor_train)
# Find highly correlated variables
highly_correlated <- findCorrelation(correlation, cutoff = 0.7)
# Get the names of the highly correlated variablesnames(cor_train)[highly_correlated]
names(cor_train)[highly_correlated]
## [1] "avg_fam_inc" "med_fam_inc" "avg_gift"
with a cutoff point of 0.7, the heavy correlated variables are:“avg_fam_inc” “med_fam_inc” “avg_gift”
Loop through each predictor variable and create a histogram with the target variable
# Create a list of predictor variables
predictors <- c("zipconvert2", "zipconvert3", "zipconvert4", "zipconvert5", "homeowner", "female",
"num_child", "income", "wealth", "home_value", "med_fam_inc", "avg_fam_inc",
"pct_lt15k", "num_prom", "lifetime_gifts", "largest_gift", "months_since_donate",
"time_lag", "avg_gift")
# Loop through each predictor variable and create a histogram with the target variable
for (i in predictors) {
plot_data <- train[, c(i, "target")]
plot_data <- plot_data[complete.cases(plot_data), ]
plot <- ggplot(data = plot_data, aes(x = .data[[i]], y = ..count..)) +
stat_count(binwidth = 5, fill = "blue", alpha = 0.5) +
labs(title = paste("Histogram of", i, "vs. Target Variable"), x = i, y = "Frequency")
print(plot)
}
2. Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.
#Model1 :Logistic Regression#
temp = fundraising[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(temp)
round(correlation, 5)
## num_child income wealth home_value med_fam_inc
## num_child 1.00000 0.09189 0.06018 -0.01196 0.04696
## income 0.09189 1.00000 0.20899 0.29197 0.36751
## wealth 0.06018 0.20899 1.00000 0.26116 0.37776
## home_value -0.01196 0.29197 0.26116 1.00000 0.73815
## med_fam_inc 0.04696 0.36751 0.37776 0.73815 1.00000
## avg_fam_inc 0.04726 0.37859 0.38589 0.75257 0.97227
## pct_lt15k -0.03172 -0.28319 -0.37515 -0.39909 -0.66536
## num_prom -0.08643 -0.06901 -0.41212 -0.06451 -0.05078
## lifetime_gifts -0.05095 -0.01957 -0.22547 -0.02407 -0.03525
## largest_gift -0.01755 0.03318 -0.02528 0.05649 0.04703
## last_gift -0.01295 0.10959 0.05259 0.15886 0.13598
## months_since_donate -0.00556 0.07724 0.03371 0.02343 0.03234
## time_lag -0.00607 -0.00155 -0.06642 0.00068 0.01520
## avg_gift -0.01969 0.12406 0.09108 0.16877 0.13716
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child 0.04726 -0.03172 -0.08643 -0.05095 -0.01755
## income 0.37859 -0.28319 -0.06901 -0.01957 0.03318
## wealth 0.38589 -0.37515 -0.41212 -0.22547 -0.02528
## home_value 0.75257 -0.39909 -0.06451 -0.02407 0.05649
## med_fam_inc 0.97227 -0.66536 -0.05078 -0.03525 0.04703
## avg_fam_inc 1.00000 -0.68028 -0.05731 -0.04033 0.04310
## pct_lt15k -0.68028 1.00000 0.03778 0.05962 -0.00788
## num_prom -0.05731 0.03778 1.00000 0.53862 0.11381
## lifetime_gifts -0.04033 0.05962 0.53862 1.00000 0.50726
## largest_gift 0.04310 -0.00788 0.11381 0.50726 1.00000
## last_gift 0.13138 -0.06175 -0.05587 0.20206 0.44724
## months_since_donate 0.03127 -0.00901 -0.28232 -0.14462 0.01979
## time_lag 0.02434 -0.01991 0.11962 0.03855 0.03998
## avg_gift 0.13176 -0.06248 -0.14725 0.18232 0.47483
## last_gift months_since_donate time_lag avg_gift
## num_child -0.01295 -0.00556 -0.00607 -0.01969
## income 0.10959 0.07724 -0.00155 0.12406
## wealth 0.05259 0.03371 -0.06642 0.09108
## home_value 0.15886 0.02343 0.00068 0.16877
## med_fam_inc 0.13598 0.03234 0.01520 0.13716
## avg_fam_inc 0.13138 0.03127 0.02434 0.13176
## pct_lt15k -0.06175 -0.00901 -0.01991 -0.06248
## num_prom -0.05587 -0.28232 0.11962 -0.14725
## lifetime_gifts 0.20206 -0.14462 0.03855 0.18232
## largest_gift 0.44724 0.01979 0.03998 0.47483
## last_gift 1.00000 0.18672 0.07511 0.86640
## months_since_donate 0.18672 1.00000 0.01553 0.18911
## time_lag 0.07511 0.01553 1.00000 0.07008
## avg_gift 0.86640 0.18911 0.07008 1.00000
glm.fund = glm(target ~., data = train, family = 'binomial')
glm.step = step(glm.fund, scope = list(upper = glm.fund),
direction = "both", test = "Chisq", trace = F)
summary(glm.step)
##
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 +
## zipconvert5 + homeowner + num_child + income + home_value +
## avg_fam_inc + last_gift + months_since_donate, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7887 -1.1411 -0.7736 1.1703 1.6751
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.335e+00 3.900e-01 -5.987 2.14e-09 ***
## zipconvert2Yes -1.266e+01 2.271e+02 -0.056 0.95553
## zipconvert3No 1.257e+01 2.271e+02 0.055 0.95586
## zipconvert4Yes -1.264e+01 2.271e+02 -0.056 0.95563
## zipconvert5Yes -1.258e+01 2.271e+02 -0.055 0.95582
## homeownerNo 1.505e-01 1.052e-01 1.430 0.15272
## num_child 3.424e-01 1.271e-01 2.694 0.00706 **
## income -5.308e-02 2.858e-02 -1.857 0.06330 .
## home_value -1.091e-04 7.675e-05 -1.422 0.15508
## avg_fam_inc 5.741e-04 4.042e-04 1.420 0.15550
## last_gift 1.610e-02 4.798e-03 3.356 0.00079 ***
## months_since_donate 5.852e-02 1.076e-02 5.438 5.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3255.7 on 2388 degrees of freedom
## AIC: 3279.7
##
## Number of Fisher Scoring iterations: 11
hoslem.test(glm.step$y, fitted(glm.step), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.step$y, fitted(glm.step)
## X-squared = 1.7672, df = 8, p-value = 0.9873
Based on the above analysis, we fit the final model with the following predictors: num_child, last_gift, and months_since_donate
glm.fund_final = glm(target ~ num_child + last_gift + months_since_donate, data = train, family = 'binomial')
pred.prob = predict.glm(glm.fund_final, newdata = test, type = 'response')
pred = ifelse(pred.prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 98 131
## No Donor 192 179
##
## Accuracy : 0.4617
## 95% CI : (0.4212, 0.5025)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.9968933
##
## Kappa : -0.0852
##
## Mcnemar's Test P-Value : 0.0008424
##
## Sensitivity : 0.3379
## Specificity : 0.5774
## Pos Pred Value : 0.4279
## Neg Pred Value : 0.4825
## Prevalence : 0.4833
## Detection Rate : 0.1633
## Detection Prevalence : 0.3817
## Balanced Accuracy : 0.4577
##
## 'Positive' Class : Donor
##
*Hosmer and Lemeshow goodness of fit (GOF) test yielded a p-value of 0.9873 which is above the significance level of 0.05. therefore, the model is adequate.Logistic Regression model come out with a accuracy rate of 46.17%**
#Model2: Random Forest#
train_control = trainControl(method="repeatedcv",number=2,repeats=1)
rf.fit = train(target~.,
data = train,
method ='rf',
trControl = train_control,
importance = TRUE)
rf.fit$besttune
## NULL
varImp(rf.fit)
## rf variable importance
##
## Importance
## months_since_donate 100.000
## largest_gift 84.900
## last_gift 59.995
## num_child 57.812
## avg_gift 50.522
## pct_lt15k 46.783
## income 42.604
## home_value 40.148
## avg_fam_inc 34.732
## med_fam_inc 32.086
## homeownerNo 26.448
## zipconvert3No 26.428
## wealth 18.079
## num_prom 16.817
## zipconvert2Yes 13.093
## femaleNo 9.895
## time_lag 6.264
## lifetime_gifts 4.329
## zipconvert4Yes 2.327
## zipconvert5Yes 0.000
plot(varImp(rf.fit))
We remove avg_gift as it is collinear with last_gift, and we remove
med_fam_inc as it is collinear with pct_lt15k.
train_control = trainControl(method="repeatedcv",number=2,repeats=1)
rf.fit_refitted = train(target~ months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
data = train,
method ='rf',
trControl = train_control,
importance = TRUE)
pred.rf_refitted = predict(rf.fit_refitted,test)
confusionMatrix(pred.rf_refitted,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 160 144
## No Donor 130 166
##
## Accuracy : 0.5433
## 95% CI : (0.5025, 0.5837)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.1026
##
## Kappa : 0.0871
##
## Mcnemar's Test P-Value : 0.4322
##
## Sensitivity : 0.5517
## Specificity : 0.5355
## Pos Pred Value : 0.5263
## Neg Pred Value : 0.5608
## Prevalence : 0.4833
## Detection Rate : 0.2667
## Detection Prevalence : 0.5067
## Balanced Accuracy : 0.5436
##
## 'Positive' Class : Donor
##
Model3 Support Vector Machine
# Fit a support vector machine model
svm_model <- svm(target ~ ., data = train)
# Make predictions on the test set
svm_pred <- predict(svm_model, newdata = test)
# Evaluate the model
table(svm_pred, test$target)
##
## svm_pred Donor No Donor
## Donor 186 161
## No Donor 104 149
# Create a data partition for cross-validation
folds <- createFolds(train$target, k = 5)
# Define the SVM model
model <- svm(target ~ ., data = train)
# Train the model using 5-fold cross-validation
ctrl <- trainControl(method = "cv", index = folds)
fit <- train(target ~ ., data = train, method = "svmRadial", trControl = ctrl, tuneLength = 10)
# Predict the class labels of the test set
predictions <- predict(fit, newdata = test)
# Create a confusion matrix
cm <- confusionMatrix(predictions, test$target)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 182 169
## No Donor 108 141
##
## Accuracy : 0.5383
## 95% CI : (0.4975, 0.5788)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.1535766
##
## Kappa : 0.0819
##
## Mcnemar's Test P-Value : 0.0003121
##
## Sensitivity : 0.6276
## Specificity : 0.4548
## Pos Pred Value : 0.5185
## Neg Pred Value : 0.5663
## Prevalence : 0.4833
## Detection Rate : 0.3033
## Detection Prevalence : 0.5850
## Balanced Accuracy : 0.5412
##
## 'Positive' Class : Donor
##
3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset? A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.
4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
models = c('Logistic Regression', 'Random Forest','Support Vector Machine')
acc= c(46.17, 53.83, 52.67 )
acc.summary= as.data.frame(acc, row.names = models)
acc.summary
## acc
## Logistic Regression 46.17
## Random Forest 53.83
## Support Vector Machine 52.67
barplot(acc,names.arg = models, ylab="Accuracy Score", col="pink",
main="Model Results", border="orange")
The Random Forest model appears to be the model that
dominates.
5. Select best model. From your answer in (4), what do you think is the “best” model? Model Selected: Random Forest is the selected model due to its slightly higher accuracy.
6. Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.
train_control = trainControl(method="repeatedcv",number=2,repeats=1)
rf.fit_final = train(target~months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
data = fundraising,
method ='rf',
trControl = train_control,
importance = TRUE)
pred.rf_final = predict(rf.fit_final,future_fundraising)
pred.rf_final
## [1] Donor No Donor Donor Donor Donor No Donor No Donor No Donor
## [9] Donor No Donor No Donor Donor No Donor Donor No Donor No Donor
## [17] Donor Donor Donor No Donor Donor Donor No Donor No Donor
## [25] Donor No Donor Donor No Donor No Donor Donor Donor Donor
## [33] No Donor No Donor Donor No Donor No Donor Donor Donor Donor
## [41] No Donor No Donor Donor No Donor No Donor Donor Donor No Donor
## [49] No Donor Donor No Donor Donor Donor Donor No Donor Donor
## [57] No Donor No Donor Donor Donor Donor No Donor No Donor No Donor
## [65] No Donor Donor No Donor No Donor No Donor Donor Donor No Donor
## [73] No Donor Donor Donor Donor No Donor No Donor No Donor Donor
## [81] No Donor Donor No Donor Donor No Donor Donor Donor No Donor
## [89] Donor Donor Donor Donor Donor No Donor No Donor Donor
## [97] Donor No Donor Donor No Donor No Donor Donor Donor No Donor
## [105] No Donor No Donor Donor Donor Donor Donor Donor Donor
## [113] No Donor No Donor Donor Donor No Donor No Donor No Donor Donor
## Levels: Donor No Donor
summary(pred.rf_final)
## Donor No Donor
## 62 58
7. Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header and have the following format:
write.table(pred.rf_final, file = "predictions_randomforest_final.csv", col.names = c("value"), row.names = FALSE)
#"pred.rf_final" is your predicted values for the test set.
# Create a data frame with the predicted values
submission <- data.frame(value = pred.rf_final)
# Write the data frame to a CSV file with header
write.csv(submission, file = "submission.csv", row.names = FALSE)