A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is 13.00 USD. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 USD to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-respondents so that the sample has equal numbers of donors and non-donors.
The dataset for this project was provided by the American Legion. A dataset containing 3000 observations was provided for training the model purpose. Another set containing 120 observations was provided to predict the future fundraising based on the best model. There 20 predictor variables and the target variable is target, which represents whether the customer will donate or nor. The target variable is missing from the future data set.
#load datasets
fundraising = readRDS("fundraising.rds")
fundraising_future = readRDS("future_fundraising.rds")str(fundraising) # check for the type of the data ## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
#univariate statistics to summarize each variables
summary(fundraising)## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
summary(fundraising_future)## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :99 Yes:27 No :94 No :74 Yes:91 Min. :1.000
## Yes:21 No :93 Yes:26 Yes:46 No :29 1st Qu.:1.000
## Median :1.000
## Mean :1.067
## 3rd Qu.:1.000
## Max. :4.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:70 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:2.750 No :50 1st Qu.:5.750 1st Qu.: 579.5 1st Qu.:276.2
## Median :4.000 Median :8.000 Median : 897.5 Median :359.0
## Mean :3.767 Mean :6.567 Mean :1093.7 Mean :384.7
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1319.5 3rd Qu.:474.8
## Max. :7.000 Max. :9.000 Max. :5413.0 Max. :823.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 15.00 Min. : 15.0
## 1st Qu.:311.8 1st Qu.: 5.00 1st Qu.: 29.75 1st Qu.: 45.0
## Median :405.5 Median :12.50 Median : 47.00 Median : 82.0
## Mean :426.5 Mean :14.52 Mean : 47.85 Mean :101.9
## 3rd Qu.:522.2 3rd Qu.:21.25 3rd Qu.: 64.00 3rd Qu.:126.5
## Max. :813.0 Max. :63.00 Max. :103.00 Max. :431.0
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 1.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.750
## Median : 15.00 Median : 12.00 Median :31.00 Median : 5.000
## Mean : 16.79 Mean : 14.52 Mean :31.24 Mean : 6.508
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:35.00 3rd Qu.: 9.000
## Max. :100.00 Max. :100.00 Max. :37.00 Max. :26.000
## avg_gift
## Min. : 2.963
## 1st Qu.: 6.819
## Median : 9.331
## Mean :11.235
## 3rd Qu.:13.354
## Max. :42.500
The values seems to be within a reasonable range and I don’t notice data inconsistency issues in the dataset.
#Check for missing values
sum(is.na(fundraising)) # Check for missing value per column## [1] 0
(sum(duplicated(fundraising))) # Check for duplicated observations## [1] 0
sum(is.na(fundraising_future)) # Check for missing value per column## [1] 0
(sum(duplicated(fundraising_future))) # Check for missing value per column## [1] 0
No missing value and no duplicated in the datasets
The dataset is clean with no missing values and duplicated rows. An exploratory data analysis involving univariate statistics, correlation matrix, histograms and boxplots will be performed to understand the dataset, examine the relationship between the predictors and the response variable target, the correlation and existence of collinearity between predictors. This in combination with the Variable Importance feature of the randomForest package will be used to select the predictors that are important in predicting whether a customer will donate or not. Variables that are found to be collinear will be excluded as this can be problematic while fitting some models. For example in a logistic regression model highly correlated independent variables will usually result in large standard errors for the estimated beta coefficients (or slopes) of these variables.
Weighted sampling was produce a training dataset with equal numbers of donors and non-donors. It is because a simple random sample will sample the two classes equally and will produce an unbalanced dataset. This is important to have a balanced dataset as in an unbalanced dataset the model will learn a lot about the majority class and will be very good at predicting the majority class and will not accurately classify with minority class. With a balanced data our model learn about the two classed equally and will be able to produce a much better prediction.
Several statistical learning algorithms will be fit to predict the response variable. Random fores, Logistic Regression, K-nearest Neighbor, Support vector machines and Decision Tree will be fit. In order to use all the observations for the training of the models, the repeated cross validation (repeatedcv) will be used to estimate the out of sample error. Accuracy will be used to evaluate, compare and choose the best model.
I will perform the tasks in the assignment step by step and describe results along the way.
Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?
I will first make a few plots to see the distribution of the variables to better understand the dataset.
#plot histograms
plot_histogram(fundraising, theme_config = list("strip.background" = element_rect(fill = "lightblue"),"aspect.ratio" = 1/3), ggtheme = theme_bw())#color coded histograms of numeric variables coded using the target variable
plt1 = ggplot(fundraising,aes(home_value)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt2 = ggplot(fundraising,aes(avg_gift)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt3 = ggplot(fundraising,aes(avg_fam_inc)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt4 = ggplot(fundraising,aes(num_prom)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt5 = ggplot(fundraising,aes(lifetime_gifts)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt6 = ggplot(fundraising,aes(pct_lt15k)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt7 = ggplot(fundraising,aes(income)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt8 = ggplot(fundraising,aes(time_lag)) +
geom_histogram(aes(fill= as.factor(target))) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# combine the histogram in one plot
ggarrange(plt1, plt2, plt3, plt4, plt5, plt6,plt7, plt8,
ncol = 2, nrow = 4,
widths = 1,
heights = 1,
font.label = list(size = 20, color = "green", face = "bold", family = NULL),
common.legend = T)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histograms shows that most of the variables are skewed. The exception are income, which is indicates the income levels. I will normalize and scale variables while fitting in caret.
Correlation and collinearity analysis
I will use the correlation matrix and multicolinearity test to evaluate the association between the predictors. Color coded histograms and the variable importance feature from the RandomForest method will be used to analyse the association between the predictors and the response variables and select predictors that are useful in predicting whether a customer will donate or not.
Lets check the correlation matrix for the numerical variables
#correlation matrix
corrplot(cor(fundraising[,-c(1:5,8,21)]), diag = F, method = "color", type = "lower", col=brewer.pal(n=10, name="RdYlBu"), tl.col = "Black", tl.srt = 45, order = "hclust", outline = T)avg_fam_inc and med_fam_inc, and avg_gift and last_gift are highly correlated with a correlation coefficient of 0.98 and 0.87 respectively. Including these highly correlated variables leads to multicollinearity.med_fam_inc and avg_fam_inc also have negative correlation with the percentage of people earning less than $15K in potential donor’s neighborhood (pct_ir15k). home_value seems to be correlated with med_fam_inc and avg_fam_inc.plt1 = ggplot(fundraising,aes(x = as.factor(target), y = home_value)) +
geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt2 = ggplot(fundraising,aes(x = as.factor(target), y = avg_gift)) +
geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt3 = ggplot(fundraising,aes(x = as.factor(target), y = avg_fam_inc)) +
geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt4 = ggplot(fundraising,aes(x = as.factor(target), y = num_prom)) +
geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt5 = ggplot(fundraising,aes(x = as.factor(target), y = lifetime_gifts)) +
geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
plt6 = ggplot(fundraising,aes(x = as.factor(target), y = pct_lt15k)) +
geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggarrange(plt1, plt2, plt3, plt4, plt5, plt6,
ncol = 6, nrow = 1,
widths = 1,
heights = 1,
font.label = list(size = 20, color = "green", face = "bold", family = NULL),
legend = "none")These predictors seems to have different median for the two classes, but the difference is not large.
#plot boxplot for predictors color coded with the target variable
pl1 = ggplot(fundraising,aes(homeowner)) + geom_bar(aes(fill=target), position = position_dodge()) +
theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pl2 = ggplot(fundraising,aes(female)) + geom_bar(aes(fill=target), position = position_dodge()) +
theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pl3 = ggplot(fundraising,aes(num_child)) + geom_bar(aes(fill=target), position = position_dodge()) +
theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pl4 = ggplot(fundraising,aes(income)) + geom_bar(aes(fill=target), position = position_dodge()) +
theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pl5 = ggplot(fundraising,aes(wealth)) + geom_bar(aes(fill=target), position = position_dodge()) +
theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pl6 = ggplot(fundraising,aes(target)) + geom_bar(aes(fill=target), position = position_dodge()) +
theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggarrange(pl1, pl2, pl3, pl4, pl5, pl6,
ncol = 2, nrow = 3,
widths = 1,
heights = 1,
font.label = list(size = 20, color = "green", face = "bold", family = NULL),
common.legend = T)We can see that the variables that are not important are indeed are not separating donors from non donors.
Variable importance plot
Random forests provide variable importance measures that can be used to identify the most important predictor variables.
#fit a basic randomforest model to select the important variables based on the variable importance feature.
set.seed(12345) #set seed for reproducibility
rf=train(target~.,data=fundraising) #train
rfImp<-varImp(rf) #compute the variable importance
plot(rfImp,top=10) #plot According to the importance plot avg_gift, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, time_lag,months_since_donate important.
Through the med_fam_inc has a higher score on the variable importance table, I chose not to include it in the model as it is highly correlated with avg_fam_inc
Based on the correlation matrix, plots and the variable importance plot the following predictors are selected to the models.
home_valueavg_fam_incpct_lt15knum_promlifetime_giftstime_lagavg_giftmonths_since_donateSeveral models will be fit to select the best model. The fitting is made in Caret package. Repeated 5 - fold cross validations were used to obtain a better estimate of the test-set error rather than a regular cross validation. The entire cross-validation procedure is repeated 3 times. This method results in a more precise assessments of how well the model performs, as we have many more out-of-sample datasets to look at.
Since the majority of the predictors are skewed and the scales of the predictors are quite different the preProcess option will be used to ensure the variables are centered around zero and have roughly unit variance the data set. Center and scale are chosen.
A tuneLength of 10 was used to explore more potential models and find a better model. The hyper parameters that resulted in the lowest out of sample error or highest accuracy will be presented under each model.
Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, missing value, complex interactions and correlations between variables (multicollinearity). The hyperparameter required to fit the moded in Caret is mtry, which defines the number of variables randomly sampled as candidates at each split.
set.seed(12345) #set seed for reproducibility
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) #train control to be applied for all the models.
# Repeated 5 fold cv
rfFit <- train(target ~ home_value + avg_fam_inc + months_since_donate +
pct_lt15k + num_prom + lifetime_gifts + time_lag*avg_gift,
data = fundraising,
method = "rf",
trControl = ctrl,
preProcess = c("center","scale"), # standardise variables
tuneLength = 10) ## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
rfFit$bestTuneplot(rfFit)
Linear regression algorithms one of the most widely known and used machine learning algorithms used for a classification problem.
set.seed(12345) #set seed for reproducibility
glmFit <- train(target ~ home_value + avg_fam_inc + months_since_donate +
pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift,
data = fundraising,
method = "glm",
trControl = ctrl,
preProcess = c("center","scale"),
tuneLength = 10)
glmFit$results$Accuracy #retrieve accuracy from final model## [1] 0.5538848
Support vector machine algorithm tries to find a hyperplane in an N-dimensional space(N — the number of features) that gives the biggest separation between the classes i.e. the points are as far from the line as possible. It is linear hyperplane but points that are non-linearly separable can be extended using a Kernel Trick: transformation of original data to map into new dimension. Here, the svmRadial, radial kernel is used. The hyperparameter in support vector machines is C, which is a penalty term that defines how much one want to avoid classifying each training example. SVM using the radial kernel has an additional parameter in Caret. This parameter is called Sigma.
set.seed(12345) #set seed for reproducibility
svmFit.lin <- train(target ~ home_value + avg_fam_inc + months_since_donate +
pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift,
data = fundraising,
method = "svmLinear",
trControl = ctrl,
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.1, 2, length = 10)),
tuneLength = 10)
res.svm.lin<-as_tibble(svmFit.lin$results[which.max(svmFit.lin$results[,2]),])plot(svmFit.lin)Non linear SVM classifier We can also fit non linear Kernel tricks to predict the dataset.
- Radial Kernel
set.seed(12345)
svmFit.rad <- train(target ~ home_value + avg_fam_inc + months_since_donate +
pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift,
data = fundraising,
method = "svmRadial",
trControl = ctrl,
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0, 3, length = 10), sigma = c(.01, .015, 0.2)),
tuneLength = 5)## Warning: model fit failed for Fold1.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
svmFit.rad$bestTuneres.svm.rad<-as_tibble(svmFit.rad$results[which.max(svmFit.rad$results[,3]),])
plot(svmFit.rad)KNN is unsupervised learning algorithm that tries to predict the correct class for the test data by computing the distance between the test data and all the training points. Then select the K number of points which is closet to the test data. The only required parameter is the k value. The package chooses he optimal k-neighbors.
knnFit <- train(target ~ home_value + avg_fam_inc + months_since_donate +
pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift,
data = fundraising,
method = "knn",
trControl = ctrl,
preProcess = c("center","scale"),
tuneLength = 10)
knnFit$bestTuneplot(knnFit)Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
Weighted sampling was produce a training dataset with equal numbers of donors and non-donors. It is because a simple random sample will sample the two classes equally and will produce an unbalanced dataset. This is important to have a balanced dataset as in an unbalanced dataset the model will learn a lot about the majority class and will be very good at predicting the majority class and will not accurately classify with minority class. With a balanced data our model learn about the two classed equally and will be able to produce a much better prediction.
Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
# save all the accuracies to a tibble.
df<-tibble(Model=c('Random Forest','Logistic Regression', "Linear SVM", "SVM Radial",'K-Nearest Neighbout'),
Accuracy=c(rfFit$results[which.max(rfFit$results[,2]),][2],
glmFit$results$Accuracy,
res.svm.lin$Accuracy,
res.svm.rad$Accuracy,
knnFit$results[which.max(knnFit$results[,2]),][2]))
(df$Accuracy = as.numeric(df$Accuracy)) #transform to numeric## [1] 0.5296600 0.5538848 0.5466641 0.5548859 0.5328932
df %>% arrange(Accuracy) From your answer in (4), what do you think is the “best” model?
The file FutureFundraising.csv contains the attributes for future mailing candidates.
Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.
#predict new data using the best model
preds = predict(svmFit.rad, fundraising_future)For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header and have the following format:
#write.csv
pr = as.data.frame(preds) #save predictions as dataframe
colnames(pr)= "value" # name the column
write.csv(pr,file="mySubmission.csv")The maximum accuracy was obtained from the support vector machine using the radial kernel at an accuracy of 0.5548859, which is not fabulous. Accuracy on the future fundraising dataset is 65%, which indicates that our models are underfitting. It is therefore recommended that further analysis on the interaction terms between the variables and expanding grid search to obtain better hyperparameters are necessary to improve the prediction performance of the models.