Background

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is 13.00 USD. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 USD to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-respondents so that the sample has equal numbers of donors and non-donors.

Data

The dataset for this project was provided by the American Legion. A dataset containing 3000 observations was provided for training the model purpose. Another set containing 120 observations was provided to predict the future fundraising based on the best model. There 20 predictor variables and the target variable is target, which represents whether the customer will donate or nor. The target variable is missing from the future data set.

#load datasets
fundraising = readRDS("fundraising.rds")
fundraising_future = readRDS("future_fundraising.rds")
str(fundraising) # check for the type of the data 
## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
#univariate statistics to summarize each variables
summary(fundraising)
##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167
summary(fundraising_future)
##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner   num_child    
##  No :99      Yes:27      No :94      No :74      Yes:91    Min.   :1.000  
##  Yes:21      No :93      Yes:26      Yes:46      No :29    1st Qu.:1.000  
##                                                            Median :1.000  
##                                                            Mean   :1.067  
##                                                            3rd Qu.:1.000  
##                                                            Max.   :4.000  
##      income      female       wealth        home_value      med_fam_inc   
##  Min.   :1.000   Yes:70   Min.   :0.000   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:2.750   No :50   1st Qu.:5.750   1st Qu.: 579.5   1st Qu.:276.2  
##  Median :4.000            Median :8.000   Median : 897.5   Median :359.0  
##  Mean   :3.767            Mean   :6.567   Mean   :1093.7   Mean   :384.7  
##  3rd Qu.:5.000            3rd Qu.:8.000   3rd Qu.:1319.5   3rd Qu.:474.8  
##  Max.   :7.000            Max.   :9.000   Max.   :5413.0   Max.   :823.0  
##   avg_fam_inc      pct_lt15k        num_prom      lifetime_gifts 
##  Min.   :  0.0   Min.   : 0.00   Min.   : 15.00   Min.   : 15.0  
##  1st Qu.:311.8   1st Qu.: 5.00   1st Qu.: 29.75   1st Qu.: 45.0  
##  Median :405.5   Median :12.50   Median : 47.00   Median : 82.0  
##  Mean   :426.5   Mean   :14.52   Mean   : 47.85   Mean   :101.9  
##  3rd Qu.:522.2   3rd Qu.:21.25   3rd Qu.: 64.00   3rd Qu.:126.5  
##  Max.   :813.0   Max.   :63.00   Max.   :103.00   Max.   :431.0  
##   largest_gift      last_gift      months_since_donate    time_lag     
##  Min.   :  5.00   Min.   :  1.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.: 10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.750  
##  Median : 15.00   Median : 12.00   Median :31.00       Median : 5.000  
##  Mean   : 16.79   Mean   : 14.52   Mean   :31.24       Mean   : 6.508  
##  3rd Qu.: 20.00   3rd Qu.: 16.00   3rd Qu.:35.00       3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :100.00   Max.   :37.00       Max.   :26.000  
##     avg_gift     
##  Min.   : 2.963  
##  1st Qu.: 6.819  
##  Median : 9.331  
##  Mean   :11.235  
##  3rd Qu.:13.354  
##  Max.   :42.500

The values seems to be within a reasonable range and I don’t notice data inconsistency issues in the dataset.

#Check for missing values
sum(is.na(fundraising))      # Check for missing value per column
## [1] 0
(sum(duplicated(fundraising)))      # Check for duplicated observations
## [1] 0
sum(is.na(fundraising_future))  # Check for missing value per column
## [1] 0
(sum(duplicated(fundraising_future))) # Check for missing value per column
## [1] 0

No missing value and no duplicated in the datasets



Methodlogy

The dataset is clean with no missing values and duplicated rows. An exploratory data analysis involving univariate statistics, correlation matrix, histograms and boxplots will be performed to understand the dataset, examine the relationship between the predictors and the response variable target, the correlation and existence of collinearity between predictors. This in combination with the Variable Importance feature of the randomForest package will be used to select the predictors that are important in predicting whether a customer will donate or not. Variables that are found to be collinear will be excluded as this can be problematic while fitting some models. For example in a logistic regression model highly correlated independent variables will usually result in large standard errors for the estimated beta coefficients (or slopes) of these variables.

Weighted sampling was produce a training dataset with equal numbers of donors and non-donors. It is because a simple random sample will sample the two classes equally and will produce an unbalanced dataset. This is important to have a balanced dataset as in an unbalanced dataset the model will learn a lot about the majority class and will be very good at predicting the majority class and will not accurately classify with minority class. With a balanced data our model learn about the two classed equally and will be able to produce a much better prediction.

Several statistical learning algorithms will be fit to predict the response variable. Random fores, Logistic Regression, K-nearest Neighbor, Support vector machines and Decision Tree will be fit. In order to use all the observations for the training of the models, the repeated cross validation (repeatedcv) will be used to estimate the out of sample error. Accuracy will be used to evaluate, compare and choose the best model.



Assignment

I will perform the tasks in the assignment step by step and describe results along the way.




Step 1: Partitioning

  • Cross validation will be used to estimate the out of sample error.

Step 2: Model Building





1. Exploratory data analysis

Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

I will first make a few plots to see the distribution of the variables to better understand the dataset.

#plot histograms 
plot_histogram(fundraising, theme_config = list("strip.background" = element_rect(fill = "lightblue"),"aspect.ratio" = 1/3), ggtheme = theme_bw())

#color coded histograms of numeric variables coded using the target variable
plt1 = ggplot(fundraising,aes(home_value)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt2 = ggplot(fundraising,aes(avg_gift)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt3 = ggplot(fundraising,aes(avg_fam_inc)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt4 = ggplot(fundraising,aes(num_prom)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt5 = ggplot(fundraising,aes(lifetime_gifts)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt6 = ggplot(fundraising,aes(pct_lt15k)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt7 = ggplot(fundraising,aes(income)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt8 = ggplot(fundraising,aes(time_lag)) + 
  geom_histogram(aes(fill= as.factor(target))) + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# combine the histogram in one plot
ggarrange(plt1, plt2, plt3, plt4, plt5, plt6,plt7, plt8,
          ncol = 2, nrow = 4, 
          widths = 1,
          heights = 1,
          font.label = list(size = 20, color = "green", face = "bold", family = NULL),
          common.legend = T)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histograms shows that most of the variables are skewed. The exception are income, which is indicates the income levels. I will normalize and scale variables while fitting in caret.

Correlation and collinearity analysis

I will use the correlation matrix and multicolinearity test to evaluate the association between the predictors. Color coded histograms and the variable importance feature from the RandomForest method will be used to analyse the association between the predictors and the response variables and select predictors that are useful in predicting whether a customer will donate or not.

Lets check the correlation matrix for the numerical variables

#correlation matrix
corrplot(cor(fundraising[,-c(1:5,8,21)]), diag = F, method = "color", type = "lower", col=brewer.pal(n=10, name="RdYlBu"), tl.col = "Black", tl.srt = 45, order = "hclust", outline = T)

  • The avg_fam_inc and med_fam_inc, and avg_gift and last_gift are highly correlated with a correlation coefficient of 0.98 and 0.87 respectively. Including these highly correlated variables leads to multicollinearity.
  • med_fam_inc and avg_fam_inc also have negative correlation with the percentage of people earning less than $15K in potential donor’s neighborhood (pct_ir15k). home_value seems to be correlated with med_fam_inc and avg_fam_inc.
plt1 = ggplot(fundraising,aes(x = as.factor(target), y = home_value)) + 
  geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt2 = ggplot(fundraising,aes(x = as.factor(target), y = avg_gift)) + 
  geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt3 = ggplot(fundraising,aes(x = as.factor(target), y = avg_fam_inc)) + 
  geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt4 = ggplot(fundraising,aes(x = as.factor(target), y = num_prom)) + 
  geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt5 = ggplot(fundraising,aes(x = as.factor(target), y = lifetime_gifts)) + 
  geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

plt6 = ggplot(fundraising,aes(x = as.factor(target), y = pct_lt15k)) + 
  geom_boxplot(aes(fill= as.factor(target))) + xlab(" ") + theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggarrange(plt1, plt2, plt3, plt4, plt5, plt6,
          ncol = 6, nrow = 1, 
          widths = 1,
          heights = 1,
          font.label = list(size = 20, color = "green", face = "bold", family = NULL),
          legend = "none")

These predictors seems to have different median for the two classes, but the difference is not large.

#plot boxplot for predictors color coded with the target variable

pl1 = ggplot(fundraising,aes(homeowner)) + geom_bar(aes(fill=target), position = position_dodge()) + 
    theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

pl2 = ggplot(fundraising,aes(female)) + geom_bar(aes(fill=target), position = position_dodge()) + 
    theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

pl3 = ggplot(fundraising,aes(num_child)) + geom_bar(aes(fill=target), position = position_dodge()) + 
    theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

pl4 = ggplot(fundraising,aes(income)) + geom_bar(aes(fill=target), position = position_dodge()) + 
    theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

pl5 = ggplot(fundraising,aes(wealth)) + geom_bar(aes(fill=target), position = position_dodge()) + 
    theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

pl6 = ggplot(fundraising,aes(target)) + geom_bar(aes(fill=target), position = position_dodge()) +
    theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggarrange(pl1, pl2, pl3, pl4, pl5, pl6,
          ncol = 2, nrow = 3, 
          widths = 1,
          heights = 1,
          font.label = list(size = 20, color = "green", face = "bold", family = NULL),
          common.legend = T)

We can see that the variables that are not important are indeed are not separating donors from non donors.

Variable importance plot

Random forests provide variable importance measures that can be used to identify the most important predictor variables.

#fit a basic randomforest model to select the important variables based on the variable importance feature.
set.seed(12345)                        #set seed for reproducibility
rf=train(target~.,data=fundraising)    #train
rfImp<-varImp(rf)                      #compute the variable importance
plot(rfImp,top=10)                     #plot 

According to the importance plot avg_gift, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, time_lag,months_since_donate important.

Through the med_fam_inc has a higher score on the variable importance table, I chose not to include it in the model as it is highly correlated with avg_fam_inc

Based on the correlation matrix, plots and the variable importance plot the following predictors are selected to the models.

  • home_value
  • avg_fam_inc
  • pct_lt15k
  • num_prom
  • lifetime_gifts
  • time_lag
  • avg_gift
  • months_since_donate



2. Select Classification tools and parameters

Several models will be fit to select the best model. The fitting is made in Caret package. Repeated 5 - fold cross validations were used to obtain a better estimate of the test-set error rather than a regular cross validation. The entire cross-validation procedure is repeated 3 times. This method results in a more precise assessments of how well the model performs, as we have many more out-of-sample datasets to look at.

Since the majority of the predictors are skewed and the scales of the predictors are quite different the preProcess option will be used to ensure the variables are centered around zero and have roughly unit variance the data set. Center and scale are chosen.

A tuneLength of 10 was used to explore more potential models and find a better model. The hyper parameters that resulted in the lowest out of sample error or highest accuracy will be presented under each model.

RandomForest

Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, missing value, complex interactions and correlations between variables (multicollinearity). The hyperparameter required to fit the moded in Caret is mtry, which defines the number of variables randomly sampled as candidates at each split.

set.seed(12345)                                                      #set seed for reproducibility
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)   #train control to be applied for all the models. 
                                                                     # Repeated 5 fold cv
rfFit <- train(target ~ home_value  + avg_fam_inc + months_since_donate +
                 pct_lt15k + num_prom + lifetime_gifts + time_lag*avg_gift, 
               data = fundraising, 
               method = "rf",
               trControl = ctrl, 
               preProcess = c("center","scale"),                     # standardise variables
               tuneLength = 10) 
## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
rfFit$bestTune
plot(rfFit)



Logsitic Regression

Linear regression algorithms one of the most widely known and used machine learning algorithms used for a classification problem.

set.seed(12345)                        #set seed for reproducibility
glmFit <- train(target ~ home_value + avg_fam_inc + months_since_donate +
                 pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift, 
               data = fundraising,
               method = "glm",
               trControl = ctrl,
               preProcess = c("center","scale"), 
               tuneLength = 10)
glmFit$results$Accuracy               #retrieve accuracy from final model
## [1] 0.5538848



Support Vector Machines

Support vector machine algorithm tries to find a hyperplane in an N-dimensional space(N — the number of features) that gives the biggest separation between the classes i.e. the points are as far from the line as possible. It is linear hyperplane but points that are non-linearly separable can be extended using a Kernel Trick: transformation of original data to map into new dimension. Here, the svmRadial, radial kernel is used. The hyperparameter in support vector machines is C, which is a penalty term that defines how much one want to avoid classifying each training example. SVM using the radial kernel has an additional parameter in Caret. This parameter is called Sigma.

set.seed(12345)                        #set seed for reproducibility
svmFit.lin <- train(target ~ home_value + avg_fam_inc + months_since_donate +
                 pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift, 
               data = fundraising,
               method = "svmLinear",
               trControl = ctrl, 
               preProcess = c("center","scale"),
               tuneGrid = expand.grid(C = seq(0.1, 2, length = 10)),
               tuneLength = 10)

res.svm.lin<-as_tibble(svmFit.lin$results[which.max(svmFit.lin$results[,2]),])
plot(svmFit.lin)

Non linear SVM classifier We can also fit non linear Kernel tricks to predict the dataset.

- Radial Kernel

set.seed(12345)
svmFit.rad <- train(target ~ home_value + avg_fam_inc + months_since_donate +
                 pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift, 
               data = fundraising,
               method = "svmRadial",
               trControl = ctrl, 
               preProcess = c("center","scale"),
               tuneGrid = expand.grid(C = seq(0, 3, length = 10), sigma = c(.01, .015, 0.2)),
               tuneLength = 5)
## Warning: model fit failed for Fold1.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep1: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep1: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep1: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep2: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep2: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep2: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold1.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold2.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold3.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold4.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep3: C=0.0000, sigma=0.010 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep3: C=0.0000, sigma=0.015 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold5.Rep3: C=0.0000, sigma=0.200 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
svmFit.rad$bestTune
res.svm.rad<-as_tibble(svmFit.rad$results[which.max(svmFit.rad$results[,3]),])
plot(svmFit.rad)



K-Nearest Neighbours

KNN is unsupervised learning algorithm that tries to predict the correct class for the test data by computing the distance between the test data and all the training points. Then select the K number of points which is closet to the test data. The only required parameter is the k value. The package chooses he optimal k-neighbors.

knnFit <- train(target ~ home_value + avg_fam_inc + months_since_donate +
                 pct_lt15k + num_prom + lifetime_gifts + time_lag + avg_gift, 
               data = fundraising,
               method = "knn",
               trControl = ctrl,
               preProcess = c("center","scale"), 
               tuneLength = 10)

knnFit$bestTune
plot(knnFit)

3. Classification under asymmetric response and cost.

Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

Weighted sampling was produce a training dataset with equal numbers of donors and non-donors. It is because a simple random sample will sample the two classes equally and will produce an unbalanced dataset. This is important to have a balanced dataset as in an unbalanced dataset the model will learn a lot about the majority class and will be very good at predicting the majority class and will not accurately classify with minority class. With a balanced data our model learn about the two classed equally and will be able to produce a much better prediction.



4. Evaluate the fit

Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

# save all the accuracies to a tibble. 
df<-tibble(Model=c('Random Forest','Logistic Regression', "Linear SVM", "SVM Radial",'K-Nearest Neighbout'),
Accuracy=c(rfFit$results[which.max(rfFit$results[,2]),][2], 
           glmFit$results$Accuracy, 
           res.svm.lin$Accuracy, 
           res.svm.rad$Accuracy,
           knnFit$results[which.max(knnFit$results[,2]),][2]))
(df$Accuracy = as.numeric(df$Accuracy))    #transform to numeric
## [1] 0.5296600 0.5538848 0.5466641 0.5548859 0.5328932
df %>% arrange(Accuracy)            



5. Select best model.

From your answer in (4), what do you think is the “best” model?

  • The support vector machine with radial Kernel performs better than the rest of the models.



Step 3: Testing.

The file FutureFundraising.csv contains the attributes for future mailing candidates.

6. Test best model

Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.

#predict new data using the best model
preds = predict(svmFit.rad,  fundraising_future)

7. Submission File.

For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header and have the following format:

#write.csv
pr = as.data.frame(preds)     #save predictions as dataframe
colnames(pr)= "value"         # name the column
write.csv(pr,file="mySubmission.csv")

Conclusion

The maximum accuracy was obtained from the support vector machine using the radial kernel at an accuracy of 0.5548859, which is not fabulous. Accuracy on the future fundraising dataset is 65%, which indicates that our models are underfitting. It is therefore recommended that further analysis on the interaction terms between the variables and expanding grid search to obtain better hyperparameters are necessary to improve the prediction performance of the models.