# Load libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
library(leaps)
library(tidyverse)
library(psych)
library(rpart)
library(kableExtra)
library(C50)Given the college dataset attached to this assignment, the goal of this problem is to predict the number of applications received (“Apps” variable) using the other variables in the dataset. The variables are:
•X: Name of the college
•Private : Public/private indicator
•Apps : Number of applications received
•Accept : Number of applicants accepted
•Enroll : Number of new students enrolled
•Top10perc : Proportion of New students from top 10% of high school class
•Top25perc : Proportion of New students from top 25% of high school class
•F.Undergrad : Number of full-time undergraduates
•P.Undergrad : Number of part-time undergraduates
•Outstate : Out-of-state tuition
•Room.Board : Room and board costs
•Books : Estimated book costs
•Personal : Estimated personal spending
•PhD : Percent of faculty with Ph.D.’s
•Terminal : Percent of faculty with terminal degree
•S.F.Ratio : Student/faculty ratio
•perc.alumni : Percent of alumni who donate
•Expend : Instructional expenditure per student
•Grad.Rate : Graduation rate
1. Download the dataset college.csv and explore its overall structure. Get a summary statistics of each variable. Answer the following questions:
# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/College.csv", header = T, sep = ",", na.strings = "?", strip.white=TRUE)
summary(df)## X Private Apps Accept
## Length:777 Length:777 Min. : 81 Min. : 72
## Class :character Class :character 1st Qu.: 776 1st Qu.: 604
## Mode :character Mode :character Median : 1558 Median : 1110
## Mean : 3002 Mean : 2019
## 3rd Qu.: 3624 3rd Qu.: 2424
## Max. :48094 Max. :26330
## Enroll Top10perc Top25perc F.Undergrad
## Min. : 35 Min. : 1.00 Min. : 9.0 Min. : 139
## 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992
## Median : 434 Median :23.00 Median : 54.0 Median : 1707
## Mean : 780 Mean :27.56 Mean : 55.8 Mean : 3700
## 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
## Max. :6392 Max. :96.00 Max. :100.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0
## Median : 353.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4
## 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.00 Min. : 24.0 Min. : 2.50
## 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50
## Median :1200 Median : 75.00 Median : 82.0 Median :13.60
## Mean :1341 Mean : 72.66 Mean : 79.7 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50
## Max. :6800 Max. :103.00 Max. :100.0 Max. :39.80
## perc.alumni Expend Grad.Rate
## Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :21.00 Median : 8377 Median : 65.00
## Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :64.00 Max. :56233 Max. :118.00
## [1] 777 19
There are 777 observations in the dataset.
## 'data.frame': 777 obs. of 19 variables:
## $ X : chr "Abilene Christian University" "Adelphi University" "Adrian College" "Agnes Scott College" ...
## $ Private : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
# Count the number of categorical and numeric variables
categorical_variable <- sapply(df, is.character)
numerical_variables <- sapply(df, is.numeric)
count_categorical <- sum(categorical_variable)
count_numerical <- sum(numerical_variables)
cat("categorical variables: ",count_categorical,"\n")## categorical variables: 2
## numerical variables: 17
## X Private Apps Accept Enroll Top10perc
## 0 0 0 0 0 0
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books
## 0 0 0 0 0 0
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 0 0 0 0 0 0
## Grad.Rate
## 0
No missing value in the dataset.
2. Remove the first column (the name of the college)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
3. Which variables are associated with “Apps”? Use appropriate plots and statistics test to answer this question.
numeric_values <- df2 %>%
dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>%
data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt', number.cex = 0.5)Variables Accept, Enroll and F.Undergrad are highly positively associate with Apps and perc.alumni negatively associate with Apps.
Apps vs Private
# Create the boxplot
boxPlot_private = ggplot(df2, aes(x = Private, y = Apps, color = Private)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_private##
## Welch Two Sample t-test
##
## data: Apps by Private
## t = 9.7985, df = 244.49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2997.758 4506.223
## sample estimates:
## mean in group No mean in group Yes
## 5729.920 1977.929
Here p-value is less than 0.05, that is, we can conclude that there is a statistically significant difference between the means of the numeric variable for each level of the categorical variable. The Apps is associated with Types of college.
4. plot the histogram of the number of applications “Apps” variable. Explain what the histogram shows?
# Draw a histogram
ggplot(df2, aes(Apps)) +
geom_histogram(fill = "blue", alpha = 0.5, bins = 50) +
labs(x = "Applications", y = "Frequency") +
ggtitle("Histogram of Number of applications received")The histogram is right skewed. The applications having high frequancy between 1000 to 5000. We see an outlier where the number of applications between 45,000 to 50,000.
5.(0.5 pt) Split the data into train and test ( you can use the first 621 rows for training and rest for testing).
## [1] 621 18
## [1] 156 18
6. set the random seed: set.seed(123)
7. Use caret package to run 10 fold cross validation using linear regression method on the train data . Print the resulting model to see the cross validation RMSE. In addition, take a summary of the model and interpret the coefficients. Which coefficients are statistically different from zero? What does this meant?
train.control= trainControl(method = "cv", number = 10)
cv_model_1 = train(Apps ~ .,data = df_train,method= "lm",trControl= train.control)Above built the model using 10 fold cross validation and model used train data.
## Linear Regression
##
## 621 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 559, 558, 558, 560, 558, 559, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1095.186 0.915507 608.9292
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Rsquared is 0.915507, the model explains 0.915507 (91.5%) of the variance of Apps, and the remaining 0.084493 (8.5%) is unexplained.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5249.5 -379.4 2.6 283.3 7372.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -674.81356 449.81107 -1.500 0.13408
## PrivateYes -492.49111 159.85957 -3.081 0.00216 **
## Accept 1.68045 0.04429 37.944 < 2e-16 ***
## Enroll -1.18227 0.20477 -5.774 1.24e-08 ***
## Top10perc 43.67102 6.13052 7.124 3.01e-12 ***
## Top25perc -11.60268 4.94035 -2.349 0.01917 *
## F.Undergrad 0.07072 0.03604 1.962 0.05017 .
## P.Undergrad 0.02374 0.04871 0.487 0.62620
## Outstate -0.08905 0.02099 -4.242 2.56e-05 ***
## Room.Board 0.14926 0.05276 2.829 0.00482 **
## Books 0.05611 0.25857 0.217 0.82828
## Personal 0.03970 0.07157 0.555 0.57932
## PhD -8.61642 5.02397 -1.715 0.08685 .
## Terminal -2.45096 5.49433 -0.446 0.65569
## S.F.Ratio 29.32492 14.57724 2.012 0.04470 *
## perc.alumni 1.14958 4.57727 0.251 0.80178
## Expend 0.09614 0.01460 6.585 9.89e-11 ***
## Grad.Rate 6.10467 3.21809 1.897 0.05831 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1032 on 603 degrees of freedom
## Multiple R-squared: 0.9279, Adjusted R-squared: 0.9258
## F-statistic: 456.3 on 17 and 603 DF, p-value: < 2.2e-16
The t-statistics and p-values shows how likely true coefficient is zero. The summary shows, PrivateYes, Accept, Enroll, Top10perc, Outstate, Room.Board, and Expend are more significant due to lower P-value. Top25perc and S.F.Ratio are significant. P.Undergrad, Books, Personal, Terminal, PHD,and perc.alumni are not significant due to higher p-value.
8. Compute RMSE of the model on the test data. You can call “predict” function and pass to it the model (returned by caret train method) and the test data. Then compute RMSE of the predictions returned by the “predict” method.
result <- predict(cv_model_1, df_test)
test_rmse_1 <- RMSE(result, df_test$Apps)
cat("RMSE of the stepwise model on the test data is",test_rmse_1)## RMSE of the stepwise model on the test data is 1115.375
9. Set the random seed again. We need to do this before each training to ensure we get the same folds in cross validation. Set.seed(123) so we can compare the models using their cross validation RMSE.
11. Use caret and leap packages to run a 10 fold cross validation using step wise linear regression method with backward selection on the train data. The train method by default uses maximum of 4 predictors and reports the best models with 1..4 predictors. We need to change this parameter to consider all predictors. So inside your train function, add the following parameter tuneGrid =data.frame(nvmax = 1:n), where n is the number of variables you use to predict “Apps” . Which model (with how many variables or nvmax ) has the lowest cross validation RMSE? Take the summary of the final model, which variables are selected in the model with the lowest RMSE?
train_control_2= trainControl(method = "cv", number = 10)
cv_model_2 <- train(Apps ~., data = df_train,method= "leapBackward",
tuneGrid = data.frame(nvmax = 1:4), trControl= train_control_2)## Linear Regression with Backwards Selection
##
## 621 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 559, 558, 558, 560, 558, 559, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 1221.693 0.8853502 581.6962
## 2 1119.005 0.9083869 638.5812
## 3 1137.909 0.9038610 636.0471
## 4 1143.690 0.9033961 631.8455
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 2.
## The best one predictor model had an RMSE= 1221.693
## The best two predictor model had an RMSE= 1119.005
## The best three predictor model had an RMSE= 1137.909
## The best four predictor model had an RMSE= 1143.69
RMSE was used to select the optimal model using the smallest value. The final value used for the model was nvmax= 2.
## nvmax
## 2 2
## Subset selection object
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: backward
## PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " "
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## Expend Grad.Rate
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
The best one variable model only contains “Accept” variable. The best two variables model contains Top10perc and Accept variables.
12. Compute the RMSE of the stepwise model on the test data.
result_2 <- predict(cv_model_2, df_test)
test_rmse_2 <- RMSE(result_2, df_test$Apps)
cat("RMSE of the stepwise model on the test data is",test_rmse_2)## RMSE of the stepwise model on the test data is 1112.527
13. use “rpart” function to create a regression tree model from the train data. Get the predictions on test data and compute the RMSE.
## n= 621
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 621 8902853000 2729.8080
## 2) Accept< 2877.5 515 977531000 1491.7590
## 4) Accept< 1398.5 381 118534600 916.2782 *
## 5) Accept>=1398.5 134 374054800 3128.0150
## 10) Top10perc< 72.5 127 93741430 2864.5830 *
## 11) Top10perc>=72.5 7 111600500 7907.4290 *
## 3) Accept>=2877.5 106 3300780000 8744.8580
## 6) Accept< 8061.5 89 588818700 7078.2250
## 12) Accept< 6013 73 334578600 6374.2600
## 24) Top25perc< 88.5 62 139193600 5857.6450 *
## 25) Top25perc>=88.5 11 85571440 9286.0910 *
## 13) Accept>=6013 16 53009390 10290.0600 *
## 7) Accept>=8061.5 17 1170519000 17470.1800 *
result_3 <- predict(tree_model_3, df_test)
test_rmse_3 <- RMSE(result_3, df_test$Apps)
cat("RMSE of the tree model on the test data is",test_rmse_3)## RMSE of the tree model on the test data is 1441.396
14. Compare the RMSE on the test data for linear regression, stepwise regression, and the regression tree.
tabularview <- data.frame("Models" = c("Linear Regression", "Stepwise Regression","Regression Tree"),
"RMSE" = c(test_rmse_1, test_rmse_2, test_rmse_3))
kableExtra::kable(tabularview) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)| Models | RMSE |
|---|---|
| Linear Regression | 1115.375 |
| Stepwise Regression | 1112.527 |
| Regression Tree | 1441.396 |
For this problem, you will use customer churn modeling dataset from this Kaggle project (https://www.kaggle.com/shrutimechlearn/churn-modelling ).
The dataset contains details of a bank’s customers. The target variable we want to predict is: “Exited”: a binary variable reflecting whether the customer left the bank (closed his account, Exited=1) or she/he continued to be a customer (Exited=0).
1. Load the dataset. Examine its structure and remove the first three variables(RowNumber,CustomerId, and Surname). These variables are unique for each sample hence, they arenot useful for prediction. Convert all string variables to factors
# Load data
df_bank <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/Churn_Modelling.csv", header = T, sep = ",", na.strings = "?", strip.white=TRUE)
# Structure of data
str(df_bank)## 'data.frame': 10000 obs. of 14 variables:
## $ RowNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CustomerId : int 15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
## $ Surname : chr "Hargrave" "Hill" "Onio" "Boni" ...
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : chr "France" "Spain" "France" "France" ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : int 1 0 1 0 0 1 0 1 0 0 ...
## 'data.frame': 10000 obs. of 11 variables:
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : chr "France" "Spain" "France" "France" ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : int 1 0 1 0 0 1 0 1 0 0 ...
#convert string to factor
df_bank$Geography <- as.factor(df_bank$Geography)
df_bank$Gender <- as.factor(df_bank$Gender)
# Structure of data
str(df_bank)## 'data.frame': 10000 obs. of 11 variables:
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : int 1 0 1 0 0 1 0 1 0 0 ...
2. Use appropriate plots and statistical tests to find which variables are associated with “Exited”.Remove variables not associated with “Exited”
Convert Exited to the factor.
CreditScore vs Exited
boxPlot_cs = ggplot(df_bank, aes(x = Exited, y = CreditScore, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_csThe customer who are existed vs who are not, do not see significant difference between their mean.
##
## Welch Two Sample t-test
##
## data: CreditScore by Exited
## t = 2.6347, df = 3050.9, p-value = 0.008465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.663067 11.340331
## sample estimates:
## mean in group 0 mean in group 1
## 651.8532 645.3515
Age vs Exited
boxPlot_age = ggplot(df_bank, aes(x = Exited, y = Age, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_ageThe customer who are existed vs who are not, does show significant difference between their mean.
##
## Welch Two Sample t-test
##
## data: Age by Exited
## t = -30.419, df = 3248.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.908490 -6.950727
## sample estimates:
## mean in group 0 mean in group 1
## 37.40839 44.83800
Exited vs Tenure
boxPlot_Tenure = ggplot(df_bank, aes(x = Exited, y = Tenure, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_TenureThe customer who are existed vs who are not, does show not significant difference between their mean.
##
## Welch Two Sample t-test
##
## data: Tenure by Exited
## t = 1.3843, df = 3113.8, p-value = 0.1664
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04185994 0.24292931
## sample estimates:
## mean in group 0 mean in group 1
## 5.033279 4.932744
The test shows boxTenure and Existed are not significant.
Exited vs Balance
boxPlot_bl = ggplot(df_bank, aes(x = Exited, y = Balance, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_blThe customer who are existed vs who are not, does show some significant difference between their mean.
##
## Welch Two Sample t-test
##
## data: Balance by Exited
## t = -12.471, df = 3347.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -21250.22 -15476.26
## sample estimates:
## mean in group 0 mean in group 1
## 72745.30 91108.54
The test shows Balance and Existed are significant.
Exited vs NumOfProducts
boxPlot_np = ggplot(df_bank, aes(x = Exited, y = NumOfProducts, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_npThe customer who are existed vs NumOfProducts are not significant difference between their mean.
##
## Welch Two Sample t-test
##
## data: NumOfProducts by Exited
## t = 3.702, df = 2472, p-value = 0.0002186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03247871 0.10563848
## sample estimates:
## mean in group 0 mean in group 1
## 1.544267 1.475209
The test shows boxTenure and Existed are significant.
Exited vs EstimatedSalary
boxPlot_es = ggplot(df_bank, aes(x = Exited, y = EstimatedSalary, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_esThe customer who are existed vs who are not, does not show much significant difference between their mean.
##
## Welch Two Sample t-test
##
## data: EstimatedSalary by Exited
## t = -1.2034, df = 3137.4, p-value = 0.2289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4541.656 1087.085
## sample estimates:
## mean in group 0 mean in group 1
## 99738.39 101465.68
The test shows EstimatedSalary and Existed customer are not significant.
Exited vs HasCrCard
table_HasCrCard <- table(df_bank$HasCrCard, df_bank$Exited)
mosaicplot(table_HasCrCard, ylab= "Exited", xlab="HasCrCard", main = "Exited vs HasCrCard", shade=TRUE)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_HasCrCard
## X-squared = 0.47134, df = 1, p-value = 0.4924
Due to high p-value, Exited vs HasCrCard are not significant.
Exited vs IsActiveMember
table_IsActiveMember <- table(df_bank$IsActiveMember, df_bank$Exited)
mosaicplot(table_IsActiveMember, ylab= "Exited ", xlab="IsActiveMember", main = "Exited vs IsActiveMember", shade=TRUE)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_IsActiveMember
## X-squared = 242.99, df = 1, p-value < 2.2e-16
Due to very low p-value, Exited vs IsActiveMember are significant.
Exited vs Geography
table_Geography <- table(df_bank$Geography, df_bank$Exited)
mosaicplot(table_Geography, ylab= "Exited", xlab="Geography", main = "Exited vs Geography", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_Geography
## X-squared = 301.26, df = 2, p-value < 2.2e-16
Due to very low p-value, Exited vs Geography are significant.
Exited vs Gender
table_Gender <- table(df_bank$Gender, df_bank$Exited)
mosaicplot(table_Gender, ylab= "Exited ", xlab="Gender", main = "Exited vs Gender", shade=TRUE)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_Gender
## X-squared = 112.92, df = 1, p-value < 2.2e-16
Due to very low p-value, Exited vs Gender are significant.
3. Set the random seed, set.seed(123), and split the data to train/test. Use 80% of samplesfor training and the remaining 20% for testing. You can use “sample” (as we did in slide 37 ofweek 6 lecture) or alternatively, you can use “createDataPartition” method from caret package.
# set the seed
set.seed(123)
split <- createDataPartition(df_bank$Exited, p = 0.8, list = FALSE)
# created train and test set
training_set <- df_bank[split, ]
test_set <- df_bank[-split, ]
dim(training_set)## [1] 8001 11
## [1] 1999 11
4. Train a logistic regression model on the train data using the glm package and use it topredict “Exited” for the test data. Note: As explained in the lectures, “predict” method will returnpredicted probabilities. To convert them to labels, you need to use some threshold ( typically setas 50%) and if the predicted probability is greater than 50% you predict label “1” for Exited;otherwise predict label “0” ( please review the example in lecture 7.2).
##
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2776 -0.6601 -0.4622 -0.2741 2.9777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.577e+00 2.723e-01 -13.138 < 2e-16 ***
## CreditScore -4.089e-04 3.120e-04 -1.311 0.1900
## GeographyGermany 7.609e-01 7.535e-02 10.099 < 2e-16 ***
## GeographySpain 4.787e-02 7.882e-02 0.607 0.5436
## GenderMale -5.049e-01 6.072e-02 -8.315 < 2e-16 ***
## Age 7.125e-02 2.845e-03 25.044 < 2e-16 ***
## Tenure -1.952e-02 1.043e-02 -1.871 0.0613 .
## Balance 2.711e-06 5.718e-07 4.742 2.12e-06 ***
## NumOfProducts -5.851e-02 5.173e-02 -1.131 0.2581
## HasCrCard -5.177e-02 6.579e-02 -0.787 0.4313
## IsActiveMember -1.076e+00 6.436e-02 -16.719 < 2e-16 ***
## EstimatedSalary 6.310e-07 5.262e-07 1.199 0.2305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8089.4 on 8000 degrees of freedom
## Residual deviance: 6891.9 on 7989 degrees of freedom
## AIC: 6915.9
##
## Number of Fisher Scoring iterations: 5
## 1 5 11 14 16 19 20
## 0.11929779 0.16135817 0.11961030 0.10768453 0.23333315 0.24169327 0.02955896
## 23 25 30
## 0.24164741 0.08463062 0.03794250
## 1 5 11 14 16 19 20 23 25 30
## 0 0 0 0 0 0 0 0 0 0
5. Get the cross table between the predicted labels and true labels in the test data andcompute total_error, false positive rate, and false negative rate.
Confusion Matrix
## actual.label
## predicted.label 0 1
## 0 1548 326
## 1 44 81
Error
## Total Error = 0.1850925
Accuracy
missing_classerr <- mean(predicted.label != test_set$Exited)
print(paste('Accuracy =', 1 - missing_classerr))## [1] "Accuracy = 0.814907453726863"
False Positive Rate fpr = fp / (fp + tn)
## [1] "False Positive Rate = 0.0276381909547739"
False Negative Rate fnr = fn / (fn + tp)
## [1] "False Negative Rate = 0.800982800982801"
6. The target variable “Exited” is severely imbalanced; the number of customers who stayed withthe bank is almost four times the number of customer who exited the bank. Most classificationmodels trained on imbalanced data are biased towards predicting the majority class ( Exited=0 in thiscase) and yield a higher classification error on the minority class (Exited=1).
One way to deal with class imbalance problem is to down-sample the majority class; meaningrandomly sample the observations in the majority class to make it the same size as the minority class.
The downside of this approach is that for smaller datasets, removing data will result in significant loss of information and lower performance. In Module 12, we will learn about other techniques to deal with data imbalance without removing information, but for this assignment, you are to use down-sampling in an attempt to address data imbalance and increase model performance in predicting customers who will exit the bank.
Note: Down-sampling should only be done on the training data and the test data should have the original imbalance distribution:
1. Divide your training data into two sets, customers who exited and customer who did not exit.
2. Sample the non-exiting customers set such that you have the same number of exiting andnon-exiting customers You can use “sample” from the base package to sample the rows or alternatively,you can use the method “sample_n” from dplyr package to directly sample thedataframe
3. Combine the exiting and non-exiting customers into one dataframe
4. Re-train the logistic regression model on the down-sampled training data and evaluate it onthe test data. Compare the total error, false positive rate, and false negative rate with th eprevious model. Which model does better at predicting exiting customers?
##
## 0 1
## 7963 2037
The data is not balanced. Let’s do the down sample.
set.seed(1)
smpl1 <- df_bank %>% filter(df_bank$Exited == 0) %>% sample_n(size = 2037)
smpl2 <- df_bank %>% filter(df_bank$Exited == 1)
smpl_2037 <- rbind(smpl1, smpl2)
dim(smpl_2037)## [1] 4074 11
##
## 0 1
## 2037 2037
Now the data is balanced.
Randomize the order of the rows in the dataset.
set.seed(10)
split_2 <- createDataPartition(smpl_2037$Exited, p = 0.8, list = FALSE)
# created train and test set
smpl_training_set <- smpl_2037[split_2, ]
smpl_test_set <- smpl_2037[-split_2, ]
dim(smpl_training_set)## [1] 3260 11
## [1] 814 11
##
## 0 1
## 407 407
logistic_model_2 = glm(Exited~., data = smpl_training_set, family = "binomial" )
summary(logistic_model_2)##
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = smpl_training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.58690 -0.93990 -0.04068 0.95696 2.50148
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.312e+00 3.589e-01 -6.440 1.20e-10 ***
## CreditScore -7.564e-04 4.040e-04 -1.872 0.061161 .
## GeographyGermany 8.520e-01 1.002e-01 8.506 < 2e-16 ***
## GeographySpain 1.452e-01 1.006e-01 1.443 0.148945
## GenderMale -5.759e-01 7.935e-02 -7.258 3.92e-13 ***
## Age 7.569e-02 4.127e-03 18.339 < 2e-16 ***
## Tenure -1.404e-02 1.359e-02 -1.033 0.301631
## Balance 2.456e-06 7.112e-07 3.453 0.000555 ***
## NumOfProducts -5.011e-02 5.943e-02 -0.843 0.399140
## HasCrCard 7.308e-02 8.600e-02 0.850 0.395497
## IsActiveMember -9.439e-01 8.047e-02 -11.730 < 2e-16 ***
## EstimatedSalary 3.611e-07 6.926e-07 0.521 0.602110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4519.3 on 3259 degrees of freedom
## Residual deviance: 3756.6 on 3248 degrees of freedom
## AIC: 3780.6
##
## Number of Fisher Scoring iterations: 4
## 2787 3324 1748 355 798 3058 686 115
## 0.4994796 0.4096367 0.4739061 0.1725451 0.2295362 0.8195343 0.3815372 0.1953250
## 3878 2707
## 0.7150617 0.5699479
## 2787 3324 1748 355 798 3058 686 115 3878 2707
## 0 0 0 0 0 1 0 0 1 1
Confusion Matrix
## actual.label_2
## predicted.label_2 0 1
## 0 287 111
## 1 120 296
Error
## Total Error = 0.2837838
False Positive Rate
## [1] "False Positive Rate = 0.294840294840295"
False Negative Rate
## [1] "False Negative Rate = 0.272727272727273"
Compair both the models
tabularview_2 <- data.frame("Matrices" = c("Total Error", "False Positive Rate", "False Negative Rate"),
"Model 1" = c(error, fpr, fnr),
"Model 2" = c(error_2, fpr_2, fnr_2)
)
kableExtra::kable(tabularview_2) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)| Matrices | Model.1 | Model.2 |
|---|---|---|
| Total Error | 0.1850925 | 0.2837838 |
| False Positive Rate | 0.0276382 | 0.2948403 |
| False Negative Rate | 0.8009828 | 0.2727273 |
Note: Above table shows:
Model 1 represents the Logistic Regression
Model 2 represents the Down sample Logistic regression
We see Model 1 has lower Total Error rate and False Positive rate than Model 2. Model 2 has lower False Nagative rate than Model 1.
7. Repeat steps 4,5,6 above but this time, use a C5.0 decision tree model to predict “Exited”. (usetrials=30 for boosting multiple decision trees (see an example in slide 44, module 6) Compare the logistic regression model with the boosted C5.0 model.
##
## Call:
## C5.0.default(x = training_set[-11], y = training_set$Exited, trials = 10)
##
## Classification Tree
## Number of samples: 8001
## Number of predictors: 10
##
## Number of boosting iterations: 10
## Average tree size: 26.4
##
## Non-standard options: attempt to group attributes
## [1] 0 0 0 0 0 1 0 0 0 0
## Levels: 0 1
Confusion Matrix
## actual_3
## prediction_3 0 1
## 0 1529 220
## 1 63 187
Error
## Total Error = 0.1415708
False Positive Rate
## [1] "False Positive Rate = 0.039572864321608"
False Negative Rate
## [1] "False Negative Rate = 0.540540540540541"
C5.0 decision tree model with down sample data
##
## Call:
## C5.0.default(x = smpl_training_set[-11], y = smpl_training_set$Exited, trials
## = 10)
##
## Classification Tree
## Number of samples: 3260
## Number of predictors: 10
##
## Number of boosting iterations: 10
## Average tree size: 18.5
##
## Non-standard options: attempt to group attributes
## [1] 0 0 0 0 0 1 0 0 1 1
## Levels: 0 1
Confusion Matrix
## actual_4
## prediction_4 0 1
## 0 334 97
## 1 73 310
Error
## Total Error = 0.2088452
False Positive Rate
## [1] "False Positive Rate = 0.179361179361179"
False Negative Rate
## [1] "False Negative Rate = 0.238329238329238"
Compare the logistic regression model with the boosted C5.0 model:
tabularview_3 <- data.frame("Matrices" = c("Total Error", "False Positive Rate", "False Negative Rate"),
"Model 1" = c(error, fpr, fnr),
"Model 2" = c(error_2, fpr_2, fnr_2),
"Model 3" = c(error_3, fpr_3, fnr_3),
"Model 4" = c(error_4, fpr_4, fnr_4)
)
kableExtra::kable(tabularview_3) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)| Matrices | Model.1 | Model.2 | Model.3 | Model.4 |
|---|---|---|---|---|
| Total Error | 0.1850925 | 0.2837838 | 0.1415708 | 0.2088452 |
| False Positive Rate | 0.0276382 | 0.2948403 | 0.0395729 | 0.1793612 |
| False Negative Rate | 0.8009828 | 0.2727273 | 0.5405405 | 0.2383292 |
Note :
Model 1 represents the Logistic Regression
Model 2 represents the Down sample Logistic regression
Model 3 represents C5.0 decision tree regression
Model 4 represents Down sample C5.0 decision tree regression
Above table shows, Model 3 has lower Error rate than other models. Model 1 has lower False Positive Rate than other models. Model 4 has lower False Nagative Rat ethan other models.
Lower the error, better the model performnace. So Model 3 i.e C5.0 decision tree regression model performs well for predicting existing customers.