Assignment 3

# Load libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
library(leaps)
library(tidyverse)
library(psych)
library(rpart)
library(kableExtra)
library(C50)

Problem 1: Predicting number of college applications

Given the college dataset attached to this assignment, the goal of this problem is to predict the number of applications received (“Apps” variable) using the other variables in the dataset. The variables are:

•X: Name of the college

•Private : Public/private indicator

•Apps : Number of applications received

•Accept : Number of applicants accepted

•Enroll : Number of new students enrolled

•Top10perc : Proportion of New students from top 10% of high school class

•Top25perc : Proportion of New students from top 25% of high school class

•F.Undergrad : Number of full-time undergraduates

•P.Undergrad : Number of part-time undergraduates

•Outstate : Out-of-state tuition

•Room.Board : Room and board costs

•Books : Estimated book costs

•Personal : Estimated personal spending

•PhD : Percent of faculty with Ph.D.’s

•Terminal : Percent of faculty with terminal degree

•S.F.Ratio : Student/faculty ratio

•perc.alumni : Percent of alumni who donate

•Expend : Instructional expenditure per student

•Grad.Rate : Graduation rate

1. Download the dataset college.csv and explore its overall structure. Get a summary statistics of each variable. Answer the following questions:

How many observations do you have in the data?

How many categorical and numeric variables you have in your data?

Is there any missing value?

# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/College.csv", header = T, sep = ",", na.strings = "?", strip.white=TRUE)

summary(df)

##       X               Private               Apps           Accept     
##  Length:777         Length:777         Min.   :   81   Min.   :   72  
##  Class :character   Class :character   1st Qu.:  776   1st Qu.:  604  
##  Mode  :character   Mode  :character   Median : 1558   Median : 1110  
##                                        Mean   : 3002   Mean   : 2019  
##                                        3rd Qu.: 3624   3rd Qu.: 2424  
##                                        Max.   :48094   Max.   :26330  
##      Enroll       Top10perc       Top25perc      F.Undergrad   
##  Min.   :  35   Min.   : 1.00   Min.   :  9.0   Min.   :  139  
##  1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992  
##  Median : 434   Median :23.00   Median : 54.0   Median : 1707  
##  Mean   : 780   Mean   :27.56   Mean   : 55.8   Mean   : 3700  
##  3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005  
##  Max.   :6392   Max.   :96.00   Max.   :100.0   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0  
##  Median :  353.0   Median : 9990   Median :4200   Median : 500.0  
##  Mean   :  855.3   Mean   :10441   Mean   :4358   Mean   : 549.4  
##  3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD            Terminal       S.F.Ratio    
##  Min.   : 250   Min.   :  8.00   Min.   : 24.0   Min.   : 2.50  
##  1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50  
##  Median :1200   Median : 75.00   Median : 82.0   Median :13.60  
##  Mean   :1341   Mean   : 72.66   Mean   : 79.7   Mean   :14.09  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50  
##  Max.   :6800   Max.   :103.00   Max.   :100.0   Max.   :39.80  
##   perc.alumni        Expend        Grad.Rate     
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median :21.00   Median : 8377   Median : 65.00  
##  Mean   :22.74   Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :64.00   Max.   :56233   Max.   :118.00

dim(df)

## [1] 777  19

There are 777 observations in the dataset.

str(df)

## 'data.frame':    777 obs. of  19 variables:
##  $ X          : chr  "Abilene Christian University" "Adelphi University" "Adrian College" "Agnes Scott College" ...
##  $ Private    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Apps       : int  1660 2186 1428 417 193 587 353 1899 1038 582 ...
##  $ Accept     : int  1232 1924 1097 349 146 479 340 1720 839 498 ...
##  $ Enroll     : int  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : int  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : int  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: int  2885 2683 1036 510 249 678 416 1594 973 799 ...
##  $ P.Undergrad: int  537 1227 99 63 869 41 230 32 306 78 ...
##  $ Outstate   : int  7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
##  $ Room.Board : int  3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
##  $ Books      : int  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : int  2200 1500 1165 875 1500 675 1500 850 500 1800 ...
##  $ PhD        : int  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : int  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: int  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : int  7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
##  $ Grad.Rate  : int  60 56 54 59 15 55 63 73 80 52 ...

# Count the number of categorical and numeric variables
categorical_variable <- sapply(df, is.character)
numerical_variables <- sapply(df, is.numeric)
count_categorical <- sum(categorical_variable)
count_numerical <- sum(numerical_variables)
cat("categorical variables: ",count_categorical,"\n")

## categorical variables:  2

cat("numerical variables: ", count_numerical)

## numerical variables:  17

colSums(is.na(df))

##           X     Private        Apps      Accept      Enroll   Top10perc 
##           0           0           0           0           0           0 
##   Top25perc F.Undergrad P.Undergrad    Outstate  Room.Board       Books 
##           0           0           0           0           0           0 
##    Personal         PhD    Terminal   S.F.Ratio perc.alumni      Expend 
##           0           0           0           0           0           0 
##   Grad.Rate 
##           0

No missing value in the dataset.

2. Remove the first column (the name of the college)

# Remove Columns by Index
df2 <- df[,-1]
str(df2)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Apps       : int  1660 2186 1428 417 193 587 353 1899 1038 582 ...
##  $ Accept     : int  1232 1924 1097 349 146 479 340 1720 839 498 ...
##  $ Enroll     : int  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : int  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : int  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: int  2885 2683 1036 510 249 678 416 1594 973 799 ...
##  $ P.Undergrad: int  537 1227 99 63 869 41 230 32 306 78 ...
##  $ Outstate   : int  7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
##  $ Room.Board : int  3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
##  $ Books      : int  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : int  2200 1500 1165 875 1500 675 1500 850 500 1800 ...
##  $ PhD        : int  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : int  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: int  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : int  7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
##  $ Grad.Rate  : int  60 56 54 59 15 55 63 73 80 52 ...

3. Which variables are associated with “Apps”? Use appropriate plots and statistics test to answer this question.

numeric_values <- df2 %>% 
  dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>% 
  data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt', number.cex = 0.5)

pairs.panels(df2[c("Apps", "Accept", "Enroll", "Top10perc","Top25perc")])

pairs.panels(df2[c("Apps","F.Undergrad","P.Undergrad","Outstate","Room.Board")])

pairs.panels(df2[c("Apps","Books","Personal","PhD","Terminal")])

pairs.panels(df2[c("Apps","S.F.Ratio","perc.alumni","Expend","Grad.Rate")])

Variables Accept, Enroll and F.Undergrad are highly positively associate with Apps and perc.alumni negatively associate with Apps.

Apps vs Private

# Create the boxplot
boxPlot_private = ggplot(df2, aes(x = Private, y = Apps, color = Private)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_private

# t-test between Private and Apps
t.test(Apps~Private,alternative="two.sided", data=df2)

## 
##  Welch Two Sample t-test
## 
## data:  Apps by Private
## t = 9.7985, df = 244.49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2997.758 4506.223
## sample estimates:
##  mean in group No mean in group Yes 
##          5729.920          1977.929

Here p-value is less than 0.05, that is, we can conclude that there is a statistically significant difference between the means of the numeric variable for each level of the categorical variable. The Apps is associated with Types of college.

4. plot the histogram of the number of applications “Apps” variable. Explain what the histogram shows?

# Draw a histogram
ggplot(df2, aes(Apps)) +
  geom_histogram(fill = "blue", alpha = 0.5, bins = 50) +
  labs(x = "Applications", y = "Frequency") +
  ggtitle("Histogram of Number of applications received")

The histogram is right skewed. The applications having high frequancy between 1000 to 5000. We see an outlier where the number of applications between 45,000 to 50,000.

5.(0.5 pt) Split the data into train and test ( you can use the first 621 rows for training and rest for testing).

df_train<-df2[1:621, ]
df_test<-df2[622:dim(df2)[1], ]

dim(df_train)

## [1] 621  18

dim(df_test)

## [1] 156  18

6. set the random seed: set.seed(123)

set.seed(123)

7. Use caret package to run 10 fold cross validation using linear regression method on the train data . Print the resulting model to see the cross validation RMSE. In addition, take a summary of the model and interpret the coefficients. Which coefficients are statistically different from zero? What does this meant?

train.control= trainControl(method = "cv", number = 10)
cv_model_1 = train(Apps ~ .,data = df_train,method= "lm",trControl= train.control)

Above built the model using 10 fold cross validation and model used train data.

print(cv_model_1)

## Linear Regression 
## 
## 621 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 559, 558, 558, 560, 558, 559, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   1095.186  0.915507  608.9292
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Rsquared is 0.915507, the model explains 0.915507 (91.5%) of the variance of Apps, and the remaining 0.084493 (8.5%) is unexplained.

summary(cv_model_1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5249.5  -379.4     2.6   283.3  7372.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -674.81356  449.81107  -1.500  0.13408    
## PrivateYes  -492.49111  159.85957  -3.081  0.00216 ** 
## Accept         1.68045    0.04429  37.944  < 2e-16 ***
## Enroll        -1.18227    0.20477  -5.774 1.24e-08 ***
## Top10perc     43.67102    6.13052   7.124 3.01e-12 ***
## Top25perc    -11.60268    4.94035  -2.349  0.01917 *  
## F.Undergrad    0.07072    0.03604   1.962  0.05017 .  
## P.Undergrad    0.02374    0.04871   0.487  0.62620    
## Outstate      -0.08905    0.02099  -4.242 2.56e-05 ***
## Room.Board     0.14926    0.05276   2.829  0.00482 ** 
## Books          0.05611    0.25857   0.217  0.82828    
## Personal       0.03970    0.07157   0.555  0.57932    
## PhD           -8.61642    5.02397  -1.715  0.08685 .  
## Terminal      -2.45096    5.49433  -0.446  0.65569    
## S.F.Ratio     29.32492   14.57724   2.012  0.04470 *  
## perc.alumni    1.14958    4.57727   0.251  0.80178    
## Expend         0.09614    0.01460   6.585 9.89e-11 ***
## Grad.Rate      6.10467    3.21809   1.897  0.05831 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1032 on 603 degrees of freedom
## Multiple R-squared:  0.9279, Adjusted R-squared:  0.9258 
## F-statistic: 456.3 on 17 and 603 DF,  p-value: < 2.2e-16

The t-statistics and p-values shows how likely true coefficient is zero. The summary shows, PrivateYes, Accept, Enroll, Top10perc, Outstate, Room.Board, and Expend are more significant due to lower P-value. Top25perc and S.F.Ratio are significant. P.Undergrad, Books, Personal, Terminal, PHD,and perc.alumni are not significant due to higher p-value.

8. Compute RMSE of the model on the test data. You can call “predict” function and pass to it the model (returned by caret train method) and the test data. Then compute RMSE of the predictions returned by the “predict” method.

result <- predict(cv_model_1, df_test)
test_rmse_1 <- RMSE(result, df_test$Apps)
cat("RMSE of the stepwise model on the test data is",test_rmse_1)

## RMSE of the stepwise model on the test data is 1115.375

9. Set the random seed again. We need to do this before each training to ensure we get the same folds in cross validation. Set.seed(123) so we can compare the models using their cross validation RMSE.

set.seed(123)

11. Use caret and leap packages to run a 10 fold cross validation using step wise linear regression method with backward selection on the train data. The train method by default uses maximum of 4 predictors and reports the best models with 1..4 predictors. We need to change this parameter to consider all predictors. So inside your train function, add the following parameter tuneGrid =data.frame(nvmax = 1:n), where n is the number of variables you use to predict “Apps” . Which model (with how many variables or nvmax ) has the lowest cross validation RMSE? Take the summary of the final model, which variables are selected in the model with the lowest RMSE?

train_control_2= trainControl(method = "cv", number = 10)
cv_model_2 <- train(Apps ~., data = df_train,method= "leapBackward",
                     tuneGrid = data.frame(nvmax = 1:4), trControl= train_control_2)

print(cv_model_2)

## Linear Regression with Backwards Selection 
## 
## 621 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 559, 558, 558, 560, 558, 559, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   1      1221.693  0.8853502  581.6962
##   2      1119.005  0.9083869  638.5812
##   3      1137.909  0.9038610  636.0471
##   4      1143.690  0.9033961  631.8455
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 2.

cat("The best one predictor model had an RMSE=", cv_model_2$results$RMSE[1],"\n")

## The best one predictor model had an RMSE= 1221.693

cat("The best two predictor model had an RMSE=", cv_model_2$results$RMSE[2],"\n")

## The best two predictor model had an RMSE= 1119.005

cat("The best three predictor model had an RMSE=", cv_model_2$results$RMSE[3],"\n")

## The best three predictor model had an RMSE= 1137.909

cat("The best four predictor model had an RMSE=", cv_model_2$results$RMSE[4],"\n")

## The best four predictor model had an RMSE= 1143.69

RMSE was used to select the optimal model using the smallest value. The final value used for the model was nvmax= 2.

cv_model_2$bestTune

##   nvmax
## 2     2

summary(cv_model_2$finalModel)

## Subset selection object
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: backward
##          PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1  ( 1 ) " "        "*"    " "    " "       " "       " "         " "        
## 2  ( 1 ) " "        "*"    " "    "*"       " "       " "         " "        
##          Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 1  ( 1 ) " "      " "        " "   " "      " " " "      " "       " "        
## 2  ( 1 ) " "      " "        " "   " "      " " " "      " "       " "        
##          Expend Grad.Rate
## 1  ( 1 ) " "    " "      
## 2  ( 1 ) " "    " "

The best one variable model only contains “Accept” variable. The best two variables model contains Top10perc and Accept variables.

12. Compute the RMSE of the stepwise model on the test data.

result_2 <- predict(cv_model_2, df_test)
test_rmse_2 <- RMSE(result_2, df_test$Apps)
cat("RMSE of the stepwise model on the test data is",test_rmse_2)

## RMSE of the stepwise model on the test data is 1112.527

13. use “rpart” function to create a regression tree model from the train data. Get the predictions on test data and compute the RMSE.

set.seed(123)

tree_model_3 <- rpart::rpart(Apps ~ ., data = df_train)
tree_model_3

## n= 621 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 621 8902853000  2729.8080  
##    2) Accept< 2877.5 515  977531000  1491.7590  
##      4) Accept< 1398.5 381  118534600   916.2782 *
##      5) Accept>=1398.5 134  374054800  3128.0150  
##       10) Top10perc< 72.5 127   93741430  2864.5830 *
##       11) Top10perc>=72.5 7  111600500  7907.4290 *
##    3) Accept>=2877.5 106 3300780000  8744.8580  
##      6) Accept< 8061.5 89  588818700  7078.2250  
##       12) Accept< 6013 73  334578600  6374.2600  
##         24) Top25perc< 88.5 62  139193600  5857.6450 *
##         25) Top25perc>=88.5 11   85571440  9286.0910 *
##       13) Accept>=6013 16   53009390 10290.0600 *
##      7) Accept>=8061.5 17 1170519000 17470.1800 *

result_3 <- predict(tree_model_3, df_test)
test_rmse_3 <- RMSE(result_3, df_test$Apps)
cat("RMSE of the tree model on the test data is",test_rmse_3)

## RMSE of the tree model on the test data is 1441.396

14. Compare the RMSE on the test data for linear regression, stepwise regression, and the regression tree.

tabularview <- data.frame("Models" = c("Linear Regression", "Stepwise Regression","Regression Tree"),
                          "RMSE" = c(test_rmse_1, test_rmse_2, test_rmse_3))

kableExtra::kable(tabularview) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)

Models	RMSE
Linear Regression	1115.375
Stepwise Regression	1112.527
Regression Tree	1441.396

Problem2—Predicting Customer Churn using Logistic Regression and Decision Trees

For this problem, you will use customer churn modeling dataset from this Kaggle project (https://www.kaggle.com/shrutimechlearn/churn-modelling ).

The dataset contains details of a bank’s customers. The target variable we want to predict is: “Exited”: a binary variable reflecting whether the customer left the bank (closed his account, Exited=1) or she/he continued to be a customer (Exited=0).

1. Load the dataset. Examine its structure and remove the first three variables(RowNumber,CustomerId, and Surname). These variables are unique for each sample hence, they arenot useful for prediction. Convert all string variables to factors

# Load data
df_bank <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/Churn_Modelling.csv", header = T, sep = ",", na.strings = "?", strip.white=TRUE)

# Structure of data
str(df_bank)

## 'data.frame':    10000 obs. of  14 variables:
##  $ RowNumber      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CustomerId     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ Surname        : chr  "Hargrave" "Hill" "Onio" "Boni" ...
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : chr  "France" "Spain" "France" "France" ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : int  1 0 1 0 0 1 0 1 0 0 ...

# Remove Columns by Index
df_bank <- df_bank[,c(-1,-2,-3)]

# Structure of data
str(df_bank)

## 'data.frame':    10000 obs. of  11 variables:
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : chr  "France" "Spain" "France" "France" ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : int  1 0 1 0 0 1 0 1 0 0 ...

#convert string to factor
df_bank$Geography <- as.factor(df_bank$Geography) 
df_bank$Gender <- as.factor(df_bank$Gender)

# Structure of data
str(df_bank)

## 'data.frame':    10000 obs. of  11 variables:
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
##  $ Gender         : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : int  1 0 1 0 0 1 0 1 0 0 ...

2. Use appropriate plots and statistical tests to find which variables are associated with “Exited”.Remove variables not associated with “Exited”

Convert Exited to the factor.

df_bank$Exited <- as.factor(df_bank$Exited)

CreditScore vs Exited

boxPlot_cs = ggplot(df_bank, aes(x = Exited, y = CreditScore, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_cs

The customer who are existed vs who are not, do not see significant difference between their mean.

# t-test between Private and Apps
t.test(CreditScore~Exited,alternative="two.sided", data=df_bank)

## 
##  Welch Two Sample t-test
## 
## data:  CreditScore by Exited
## t = 2.6347, df = 3050.9, p-value = 0.008465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   1.663067 11.340331
## sample estimates:
## mean in group 0 mean in group 1 
##        651.8532        645.3515

Age vs Exited

boxPlot_age = ggplot(df_bank, aes(x = Exited, y = Age, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_age

The customer who are existed vs who are not, does show significant difference between their mean.

# t-test between Private and Apps
t.test(Age~Exited,alternative="two.sided", df_bank)

## 
##  Welch Two Sample t-test
## 
## data:  Age by Exited
## t = -30.419, df = 3248.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.908490 -6.950727
## sample estimates:
## mean in group 0 mean in group 1 
##        37.40839        44.83800

Exited vs Tenure

boxPlot_Tenure = ggplot(df_bank, aes(x = Exited, y = Tenure, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_Tenure

The customer who are existed vs who are not, does show not significant difference between their mean.

# t-test between Private and Apps
t.test(Tenure~Exited,alternative="two.sided", df_bank)

## 
##  Welch Two Sample t-test
## 
## data:  Tenure by Exited
## t = 1.3843, df = 3113.8, p-value = 0.1664
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04185994  0.24292931
## sample estimates:
## mean in group 0 mean in group 1 
##        5.033279        4.932744

The test shows boxTenure and Existed are not significant.

Exited vs Balance

boxPlot_bl = ggplot(df_bank, aes(x = Exited, y = Balance, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_bl

The customer who are existed vs who are not, does show some significant difference between their mean.

# t-test between Private and Apps
t.test(Balance~Exited,alternative="two.sided", df_bank)

## 
##  Welch Two Sample t-test
## 
## data:  Balance by Exited
## t = -12.471, df = 3347.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -21250.22 -15476.26
## sample estimates:
## mean in group 0 mean in group 1 
##        72745.30        91108.54

The test shows Balance and Existed are significant.

Exited vs NumOfProducts

boxPlot_np = ggplot(df_bank, aes(x = Exited, y = NumOfProducts, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_np

The customer who are existed vs NumOfProducts are not significant difference between their mean.

# t-test between Private and Apps
t.test(NumOfProducts~Exited,alternative="two.sided", df_bank)

## 
##  Welch Two Sample t-test
## 
## data:  NumOfProducts by Exited
## t = 3.702, df = 2472, p-value = 0.0002186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03247871 0.10563848
## sample estimates:
## mean in group 0 mean in group 1 
##        1.544267        1.475209

The test shows boxTenure and Existed are significant.

Exited vs EstimatedSalary

boxPlot_es = ggplot(df_bank, aes(x = Exited, y = EstimatedSalary, color = Exited)) + geom_boxplot(outlier.colour = "blue",outlier.size = 2) + theme_classic()
boxPlot_es

The customer who are existed vs who are not, does not show much significant difference between their mean.

# t-test between Private and Apps
t.test(EstimatedSalary~Exited,alternative="two.sided", df_bank)

## 
##  Welch Two Sample t-test
## 
## data:  EstimatedSalary by Exited
## t = -1.2034, df = 3137.4, p-value = 0.2289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4541.656  1087.085
## sample estimates:
## mean in group 0 mean in group 1 
##        99738.39       101465.68

The test shows EstimatedSalary and Existed customer are not significant.

Exited vs HasCrCard

table_HasCrCard <- table(df_bank$HasCrCard, df_bank$Exited)
mosaicplot(table_HasCrCard, ylab= "Exited", xlab="HasCrCard", main = "Exited vs HasCrCard", shade=TRUE)

chisq.test(table_HasCrCard)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_HasCrCard
## X-squared = 0.47134, df = 1, p-value = 0.4924

Due to high p-value, Exited vs HasCrCard are not significant.

Exited vs IsActiveMember

table_IsActiveMember <- table(df_bank$IsActiveMember, df_bank$Exited)
mosaicplot(table_IsActiveMember, ylab= "Exited ", xlab="IsActiveMember", main = "Exited vs IsActiveMember", shade=TRUE)

chisq.test(table_IsActiveMember)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_IsActiveMember
## X-squared = 242.99, df = 1, p-value < 2.2e-16

Due to very low p-value, Exited vs IsActiveMember are significant.

Exited vs Geography

table_Geography <- table(df_bank$Geography, df_bank$Exited)
mosaicplot(table_Geography, ylab= "Exited", xlab="Geography", main = "Exited vs Geography", shade=TRUE)

chisq.test(table_Geography)

## 
##  Pearson's Chi-squared test
## 
## data:  table_Geography
## X-squared = 301.26, df = 2, p-value < 2.2e-16

Due to very low p-value, Exited vs Geography are significant.

Exited vs Gender

table_Gender <- table(df_bank$Gender, df_bank$Exited)
mosaicplot(table_Gender, ylab= "Exited ", xlab="Gender", main = "Exited vs Gender", shade=TRUE)

chisq.test(table_Gender)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_Gender
## X-squared = 112.92, df = 1, p-value < 2.2e-16

Due to very low p-value, Exited vs Gender are significant.

3. Set the random seed, set.seed(123), and split the data to train/test. Use 80% of samplesfor training and the remaining 20% for testing. You can use “sample” (as we did in slide 37 ofweek 6 lecture) or alternatively, you can use “createDataPartition” method from caret package.

# set the seed
set.seed(123)
split <- createDataPartition(df_bank$Exited, p = 0.8, list = FALSE)
# created train and test set
training_set <- df_bank[split, ]
test_set <- df_bank[-split, ]

dim(training_set)

## [1] 8001   11

dim(test_set)

## [1] 1999   11

4. Train a logistic regression model on the train data using the glm package and use it topredict “Exited” for the test data. Note: As explained in the lectures, “predict” method will returnpredicted probabilities. To convert them to labels, you need to use some threshold ( typically setas 50%) and if the predicted probability is greater than 50% you predict label “1” for Exited;otherwise predict label “0” ( please review the example in lecture 7.2).

logistic_model = glm(Exited~., data = training_set, family = "binomial" )
summary(logistic_model)

## 
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2776  -0.6601  -0.4622  -0.2741   2.9777  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.577e+00  2.723e-01 -13.138  < 2e-16 ***
## CreditScore      -4.089e-04  3.120e-04  -1.311   0.1900    
## GeographyGermany  7.609e-01  7.535e-02  10.099  < 2e-16 ***
## GeographySpain    4.787e-02  7.882e-02   0.607   0.5436    
## GenderMale       -5.049e-01  6.072e-02  -8.315  < 2e-16 ***
## Age               7.125e-02  2.845e-03  25.044  < 2e-16 ***
## Tenure           -1.952e-02  1.043e-02  -1.871   0.0613 .  
## Balance           2.711e-06  5.718e-07   4.742 2.12e-06 ***
## NumOfProducts    -5.851e-02  5.173e-02  -1.131   0.2581    
## HasCrCard        -5.177e-02  6.579e-02  -0.787   0.4313    
## IsActiveMember   -1.076e+00  6.436e-02 -16.719  < 2e-16 ***
## EstimatedSalary   6.310e-07  5.262e-07   1.199   0.2305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8089.4  on 8000  degrees of freedom
## Residual deviance: 6891.9  on 7989  degrees of freedom
## AIC: 6915.9
## 
## Number of Fisher Scoring iterations: 5

prediction_1 = predict(logistic_model, test_set, type = "response")
head(prediction_1,10)

##          1          5         11         14         16         19         20 
## 0.11929779 0.16135817 0.11961030 0.10768453 0.23333315 0.24169327 0.02955896 
##         23         25         30 
## 0.24164741 0.08463062 0.03794250

# to convert lebel
predicted.label = ifelse(prediction_1 > 0.5, 1,0)
head(predicted.label,10)

##  1  5 11 14 16 19 20 23 25 30 
##  0  0  0  0  0  0  0  0  0  0

5. Get the cross table between the predicted labels and true labels in the test data andcompute total_error, false positive rate, and false negative rate.

Confusion Matrix

actual.label = test_set$Exited
t = table(predicted.label, actual.label)
print(t)

##                actual.label
## predicted.label    0    1
##               0 1548  326
##               1   44   81

Error

error = (t[1,2] + t[2,1])/sum(t)
cat("Total Error = ", error)

## Total Error =  0.1850925

Accuracy

missing_classerr <- mean(predicted.label != test_set$Exited)
print(paste('Accuracy =', 1 - missing_classerr))

## [1] "Accuracy = 0.814907453726863"

#t[1,1] = 'TN'
#t[1,2] = 'FN'
#t[2,1] = 'FP'
#t[2,2] = 'TP'

False Positive Rate fpr = fp / (fp + tn)

fpr = t[2,1]/(t[2,1] + t[1,1])
print(paste('False Positive Rate =', fpr))

## [1] "False Positive Rate = 0.0276381909547739"

False Negative Rate fnr = fn / (fn + tp)

fnr = t[1,2]/(t[1,2] + t[2,2])
print(paste('False Negative Rate =', fnr))

## [1] "False Negative Rate = 0.800982800982801"

6. The target variable “Exited” is severely imbalanced; the number of customers who stayed withthe bank is almost four times the number of customer who exited the bank. Most classificationmodels trained on imbalanced data are biased towards predicting the majority class ( Exited=0 in thiscase) and yield a higher classification error on the minority class (Exited=1).

One way to deal with class imbalance problem is to down-sample the majority class; meaningrandomly sample the observations in the majority class to make it the same size as the minority class.

The downside of this approach is that for smaller datasets, removing data will result in significant loss of information and lower performance. In Module 12, we will learn about other techniques to deal with data imbalance without removing information, but for this assignment, you are to use down-sampling in an attempt to address data imbalance and increase model performance in predicting customers who will exit the bank.

Note: Down-sampling should only be done on the training data and the test data should have the original imbalance distribution:

1. Divide your training data into two sets, customers who exited and customer who did not exit.

2. Sample the non-exiting customers set such that you have the same number of exiting andnon-exiting customers You can use “sample” from the base package to sample the rows or alternatively,you can use the method “sample_n” from dplyr package to directly sample thedataframe

3. Combine the exiting and non-exiting customers into one dataframe

4. Re-train the logistic regression model on the down-sampled training data and evaluate it onthe test data. Compare the total error, false positive rate, and false negative rate with th eprevious model. Which model does better at predicting exiting customers?

table(df_bank$Exited)

## 
##    0    1 
## 7963 2037

The data is not balanced. Let’s do the down sample.

set.seed(1)
smpl1 <- df_bank %>% filter(df_bank$Exited == 0) %>% sample_n(size = 2037)
smpl2 <- df_bank %>% filter(df_bank$Exited == 1) 
smpl_2037 <- rbind(smpl1, smpl2)

dim(smpl_2037)

## [1] 4074   11

table(smpl_2037$Exited)

## 
##    0    1 
## 2037 2037

Now the data is balanced.

smpl_2037 = smpl_2037[sample(nrow(smpl_2037),replace =FALSE),]

Randomize the order of the rows in the dataset.

set.seed(10)
split_2 <- createDataPartition(smpl_2037$Exited, p = 0.8, list = FALSE)
# created train and test set
smpl_training_set <- smpl_2037[split_2, ]
smpl_test_set <- smpl_2037[-split_2, ]

dim(smpl_training_set)

## [1] 3260   11

dim(smpl_test_set)

## [1] 814  11

table(smpl_test_set$Exited)

## 
##   0   1 
## 407 407

logistic_model_2 = glm(Exited~., data = smpl_training_set, family = "binomial" )
summary(logistic_model_2)

## 
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = smpl_training_set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.58690  -0.93990  -0.04068   0.95696   2.50148  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.312e+00  3.589e-01  -6.440 1.20e-10 ***
## CreditScore      -7.564e-04  4.040e-04  -1.872 0.061161 .  
## GeographyGermany  8.520e-01  1.002e-01   8.506  < 2e-16 ***
## GeographySpain    1.452e-01  1.006e-01   1.443 0.148945    
## GenderMale       -5.759e-01  7.935e-02  -7.258 3.92e-13 ***
## Age               7.569e-02  4.127e-03  18.339  < 2e-16 ***
## Tenure           -1.404e-02  1.359e-02  -1.033 0.301631    
## Balance           2.456e-06  7.112e-07   3.453 0.000555 ***
## NumOfProducts    -5.011e-02  5.943e-02  -0.843 0.399140    
## HasCrCard         7.308e-02  8.600e-02   0.850 0.395497    
## IsActiveMember   -9.439e-01  8.047e-02 -11.730  < 2e-16 ***
## EstimatedSalary   3.611e-07  6.926e-07   0.521 0.602110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4519.3  on 3259  degrees of freedom
## Residual deviance: 3756.6  on 3248  degrees of freedom
## AIC: 3780.6
## 
## Number of Fisher Scoring iterations: 4

prediction_2 = predict(logistic_model_2, smpl_test_set, type = "response")
head(prediction_2,10)

##      2787      3324      1748       355       798      3058       686       115 
## 0.4994796 0.4096367 0.4739061 0.1725451 0.2295362 0.8195343 0.3815372 0.1953250 
##      3878      2707 
## 0.7150617 0.5699479

# to convert lebel
predicted.label_2 = ifelse(prediction_2 > 0.5, 1,0)
head(predicted.label_2,10)

## 2787 3324 1748  355  798 3058  686  115 3878 2707 
##    0    0    0    0    0    1    0    0    1    1

Confusion Matrix

actual.label_2 = smpl_test_set$Exited
t2 = table(predicted.label_2, actual.label_2)
print(t2)

##                  actual.label_2
## predicted.label_2   0   1
##                 0 287 111
##                 1 120 296

Error

error_2 = (t2[1,2] + t2[2,1])/sum(t2)
cat("Total Error = ", error_2)

## Total Error =  0.2837838

False Positive Rate

fpr_2 = t2[2,1]/(t2[2,1] + t2[1,1])
print(paste('False Positive Rate =', fpr_2))

## [1] "False Positive Rate = 0.294840294840295"

False Negative Rate

fnr_2 = t2[1,2]/(t2[1,2] + t2[2,2])
print(paste('False Negative Rate =', fnr_2))

## [1] "False Negative Rate = 0.272727272727273"

Compair both the models

tabularview_2 <- data.frame("Matrices" = c("Total Error", "False Positive Rate", "False Negative Rate"),
                          "Model 1" = c(error, fpr, fnr),
                           "Model 2" = c(error_2, fpr_2, fnr_2)
                          
                          )

kableExtra::kable(tabularview_2) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)

Matrices	Model.1	Model.2
Total Error	0.1850925	0.2837838
False Positive Rate	0.0276382	0.2948403
False Negative Rate	0.8009828	0.2727273

Note: Above table shows:

Model 1 represents the Logistic Regression
Model 2 represents the Down sample Logistic regression

We see Model 1 has lower Total Error rate and False Positive rate than Model 2. Model 2 has lower False Nagative rate than Model 1.

7. Repeat steps 4,5,6 above but this time, use a C5.0 decision tree model to predict “Exited”. (usetrials=30 for boosting multiple decision trees (see an example in slide 44, module 6) Compare the logistic regression model with the boosted C5.0 model.

lg_model_3 = C5.0(training_set[-11], training_set$Exited, trials = 10)
lg_model_3

## 
## Call:
## C5.0.default(x = training_set[-11], y = training_set$Exited, trials = 10)
## 
## Classification Tree
## Number of samples: 8001 
## Number of predictors: 10 
## 
## Number of boosting iterations: 10 
## Average tree size: 26.4 
## 
## Non-standard options: attempt to group attributes

prediction_3 = predict(lg_model_3, test_set)
head(prediction_3,10)

##  [1] 0 0 0 0 0 1 0 0 0 0
## Levels: 0 1

Confusion Matrix

actual_3 = test_set$Exited
t3 = table(prediction_3, actual_3)
print(t3)

##             actual_3
## prediction_3    0    1
##            0 1529  220
##            1   63  187

Error

error_3 = (t3[1,2] + t3[2,1])/sum(t3)
cat("Total Error = ", error_3)

## Total Error =  0.1415708

False Positive Rate

fpr_3 = t3[2,1]/(t3[2,1] + t3[1,1])
print(paste('False Positive Rate =', fpr_3))

## [1] "False Positive Rate = 0.039572864321608"

False Negative Rate

fnr_3 = t3[1,2]/(t3[1,2] + t3[2,2])
print(paste('False Negative Rate =', fnr_3))

## [1] "False Negative Rate = 0.540540540540541"

C5.0 decision tree model with down sample data

lg_model_4 = C5.0(smpl_training_set[-11], smpl_training_set$Exited, trials = 10)
lg_model_4

## 
## Call:
## C5.0.default(x = smpl_training_set[-11], y = smpl_training_set$Exited, trials
##  = 10)
## 
## Classification Tree
## Number of samples: 3260 
## Number of predictors: 10 
## 
## Number of boosting iterations: 10 
## Average tree size: 18.5 
## 
## Non-standard options: attempt to group attributes

prediction_4 = predict(lg_model_4, smpl_test_set)
head(prediction_4,10)

##  [1] 0 0 0 0 0 1 0 0 1 1
## Levels: 0 1

Confusion Matrix

actual_4 = smpl_test_set$Exited
t4 = table(prediction_4, actual_4)
print(t4)

##             actual_4
## prediction_4   0   1
##            0 334  97
##            1  73 310

Error

error_4 = (t4[1,2] + t4[2,1])/sum(t4)
cat("Total Error = ", error_4)

## Total Error =  0.2088452

False Positive Rate

fpr_4 = t4[2,1]/(t4[2,1] + t4[1,1])
print(paste('False Positive Rate =', fpr_4))

## [1] "False Positive Rate = 0.179361179361179"

False Negative Rate

fnr_4 = t4[1,2]/(t4[1,2] + t4[2,2])
print(paste('False Negative Rate =', fnr_4))

## [1] "False Negative Rate = 0.238329238329238"

Compare the logistic regression model with the boosted C5.0 model:

tabularview_3 <- data.frame("Matrices" = c("Total Error", "False Positive Rate", "False Negative Rate"),
                          "Model 1" = c(error, fpr, fnr),
                           "Model 2" = c(error_2, fpr_2, fnr_2),
                          "Model 3" = c(error_3, fpr_3, fnr_3),
                          "Model 4" = c(error_4, fpr_4, fnr_4)
                          
                          )

kableExtra::kable(tabularview_3) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)

Matrices	Model.1	Model.2	Model.3	Model.4
Total Error	0.1850925	0.2837838	0.1415708	0.2088452
False Positive Rate	0.0276382	0.2948403	0.0395729	0.1793612
False Negative Rate	0.8009828	0.2727273	0.5405405	0.2383292

Note :

Model 1 represents the Logistic Regression
Model 2 represents the Down sample Logistic regression
Model 3 represents C5.0 decision tree regression
Model 4 represents Down sample C5.0 decision tree regression

Above table shows, Model 3 has lower Error rate than other models. Model 1 has lower False Positive Rate than other models. Model 4 has lower False Nagative Rat ethan other models.

Lower the error, better the model performnace. So Model 3 i.e C5.0 decision tree regression model performs well for predicting existing customers.

Assignment 3

Group_Assignment: Venkata Naga Vamsidhar reddy karasani(vkara4), Anila Cheekati(vchee3), Venkata sai ram tirunagari(Vtiru5) , Pradeep kumar Naidu(Pnaid2), Simhadri Ramanjaneyulu(rsimh3), Subhalaxmi Rout(srout2)

Problem 1: Predicting number of college applications

Problem2—Predicting Customer Churn using Logistic Regression and Decision Trees