GLM_Project

### loading libraries
library(readr)
library(knitr)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(e1071)
library(generalhoslem)

## Loading required package: reshape

## Loading required package: MASS

Prepairing data and some exploratory analysis

Churn_Data <- read_csv("/Users/hamed/Documents/Academics/Generalized Linear Models/Term Project/Churn_Modelling.csv")

## Parsed with column specification:
## cols(
##   RowNumber = col_double(),
##   CustomerId = col_double(),
##   Surname = col_character(),
##   CreditScore = col_double(),
##   Geography = col_character(),
##   Gender = col_character(),
##   Age = col_double(),
##   Tenure = col_double(),
##   Balance = col_double(),
##   NumOfProducts = col_double(),
##   HasCrCard = col_double(),
##   IsActiveMember = col_double(),
##   EstimatedSalary = col_double(),
##   Exited = col_double()
## )

# Excluding columns of id and name
Churn_Data <- subset(Churn_Data,select = 4:ncol(Churn_Data))

Churn_Data$Exited <- as.factor(Churn_Data$Exited)

# Examining multi-collinearity between continuous independant variables.

con_var <- subset(Churn_Data, select = c("CreditScore","Age","Tenure","Balance","EstimatedSalary"))
pairs(con_var)

kable(cor(con_var))

	CreditScore	Age	Tenure	Balance	EstimatedSalary
CreditScore	1.0000000	-0.0039649	0.0008419	0.0062684	-0.0013843
Age	-0.0039649	1.0000000	-0.0099968	0.0283084	-0.0072010
Tenure	0.0008419	-0.0099968	1.0000000	-0.0122539	0.0077838
Balance	0.0062684	0.0283084	-0.0122539	1.0000000	0.0127975
EstimatedSalary	-0.0013843	-0.0072010	0.0077838	0.0127975	1.0000000

str(Churn_Data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    10000 obs. of  11 variables:
##  $ CreditScore    : num  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : chr  "France" "Spain" "France" "France" ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : num  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : num  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : num  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : num  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : num  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 1 ...

No multi-collinearity problem: correlations are not extreme and basically as long as our objective is prediction, multi-collinearity is not a concern. Because even in extreme case of correlation between independant variables, coefficients’ estimates are still unbiased.

Spliting dataset to train and test segments

set.seed(648)
ind <- sample(2, nrow(Churn_Data), replace = TRUE, prob = c(0.8, 0.2))
train_data <- Churn_Data[ind==1,]
test_data <- Churn_Data[ind==2,]

#Check proportion of churned and unchurned users in train and test data
prop.table(table(train_data$Exited))

## 
##        0        1 
## 0.792057 0.207943

prop.table(table(test_data$Exited))

## 
##         0         1 
## 0.8133467 0.1866533

Fit model 1

I fitted model 1, which is the model with all independant variables and data set. I also tried to find influential values (outliers) using cook’s distance, but results are a bit strange. Standardized residuals does not show any problematic point. it will be great if you could add some analyses to this part

#### MODEL 1
model1 <- glm(Exited ~., family = binomial(), data = train_data)
summary(model1)

## 
## Call:
## glm(formula = Exited ~ ., family = binomial(), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3013  -0.6670  -0.4632  -0.2745   2.9528  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.391e+00  2.720e-01 -12.465  < 2e-16 ***
## CreditScore      -4.582e-04  3.107e-04  -1.475   0.1402    
## GeographyGermany  7.305e-01  7.527e-02   9.706  < 2e-16 ***
## GeographySpain   -2.749e-02  7.821e-02  -0.351   0.7252    
## GenderMale       -5.216e-01  6.034e-02  -8.644  < 2e-16 ***
## Age               7.157e-02  2.827e-03  25.315  < 2e-16 ***
## Tenure           -1.551e-02  1.037e-02  -1.497   0.1345    
## Balance           2.376e-06  5.682e-07   4.182 2.89e-05 ***
## NumOfProducts    -1.228e-01  5.258e-02  -2.336   0.0195 *  
## HasCrCard        -4.335e-02  6.590e-02  -0.658   0.5107    
## IsActiveMember   -1.042e+00  6.366e-02 -16.363  < 2e-16 ***
## EstimatedSalary   4.618e-07  5.245e-07   0.880   0.3786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6965.1  on 7995  degrees of freedom
## AIC: 6989.1
## 
## Number of Fisher Scoring iterations: 5

#fit$fitted.values
cooksd <- cooks.distance(model1)
plot(cooksd, pch="*", cex=1, main="Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="red")

plot(rstandard(model1))

Suggesting a more parsimonous model

I am dropping insignificant independant variables to see if we can have more parsimonous model

model2 <- glm(Exited ~.-CreditScore-Tenure-HasCrCard-EstimatedSalary, family = binomial(), data = train_data)
summary(model2)

## 
## Call:
## glm(formula = Exited ~ . - CreditScore - Tenure - HasCrCard - 
##     EstimatedSalary, family = binomial(), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3467  -0.6651  -0.4646  -0.2749   2.9265  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.748e+00  1.609e-01 -23.290  < 2e-16 ***
## GeographyGermany  7.273e-01  7.519e-02   9.672  < 2e-16 ***
## GeographySpain   -2.850e-02  7.817e-02  -0.365   0.7154    
## GenderMale       -5.230e-01  6.029e-02  -8.675  < 2e-16 ***
## Age               7.152e-02  2.825e-03  25.316  < 2e-16 ***
## Balance           2.394e-06  5.677e-07   4.217 2.48e-05 ***
## NumOfProducts    -1.226e-01  5.253e-02  -2.334   0.0196 *  
## IsActiveMember   -1.040e+00  6.353e-02 -16.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8186.7  on 8006  degrees of freedom
## Residual deviance: 6970.7  on 7999  degrees of freedom
## AIC: 6986.7
## 
## Number of Fisher Scoring iterations: 5

model2$deviance

## [1] 6970.709

Dev_test <- abs(model2$deviance - model1$deviance) > qchisq(0.95,4)
Dev_test

## [1] FALSE

According to Deviance test, null hypothesis is not rejected and covariates in model 2 are enough to explain response variable.

Hosmer-Lemeshow goodness-of-fit statistic

hoslem <- logitgof(train_data$Exited, model2$fitted.values, g=10)
hoslem

## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  train_data$Exited, model2$fitted.values
## X-squared = 13.22, df = 8, p-value = 0.1045

we cannot reject null-hypothesis (goodness of fit of this model) in %95 significance level

Confusion matrix and checking accuracy of models

Model 1

###accuracy of prediction for model 1

prediction1 <- predict(model1, test_data,type = "response")
#prediction1
threshold = 0.5
prediction1[prediction1 >= threshold] <- 1
prediction1[prediction1 < threshold] <- 0
#prediction1
confusionMatrix(as.factor(prediction1), test_data$Exited, positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1572  291
##          1   49   81
##                                           
##                Accuracy : 0.8294          
##                  95% CI : (0.8122, 0.8457)
##     No Information Rate : 0.8133          
##     P-Value [Acc > NIR] : 0.03398         
##                                           
##                   Kappa : 0.2502          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21774         
##             Specificity : 0.96977         
##          Pos Pred Value : 0.62308         
##          Neg Pred Value : 0.84380         
##              Prevalence : 0.18665         
##          Detection Rate : 0.04064         
##    Detection Prevalence : 0.06523         
##       Balanced Accuracy : 0.59376         
##                                           
##        'Positive' Class : 1               
##

Although accuracy of this model is more than %82, attending accuracy as the only criteria is misleading. Since, we are mainly interested in predicting churning users precisely (i.e. sensitivity) and this model predicts churning users correctly in %22 of cases. This is terrible for our objective. Thus, we can improve sensitivity (at the cost of specificity and therefore accuracy) by changing threshold of classificatin. In the plot below, we examine how these three criteria change in deferent thresholds.

thresholds <- seq(0.1,1,0.1)
acc <- rep(0,10)
sens <- rep(0,10)
spec <- rep(0,10)
for (i in 1:10){
  pred <- predict(model1, test_data,type = "response")
  pred[pred >= thresholds[i]] <- 1
  pred[pred < thresholds[i]] <- 0
  c <- confusionMatrix(as.factor(pred), test_data$Exited, positive = '1')
  acc[i] <- c$overall['Accuracy']
  sens[i] <- c$byClass['Sensitivity']
  spec[i] <- c$byClass['Specificity']
}

## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.

## Warning in confusionMatrix.default(as.factor(pred), test_data$Exited, positive =
## "1"): Levels are not in the same order for reference and data. Refactoring data
## to match.

plot(thresholds,acc, type = "o", pch=19, ylim = c(0,1))
points(thresholds,sens, type = "o", pch=19, col="red")
points(thresholds,spec, type = "o", pch=19, col="blue")

legend("right",c("Accuracy","Sensitivity","Specificity"),pch=c(19,19,19),col = c("black","red","blue"))

Model 2

###accuracy of prediction for model 2

prediction2 <- predict(model2, test_data,type = "response")
threshold = 0.5
prediction2[prediction2 >= threshold] <- 1
prediction2[prediction2 < threshold] <- 0
confusionMatrix(as.factor(prediction2), test_data$Exited, positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1571  292
##          1   50   80
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8111, 0.8447)
##     No Information Rate : 0.8133          
##     P-Value [Acc > NIR] : 0.04384         
##                                           
##                   Kappa : 0.2458          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.21505         
##             Specificity : 0.96915         
##          Pos Pred Value : 0.61538         
##          Neg Pred Value : 0.84326         
##              Prevalence : 0.18665         
##          Detection Rate : 0.04014         
##    Detection Prevalence : 0.06523         
##       Balanced Accuracy : 0.59210         
##                                           
##        'Positive' Class : 1               
##

thresholds <- seq(0.1,1,0.1)
acc2 <- rep(0,10)
sens2 <- rep(0,10)
spec2 <- rep(0,10)
for (i in 1:10){
  pred2 <- predict(model2, test_data,type = "response")
  pred2[pred2 >= thresholds[i]] <- 1
  pred2[pred2 < thresholds[i]] <- 0
  c2 <- confusionMatrix(as.factor(pred2), test_data$Exited, positive = '1')
  acc2[i] <- c2$overall['Accuracy']
  sens2[i] <- c2$byClass['Sensitivity']
  spec2[i] <- c2$byClass['Specificity']
}

## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.

## Warning in confusionMatrix.default(as.factor(pred2), test_data$Exited, positive
## = "1"): Levels are not in the same order for reference and data. Refactoring
## data to match.

plot(thresholds,acc2, type = "o", pch=19, ylim = c(0,1))
points(thresholds,sens2, type = "o", pch=19, col="red")
points(thresholds,spec2, type = "o", pch=19, col="blue")
points(thresholds,sens, type = "o", pch=19, col="green")
legend("right",c("Accuracy-Model2","Sensitivity-Model2","Specificity-Model2", "Sensitivity-Model1"),pch=c(19,19,19,19),col = c("black","red","blue","green"))

Since sensitivity is important to us, I have also compared with the sensitivity of model 1. Not much difference can be seen.