knitr::opts_chunk$set(echo = TRUE)
options(width=120)
library(car)
library(ggplot2)
library(reshape2)
library(psych)
library(pastecs)
library(tidyr)
library(dplyr)## SeniorCitizen tenure MultipleLines InternetService Contract MonthlyCharges Churn
## 1 0 71 Yes DSL Two year 80.10 No
## 2 0 53 Yes Fiber optic One year 91.15 No
## 3 0 67 No No Two year 19.40 No
## 4 0 6 No Fiber optic Month-to-month 69.80 No
## 5 0 14 No No Month-to-month 19.60 No
## 6 0 68 No Fiber optic One year 100.30 No
The dataset contains the data of the customers in telecommunication industry. Therefore the unit of observation is a customer. The dataset includes 21 variables and data of 7043 customers.
For easier data manipulation I shrinked the data to 1000 observations and 7 variables, as more are not going to be necessary.
The unit of observation of the analysis is a customer in the network.
Source:
https://www.kaggle.com/datasets/blastchar/telco-customer-churn
The variable description is as below.
SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
tenure: Number of months the customer has stayed with the company
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService: Customer internet service provider (DSL, Fiber optic, No)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
MonthlyCharges: The amount charged to the customer monthly ($)
Churn: Whether the customer churned or not (Yes or No)
The main goal of this study is to build a prediction model for predicting the customers who are likely to churn from the network. In the first section the descriptive analysis and in the third section hypothesis testing was conducted. Therefore as mentioned the prediction model will be build in this section.
Before moving to the modeling part data pre processing is a necessary and required step, to get the data for suitable types so it can be continued with the further analysis without any error. In this dataset also there were many required preprocesing steps were identified and they were corrected before the next step.
df$SeniorCitizen <- factor(df$SeniorCitizen,
levels = c("0","1"),
labels = c("No", "Yes"))All the categorical variables in the dataset are in ‘char’ format. Therefore they will be converted to factors before moving to the descriptive analysis.
df$MultipleLines= as.factor(df$MultipleLines)
df$InternetService= as.factor(df$InternetService)
df$Contract= as.factor(df$Contract)
df$Churn= as.factor(df$Churn)The first model is built only as an intercept model, which indicates no independent variables will be used to build the model.
#Fitting the logistic only intercept model
fit0 <- glm(df$Churn ~ 1, #Dependent and explanatory variables
family = binomial, #Binary logistic regression
data = df)The below outputs show the intercept model summary, the intercept value is -1.046.
summary(fit0)##
## Call:
## glm(formula = df$Churn ~ 1, family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.776 -0.776 -0.776 1.641 1.641
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.04597 0.07209 -14.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1146.1 on 999 degrees of freedom
## Residual deviance: 1146.1 on 999 degrees of freedom
## AIC: 1148.1
##
## Number of Fisher Scoring iterations: 4
exp(cbind(odds = fit0$coefficients, confint.default(fit0))) #Odds for Y=1## odds 2.5 % 97.5 %
## (Intercept) 0.3513514 0.3050531 0.4046764
The next step includes testing the predicted vs the actuals, for that a classification matrix is created.
#Confusion Matrix
head(fitted(fit0))## 1 2 3 4 5 6
## 0.26 0.26 0.26 0.26 0.26 0.26
correctly <- 740/1000The probabilities are all the same as we did not include any other explanatory variable. The probability of client churning is 26 percent and the probability of correctly classified units is therefore 74 percent.
Building the second model we included the tenure as an independent variable to see if the model is better.
#Fitting the logistic model
fit1 <- glm(df$Churn ~ tenure, #Dependent and explanatory variables
family = binomial, #Binary logistic regression
data = df)
summary(fit1)##
## Call:
## glm(formula = df$Churn ~ tenure, family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1429 -0.8486 -0.5072 1.2277 2.2956
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08163 0.11218 -0.728 0.467
## tenure -0.03592 0.00372 -9.658 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1146.1 on 999 degrees of freedom
## Residual deviance: 1031.3 on 998 degrees of freedom
## AIC: 1035.3
##
## Number of Fisher Scoring iterations: 4
We see that tenure is statistically sigificant.
anova(fit0, fit1, test = "Chi") #Comparision of models based on -2LL statistics## Analysis of Deviance Table
##
## Model 1: df$Churn ~ 1
## Model 2: df$Churn ~ tenure
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 999 1146.1
## 2 998 1031.3 1 114.81 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Simple model is better
H1: Complex model is better
The degrees of freedom (m) equals to 1 and deviance is smaller, and considering the p-value we can reject the null hypothesis and conclude that complex model which includes tenure as explanatory variable is better (p < 0.001).
exp(cbind(OR = fit1$coefficients, confint.default(fit1)))## OR 2.5 % 97.5 %
## (Intercept) 0.9216090 0.7397017 1.1482509
## tenure 0.9647124 0.9577048 0.9717713
If tenure increases by 1 month, the odds for churning is equal to 0.96 times the initial odds under the assumption that the remaining explanatory variables do not change (p<0,001).
Model 03 is build based on all the explanatory variables we were left off in the table.
fit2 <- glm(Churn ~ SeniorCitizen + tenure + MultipleLines+ InternetService + Contract + MonthlyCharges,
family = binomial,
data = df)vif(fit2)## GVIF Df GVIF^(1/(2*Df))
## SeniorCitizen 1.085280 1 1.041768
## tenure 1.903811 1 1.379786
## MultipleLines 2.620181 2 1.272280
## InternetService 7.379071 2 1.648164
## Contract 1.531826 2 1.112506
## MonthlyCharges 8.484722 1 2.912855
We will remove MonthlyCharges because of the multicollinearity.
df <- df[ , -c(6)]
fit2 <- glm(Churn ~ SeniorCitizen + tenure + MultipleLines+ InternetService + Contract,
family = binomial,
data = df)
vif(fit2)## GVIF Df GVIF^(1/(2*Df))
## SeniorCitizen 1.080885 1 1.039656
## tenure 1.745943 1 1.321341
## MultipleLines 1.746574 2 1.149600
## InternetService 1.635842 2 1.130929
## Contract 1.474496 2 1.101947
We see that the GVIF is reduced at InternetService when removing MonthlyCharges, so that is good.
df$StdResid <- rstandard(fit2)
df$CookDist <- cooks.distance(fit2)
hist(df$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")
Standardized residual greater than 3 or less than -3 can be considered
as outliers.Therefore, in this histogram, outliers are not observed.
head(df$CookDist)## [1] 2.689948e-06 2.028701e-04 6.770075e-07 1.238401e-03 1.422785e-04 1.397198e-04
head(df[order(-df$CookDist), ])## SeniorCitizen tenure MultipleLines InternetService Contract Churn StdResid CookDist
## 401 No 55 No DSL Two year Yes 2.868602 0.02351044
## 604 No 57 No Fiber optic Two year Yes 2.389266 0.02100341
## 537 No 54 No Fiber optic Two year Yes 2.353393 0.02064268
## 112 No 66 Yes Fiber optic Two year Yes 2.490158 0.01965960
## 152 No 65 Yes Fiber optic Two year Yes 2.478531 0.01956842
## 57 No 61 Yes Fiber optic Two year Yes 2.431698 0.01925458
df <- df[-c(401) , ]Refitting the model after removing a unit.
Chi-square test to compare two models:
H0: Simple model is better
H1: Complex model is better
fit1 <- glm(df$Churn ~ tenure,
family = binomial,
data = df)
fit2 <- glm(Churn ~ SeniorCitizen + tenure + MultipleLines+ InternetService + Contract,
family = binomial,
data = df)anova(fit1, fit2, test = "Chi")## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: df$Churn
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 998 1143.4
## tenure 1 116.49 997 1026.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At p< 0,001 I will reject null hypothesis therefore the complex model fits the data better.
summary(fit2)##
## Call:
## glm(formula = Churn ~ SeniorCitizen + tenure + MultipleLines +
## InternetService + Contract, family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6737 -0.7248 -0.3129 0.7725 2.5500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.75688 0.19625 -3.857 0.000115 ***
## SeniorCitizenYes 0.46060 0.21782 2.115 0.034463 *
## tenure -0.03121 0.00554 -5.634 1.76e-08 ***
## MultipleLinesNo phone service 0.74996 0.32134 2.334 0.019606 *
## MultipleLinesYes 0.03004 0.20363 0.148 0.882725
## InternetServiceFiber optic 1.41505 0.22573 6.269 3.64e-10 ***
## InternetServiceNo -0.87388 0.35979 -2.429 0.015145 *
## ContractOne year -0.61237 0.27626 -2.217 0.026647 *
## ContractTwo year -1.84009 0.50240 -3.663 0.000250 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1143.42 on 998 degrees of freedom
## Residual deviance: 857.46 on 990 degrees of freedom
## AIC: 875.46
##
## Number of Fisher Scoring iterations: 6
We see that Multiple phone lines compared to single phone line are statstically insignificant.
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))## OR 2.5 % 97.5 %
## (Intercept) 0.4691263 0.3193317 0.6891877
## SeniorCitizenYes 1.5850301 1.0342590 2.4291019
## tenure 0.9692736 0.9588066 0.9798548
## MultipleLinesNo phone service 2.1169086 1.1276451 3.9740356
## MultipleLinesYes 1.0304944 0.6913771 1.5359472
## InternetServiceFiber optic 4.1167125 2.6448701 6.4076196
## InternetServiceNo 0.4173290 0.2061717 0.8447496
## ContractOne year 0.5420625 0.3154242 0.9315448
## ContractTwo year 0.1588025 0.0593216 0.4251107
If tenure increases by 1 month, the odds for churning is equal to 0.97 times the initial odds under the assumption that the remaining explanatory variables do not change (p<0.001).
Given the value of all other variables, the odds churning for customer with no phone line are 2.12 times the odds of those who have one phone line (p < 0.05).
Given the value of all other variables, the odds churning for a senior citizen are 1.59 times the odds of non senior citizen churning (p < 0.05).
Given the value of all other variables, the odds churning for a customer with fiber optic are 4.21 times the odds of customer with DSL churning (p < 0.001).
Given the value of all other variables, the odds churning for a customer with no internet access are 0.42 times the odds of customer with DSL churning (p < 0.05).
Given the value of all other variables, the odds churning for a customer with one year contract are 0.54 times the odds of customr with month-to-month contract churning (p < 0.05).
Given the value of all other variables, the odds churning for a customer with two year contract are 0.16 times the odds of customr with month-to-month contract churning (p < 0.001).
df$EstimatedProbab <- fitted(fit2)
head(df)## SeniorCitizen tenure MultipleLines InternetService Contract Churn StdResid CookDist EstimatedProbab
## 1 No 71 Yes DSL Two year No -0.14401934 2.689948e-06 0.008303542
## 2 No 53 Yes Fiber optic One year No -0.61239629 2.028701e-04 0.171049282
## 3 No 67 No No Two year No -0.09658068 6.770075e-07 0.003827074
## 4 No 6 No Fiber optic Month-to-month No -1.38722998 1.238401e-03 0.615601779
## 5 No 14 No No Month-to-month No -0.48866018 1.422785e-04 0.112277938
## 6 No 68 No Fiber optic One year No -0.49242353 1.397198e-04 0.111414469
In the far right column we can observe the change of a customer churning based on values of his explanatory variables.
df$Classification <- ifelse(test = df$EstimatedProbab < 0.5,
yes = "No",
no = "Yes")
ClassificationTable <- table(df$Churn, df$Classification)
ClassificationTable##
## No Yes
## No 664 76
## Yes 133 126
The correctly classified units of observation lie on the main diagonal of the matrix.
Pseudo_R2_fit2 <- (ClassificationTable[1,1] + ClassificationTable[2,2])/nrow(df)
Pseudo_R2_fit2## [1] 0.7907908
79.1 percent of all observations were correctly classified.