INTRODUCTION:
The Telco data contains information about almost seven thousand three hundred users, their demographic characteristics, the services they use, the duration of using the operator’s services, the method of payment, and the amount of payment. The analysis of the data is to predict the churn of users (to identify people who will and will not renew their contract).
PROBLEM STATEMENT:
The Telco industry heavily relies on retaining existing customers. It’s vital for these organizations looking to grow their business without relying too heavily on the significantly higher cost of acquiring new customers. Marketing, sales, and customer retention departments need to work to make sure customers are satisfied, provide them with incentives, and present offers at the right time to reduce churn
Objective
library("dplyr")
library("tidyverse")
df = read.csv("Telco_Customer_Dataset_Dirty.csv")
head(df)
dim(df)
## [1] 7322 23
str(df)
## 'data.frame': 7322 obs. of 23 variables:
## $ RowId : chr "0096-BXERSMonth-to-monthElectronic check" "0178-CIIKRMonth-to-monthMailed check" "0318-ZOPWSTwo yearBank transfer (automatic)" "4445-ZJNMUMonth-to-monthCredit card (automatic)" ...
## $ customerID : chr "0096-BXERS" "0178-CIIKR" "0318-ZOPWS" "4445-ZJNMU" ...
## $ gender : chr "female" "female" "Female" "M" ...
## $ SeniorCitizen : num 0 0 0 0 0 0 1 1 1 0 ...
## $ Partner : chr "Yes" "No" "Yes" "No" ...
## $ Supplementary.Line: chr "" "" "" "" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : num 6 3 49 9 12 1 1 16 24 36 ...
## $ PhoneService : chr "Yes" "Yes" "Yes" "Yes" ...
## $ MultipleLines : chr "Yes" "No" "No" "Yes" ...
## $ InternetService : chr "DSL" "No" "No" "Fiber optic" ...
## $ OnlineSecurity : chr "No" "No internet service" "No internet service" "No" ...
## $ OnlineBackup : chr "No" "No internet service" "No internet service" "Yes" ...
## $ DeviceProtection : chr "No" "No internet service" "No internet service" "No" ...
## $ TechSupport : chr "No" "No internet service" "No internet service" "No" ...
## $ StreamingTV : chr "No" "No internet service" "No internet service" "Yes" ...
## $ StreamingMovies : chr "No" "No internet service" "No internet service" "Yes" ...
## $ Contract : chr "Month-to-month" "Month-to-month" "Two year" "Month-to-month" ...
## $ PaperlessBilling : chr "n" "No" "Yes" "Yes" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Bank transfer (automatic)" "Credit card (automatic)" ...
## $ MonthlyCharges : num 50.4 19.9 20.1 99.3 19.8 ...
## $ TotalCharges : num 315 58 973 919 202 ...
## $ Churn : chr "No" "No" "No" "No" ...
sum(duplicated(df$customerID))
## [1] 279
df = df[!duplicated(df$customerID), ]
sum(duplicated(df$customerID))
## [1] 0
print(dim(df))
## [1] 7043 23
colSums(is.na(df))
## RowId customerID gender SeniorCitizen
## 0 0 0 0
## Partner Supplementary.Line Dependents tenure
## 0 0 0 0
## PhoneService MultipleLines InternetService OnlineSecurity
## 0 0 0 0
## OnlineBackup DeviceProtection TechSupport StreamingTV
## 0 0 0 0
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 0 0 0 0
## MonthlyCharges TotalCharges Churn
## 0 11 0
colSums(df == "")
## RowId customerID gender SeniorCitizen
## 0 0 0 0
## Partner Supplementary.Line Dependents tenure
## 0 6996 0 0
## PhoneService MultipleLines InternetService OnlineSecurity
## 0 0 0 0
## OnlineBackup DeviceProtection TechSupport StreamingTV
## 0 0 0 0
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 0 0 0 0
## MonthlyCharges TotalCharges Churn
## 0 NA 0
drop <- c("RowId","Supplementary.Line")
df = df[,!(names(df) %in% drop)]
‘Row ID’ and ‘Supplementary.Line’ is dropped.
RowID is a concatenation of customerID + Contract + PaymentMethod which is probably only being used in Data Warehousing
Supplementary.Line has more than 80% missing values and it’s safe to drop as it may cause bias
new_DF <- df[rowSums(is.na(df)) > 0,]
head(new_DF,20)
df = na.omit(df)
colSums(is.na(df))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 0
## Churn
## 0
head(df)
df %>% group_by(gender)%>% dplyr::summarise(count = n())
male_cat <- c('m','male','M')
female_cat <- c('f','female','F')
df$gender[df$gender %in% male_cat] <- 'Male'
df$gender[df$gender %in% female_cat] <- 'Female'
df %>% group_by(gender)%>% dplyr::summarise(count = n())
df %>% group_by(PaperlessBilling)%>% dplyr::summarise(count = n())
yes_cat <- c('n','no','N')
no_cat <- c('y','yes','Y')
df$PaperlessBilling[df$PaperlessBilling %in% yes_cat] <- 'Yes'
df$PaperlessBilling[df$PaperlessBilling %in% no_cat] <- 'No'
df %>% group_by(PaperlessBilling)%>% dplyr::summarise(count = n())
df %>% group_by(Contract)%>% dplyr::summarise(count = n())
one_cat <- c('1 year')
two_cat <- c('2 year')
df$Contract[df$Contract %in% one_cat] <- 'One year'
df$Contract[df$Contract %in% two_cat] <- 'Two year'
df %>% group_by(Contract)%>% dplyr::summarise(count = n())
dim(df)
## [1] 7032 21
write.csv(df,'Telco_Customer_Dataset_Cleaned.csv')
| Column Name | Description |
|---|---|
| customerID | Customer ID |
| gender | Customer gender |
| SeniorCitizen | Whether the customer is a senior citizen or not |
| Partner | Whether the customer has a partner or not |
| Dependents | Whether the customer has dependents or not |
| tenure | Number of months the customer has stayed with the company |
| PhoneService | Whether the customer has a phone service or not |
| MultipleLines | Whether the customer has multiple lines or not |
| InternetService | Customer’s internet service provider |
| OnlineSecurity | Whether the customer has online security or not |
| OnlineBackup | Whether the customer has online backup or not |
| DeviceProtection | Whether the customer has device protection or not |
| TechSupport | Whether the customer has tech support or not |
| StreamingTV | Whether the customer has streaming TV or not |
| StreamingMovies | Whether the customer has streaming movies or not |
| Contract | The contract term of the customer |
| PaperlessBilling | Whether the customer has paperless billing or not |
| PaymentMethod | The customer’s payment method |
| MonthlyCharges | The amount charged to the customer monthly |
| TotalCharges | The total amount charged to the customer |
library(ggplot2)
library(caret)
library(cowplot)
library(ggcorrplot)
library(gridExtra)
df = read.csv("Telco_Customer_Dataset_Cleaned.csv")
x = ggplot(df, aes(Partner, fill = Churn)) +
geom_bar() +
labs(title = "Customer Partner Status",
x = "Does the Customer have a Partner?",
y = "Count")
y= ggplot(df, aes(Dependents, fill = Churn)) +
geom_bar() +
labs(title = "Customer Dependents Status",
x = "Does the Customer have Dependents?",
y = "Count")
grid.arrange(x, y,nrow=1,ncol=2)
df <- mutate(df, tenure_bin = tenure)
df$tenure_bin[df$tenure_bin >=0 & df$tenure_bin <= 12] <- '0-1 year'
df$tenure_bin[df$tenure_bin > 12 & df$tenure_bin <= 24] <- '1-2 years'
df$tenure_bin[df$tenure_bin > 24 & df$tenure_bin <= 36] <- '2-3 years'
df$tenure_bin[df$tenure_bin > 36 & df$tenure_bin <= 48] <- '3-4 years'
df$tenure_bin[df$tenure_bin > 48 & df$tenure_bin <= 60] <- '4-5 years'
df$tenure_bin[df$tenure_bin > 60 & df$tenure_bin <= 72] <- '5-6 years'
df$tenure_bin <- as.factor(df$tenure_bin)
options(repr.plot.width =6, repr.plot.height = 3)
ggplot(df, aes(tenure_bin, fill = Partner)) + geom_bar()
options(repr.plot.width =6, repr.plot.height = 2)
ggplot(df, aes(y= tenure, x = "", fill = Churn)) +
geom_boxplot()+
theme_bw()+
xlab("") +ylab("Tenure (months)")
library("tidyverse")
library("dplyr")
df = read.csv("Telco_Customer_Dataset_Cleaned.csv")
ggplot(df, aes(Contract, fill = Churn)) +
geom_bar() +
labs(x = "Type of Contract Customer Has",
y = "Count") +
ggtitle("Popularity of Contract Types") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
ggplot(df, aes(MonthlyCharges, fill = Churn)) +
geom_histogram() +
labs(x = "Monthly Charge to Customer",
y = "Count") +
ggtitle("Monthly Charges Histogram") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
df %>%
dplyr::select(Churn, InternetService, OnlineSecurity, OnlineBackup) %>%
mutate(OnlineSecurity = factor(OnlineSecurity,
levels = c("Yes", "No", "No internet service"),
labels = c("Security", "No security", "No internet")),
OnlineBackup = factor(OnlineBackup,
levels = c("Yes", "No", "No internet service"),
labels = c("Backup", "No backup", "No internet"))) %>%
group_by(Churn, InternetService, OnlineSecurity, OnlineBackup) %>%
count() %>%
ggplot(aes(x = InternetService, y = n, fill = Churn)) +
geom_col(position = "fill") +
scale_fill_brewer(palette = "Set2") +
facet_grid(OnlineBackup ~ OnlineSecurity) +
labs(y = "Customers") +
ggtitle("Customer Churn by Internet service") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
scale_y_continuous(labels = scales::percent)
df %>%
mutate(SeniorCitizen = ifelse(SeniorCitizen == 0, "No", "Yes")) -> categorical
categorical %>%
dplyr::select(gender:Dependents, PhoneService:PaymentMethod, Churn) -> categorical
categorical %>%
dplyr::select(MultipleLines, OnlineSecurity:StreamingMovies, Churn) %>%
filter(MultipleLines != "No phone service" &
OnlineSecurity != "No internet service") -> c2
gather(c2, columns, value, -Churn) -> c3
ggplot(c3)+
geom_bar(aes(x = value, fill = Churn), position = "fill", stat = "count")+
facet_wrap(~columns)+
xlab("Attributes") +
ggtitle("Value-Added Service Against Churn") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
library(caTools)
library(car)
library(pROC)
library(MASS)
library(cowplot)
library(e1071)
data_raw = read.csv("Telco_Customer_Dataset_Cleaned.csv")
data_raw <- dplyr::select(data_raw, -1)
data_raw <- data_raw[complete.cases(data_raw),]
data_raw$SeniorCitizen <- as.factor(ifelse(data_raw$SeniorCitizen==1, 'YES', 'NO'))
For our machine learning purpose, we are aware that there are few categorical features for example that have ‘No Internet Service’ or ‘No Phone Service’ as categories. Thus we can transform them as ‘No’ and reorganize these features.
data_raw <- data.frame(lapply(data_raw, function(x) {
gsub("No internet service", "No", x)}))
data_raw <- data.frame(lapply(data_raw, function(x) {
gsub("No phone service", "No", x)}))
columns <- c("tenure", "MonthlyCharges", "TotalCharges")
data_raw[columns] <- sapply(data_raw[columns], as.numeric)
data_raw_int<- data_raw[,c("tenure", "MonthlyCharges", "TotalCharges")]
data_raw_int <- data.frame(scale(data_raw_int))
Let’s proceed to create derived features from tenure, where we have made multiple bins of tenure which are in months format to ‘0-1 year’, ‘2-3 years’, ‘3-4 years’, ‘4-5 years’, and etc.
data_raw <- mutate(data_raw, tenure_bin = tenure)
data_raw$tenure_bin[data_raw$tenure_bin >=0 & data_raw$tenure_bin <= 12] <- '0-1 year'
data_raw$tenure_bin[data_raw$tenure_bin > 12 & data_raw$tenure_bin <= 24] <- '1-2 years'
data_raw$tenure_bin[data_raw$tenure_bin > 24 & data_raw$tenure_bin <= 36] <- '2-3 years'
data_raw$tenure_bin[data_raw$tenure_bin > 36 & data_raw$tenure_bin <= 48] <- '3-4 years'
data_raw$tenure_bin[data_raw$tenure_bin > 48 & data_raw$tenure_bin <= 60] <- '4-5 years'
data_raw$tenure_bin[data_raw$tenure_bin > 60 & data_raw$tenure_bin <= 72] <- '5-6 years'
data_raw$tenure_bin <- as.factor(data_raw$tenure_bin)
data_raw_cat <- data_raw[,-c(1,6,19,20)]
dummy<- data.frame(sapply(data_raw_cat,function(x) data.frame(model.matrix(~x-1,data =data_raw_cat))[,-1]))
head(dummy)
data_final <- cbind(data_raw_int,dummy)
head(data_final)
set.seed(123)
indices = sample.split(data_final$Churn, SplitRatio = 0.7)
train = data_final[indices,]
validation = data_final[!(indices),]
logisticRegressionModel_1 = glm(Churn ~ ., data = train, family = "binomial")
summary(logisticRegressionModel_1)
##
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0302 -0.6825 -0.2813 0.6462 3.1922
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.658413 1.594718 -2.921 0.003487
## tenure -2.267207 0.312586 -7.253 4.07e-13
## MonthlyCharges -1.573734 1.162395 -1.354 0.175777
## TotalCharges 0.047889 0.191794 0.250 0.802826
## gender -0.044823 0.079352 -0.565 0.572167
## SeniorCitizen 0.412759 0.102490 4.027 5.64e-05
## Partner 0.005844 0.095418 0.061 0.951166
## Dependents -0.086100 0.108181 -0.796 0.426097
## PhoneService 0.611067 0.790021 0.773 0.439237
## MultipleLines 0.628124 0.213993 2.935 0.003333
## InternetService.xFiber.optic 2.190157 0.971833 2.254 0.024219
## InternetService.xNo -2.157244 0.983457 -2.194 0.028269
## OnlineSecurity -0.069821 0.216431 -0.323 0.746996
## OnlineBackup 0.121141 0.214402 0.565 0.572064
## DeviceProtection 0.298124 0.214337 1.391 0.164251
## TechSupport 0.017730 0.218627 0.081 0.935366
## StreamingTV 0.759168 0.396137 1.916 0.055310
## StreamingMovies 0.843451 0.398770 2.115 0.034419
## Contract.xOne.year -0.713523 0.129933 -5.491 3.99e-08
## Contract.xTwo.year -1.500326 0.216948 -6.916 4.66e-12
## PaperlessBilling -0.110683 0.125114 -0.885 0.376340
## PaymentMethod.xCredit.card..automatic. 0.022790 0.136162 0.167 0.867077
## PaymentMethod.xElectronic.check 0.377997 0.113721 3.324 0.000888
## PaymentMethod.xMailed.check -0.011456 0.138454 -0.083 0.934058
## tenure_bin.x1.2.years 0.257856 0.188121 1.371 0.170472
## tenure_bin.x2.3.years 0.877473 0.312040 2.812 0.004923
## tenure_bin.x3.4.years 1.927149 0.448153 4.300 1.71e-05
## tenure_bin.x4.5.years 2.885506 0.578430 4.989 6.08e-07
## tenure_bin.x5.6.years 3.581315 0.721274 4.965 6.86e-07
##
## (Intercept) **
## tenure ***
## MonthlyCharges
## TotalCharges
## gender
## SeniorCitizen ***
## Partner
## Dependents
## PhoneService
## MultipleLines **
## InternetService.xFiber.optic *
## InternetService.xNo *
## OnlineSecurity
## OnlineBackup
## DeviceProtection
## TechSupport
## StreamingTV .
## StreamingMovies *
## Contract.xOne.year ***
## Contract.xTwo.year ***
## PaperlessBilling
## PaymentMethod.xCredit.card..automatic.
## PaymentMethod.xElectronic.check ***
## PaymentMethod.xMailed.check
## tenure_bin.x1.2.years
## tenure_bin.x2.3.years **
## tenure_bin.x3.4.years ***
## tenure_bin.x4.5.years ***
## tenure_bin.x5.6.years ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5699.5 on 4921 degrees of freedom
## Residual deviance: 4034.4 on 4893 degrees of freedom
## AIC: 4092.4
##
## Number of Fisher Scoring iterations: 6
We will be using stepAIC function for variable selection. It is a continpus process of adding and removing variables for us to get a subset of variables that will provide us the best performing logistic regression model.
logisticRegressionModel_2<- stepAIC(logisticRegressionModel_1, direction="both")
## Start: AIC=4092.35
## Churn ~ tenure + MonthlyCharges + TotalCharges + gender + SeniorCitizen +
## Partner + Dependents + PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## TechSupport + StreamingTV + StreamingMovies + Contract.xOne.year +
## Contract.xTwo.year + PaperlessBilling + PaymentMethod.xCredit.card..automatic. +
## PaymentMethod.xElectronic.check + PaymentMethod.xMailed.check +
## tenure_bin.x1.2.years + tenure_bin.x2.3.years + tenure_bin.x3.4.years +
## tenure_bin.x4.5.years + tenure_bin.x5.6.years
##
## Df Deviance AIC
## - Partner 1 4034.4 4090.4
## - TechSupport 1 4034.4 4090.4
## - PaymentMethod.xMailed.check 1 4034.4 4090.4
## - PaymentMethod.xCredit.card..automatic. 1 4034.4 4090.4
## - TotalCharges 1 4034.4 4090.4
## - OnlineSecurity 1 4034.5 4090.5
## - gender 1 4034.7 4090.7
## - OnlineBackup 1 4034.7 4090.7
## - PhoneService 1 4035.0 4091.0
## - Dependents 1 4035.0 4091.0
## - PaperlessBilling 1 4035.1 4091.1
## - MonthlyCharges 1 4036.2 4092.2
## - tenure_bin.x1.2.years 1 4036.2 4092.2
## - DeviceProtection 1 4036.3 4092.3
## <none> 4034.4 4092.4
## - StreamingTV 1 4038.0 4094.0
## - StreamingMovies 1 4038.8 4094.8
## - InternetService.xNo 1 4039.2 4095.2
## - InternetService.xFiber.optic 1 4039.4 4095.4
## - tenure_bin.x2.3.years 1 4042.3 4098.3
## - MultipleLines 1 4043.0 4099.0
## - PaymentMethod.xElectronic.check 1 4045.5 4101.5
## - SeniorCitizen 1 4050.6 4106.6
## - tenure_bin.x3.4.years 1 4052.9 4108.9
## - tenure_bin.x5.6.years 1 4059.2 4115.2
## - tenure_bin.x4.5.years 1 4059.4 4115.4
## - Contract.xOne.year 1 4066.0 4122.0
## - tenure 1 4088.6 4144.6
## - Contract.xTwo.year 1 4091.7 4147.7
##
## Step: AIC=4090.36
## Churn ~ tenure + MonthlyCharges + TotalCharges + gender + SeniorCitizen +
## Dependents + PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## TechSupport + StreamingTV + StreamingMovies + Contract.xOne.year +
## Contract.xTwo.year + PaperlessBilling + PaymentMethod.xCredit.card..automatic. +
## PaymentMethod.xElectronic.check + PaymentMethod.xMailed.check +
## tenure_bin.x1.2.years + tenure_bin.x2.3.years + tenure_bin.x3.4.years +
## tenure_bin.x4.5.years + tenure_bin.x5.6.years
##
## Df Deviance AIC
## - TechSupport 1 4034.4 4088.4
## - PaymentMethod.xMailed.check 1 4034.4 4088.4
## - PaymentMethod.xCredit.card..automatic. 1 4034.4 4088.4
## - TotalCharges 1 4034.4 4088.4
## - OnlineSecurity 1 4034.5 4088.5
## - OnlineBackup 1 4034.7 4088.7
## - gender 1 4034.7 4088.7
## - PhoneService 1 4035.0 4089.0
## - Dependents 1 4035.1 4089.1
## - PaperlessBilling 1 4035.1 4089.1
## - MonthlyCharges 1 4036.2 4090.2
## - tenure_bin.x1.2.years 1 4036.2 4090.2
## - DeviceProtection 1 4036.3 4090.3
## <none> 4034.4 4090.4
## - StreamingTV 1 4038.0 4092.0
## + Partner 1 4034.4 4092.4
## - StreamingMovies 1 4038.8 4092.8
## - InternetService.xNo 1 4039.2 4093.2
## - InternetService.xFiber.optic 1 4039.4 4093.4
## - tenure_bin.x2.3.years 1 4042.3 4096.3
## - MultipleLines 1 4043.0 4097.0
## - PaymentMethod.xElectronic.check 1 4045.5 4099.5
## - SeniorCitizen 1 4050.8 4104.8
## - tenure_bin.x3.4.years 1 4052.9 4106.9
## - tenure_bin.x5.6.years 1 4059.2 4113.2
## - tenure_bin.x4.5.years 1 4059.4 4113.4
## - Contract.xOne.year 1 4066.0 4120.0
## - tenure 1 4088.9 4142.9
## - Contract.xTwo.year 1 4091.8 4145.8
##
## Step: AIC=4088.36
## Churn ~ tenure + MonthlyCharges + TotalCharges + gender + SeniorCitizen +
## Dependents + PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xCredit.card..automatic. +
## PaymentMethod.xElectronic.check + PaymentMethod.xMailed.check +
## tenure_bin.x1.2.years + tenure_bin.x2.3.years + tenure_bin.x3.4.years +
## tenure_bin.x4.5.years + tenure_bin.x5.6.years
##
## Df Deviance AIC
## - PaymentMethod.xMailed.check 1 4034.4 4086.4
## - PaymentMethod.xCredit.card..automatic. 1 4034.4 4086.4
## - TotalCharges 1 4034.4 4086.4
## - gender 1 4034.7 4086.7
## - OnlineSecurity 1 4034.7 4086.7
## - OnlineBackup 1 4035.0 4087.0
## - Dependents 1 4035.1 4087.1
## - PaperlessBilling 1 4035.1 4087.1
## - tenure_bin.x1.2.years 1 4036.2 4088.2
## - PhoneService 1 4036.3 4088.3
## <none> 4034.4 4088.4
## + TechSupport 1 4034.4 4090.4
## + Partner 1 4034.4 4090.4
## - DeviceProtection 1 4038.9 4090.9
## - MonthlyCharges 1 4041.7 4093.7
## - tenure_bin.x2.3.years 1 4042.3 4094.3
## - PaymentMethod.xElectronic.check 1 4045.5 4097.5
## - StreamingTV 1 4046.7 4098.7
## - StreamingMovies 1 4049.6 4101.6
## - SeniorCitizen 1 4050.8 4102.8
## - InternetService.xNo 1 4051.0 4103.0
## - tenure_bin.x3.4.years 1 4053.0 4105.0
## - InternetService.xFiber.optic 1 4056.3 4108.3
## - MultipleLines 1 4056.7 4108.7
## - tenure_bin.x5.6.years 1 4059.2 4111.2
## - tenure_bin.x4.5.years 1 4059.4 4111.4
## - Contract.xOne.year 1 4066.0 4118.0
## - tenure 1 4088.9 4140.9
## - Contract.xTwo.year 1 4092.0 4144.0
##
## Step: AIC=4086.37
## Churn ~ tenure + MonthlyCharges + TotalCharges + gender + SeniorCitizen +
## Dependents + PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xCredit.card..automatic. +
## PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - PaymentMethod.xCredit.card..automatic. 1 4034.4 4084.4
## - TotalCharges 1 4034.4 4084.4
## - gender 1 4034.7 4084.7
## - OnlineSecurity 1 4034.7 4084.7
## - OnlineBackup 1 4035.0 4085.0
## - Dependents 1 4035.1 4085.1
## - PaperlessBilling 1 4035.2 4085.2
## - tenure_bin.x1.2.years 1 4036.2 4086.2
## - PhoneService 1 4036.3 4086.3
## <none> 4034.4 4086.4
## + PaymentMethod.xMailed.check 1 4034.4 4088.4
## + TechSupport 1 4034.4 4088.4
## + Partner 1 4034.4 4088.4
## - DeviceProtection 1 4038.9 4088.9
## - MonthlyCharges 1 4041.7 4091.7
## - tenure_bin.x2.3.years 1 4042.3 4092.3
## - StreamingTV 1 4046.7 4096.7
## - StreamingMovies 1 4049.7 4099.7
## - SeniorCitizen 1 4050.9 4100.9
## - InternetService.xNo 1 4051.1 4101.1
## - PaymentMethod.xElectronic.check 1 4051.9 4101.9
## - tenure_bin.x3.4.years 1 4053.0 4103.0
## - InternetService.xFiber.optic 1 4056.4 4106.4
## - MultipleLines 1 4056.8 4106.8
## - tenure_bin.x5.6.years 1 4059.2 4109.2
## - tenure_bin.x4.5.years 1 4059.5 4109.5
## - Contract.xOne.year 1 4066.1 4116.1
## - tenure 1 4089.4 4139.4
## - Contract.xTwo.year 1 4092.0 4142.0
##
## Step: AIC=4084.43
## Churn ~ tenure + MonthlyCharges + TotalCharges + gender + SeniorCitizen +
## Dependents + PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - TotalCharges 1 4034.5 4082.5
## - gender 1 4034.8 4082.8
## - OnlineSecurity 1 4034.8 4082.8
## - OnlineBackup 1 4035.1 4083.1
## - Dependents 1 4035.2 4083.2
## - PaperlessBilling 1 4035.2 4083.2
## - tenure_bin.x1.2.years 1 4036.3 4084.3
## - PhoneService 1 4036.4 4084.4
## <none> 4034.4 4084.4
## + PaymentMethod.xCredit.card..automatic. 1 4034.4 4086.4
## + PaymentMethod.xMailed.check 1 4034.4 4086.4
## + TechSupport 1 4034.4 4086.4
## + Partner 1 4034.4 4086.4
## - DeviceProtection 1 4039.0 4087.0
## - MonthlyCharges 1 4041.7 4089.7
## - tenure_bin.x2.3.years 1 4042.3 4090.3
## - StreamingTV 1 4046.8 4094.8
## - StreamingMovies 1 4049.7 4097.7
## - SeniorCitizen 1 4051.0 4099.0
## - InternetService.xNo 1 4051.2 4099.2
## - tenure_bin.x3.4.years 1 4053.0 4101.0
## - PaymentMethod.xElectronic.check 1 4054.3 4102.3
## - InternetService.xFiber.optic 1 4056.4 4104.4
## - MultipleLines 1 4056.9 4104.9
## - tenure_bin.x5.6.years 1 4059.3 4107.3
## - tenure_bin.x4.5.years 1 4059.5 4107.5
## - Contract.xOne.year 1 4066.1 4114.1
## - tenure 1 4089.4 4137.4
## - Contract.xTwo.year 1 4092.1 4140.1
##
## Step: AIC=4082.48
## Churn ~ tenure + MonthlyCharges + gender + SeniorCitizen + Dependents +
## PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - gender 1 4034.8 4080.8
## - OnlineSecurity 1 4034.8 4080.8
## - OnlineBackup 1 4035.1 4081.1
## - Dependents 1 4035.2 4081.2
## - PaperlessBilling 1 4035.2 4081.2
## - tenure_bin.x1.2.years 1 4036.3 4082.3
## - PhoneService 1 4036.4 4082.4
## <none> 4034.5 4082.5
## + TotalCharges 1 4034.4 4084.4
## + PaymentMethod.xCredit.card..automatic. 1 4034.4 4084.4
## + PaymentMethod.xMailed.check 1 4034.5 4084.5
## + TechSupport 1 4034.5 4084.5
## + Partner 1 4034.5 4084.5
## - DeviceProtection 1 4039.0 4085.0
## - MonthlyCharges 1 4041.8 4087.8
## - tenure_bin.x2.3.years 1 4042.4 4088.4
## - StreamingTV 1 4046.8 4092.8
## - StreamingMovies 1 4049.7 4095.7
## - SeniorCitizen 1 4051.0 4097.0
## - InternetService.xNo 1 4051.3 4097.3
## - tenure_bin.x3.4.years 1 4053.3 4099.3
## - PaymentMethod.xElectronic.check 1 4054.3 4100.3
## - InternetService.xFiber.optic 1 4056.4 4102.4
## - MultipleLines 1 4057.0 4103.0
## - tenure_bin.x4.5.years 1 4060.1 4106.1
## - tenure_bin.x5.6.years 1 4060.4 4106.4
## - Contract.xOne.year 1 4066.2 4112.2
## - Contract.xTwo.year 1 4093.3 4139.3
## - tenure 1 4097.4 4143.4
##
## Step: AIC=4080.81
## Churn ~ tenure + MonthlyCharges + SeniorCitizen + Dependents +
## PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + DeviceProtection +
## StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - OnlineSecurity 1 4035.1 4079.1
## - PaperlessBilling 1 4035.4 4079.4
## - OnlineBackup 1 4035.5 4079.5
## - Dependents 1 4035.6 4079.6
## - tenure_bin.x1.2.years 1 4036.7 4080.7
## - PhoneService 1 4036.7 4080.7
## <none> 4034.8 4080.8
## + gender 1 4034.5 4082.5
## + TotalCharges 1 4034.8 4082.8
## + PaymentMethod.xCredit.card..automatic. 1 4034.8 4082.8
## + PaymentMethod.xMailed.check 1 4034.8 4082.8
## + TechSupport 1 4034.8 4082.8
## + Partner 1 4034.8 4082.8
## - DeviceProtection 1 4039.4 4083.4
## - MonthlyCharges 1 4042.1 4086.1
## - tenure_bin.x2.3.years 1 4042.8 4086.8
## - StreamingTV 1 4047.2 4091.2
## - StreamingMovies 1 4050.2 4094.2
## - SeniorCitizen 1 4051.4 4095.4
## - InternetService.xNo 1 4051.7 4095.7
## - tenure_bin.x3.4.years 1 4053.7 4097.7
## - PaymentMethod.xElectronic.check 1 4054.6 4098.6
## - InternetService.xFiber.optic 1 4056.9 4100.9
## - MultipleLines 1 4057.4 4101.4
## - tenure_bin.x4.5.years 1 4060.5 4104.5
## - tenure_bin.x5.6.years 1 4060.8 4104.8
## - Contract.xOne.year 1 4066.5 4110.5
## - Contract.xTwo.year 1 4093.6 4137.6
## - tenure 1 4097.9 4141.9
##
## Step: AIC=4079.15
## Churn ~ tenure + MonthlyCharges + SeniorCitizen + Dependents +
## PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineBackup + DeviceProtection + StreamingTV +
## StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - PaperlessBilling 1 4035.8 4077.8
## - Dependents 1 4035.9 4077.9
## - OnlineBackup 1 4036.7 4078.7
## - tenure_bin.x1.2.years 1 4037.0 4079.0
## <none> 4035.1 4079.1
## + OnlineSecurity 1 4034.8 4080.8
## + gender 1 4034.8 4080.8
## + TechSupport 1 4034.9 4080.9
## + TotalCharges 1 4035.1 4081.1
## + PaymentMethod.xCredit.card..automatic. 1 4035.1 4081.1
## + PaymentMethod.xMailed.check 1 4035.1 4081.1
## + Partner 1 4035.1 4081.1
## - PhoneService 1 4039.9 4081.9
## - DeviceProtection 1 4042.8 4084.8
## - tenure_bin.x2.3.years 1 4043.2 4085.2
## - SeniorCitizen 1 4051.7 4093.7
## - MonthlyCharges 1 4053.0 4095.0
## - tenure_bin.x3.4.years 1 4054.1 4096.1
## - PaymentMethod.xElectronic.check 1 4055.0 4097.0
## - StreamingTV 1 4059.0 4101.0
## - tenure_bin.x4.5.years 1 4061.0 4103.0
## - tenure_bin.x5.6.years 1 4061.3 4103.3
## - StreamingMovies 1 4064.3 4106.3
## - InternetService.xNo 1 4065.8 4107.8
## - Contract.xOne.year 1 4066.7 4108.7
## - MultipleLines 1 4067.6 4109.6
## - InternetService.xFiber.optic 1 4084.4 4126.4
## - Contract.xTwo.year 1 4093.7 4135.7
## - tenure 1 4098.7 4140.7
##
## Step: AIC=4077.77
## Churn ~ tenure + MonthlyCharges + SeniorCitizen + Dependents +
## PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineBackup + DeviceProtection + StreamingTV +
## StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - Dependents 1 4036.5 4076.5
## - OnlineBackup 1 4037.2 4077.2
## - tenure_bin.x1.2.years 1 4037.7 4077.7
## <none> 4035.8 4077.8
## + PaperlessBilling 1 4035.1 4079.1
## + OnlineSecurity 1 4035.4 4079.4
## + TechSupport 1 4035.5 4079.5
## + gender 1 4035.6 4079.6
## + TotalCharges 1 4035.7 4079.7
## + PaymentMethod.xCredit.card..automatic. 1 4035.7 4079.7
## + PaymentMethod.xMailed.check 1 4035.7 4079.7
## + Partner 1 4035.8 4079.8
## - PhoneService 1 4040.4 4080.4
## - DeviceProtection 1 4043.3 4083.3
## - tenure_bin.x2.3.years 1 4043.9 4083.9
## - SeniorCitizen 1 4052.2 4092.2
## - MonthlyCharges 1 4053.5 4093.5
## - tenure_bin.x3.4.years 1 4054.9 4094.9
## - PaymentMethod.xElectronic.check 1 4055.7 4095.7
## - StreamingTV 1 4059.3 4099.3
## - tenure_bin.x4.5.years 1 4061.9 4101.9
## - tenure_bin.x5.6.years 1 4062.1 4102.1
## - StreamingMovies 1 4064.6 4104.6
## - InternetService.xNo 1 4066.2 4106.2
## - Contract.xOne.year 1 4067.2 4107.2
## - MultipleLines 1 4067.9 4107.9
## - InternetService.xFiber.optic 1 4084.7 4124.7
## - Contract.xTwo.year 1 4094.3 4134.3
## - tenure 1 4099.6 4139.6
##
## Step: AIC=4076.54
## Churn ~ tenure + MonthlyCharges + SeniorCitizen + PhoneService +
## MultipleLines + InternetService.xFiber.optic + InternetService.xNo +
## OnlineBackup + DeviceProtection + StreamingTV + StreamingMovies +
## Contract.xOne.year + Contract.xTwo.year + PaymentMethod.xElectronic.check +
## tenure_bin.x1.2.years + tenure_bin.x2.3.years + tenure_bin.x3.4.years +
## tenure_bin.x4.5.years + tenure_bin.x5.6.years
##
## Df Deviance AIC
## - OnlineBackup 1 4038.0 4076.0
## - tenure_bin.x1.2.years 1 4038.5 4076.5
## <none> 4036.5 4076.5
## + Dependents 1 4035.8 4077.8
## + PaperlessBilling 1 4035.9 4077.9
## + OnlineSecurity 1 4036.2 4078.2
## + TechSupport 1 4036.3 4078.3
## + gender 1 4036.4 4078.4
## + Partner 1 4036.4 4078.4
## + TotalCharges 1 4036.5 4078.5
## + PaymentMethod.xCredit.card..automatic. 1 4036.5 4078.5
## + PaymentMethod.xMailed.check 1 4036.5 4078.5
## - PhoneService 1 4041.3 4079.3
## - DeviceProtection 1 4044.2 4082.2
## - tenure_bin.x2.3.years 1 4044.7 4082.7
## - MonthlyCharges 1 4054.4 4092.4
## - SeniorCitizen 1 4054.5 4092.5
## - tenure_bin.x3.4.years 1 4055.7 4093.7
## - PaymentMethod.xElectronic.check 1 4056.6 4094.6
## - StreamingTV 1 4060.3 4098.3
## - tenure_bin.x4.5.years 1 4062.8 4100.8
## - tenure_bin.x5.6.years 1 4063.0 4101.0
## - StreamingMovies 1 4065.7 4103.7
## - InternetService.xNo 1 4067.3 4105.3
## - Contract.xOne.year 1 4068.5 4106.5
## - MultipleLines 1 4068.8 4106.8
## - InternetService.xFiber.optic 1 4086.1 4124.1
## - Contract.xTwo.year 1 4096.3 4134.3
## - tenure 1 4100.8 4138.8
##
## Step: AIC=4076.01
## Churn ~ tenure + MonthlyCharges + SeniorCitizen + PhoneService +
## MultipleLines + InternetService.xFiber.optic + InternetService.xNo +
## DeviceProtection + StreamingTV + StreamingMovies + Contract.xOne.year +
## Contract.xTwo.year + PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years
##
## Df Deviance AIC
## - tenure_bin.x1.2.years 1 4039.9 4075.9
## <none> 4038.0 4076.0
## + OnlineBackup 1 4036.5 4076.5
## + OnlineSecurity 1 4036.8 4076.8
## + Dependents 1 4037.2 4077.2
## - PhoneService 1 4041.3 4077.3
## + PaperlessBilling 1 4037.5 4077.5
## + gender 1 4037.8 4077.8
## + Partner 1 4037.9 4077.9
## + TotalCharges 1 4038.0 4078.0
## + PaymentMethod.xCredit.card..automatic. 1 4038.0 4078.0
## + PaymentMethod.xMailed.check 1 4038.0 4078.0
## + TechSupport 1 4038.0 4078.0
## - DeviceProtection 1 4044.3 4080.3
## - tenure_bin.x2.3.years 1 4046.2 4082.2
## - SeniorCitizen 1 4056.3 4092.3
## - tenure_bin.x3.4.years 1 4057.4 4093.4
## - MonthlyCharges 1 4057.4 4093.4
## - PaymentMethod.xElectronic.check 1 4058.5 4094.5
## - StreamingTV 1 4062.1 4098.1
## - tenure_bin.x4.5.years 1 4064.7 4100.7
## - tenure_bin.x5.6.years 1 4064.9 4100.9
## - StreamingMovies 1 4068.4 4104.4
## - MultipleLines 1 4069.4 4105.4
## - InternetService.xNo 1 4071.0 4107.0
## - Contract.xOne.year 1 4071.5 4107.5
## - InternetService.xFiber.optic 1 4098.4 4134.4
## - Contract.xTwo.year 1 4101.0 4137.0
## - tenure 1 4102.2 4138.2
##
## Step: AIC=4075.92
## Churn ~ tenure + MonthlyCharges + SeniorCitizen + PhoneService +
## MultipleLines + InternetService.xFiber.optic + InternetService.xNo +
## DeviceProtection + StreamingTV + StreamingMovies + Contract.xOne.year +
## Contract.xTwo.year + PaymentMethod.xElectronic.check + tenure_bin.x2.3.years +
## tenure_bin.x3.4.years + tenure_bin.x4.5.years + tenure_bin.x5.6.years
##
## Df Deviance AIC
## <none> 4039.9 4075.9
## + tenure_bin.x1.2.years 1 4038.0 4076.0
## + OnlineBackup 1 4038.5 4076.5
## + OnlineSecurity 1 4038.7 4076.7
## + Dependents 1 4039.2 4077.2
## - PhoneService 1 4043.2 4077.2
## + PaperlessBilling 1 4039.3 4077.3
## + gender 1 4039.7 4077.7
## + Partner 1 4039.8 4077.8
## + PaymentMethod.xCredit.card..automatic. 1 4039.9 4077.9
## + PaymentMethod.xMailed.check 1 4039.9 4077.9
## + TotalCharges 1 4039.9 4077.9
## + TechSupport 1 4039.9 4077.9
## - DeviceProtection 1 4046.2 4080.2
## - tenure_bin.x2.3.years 1 4048.3 4082.3
## - SeniorCitizen 1 4058.3 4092.3
## - MonthlyCharges 1 4059.5 4093.5
## - PaymentMethod.xElectronic.check 1 4060.6 4094.6
## - StreamingTV 1 4063.9 4097.9
## - StreamingMovies 1 4070.5 4104.5
## - tenure_bin.x3.4.years 1 4070.5 4104.5
## - MultipleLines 1 4071.0 4105.0
## - InternetService.xNo 1 4073.1 4107.1
## - Contract.xOne.year 1 4073.4 4107.4
## - tenure_bin.x5.6.years 1 4088.6 4122.6
## - tenure_bin.x4.5.years 1 4088.6 4122.6
## - InternetService.xFiber.optic 1 4100.8 4134.8
## - Contract.xTwo.year 1 4104.6 4138.6
## - tenure 1 4172.8 4206.8
summary(logisticRegressionModel_2)
##
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges + SeniorCitizen +
## PhoneService + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + DeviceProtection + StreamingTV + StreamingMovies +
## Contract.xOne.year + Contract.xTwo.year + PaymentMethod.xElectronic.check +
## tenure_bin.x2.3.years + tenure_bin.x3.4.years + tenure_bin.x4.5.years +
## tenure_bin.x5.6.years, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0192 -0.6820 -0.2846 0.6391 3.1572
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.22546 0.43105 -9.803 < 2e-16 ***
## tenure -1.94005 0.17406 -11.146 < 2e-16 ***
## MonthlyCharges -1.39722 0.31817 -4.391 1.13e-05 ***
## SeniorCitizen 0.43003 0.10031 4.287 1.81e-05 ***
## PhoneService 0.48815 0.26831 1.819 0.06886 .
## MultipleLines 0.60072 0.10849 5.537 3.07e-08 ***
## InternetService.xFiber.optic 2.08003 0.27135 7.665 1.78e-14 ***
## InternetService.xNo -2.00103 0.34864 -5.739 9.50e-09 ***
## DeviceProtection 0.27377 0.10984 2.492 0.01269 *
## StreamingTV 0.70389 0.14465 4.866 1.14e-06 ***
## StreamingMovies 0.79492 0.14529 5.471 4.47e-08 ***
## Contract.xOne.year -0.72829 0.12903 -5.644 1.66e-08 ***
## Contract.xTwo.year -1.55603 0.21379 -7.278 3.38e-13 ***
## PaymentMethod.xElectronic.check 0.38119 0.08339 4.571 4.85e-06 ***
## tenure_bin.x2.3.years 0.55021 0.18992 2.897 0.00377 **
## tenure_bin.x3.4.years 1.46331 0.26611 5.499 3.82e-08 ***
## tenure_bin.x4.5.years 2.30721 0.33517 6.884 5.83e-12 ***
## tenure_bin.x5.6.years 2.87120 0.41717 6.883 5.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5699.5 on 4921 degrees of freedom
## Residual deviance: 4039.9 on 4904 degrees of freedom
## AIC: 4075.9
##
## Number of Fisher Scoring iterations: 6
Also, we can use Variance Inflation Factor (vif) to remove redundant variables or predictors that have high multicollinearity between them. Multicollinearity exists between two or more predictor variables when they are highly related to each other. As a result, then it becomes tough for us to analyze and understand the impact of independent variable on the dependent variables.
A predictor getting VIF lesser than 2 are considered safe and it can be understood that the predictor is not correlated with other predictor variables in the dataset.
The higher the VIF is, the more signifcant the correlation of the predictor variable is with other predictor variables.
vif(logisticRegressionModel_2)
## tenure MonthlyCharges
## 14.765303 52.984214
## SeniorCitizen PhoneService
## 1.085076 4.168944
## MultipleLines InternetService.xFiber.optic
## 1.898677 11.696648
## InternetService.xNo DeviceProtection
## 7.357807 1.721037
## StreamingTV StreamingMovies
## 3.307173 3.357872
## Contract.xOne.year Contract.xTwo.year
## 1.373647 1.398162
## PaymentMethod.xElectronic.check tenure_bin.x2.3.years
## 1.134441 2.424132
## tenure_bin.x3.4.years tenure_bin.x4.5.years
## 3.756845 6.281268
## tenure_bin.x5.6.years
## 8.415471
We are now removing the DeviceProtection variable since it has high p-value
logisticRegressionModel_3 <-glm(formula = Churn ~ tenure + MonthlyCharges + SeniorCitizen +
Partner + InternetService.xFiber.optic + InternetService.xNo +
OnlineSecurity + OnlineBackup + TechSupport +
StreamingTV + Contract.xOne.year + Contract.xTwo.year + PaperlessBilling +
PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
tenure_bin.x5.6.years, family = "binomial", data = train)
summary(logisticRegressionModel_3)
##
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges + SeniorCitizen +
## Partner + InternetService.xFiber.optic + InternetService.xNo +
## OnlineSecurity + OnlineBackup + TechSupport + StreamingTV +
## Contract.xOne.year + Contract.xTwo.year + PaperlessBilling +
## PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
## tenure_bin.x5.6.years, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9480 -0.6785 -0.2977 0.6795 3.1342
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.13909 0.19859 -5.736 9.71e-09 ***
## tenure -0.85816 0.07363 -11.654 < 2e-16 ***
## MonthlyCharges 0.20547 0.15646 1.313 0.189116
## SeniorCitizen 0.46288 0.09948 4.653 3.27e-06 ***
## Partner -0.04999 0.08506 -0.588 0.556713
## InternetService.xFiber.optic 0.64672 0.19855 3.257 0.001125 **
## InternetService.xNo -0.88288 0.19095 -4.624 3.77e-06 ***
## OnlineSecurity -0.41112 0.10500 -3.915 9.02e-05 ***
## OnlineBackup -0.15733 0.09571 -1.644 0.100222
## TechSupport -0.30170 0.10639 -2.836 0.004573 **
## StreamingTV 0.26649 0.11462 2.325 0.020066 *
## Contract.xOne.year -0.70924 0.12702 -5.584 2.36e-08 ***
## Contract.xTwo.year -1.52795 0.21586 -7.078 1.46e-12 ***
## PaperlessBilling -0.11478 0.12215 -0.940 0.347392
## PaymentMethod.xElectronic.check 0.41434 0.08247 5.024 5.05e-07 ***
## tenure_bin.x1.2.years -0.38893 0.10330 -3.765 0.000166 ***
## tenure_bin.x5.6.years 0.45586 0.19036 2.395 0.016631 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5699.5 on 4921 degrees of freedom
## Residual deviance: 4108.1 on 4905 degrees of freedom
## AIC: 4142.1
##
## Number of Fisher Scoring iterations: 6
vif(logisticRegressionModel_3)
## tenure MonthlyCharges
## 2.609193 12.829118
## SeniorCitizen Partner
## 1.082235 1.149315
## InternetService.xFiber.optic InternetService.xNo
## 6.401652 2.252302
## OnlineSecurity OnlineBackup
## 1.223534 1.307788
## TechSupport StreamingTV
## 1.303619 2.104326
## Contract.xOne.year Contract.xTwo.year
## 1.347368 1.434496
## PaperlessBilling PaymentMethod.xElectronic.check
## 1.008861 1.130146
## tenure_bin.x1.2.years tenure_bin.x5.6.years
## 1.034265 1.787422
We are now removing the StreamingTV variable since it has high p-value
logisticRegressionModel_4 <- glm(formula = Churn ~ tenure + MonthlyCharges + SeniorCitizen +
Partner + InternetService.xFiber.optic + InternetService.xNo +
OnlineSecurity + OnlineBackup + TechSupport +
Contract.xOne.year + Contract.xTwo.year + PaperlessBilling +
PaymentMethod.xElectronic.check + tenure_bin.x1.2.years +
tenure_bin.x5.6.years, family = "binomial", data = train)
summary(logisticRegressionModel_4)
##
## Call:
## glm(formula = Churn ~ tenure + MonthlyCharges + SeniorCitizen +
## Partner + InternetService.xFiber.optic + InternetService.xNo +
## OnlineSecurity + OnlineBackup + TechSupport + Contract.xOne.year +
## Contract.xTwo.year + PaperlessBilling + PaymentMethod.xElectronic.check +
## tenure_bin.x1.2.years + tenure_bin.x5.6.years, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9217 -0.6749 -0.2987 0.6852 3.1315
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.93249 0.17679 -5.274 1.33e-07 ***
## tenure -0.85359 0.07355 -11.606 < 2e-16 ***
## MonthlyCharges 0.43511 0.12221 3.560 0.000370 ***
## SeniorCitizen 0.45674 0.09936 4.597 4.29e-06 ***
## Partner -0.05066 0.08500 -0.596 0.551210
## InternetService.xFiber.optic 0.41937 0.17221 2.435 0.014882 *
## InternetService.xNo -0.75129 0.18346 -4.095 4.22e-05 ***
## OnlineSecurity -0.45903 0.10297 -4.458 8.28e-06 ***
## OnlineBackup -0.19495 0.09431 -2.067 0.038723 *
## TechSupport -0.33342 0.10546 -3.161 0.001570 **
## Contract.xOne.year -0.69931 0.12685 -5.513 3.53e-08 ***
## Contract.xTwo.year -1.51470 0.21570 -7.022 2.19e-12 ***
## PaperlessBilling -0.11570 0.12215 -0.947 0.343557
## PaymentMethod.xElectronic.check 0.42527 0.08228 5.168 2.36e-07 ***
## tenure_bin.x1.2.years -0.38750 0.10315 -3.757 0.000172 ***
## tenure_bin.x5.6.years 0.44641 0.19012 2.348 0.018870 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5699.5 on 4921 degrees of freedom
## Residual deviance: 4113.5 on 4906 degrees of freedom
## AIC: 4145.5
##
## Number of Fisher Scoring iterations: 6
vif(logisticRegressionModel_4)
## tenure MonthlyCharges
## 2.605061 7.796109
## SeniorCitizen Partner
## 1.081037 1.149391
## InternetService.xFiber.optic InternetService.xNo
## 4.821043 2.081572
## OnlineSecurity OnlineBackup
## 1.178617 1.271566
## TechSupport Contract.xOne.year
## 1.284166 1.346749
## Contract.xTwo.year PaperlessBilling
## 1.433403 1.008819
## PaymentMethod.xElectronic.check tenure_bin.x1.2.years
## 1.126829 1.034175
## tenure_bin.x5.6.years
## 1.784223
logisticRegressionModel_3 has the best set of significant variables, let’s use this model for prediction first
logisticRegressionModel_final <- logisticRegressionModel_3
Model Evaluation using the validation data:
logisticRegression_Pred_result <- predict(logisticRegressionModel_final, type = "response", newdata = validation[,-24])
summary(logisticRegression_Pred_result)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00676 0.05082 0.20036 0.26891 0.44784 0.86188
validation$prob <- logisticRegression_Pred_result
Lets use probability cutoff of 50%.
pred_churn <- factor(ifelse(logisticRegression_Pred_result >= 0.50, "Yes", "No"))
actual_churn <- factor(ifelse(validation$Churn==1,"Yes","No"))
table(actual_churn,pred_churn)
## pred_churn
## actual_churn No Yes
## No 1399 150
## Yes 281 280
Let’s find the Accuracy, Sensitivity and Specificity of our model using when the cutoff is at 50%
cutoff_churn <- factor(ifelse(logisticRegression_Pred_result >=0.50, "Yes", "No"))
conf_final <- confusionMatrix(cutoff_churn, actual_churn, positive = "Yes")
accuracy <- conf_final$overall[1]
sensitivity <- conf_final$byClass[1]
specificity <- conf_final$byClass[2]
accuracy
## Accuracy
## 0.7957346
sensitivity
## Sensitivity
## 0.4991087
specificity
## Specificity
## 0.9031633
As shown above, when we use a cutoff of 0.50, we are yielding a good accuracy and specificity, however the sensitivity is very less. Thus, we need to be able to find the optimal probability cutoff which will yield maximum accuracy, sensitivity, and specificity
perform_fn <- function(cutoff)
{
predicted_churn <- factor(ifelse(logisticRegression_Pred_result >= cutoff, "Yes", "No"))
conf <- confusionMatrix(predicted_churn, actual_churn, positive = "Yes")
accuray <- conf$overall[1]
sensitivity <- conf$byClass[1]
specificity <- conf$byClass[2]
out <- t(as.matrix(c(sensitivity, specificity, accuray)))
colnames(out) <- c("Sensitivity", "Specificity", "Accuracy")
return(out)
}
options(repr.plot.width =8, repr.plot.height =6)
summary(logisticRegression_Pred_result)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00676 0.05082 0.20036 0.26891 0.44784 0.86188
s = seq(0.01,0.80,length=100)
OUT = matrix(0,100,3)
for(i in 1:100)
{
OUT[i,] = perform_fn(s[i])
}
plot(s, OUT[,1],xlab="Cutoff",ylab="Value",cex.lab=1.5,cex.axis=1.5,ylim=c(0,1),
type="l",lwd=2,axes=FALSE,col=2)
axis(1,seq(0,1,length=5),seq(0,1,length=5),cex.lab=1.5)
axis(2,seq(0,1,length=5),seq(0,1,length=5),cex.lab=1.5)
lines(s,OUT[,2],col="darkgreen",lwd=2)
lines(s,OUT[,3],col=4,lwd=2)
box()
legend("bottom",col=c(2,"darkgreen",4,"darkred"),text.font =3,inset = 0.02,
box.lty=0,cex = 0.8,
lwd=c(2,2,2,2),c("Sensitivity","Specificity","Accuracy"))
abline(v = 0.32, col="red", lwd=1, lty=2)
axis(1, at = seq(0.1, 1, by = 0.1))
Choosing a cutoff value of 0.32% for our final Logistisc Regression model, where the three curves of accuracy, specificty and sensitivity meet
cutoff_churn <- factor(ifelse(logisticRegression_Pred_result >=0.32, "Yes", "No"))
conf_final <- confusionMatrix(cutoff_churn, actual_churn, positive = "Yes")
logisticRegression_accuracy <- conf_final$overall[1]
logisticRegression_sensitivity <- conf_final$byClass[1]
logisticRegression_specificity <- conf_final$byClass[2]
logisticRegression_accuracy
## Accuracy
## 0.7701422
logisticRegression_sensitivity
## Sensitivity
## 0.7557932
logisticRegression_specificity
## Specificity
## 0.7753389
With a cutoff probability value of 0.32, we are getting better values of accuracy, sensitivity and specificity in the validation data.
Preparing the dataset for training and testing
set.seed(123)
data_final$Churn <- as.factor(data_final$Churn)
indices = sample.split(data_final$Churn, SplitRatio = 0.7)
train = data_final[indices,]
validation = data_final[!(indices),]
Training the Decision Tree model using all the predictor variables & performing prediction on the validation dataset
options(repr.plot.width = 10, repr.plot.height = 8)
library(rpart)
library(rpart.plot)
#Training The Decision Tree Model
Dtree_model = rpart(Churn ~., data = train, method = "class")
summary(Dtree_model)
## Call:
## rpart(formula = Churn ~ ., data = train, method = "class")
## n= 4922
##
## CP nsplit rel error xerror xstd
## 1 0.1127676 0 1.0000000 1.0000000 0.02369296
## 2 0.0100000 2 0.7744648 0.7752294 0.02169290
##
## Variable importance
## tenure TotalCharges
## 27 24
## InternetService.xFiber.optic MonthlyCharges
## 16 15
## PaymentMethod.xElectronic.check MultipleLines
## 5 5
## InternetService.xNo Partner
## 4 2
## PaymentMethod.xMailed.check
## 1
##
## Node number 1: 4922 observations, complexity param=0.1127676
## predicted class=0 expected loss=0.2657456 P(node) =1
## class counts: 3614 1308
## probabilities: 0.734 0.266
## left son=2 (3091 obs) right son=3 (1831 obs)
## Primary splits:
## tenure < -0.6079294 to the right, improve=230.9873, (0 missing)
## Contract.xTwo.year < 0.5 to the right, improve=179.3544, (0 missing)
## PaymentMethod.xElectronic.check < 0.5 to the left, improve=178.0319, (0 missing)
## InternetService.xFiber.optic < 0.5 to the left, improve=177.0621, (0 missing)
## TotalCharges < -0.8594936 to the right, improve=104.2756, (0 missing)
## Surrogate splits:
## TotalCharges < -0.6010313 to the right, agree=0.892, adj=0.708, (0 split)
## Partner < 0.5 to the right, agree=0.654, adj=0.069, (0 split)
## PaymentMethod.xMailed.check < 0.5 to the left, agree=0.643, adj=0.039, (0 split)
## MonthlyCharges < -1.519752 to the right, agree=0.629, adj=0.002, (0 split)
## PaymentMethod.xElectronic.check < 0.5 to the left, agree=0.628, adj=0.001, (0 split)
##
## Node number 2: 3091 observations
## predicted class=0 expected loss=0.1478486 P(node) =0.6279967
## class counts: 2634 457
## probabilities: 0.852 0.148
##
## Node number 3: 1831 observations, complexity param=0.1127676
## predicted class=0 expected loss=0.4647733 P(node) =0.3720033
## class counts: 980 851
## probabilities: 0.535 0.465
## left son=6 (1026 obs) right son=7 (805 obs)
## Primary splits:
## InternetService.xFiber.optic < 0.5 to the left, improve=137.11870, (0 missing)
## MonthlyCharges < 0.1188857 to the left, improve=116.66510, (0 missing)
## InternetService.xNo < 0.5 to the right, improve= 95.88223, (0 missing)
## PaymentMethod.xElectronic.check < 0.5 to the left, improve= 66.61460, (0 missing)
## PaymentMethod.xMailed.check < 0.5 to the right, improve= 43.01084, (0 missing)
## Surrogate splits:
## MonthlyCharges < 0.1188857 to the left, agree=0.973, adj=0.939, (0 split)
## PaymentMethod.xElectronic.check < 0.5 to the left, agree=0.701, adj=0.320, (0 split)
## TotalCharges < -0.8083967 to the left, agree=0.692, adj=0.299, (0 split)
## MultipleLines < 0.5 to the left, agree=0.691, adj=0.297, (0 split)
## InternetService.xNo < 0.5 to the right, agree=0.670, adj=0.248, (0 split)
##
## Node number 6: 1026 observations
## predicted class=0 expected loss=0.2933723 P(node) =0.2084518
## class counts: 725 301
## probabilities: 0.707 0.293
##
## Node number 7: 805 observations
## predicted class=1 expected loss=0.3167702 P(node) =0.1635514
## class counts: 255 550
## probabilities: 0.317 0.683
#Predicting the result of churning on the validation dataset using the trained model
DT_Pred_result <- predict(Dtree_model,type = "class", newdata = validation[,-24])
conf_final <- confusionMatrix(validation$Churn, DT_Pred_result)
DT_accuracy <- conf_final$overall[1]
DT_sensitivity <- conf_final$byClass[1]
DT_specificity <- conf_final$byClass[2]
DT_accuracy
## Accuracy
## 0.7810427
DT_sensitivity
## Sensitivity
## 0.804141
DT_specificity
## Specificity
## 0.6532508
Preparing the dataset for training and testing
library(randomForest)
set.seed(123)
data_final$Churn <- as.factor(data_final$Churn)
indices = sample.split(data_final$Churn, SplitRatio = 0.7)
train = data_final[indices,]
validation = data_final[!(indices),]
Model Training
model.rf <- randomForest(Churn ~ ., data=train, proximity=FALSE,importance = FALSE,
ntree=500,mtry=4, do.trace=FALSE)
model.rf
##
## Call:
## randomForest(formula = Churn ~ ., data = train, proximity = FALSE, importance = FALSE, ntree = 500, mtry = 4, do.trace = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 20.44%
## Confusion matrix:
## 0 1 class.error
## 0 3265 349 0.0965689
## 1 657 651 0.5022936
Let’s predict the result on the validation set and check the Confusion Matrix.
RandomForest_Pred_result <- predict(model.rf, newdata=validation[,-24])
table(RandomForest_Pred_result, validation$Churn)
##
## RandomForest_Pred_result 0 1
## 0 1414 298
## 1 135 263
conf_final <- confusionMatrix(validation$Churn, RandomForest_Pred_result)
RF_accuracy <- conf_final$overall[1]
RF_sensitivity <- conf_final$byClass[1]
RF_specificity <- conf_final$byClass[2]
RF_accuracy
## Accuracy
## 0.7947867
RF_sensitivity
## Sensitivity
## 0.8259346
RF_specificity
## Specificity
## 0.660804
Variable Importance Plot: As you can see below, this is the Variable Importance Plot which shows the most significant variable or attribute in descending order by mean decrease in Gini. The Mean decrease Gini shows how pure the nodes are at the end of the tree. The Higher the Gini Index is, the better the homogeneity of it.
varImpPlot(model.rf)
Let’s Check the AUC for all three machine learning models. The Area Under the Curve (AUC) is the ability of a model to differentiate between classes and is used as a summary of the ROC curve. The higher the AUC is, the better the performance of the model in differentiating between the positive and negative classes.
The ROC curve basically shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Models that result in curves closer to the top-left corner of the graph indicate a better performance.The closer the curve is to the 45-degree diagonal line of the ROC space, the less accurate the model is in terms on performance.
options(repr.plot.width =10, repr.plot.height = 8)
glm.roc <- roc(response = validation$Churn, predictor = as.numeric(logisticRegression_Pred_result))
DT.roc <- roc(response = validation$Churn, predictor = as.numeric(DT_Pred_result))
rf.roc <- roc(response = validation$Churn, predictor = as.numeric(RandomForest_Pred_result))
plot(glm.roc, legacy.axes = TRUE, print.auc.y = 1.0, print.auc = TRUE)
plot(DT.roc, col = "blue", add = TRUE, print.auc.y = 0.65, print.auc = TRUE)
plot(rf.roc, col = "red" , add = TRUE, print.auc.y = 0.85, print.auc = TRUE)
legend("bottom", c("Random Forest", "Decision Tree", "Logistic"),
lty = c(1,1), lwd = c(2, 2), col = c("red", "blue", "black"), cex = 0.75)
As we can see from the above graph, we can see that the the curve of Logistic Regression model curves closer to the top-left corner which in turn indicates the best performer among the other two models namely Decision Tree and Random Forest.
sprintf("Accuracy: %f", logisticRegression_accuracy)
## [1] "Accuracy: 0.770142"
sprintf("Sensitivity: %f", logisticRegression_sensitivity)
## [1] "Sensitivity: 0.755793"
sprintf("Specificity: %f", logisticRegression_specificity)
## [1] "Specificity: 0.775339"
sprintf("Accuracy: %f", DT_accuracy)
## [1] "Accuracy: 0.781043"
sprintf("Sensitivity: %f", DT_sensitivity)
## [1] "Sensitivity: 0.804141"
sprintf("Specificity: %f", DT_specificity)
## [1] "Specificity: 0.653251"
sprintf("Accuracy: %f", RF_accuracy)
## [1] "Accuracy: 0.794787"
sprintf("Sensitivity: %f", RF_sensitivity)
## [1] "Sensitivity: 0.825935"
sprintf("Specificity: %f", RF_specificity)
## [1] "Specificity: 0.660804"