This report describes prediction of Telco Churn using Machine Learning Algorithm. We investigated 4 algorithms: Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).
The dataset used in this report is Telco Customer Churn hosted in Kaggle.
The dataset can be downloaded here.
Report Outline:
1.Data Extaction
2.Exploratory Data Analysis
3.Data Preparation
4.Modeling
5.Evaluation
6.Recommendation
The dataset is downloaded from Kaggle and saved in the data folder. We use read read.csv() function to read the dataset and put in telco_df data frame.
telco_df<-read.csv("data/Churn.csv")
To see the number of rows and column, we used dim() function.The dataset has 7043 rows and 21 columns.
dim(telco_df)
## [1] 7043 21
To find out the column names and types, we used str() function.
str(telco_df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
From the result above, we know the following:
# remove unnecessary columns
telco_df$customerID<-NULL
# change to factor for a target variable
telco_df$Churn<- as.factor(telco_df$Churn)
telco_df$gender<- as.factor(telco_df$gender)
telco_df$SeniorCitizen<- as.factor(telco_df$SeniorCitizen)
telco_df$Partner<- as.factor(telco_df$Partner)
telco_df$Dependents<- as.factor(telco_df$Dependents)
telco_df$PhoneService<- as.factor(telco_df$PhoneService)
telco_df$MultipleLines<- as.factor(telco_df$MultipleLines)
telco_df$InternetService<- as.factor(telco_df$InternetService)
telco_df$OnlineSecurity<- as.factor(telco_df$OnlineSecurity)
telco_df$OnlineBackup<- as.factor(telco_df$OnlineBackup)
telco_df$DeviceProtection<- as.factor(telco_df$DeviceProtection)
telco_df$TechSupport<- as.factor(telco_df$TechSupport)
telco_df$StreamingTV<- as.factor(telco_df$StreamingTV)
telco_df$StreamingMovies<- as.factor(telco_df$StreamingMovies)
telco_df$Contract<- as.factor(telco_df$Contract)
telco_df$PaperlessBilling<- as.factor(telco_df$PaperlessBilling)
telco_df$PaymentMethod<- as.factor(telco_df$PaymentMethod)
Analysis of a targer variable. Number of Yes(Y)and No(N) in Churn column.
#multiple graph
library(ggplot2)
p1 <- ggplot(data=telco_df, aes(x=Churn)) + geom_bar()+
geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count', position = position_stack(vjust=0.5), size = 5,color= "white") +
labs(title="Telco Churn Analysis", x="Churn")
p2 <- ggplot(data=telco_df, aes(y=tenure)) + geom_boxplot() + labs(title="Telco Churn Analysis", y="tenure")
p3 <- ggplot(data=telco_df, aes(x=MonthlyCharges)) + geom_histogram(binwidth = 10) +
labs(title="Telco Churn Analysis", x="MonthlyCharges")
library(gridExtra)
grid.arrange(p1,p2,p3, ncol =3)
From graphs above we can know that :
1.The customer who churn based on dataset is 26.54% and not churn73.46%.
2.Average tenure telco customer is around 30 days.
3.There are so many customers with MonthlyCharges below 25
Analysis of two variables.
ggplot(telco_df, aes(x=Churn,y=MonthlyCharges)) + geom_boxplot()+
geom_jitter(alpha=0.3,
color ='blue',
width =0.2)+
labs(title="Telco Customer Churn Analysis Data",y="MonthlyCharges")
Based on boxplot above we know customer who churn has average MontlyCharges higher than not churn.
ggplot(data=telco_df, aes(x=tenure,fill=Churn)) + geom_bar()+labs(title="Tenure vs Churn", x="tenure")
Customer with short tenure are likely to Churn and customer with long period tenure are not churn.
ggplot(data=telco_df, aes(x=PhoneService,fill=Churn)) + geom_bar() +
labs(title="PhoneService vs Churn",x="PhoneService") + geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)
Customer has Phone Service are likely to Churn compare who have no PhoneService.
ggplot(data=telco_df, aes(x=InternetService,fill=Churn)) + geom_bar()+
labs(title="InternetService vs Churn", x="InternetService") +
geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)
Customers subscribe InternetService using fiber optic are likely to Churn compare who have use DSL and no InternetService.
ggplot(data=telco_df, aes(x=OnlineSecurity,fill=Churn)) + geom_bar()+
labs(title="OnlineSecurity vs Churn", x="OnlineSecurity") +
geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)
Customers with No OnlineSecurity are likely to Churn.
ggplot(data=telco_df, aes(x=TechSupport,fill=Churn)) + geom_bar()+
labs(title="TechSupport vs Churn", x="TechSupport") +
geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)
Customers with no TechSupport are likely to Churn.
ggplot(data=telco_df, aes(x=Contract,fill=Churn)) + geom_bar()+
labs(title="Contract vs Churn", x="Contract") + geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)
Customers with Month-to-month contract are likely to Churn compare who apply one year and two year contract.
In general, variables like PhoneService-Yes,InternetService-FiberOptic,OnlineSecurity-No,TechSupport-No,Contract-Month-to-Month have Churn indicator bigger than others variables. However, analysis between these variables and the target variable are not enough to separate the classes.
There are three contionous variables , we want to compute and vizualize correlation coefficient of each measuremet.
Vizualize Pearson’s Correlation Coefficient for continous variables.
library(corrgram)
corrgram(telco_df[,c("tenure", "MonthlyCharges", "TotalCharges")],main="Correlation Between Continous Variable")
Using corrgram() we get a display of the cells in matrix R. The positive correlations are shown in blue, while the negative correlations are shown in red. The darker the hue, the greater the magnitude of the correlation.
We can see tenure has postive correlation with MonthlyCharges & TotalCharges
Remove some variables which have weak correlation with target variable
telco_df$gender<- NULL
telco_df$SeniorCitizen<- NULL
telco_df$Partner<- NULL
telco_df$Dependents<- NULL
telco_df$MultipleLines<- NULL
telco_df$OnlineSecurity<- NULL
telco_df$StreamingTV<- NULL
dim(telco_df)
## [1] 7043 13
Remove missing values
telco_df<-telco_df[complete.cases(telco_df),]
set.seed() for reproducible result. Ratio train:test = 70:30.
# - Train and Test
#divide data into training and test data 70 : 30
m = nrow(telco_df)
set.seed(2021)
train_idx <- sample(m,0.7 *m)
train_df<-telco_df[train_idx,]
test_df<-telco_df[-train_idx,]
This is Data Preparation Data Analysis part.
This is Modelling Data Analysis part.
fit.logit <-glm(formula=Churn ~.,
data= train_df,
family=binomial)
summary(fit.logit)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7505 -0.6796 -0.2946 0.7517 3.4642
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2869942 0.3010547 -0.953 0.340440
## tenure -0.0576998 0.0073176 -7.885 3.14e-15 ***
## PhoneServiceYes -0.8958214 0.2212879 -4.048 5.16e-05 ***
## InternetServiceFiber optic 0.5658518 0.2297895 2.462 0.013798 *
## InternetServiceNo -0.4996879 0.2763549 -1.808 0.070585 .
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes -0.1764940 0.0989492 -1.784 0.074475 .
## DeviceProtectionNo internet service NA NA NA NA
## DeviceProtectionYes -0.1119973 0.1042619 -1.074 0.282736
## TechSupportNo internet service NA NA NA NA
## TechSupportYes -0.4465764 0.1089340 -4.100 4.14e-05 ***
## StreamingMoviesNo internet service NA NA NA NA
## StreamingMoviesYes 0.1453457 0.1346666 1.079 0.280454
## ContractOne year -0.7016891 0.1254488 -5.593 2.23e-08 ***
## ContractTwo year -1.4411463 0.2071223 -6.958 3.45e-12 ***
## PaperlessBillingYes 0.3863045 0.0885594 4.362 1.29e-05 ***
## PaymentMethodCredit card (automatic) -0.0109239 0.1367971 -0.080 0.936353
## PaymentMethodElectronic check 0.4167933 0.1132955 3.679 0.000234 ***
## PaymentMethodMailed check -0.0120486 0.1376069 -0.088 0.930228
## MonthlyCharges 0.0110594 0.0079475 1.392 0.164054
## TotalCharges 0.0002983 0.0000836 3.568 0.000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5650.1 on 4921 degrees of freedom
## Residual deviance: 4093.2 on 4905 degrees of freedom
## AIC: 4127.2
##
## Number of Fisher Scoring iterations: 6
library(party)
fit.ctree <- ctree(Churn~., data=train_df)
plot(fit.ctree, main="Conditional Inference Tree")
library(randomForest)
set.seed(2021)
fit.forest <- randomForest(Churn~., data=train_df,
na.action=na.roughfix,
importance=TRUE)
fit.forest
##
## Call:
## randomForest(formula = Churn ~ ., data = train_df, importance = TRUE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.6%
## Confusion matrix:
## No Yes class.error
## No 3249 389 0.1069269
## Yes 625 659 0.4867601
library(e1071)
set.seed(2021)
fit.svm <- svm(Churn~., data=train_df)
fit.svm
##
## Call:
## svm(formula = Churn ~ ., data = train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2228
We compute accuracy, precision, recall and F1 Score
performance <- function(table, n=2){
tn = table[1,1]
fp = table[1,2]
fn = table[2,1]
tp = table[2,2]
sensitivity = tp/(tp+fn) # recall
specificity = tn/(tn+fp)
ppp = tp/(tp+fp) # precision
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
result <- paste("Sensitivity = ", round(sensitivity, n) ,
"\nSpecificity = ", round(specificity, n),
"\nPositive Predictive Value = ", round(ppp, n),
"\nNegative Predictive Value = ", round(npp, n),
"\nAccuracy = ", round(hitrate, n), "\n", sep="")
cat(result)
}
prob <- predict(fit.logit,test_df, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE),
labels=c("No", "Yes"))
logit.perf <- table(test_df$Churn, logit.pred,
dnn=c("Actual","Predicted"))
logit.perf
## Predicted
## Actual No Yes
## No 1363 162
## Yes 255 330
performance(logit.perf)
## Sensitivity = 0.56
## Specificity = 0.89
## Positive Predictive Value = 0.67
## Negative Predictive Value = 0.84
## Accuracy = 0.8
ctree.pred <- predict(fit.ctree, test_df, type = "response")
ctree.perf <- table(test_df$Churn, ctree.pred,
dnn=c("Actual","Predicted"))
ctree.perf
## Predicted
## Actual No Yes
## No 1296 229
## Yes 210 375
performance(ctree.perf)
## Sensitivity = 0.64
## Specificity = 0.85
## Positive Predictive Value = 0.62
## Negative Predictive Value = 0.86
## Accuracy = 0.79
forest.pred <- predict(fit.forest, test_df, type = "response")
forest.perf <- table(test_df$Churn, forest.pred,
dnn=c("Actual","Predicted"))
forest.perf
## Predicted
## Actual No Yes
## No 1374 151
## Yes 278 307
performance(forest.perf)
## Sensitivity = 0.52
## Specificity = 0.9
## Positive Predictive Value = 0.67
## Negative Predictive Value = 0.83
## Accuracy = 0.8
svm.pred <- predict(fit.svm, test_df, type = "response")
svm.perf <- table(test_df$Churn, svm.pred,
dnn=c("Actual","Predicted"))
svm.perf
## Predicted
## Actual No Yes
## No 1415 110
## Yes 320 265
performance(svm.perf)
## Sensitivity = 0.45
## Specificity = 0.93
## Positive Predictive Value = 0.71
## Negative Predictive Value = 0.82
## Accuracy = 0.8
1.Decision Tree algorithm is the best among all the tested algorithms.
2.Based on decision tree model, the most important variables are Contract-Month-to-Month, InternetService-Fiber Optic and TechSupport-No
3.The results can be improved by better data preparation or using other algorithms. However, the current results can predict at least 64% customer who will churn.