Description

This report describes prediction of Telco Churn using Machine Learning Algorithm. We investigated 4 algorithms: Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).

The dataset used in this report is Telco Customer Churn hosted in Kaggle.

The dataset can be downloaded here.

Report Outline:
1.Data Extaction
2.Exploratory Data Analysis
3.Data Preparation
4.Modeling
5.Evaluation
6.Recommendation

1.Data Extraction

The dataset is downloaded from Kaggle and saved in the data folder. We use read read.csv() function to read the dataset and put in telco_df data frame.

telco_df<-read.csv("data/Churn.csv")

To see the number of rows and column, we used dim() function.The dataset has 7043 rows and 21 columns.

dim(telco_df)
## [1] 7043   21

2.Exploratory Data Analysis

To find out the column names and types, we used str() function.

str(telco_df)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

From the result above, we know the following:

  1. The first column is id.It is unique and unnecessary for prediction. So, it should be removed.
  2. There are only 3 continous variable , those are tenure, MonthlyCharges , TotalCharges. an the remain variables are char type and they are should be converted to factor.
  3. The last column is Churn.This is a target variable to be predicted.
# remove unnecessary columns
telco_df$customerID<-NULL

# change to factor for a target variable
telco_df$Churn<- as.factor(telco_df$Churn)
telco_df$gender<- as.factor(telco_df$gender)
telco_df$SeniorCitizen<- as.factor(telco_df$SeniorCitizen)
telco_df$Partner<- as.factor(telco_df$Partner)
telco_df$Dependents<- as.factor(telco_df$Dependents)
telco_df$PhoneService<- as.factor(telco_df$PhoneService)
telco_df$MultipleLines<- as.factor(telco_df$MultipleLines)
telco_df$InternetService<- as.factor(telco_df$InternetService)
telco_df$OnlineSecurity<- as.factor(telco_df$OnlineSecurity)
telco_df$OnlineBackup<- as.factor(telco_df$OnlineBackup)
telco_df$DeviceProtection<- as.factor(telco_df$DeviceProtection)
telco_df$TechSupport<- as.factor(telco_df$TechSupport)
telco_df$StreamingTV<- as.factor(telco_df$StreamingTV)
telco_df$StreamingMovies<- as.factor(telco_df$StreamingMovies)
telco_df$Contract<- as.factor(telco_df$Contract)
telco_df$PaperlessBilling<- as.factor(telco_df$PaperlessBilling)
telco_df$PaymentMethod<- as.factor(telco_df$PaymentMethod)

2.1. Univariate Data Analysis

Analysis of a targer variable. Number of Yes(Y)and No(N) in Churn column.

#multiple graph
library(ggplot2)
p1 <- ggplot(data=telco_df, aes(x=Churn)) + geom_bar()+ 
  geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', position = position_stack(vjust=0.5), size = 5,color= "white") +
            labs(title="Telco Churn Analysis", x="Churn") 
p2 <- ggplot(data=telco_df, aes(y=tenure)) + geom_boxplot() + labs(title="Telco Churn Analysis", y="tenure") 
p3 <- ggplot(data=telco_df, aes(x=MonthlyCharges)) + geom_histogram(binwidth = 10) +
  labs(title="Telco Churn Analysis", x="MonthlyCharges")


library(gridExtra)
grid.arrange(p1,p2,p3, ncol =3)

From graphs above we can know that :
1.The customer who churn based on dataset is 26.54% and not churn73.46%.
2.Average tenure telco customer is around 30 days.
3.There are so many customers with MonthlyCharges below 25

2.2. Bivariate Data Analysis

Analysis of two variables.

ggplot(telco_df, aes(x=Churn,y=MonthlyCharges)) + geom_boxplot()+
  geom_jitter(alpha=0.3, 
              color ='blue',
              width =0.2)+
  labs(title="Telco Customer Churn Analysis Data",y="MonthlyCharges")

Based on boxplot above we know customer who churn has average MontlyCharges higher than not churn.

ggplot(data=telco_df, aes(x=tenure,fill=Churn)) + geom_bar()+labs(title="Tenure vs Churn", x="tenure")

Customer with short tenure are likely to Churn and customer with long period tenure are not churn.

ggplot(data=telco_df, aes(x=PhoneService,fill=Churn)) + geom_bar() + 
  labs(title="PhoneService vs Churn",x="PhoneService") + geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)

Customer has Phone Service are likely to Churn compare who have no PhoneService.

ggplot(data=telco_df, aes(x=InternetService,fill=Churn)) + geom_bar()+
  labs(title="InternetService vs Churn", x="InternetService") +
  geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)

Customers subscribe InternetService using fiber optic are likely to Churn compare who have use DSL and no InternetService.

ggplot(data=telco_df, aes(x=OnlineSecurity,fill=Churn)) + geom_bar()+
  labs(title="OnlineSecurity vs Churn", x="OnlineSecurity") +
  geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)

Customers with No OnlineSecurity are likely to Churn.

ggplot(data=telco_df, aes(x=TechSupport,fill=Churn)) + geom_bar()+
  labs(title="TechSupport vs Churn", x="TechSupport") +
  geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)

Customers with no TechSupport are likely to Churn.

 ggplot(data=telco_df, aes(x=Contract,fill=Churn)) + geom_bar()+
  labs(title="Contract vs Churn", x="Contract") + geom_text(aes(y = ..count.. -200,label = paste0(round(prop.table(..count..),4) * 100, '%')),stat = 'count',position = position_stack(vjust=0.8), size = 3)

Customers with Month-to-month contract are likely to Churn compare who apply one year and two year contract.

In general, variables like PhoneService-Yes,InternetService-FiberOptic,OnlineSecurity-No,TechSupport-No,Contract-Month-to-Month have Churn indicator bigger than others variables. However, analysis between these variables and the target variable are not enough to separate the classes.

2.3. Multivariate Data Analysis

There are three contionous variables , we want to compute and vizualize correlation coefficient of each measuremet.

Vizualize Pearson’s Correlation Coefficient for continous variables.

library(corrgram)
corrgram(telco_df[,c("tenure", "MonthlyCharges", "TotalCharges")],main="Correlation Between Continous Variable")

Using corrgram() we get a display of the cells in matrix R. The positive correlations are shown in blue, while the negative correlations are shown in red. The darker the hue, the greater the magnitude of the correlation.

We can see tenure has postive correlation with MonthlyCharges & TotalCharges

3.Data Preparation

3.1. Feature Selection

Remove some variables which have weak correlation with target variable

telco_df$gender<- NULL
telco_df$SeniorCitizen<- NULL
telco_df$Partner<- NULL
telco_df$Dependents<- NULL
telco_df$MultipleLines<- NULL
telco_df$OnlineSecurity<- NULL
telco_df$StreamingTV<- NULL
dim(telco_df)
## [1] 7043   13

Remove missing values

telco_df<-telco_df[complete.cases(telco_df),]

3.2 Training and Test Division

set.seed() for reproducible result. Ratio train:test = 70:30.

# - Train and Test
#divide data into training and test data 70 : 30
m = nrow(telco_df)
set.seed(2021)
train_idx <- sample(m,0.7 *m)

train_df<-telco_df[train_idx,]
test_df<-telco_df[-train_idx,]

This is Data Preparation Data Analysis part.

4.Modelling

This is Modelling Data Analysis part.

4.1 Logistic Regression

fit.logit <-glm(formula=Churn ~.,
                data= train_df,
                family=binomial)
summary(fit.logit)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7505  -0.6796  -0.2946   0.7517   3.4642  
## 
## Coefficients: (4 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -0.2869942  0.3010547  -0.953 0.340440    
## tenure                               -0.0576998  0.0073176  -7.885 3.14e-15 ***
## PhoneServiceYes                      -0.8958214  0.2212879  -4.048 5.16e-05 ***
## InternetServiceFiber optic            0.5658518  0.2297895   2.462 0.013798 *  
## InternetServiceNo                    -0.4996879  0.2763549  -1.808 0.070585 .  
## OnlineBackupNo internet service              NA         NA      NA       NA    
## OnlineBackupYes                      -0.1764940  0.0989492  -1.784 0.074475 .  
## DeviceProtectionNo internet service          NA         NA      NA       NA    
## DeviceProtectionYes                  -0.1119973  0.1042619  -1.074 0.282736    
## TechSupportNo internet service               NA         NA      NA       NA    
## TechSupportYes                       -0.4465764  0.1089340  -4.100 4.14e-05 ***
## StreamingMoviesNo internet service           NA         NA      NA       NA    
## StreamingMoviesYes                    0.1453457  0.1346666   1.079 0.280454    
## ContractOne year                     -0.7016891  0.1254488  -5.593 2.23e-08 ***
## ContractTwo year                     -1.4411463  0.2071223  -6.958 3.45e-12 ***
## PaperlessBillingYes                   0.3863045  0.0885594   4.362 1.29e-05 ***
## PaymentMethodCredit card (automatic) -0.0109239  0.1367971  -0.080 0.936353    
## PaymentMethodElectronic check         0.4167933  0.1132955   3.679 0.000234 ***
## PaymentMethodMailed check            -0.0120486  0.1376069  -0.088 0.930228    
## MonthlyCharges                        0.0110594  0.0079475   1.392 0.164054    
## TotalCharges                          0.0002983  0.0000836   3.568 0.000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5650.1  on 4921  degrees of freedom
## Residual deviance: 4093.2  on 4905  degrees of freedom
## AIC: 4127.2
## 
## Number of Fisher Scoring iterations: 6

4.2 Decision Tree

library(party)
fit.ctree <- ctree(Churn~., data=train_df)
plot(fit.ctree, main="Conditional Inference Tree")

4.3 Random Forest

library(randomForest)
set.seed(2021)
fit.forest <- randomForest(Churn~., data=train_df,
                           na.action=na.roughfix,
                           importance=TRUE)
fit.forest
## 
## Call:
##  randomForest(formula = Churn ~ ., data = train_df, importance = TRUE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20.6%
## Confusion matrix:
##       No Yes class.error
## No  3249 389   0.1069269
## Yes  625 659   0.4867601

4.4 Support Vector Machine

library(e1071)
set.seed(2021)
fit.svm <- svm(Churn~., data=train_df)
fit.svm
## 
## Call:
## svm(formula = Churn ~ ., data = train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  2228

5.Evaluation

We compute accuracy, precision, recall and F1 Score

performance <- function(table, n=2){
  tn = table[1,1]
  fp = table[1,2] 
  fn = table[2,1]
  tp = table[2,2]
  
  sensitivity = tp/(tp+fn) # recall
  specificity = tn/(tn+fp) 
  ppp = tp/(tp+fp) # precision
  npp = tn/(tn+fn)
  hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
  
  result <- paste("Sensitivity = ", round(sensitivity, n) ,
  "\nSpecificity = ", round(specificity, n),
  "\nPositive Predictive Value = ", round(ppp, n),
  "\nNegative Predictive Value = ", round(npp, n),
  "\nAccuracy = ", round(hitrate, n), "\n", sep="")
  
  cat(result)
}
prob <- predict(fit.logit,test_df, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE),
                     labels=c("No", "Yes"))
logit.perf <- table(test_df$Churn, logit.pred,
                    dnn=c("Actual","Predicted"))
logit.perf
##       Predicted
## Actual   No  Yes
##    No  1363  162
##    Yes  255  330
performance(logit.perf)
## Sensitivity = 0.56
## Specificity = 0.89
## Positive Predictive Value = 0.67
## Negative Predictive Value = 0.84
## Accuracy = 0.8
ctree.pred <- predict(fit.ctree, test_df, type = "response")
ctree.perf <- table(test_df$Churn, ctree.pred,
                    dnn=c("Actual","Predicted"))
ctree.perf
##       Predicted
## Actual   No  Yes
##    No  1296  229
##    Yes  210  375
performance(ctree.perf)
## Sensitivity = 0.64
## Specificity = 0.85
## Positive Predictive Value = 0.62
## Negative Predictive Value = 0.86
## Accuracy = 0.79
forest.pred <- predict(fit.forest, test_df, type = "response")
forest.perf <- table(test_df$Churn, forest.pred,
                    dnn=c("Actual","Predicted"))
forest.perf
##       Predicted
## Actual   No  Yes
##    No  1374  151
##    Yes  278  307
performance(forest.perf)
## Sensitivity = 0.52
## Specificity = 0.9
## Positive Predictive Value = 0.67
## Negative Predictive Value = 0.83
## Accuracy = 0.8
svm.pred <- predict(fit.svm, test_df, type = "response")
svm.perf <- table(test_df$Churn, svm.pred,
                    dnn=c("Actual","Predicted"))
svm.perf
##       Predicted
## Actual   No  Yes
##    No  1415  110
##    Yes  320  265
performance(svm.perf)
## Sensitivity = 0.45
## Specificity = 0.93
## Positive Predictive Value = 0.71
## Negative Predictive Value = 0.82
## Accuracy = 0.8

6.Recommendation

1.Decision Tree algorithm is the best among all the tested algorithms.

2.Based on decision tree model, the most important variables are Contract-Month-to-Month, InternetService-Fiber Optic and TechSupport-No

3.The results can be improved by better data preparation or using other algorithms. However, the current results can predict at least 64% customer who will churn.