Group Members

Name	Matric No
Choo En Ming	22079929
Han Die	22062358
Lim Min Kye Andy	22083095
Soong Sing Ying	S2191652
Tneu Zhen Yang	22081192

Introduction

Customer churn, the loss of customers or subscribers, can have a significant impact on a company’s profitability. For businesses, retaining existing customers is more cost-effective than acquiring new ones. Understanding the factors that contribute to customer churn is essential for reducing churn rates and improving revenue. This project aims to analyze customer data and develop targeted customer retention programs in the competitive telecommunication industry.

Objectives

To identify potential factors leading to customer churn.
To compare and evaluate the performance of different models in predicting high-risk customers who are more likely to churn from the service.
To recommend telecommunication companies with actionable strategies for customer retention based on the results of our data analysis.

CRISP-DM

Data Collection: The Telco Customer Churn dataset was obtained from Kaggle. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Data Cleaning: The process starts by performing some initial data cleaning steps. This includes setting the row names / index using the customerID column, converting factor columns into that of factor type. Followed by displaying the summary statistics, structure information, etc.
Exploratory Data Analysis: Various visualizations info graphics are displayed to understand different aspects of the dataset, such as distribution, kurtosis and skewness of the variables, identifying possible outliers and correlations between the variables.
Data Preprocessing: Chi-squared tests are adopted to identify if there is significant association categorical variables associated with Churn, performs one-hot encoding on selected variables, and applies SMOTE oversampling to create a more balanced training set for modelling.
Modelling: Several classification machine learning models are trained to predict customer churn, including:
- Naive Bayes
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- k-Nearest Neighbors (kNN)
- Adaptive Boosting (AdaBoost)
- Xtreme Gradient Boodting (XGBoost)
Model Evaluation: The evaluation results (accuracy, precision, sensitivity, specificity, F1-score) for each model are summarized in a data frame, and the performance is visualized using ROC curves.
Association Analysis: Apriori algorithm is utilised to conduct association rule mining on Telco customer churn data with Internet service. It mines not only the association rules but also generates visualizations to uncover the most meaningful associations among various features.

Data Cleaning

# read data file
df <- read.csv("/Users/chooenming/Desktop/Programming for Data Science/Telco_Customer_Churn_Data.csv",
               header = T, stringsAsFactors = T)
rownames(df) <- df$customerID # put customerID as index
df$customerID <- NULL # remove customerID from variables

str(df) #

## 'data.frame':    7043 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Dataset has 7043 observations of 20 variables

# Change SeniorCitizen to factor
df$SeniorCitizen <- as.factor(df$SeniorCitizen)

# return distinct values of those factor variables
for(i in colnames(df)){
  if(is.factor(df[[i]])){
    cat(i, "\n distinct values: ", levels(df[[i]]), "\n")
  }
}

## gender 
##  distinct values:  Female Male 
## SeniorCitizen 
##  distinct values:  0 1 
## Partner 
##  distinct values:  No Yes 
## Dependents 
##  distinct values:  No Yes 
## PhoneService 
##  distinct values:  No Yes 
## MultipleLines 
##  distinct values:  No No phone service Yes 
## InternetService 
##  distinct values:  DSL Fiber optic No 
## OnlineSecurity 
##  distinct values:  No No internet service Yes 
## OnlineBackup 
##  distinct values:  No No internet service Yes 
## DeviceProtection 
##  distinct values:  No No internet service Yes 
## TechSupport 
##  distinct values:  No No internet service Yes 
## StreamingTV 
##  distinct values:  No No internet service Yes 
## StreamingMovies 
##  distinct values:  No No internet service Yes 
## Contract 
##  distinct values:  Month-to-month One year Two year 
## PaperlessBilling 
##  distinct values:  No Yes 
## PaymentMethod 
##  distinct values:  Bank transfer (automatic) Credit card (automatic) Electronic check Mailed check 
## Churn 
##  distinct values:  No Yes

# replace those with more than 3 levels to 2 levels if possible
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- df %>%
  mutate(MultipleLines = as.factor(case_when(MultipleLines == "No phone service"~"No",
                                            MultipleLines == "No"~"No",
                                            TRUE ~ "Yes"))) %>%
  mutate(OnlineSecurity = as.factor(case_when(OnlineSecurity == "No internet service"~"No",
                                              OnlineSecurity == "No"~"No",
                                              TRUE ~ "Yes"))) %>%
  mutate(OnlineBackup = as.factor(case_when(OnlineBackup == "No internet service"~"No",
                                              OnlineBackup == "No"~"No",
                                              TRUE ~ "Yes"))) %>%
  mutate(DeviceProtection = as.factor(case_when(DeviceProtection == "No internet service"~"No",
                                                DeviceProtection == "No"~"No",
                                                TRUE ~ "Yes"))) %>%
  mutate(TechSupport = as.factor(case_when(TechSupport == "No internet service"~"No",
                                           TechSupport == "No"~"No",
                                           TRUE ~ "Yes"))) %>%  
  mutate(StreamingTV = as.factor(case_when(StreamingTV == "No internet service"~"No",
                                           StreamingTV == "No"~"No",
                                           TRUE ~ "Yes"))) %>%    
  mutate(StreamingMovies = as.factor(case_when(StreamingMovies == "No internet service"~"No",
                                           StreamingMovies == "No"~"No",
                                           TRUE ~ "Yes")))

# check for missing values
sum(is.na(df))

## [1] 11

There are 11 rows that contain missing data.
Since the amount is relatively small, the rows with missing value will be removed

# Remove missing rows
df <- na.omit(df)
summary(df)

##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:3483   0:5890        No :3639   No :4933   Min.   : 1.00   No : 680    
##  Male  :3549   1:1142        Yes:3393   Yes:2099   1st Qu.: 9.00   Yes:6352    
##                                                    Median :29.00               
##                                                    Mean   :32.42               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##  MultipleLines    InternetService OnlineSecurity OnlineBackup DeviceProtection
##  No :4065      DSL        :2416   No :5017       No :4607     No :4614        
##  Yes:2967      Fiber optic:3096   Yes:2015       Yes:2425     Yes:2418        
##                No         :1520                                               
##                                                                               
##                                                                               
##                                                                               
##  TechSupport StreamingTV StreamingMovies           Contract    PaperlessBilling
##  No :4992    No :4329    No :4301        Month-to-month:3875   No :2864        
##  Yes:2040    Yes:2703    Yes:2731        One year      :1472   Yes:4168        
##                                          Two year      :1685                   
##                                                                                
##                                                                                
##                                                                                
##                    PaymentMethod  MonthlyCharges    TotalCharges    Churn     
##  Bank transfer (automatic):1542   Min.   : 18.25   Min.   :  18.8   No :5163  
##  Credit card (automatic)  :1521   1st Qu.: 35.59   1st Qu.: 401.4   Yes:1869  
##  Electronic check         :2365   Median : 70.35   Median :1397.5             
##  Mailed check             :1604   Mean   : 64.80   Mean   :2283.3             
##                                   3rd Qu.: 89.86   3rd Qu.:3794.7             
##                                   Max.   :118.75   Max.   :8684.8

# check if each attribute is numeric
is_number <- sapply(df, is.numeric)

# Plot boxplot for numerical attributes to check for outliers
par(mfrow = c(ceiling(sum(is_number)/3), 3))
for(col in colnames(df)){
  if(is_number[col]){
    boxplot(df[[col]], main = col)
  }
}

No outliers were found for all numerical attributes

Exploratory Data Analysis

Customer Churn by Gender

library(ggplot2)
library(cowplot)
library(ggcorrplot)

g1 <- ggplot(data = df , aes(x = Churn,fill = Churn)) + geom_bar()  + geom_text(stat = 'count' , aes(label = paste(NULL , formatC(..count..))),vjust = 3.5) + theme_minimal() + ggtitle("Number of customer churn") + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")

g2 <- ggplot(data = df , aes(x = gender,fill = gender)) + geom_bar()  + geom_text(stat = 'count' , aes(label = paste(NULL , formatC(..count..))),vjust = 3.5) + theme_minimal() + ggtitle("Number by Gender") + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")

g3 <- ggplot(data = df, aes(x = Churn,fill = gender)) + geom_bar(stat = "count",position = position_dodge()) + geom_text(stat = "count" , aes(label = paste(NULL , formatC(..count..))),vjust = 3.5 , position = position_dodge(0.9)) + ggtitle("Customers Churn by Gender") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")

print(plot_grid(g1,g2,g3, ncol = 2, nrow = 2))

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Customer Churn by Charges and Tenure

g1 = ggplot(data = df , aes(x = Churn , y = MonthlyCharges, fill = Churn)) + geom_bar(stat = "summary" , fun = "median")  +
stat_summary(aes(label = paste(..y..)) , fun = median , geom = "text" , vjust = 3.5) + labs( y = "Median of MonthlyCharges") +
ggtitle("Median monthly charges of customers") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
g2 = ggplot(data = df , aes(x = Churn , y = tenure, fill = Churn)) + geom_bar(stat = "summary" , fun = "median")  +
stat_summary(aes(label = paste(..y..)) , fun = median , geom = "text" , vjust = 3.5) + labs( y = "Median of tenure") +
ggtitle("Median tenure of customers") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
g3 = ggplot(data = df , aes(x = Churn , y = TotalCharges, fill = Churn)) + geom_bar(stat = "summary" , fun = "median")  +
stat_summary(aes(label = paste(..y..)) , fun = median , geom = "text" , vjust = 3.5) + labs( y = "Median of TotalCharges") +
ggtitle("Median total charges") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
print(plot_grid(g1,g2,g3, ncol = 2, nrow = 2))

Customer Churn by Contract Type

g1 <- ggplot(data = df, aes(x = Churn,fill = Contract)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("Customers Churn by Contract Type") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g2 <- ggplot(data = df, aes(x = Churn,fill = PaymentMethod)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("Customers Churn by Contract Type") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g3 <- ggplot(data = df, aes(x = Churn,fill = InternetService)) + geom_bar(stat = "count",position = position_dodge()) +  ggtitle("Customers Churn by Contract Type") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
print(plot_grid(g1,g2,g3, ncol = 1, nrow = 3))

Customer churn by Service Ordered

g1 <- ggplot(data = df, aes(x = Churn,fill = OnlineSecurity)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("OnlineSecurity") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g2 <- ggplot(data = df, aes(x = Churn,fill = OnlineBackup)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("OnlineBackup") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g3 <- ggplot(data = df, aes(x = Churn,fill = DeviceProtection)) + geom_bar(stat = "count",position = position_dodge()) +  ggtitle("DeviceProtection") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g4 <- ggplot(data = df, aes(x = Churn,fill = TechSupport)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("TechSupport") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g5 <- ggplot(data = df, aes(x = Churn,fill = StreamingTV)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("StreamingTV") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
g6 <- ggplot(data = df, aes(x = Churn,fill = StreamingMovies)) + geom_bar(stat = "count",position = position_dodge()) +  ggtitle("StreamingMovies") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count") 
print(plot_grid(g1,g2,g3,g4,g5,g6, ncol = 2, nrow = 3))

Boxplot & Histogram

Boxplot to display outliers, minimum, 25th percentile, median, 75percentile, and maximum
Histogram to show to skewness and distribution of the data

g1 <- ggplot(data = df , aes(x = tenure)) + geom_histogram(aes(y = ..density..),color = "darkblue",fill="lightblue" , binwidth = 5) + theme_classic() + ggtitle("Histogram of tenure") + theme(plot.title = element_text(hjust = 0.5))
g2 <- ggplot(data = df , aes(x = tenure)) + geom_boxplot(fill="lightgreen") + theme_classic() + ggtitle("Boxplot of tenure") + theme(plot.title = element_text(hjust = 0.5))
g3 <- ggplot(data = df , aes(x = MonthlyCharges)) + geom_histogram(aes(y = ..density..),color = "darkblue",fill="lightblue" , binwidth = 5) + theme_classic() + ggtitle("Histogram of MonthlyCharges") + theme(plot.title = element_text(hjust = 0.5))
g4 <- ggplot(data = df , aes(x = MonthlyCharges)) + geom_boxplot(fill = "lightgreen") + theme_classic() + ggtitle("Boxplot of MonthlyCharges") + theme(plot.title = element_text(hjust = 0.5))
g5 <- ggplot(data = df , aes(x = TotalCharges)) + geom_histogram(aes(y = ..density..),color = "darkblue",fill="lightblue" , binwidth = 5) + theme_classic() + ggtitle("Histogram of TotalCharges") + theme(plot.title = element_text(hjust = 0.5))
g6 <- ggplot(data = df , aes(x = TotalCharges)) + geom_boxplot(fill = "lightgreen") + theme_classic() + ggtitle("Boxplot of TotalCharges") + theme(plot.title = element_text(hjust = 0.5))
print(plot_grid(g1,g2,g3,g4,g5,g6,ncol = 2, nrow = 3))

It becomes very hard to figure out how data correlates if there are more than two features.
Data visualization can help find how individual features may correlate with the output.
This is due to not all data is relevant to the project objective. The importance of data correlation has an effect when the dataset is with many features.
It’s tempting to think that a larger number of features will help a model make better predictions.

Correlation Plots

To show the correlation among the numerical variables

data_col <- df[ , c('tenure','MonthlyCharges','TotalCharges')]
corr <- cor(data_col)
ggcorrplot(corr , hc.order = TRUE , type = 'lower', lab = TRUE)

Data Preprocessing

Chi-squared tests

To include only those having significant association to Churn

Null hypothesis, H0: There is no association between the variable and Churn
Alternative hypothesis, H1: There is association between the variable and Churn
Decision: Reject H0 if p-value < 0.05

var_categorical <- df[, sapply(df, is.factor)]
colnames(var_categorical)

##  [1] "gender"           "SeniorCitizen"    "Partner"          "Dependents"      
##  [5] "PhoneService"     "MultipleLines"    "InternetService"  "OnlineSecurity"  
##  [9] "OnlineBackup"     "DeviceProtection" "TechSupport"      "StreamingTV"     
## [13] "StreamingMovies"  "Contract"         "PaperlessBilling" "PaymentMethod"   
## [17] "Churn"

chi_sq <- lapply(var_categorical[, -which(colnames(var_categorical) == "Churn")], 
              function(x) chisq.test(var_categorical[, which(colnames(var_categorical) == "Churn")], x))

chi_sq_result <- data.frame(Variable = colnames(var_categorical)[-which(colnames(var_categorical) == "Churn")],
                            p_value = sapply(chi_sq, function(x) x$p.value),
                            stringsAsFactors = F)
chi_sq_result$Conclusion <- ifelse(chi_sq_result$p_value > 0.05, "No Association", "Significant Association")
chi_sq_result

##                          Variable       p_value              Conclusion
## gender                     gender  4.904885e-01          No Association
## SeniorCitizen       SeniorCitizen  2.479256e-36 Significant Association
## Partner                   Partner  3.973798e-36 Significant Association
## Dependents             Dependents  2.019659e-42 Significant Association
## PhoneService         PhoneService  3.499240e-01          No Association
## MultipleLines       MultipleLines  8.694083e-04 Significant Association
## InternetService   InternetService 5.831199e-159 Significant Association
## OnlineSecurity     OnlineSecurity  1.374240e-46 Significant Association
## OnlineBackup         OnlineBackup  6.259257e-12 Significant Association
## DeviceProtection DeviceProtection  3.346075e-08 Significant Association
## TechSupport           TechSupport  3.232868e-43 Significant Association
## StreamingTV           StreamingTV  1.316434e-07 Significant Association
## StreamingMovies   StreamingMovies  3.857900e-07 Significant Association
## Contract                 Contract 7.326182e-257 Significant Association
## PaperlessBilling PaperlessBilling  8.236203e-58 Significant Association
## PaymentMethod       PaymentMethod 1.426310e-139 Significant Association

Result indicates that gender and PhoneService have no association to Churn at 5% significance level.

One-Hot encoding

library(caret)

## Loading required package: lattice

col_encode <- c("InternetService", "Contract", "PaymentMethod")
df_encode <- data.frame(df[, !names(df) %in% col_encode], 
                        model.matrix(~. - 1, data = df[, col_encode]))
str(df_encode)

## 'data.frame':    7032 obs. of  25 variables:
##  $ gender                              : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen                       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner                             : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents                          : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure                              : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService                        : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines                       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ OnlineSecurity                      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup                        : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection                    : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport                         : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV                         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies                     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ PaperlessBilling                    : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ MonthlyCharges                      : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges                        : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn                               : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
##  $ InternetServiceDSL                  : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber.optic          : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ContractOne.year                    : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo.year                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaymentMethodCredit.card..automatic.: num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic.check       : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed.check           : num  0 1 1 0 0 0 0 1 0 0 ...

df_encode <- df_encode %>%
  mutate(
    InternetServiceDSL = as.factor(InternetServiceDSL),
    InternetServiceFiber.optic = as.factor(InternetServiceFiber.optic),
    InternetServiceNo = as.factor(InternetServiceNo),
    ContractOne.year = as.factor(ContractOne.year),
    ContractTwo.year = as.factor(ContractTwo.year),
    PaymentMethodCredit.card..automatic. = as.factor(PaymentMethodCredit.card..automatic.),
    PaymentMethodElectronic.check = as.factor(PaymentMethodElectronic.check),
    PaymentMethodMailed.check = as.factor(PaymentMethodMailed.check)
  )

df_churn_summary <- df_encode %>% group_by(Churn) %>%
  summarise(count = n(), percentage = n() / nrow(df_encode) * 100)
df_churn_summary

## # A tibble: 2 × 3
##   Churn count percentage
##   <fct> <int>      <dbl>
## 1 No     5163       73.4
## 2 Yes    1869       26.6

The data is quite imbalanced, where customer churning from the telco subscription has only 26.58% of the total customer base.

Oversampling

Since data is quite imbalanced, Synthetic Minority Oversampling (SMOTE) is used to create more evenly distributed training set

set.seed(123)
df_churn <- createDataPartition(df_encode$Churn, p = 0.7, list = F)
col_exclude <- c("gender", "PhoneService")
df_train <- df_encode[df_churn, setdiff(names(df_encode), col_exclude)]
df_test <- df_encode[-df_churn, setdiff(names(df_encode), col_exclude)]
library(performanceEstimation)
df_train_smote <- smote(Churn~., data = data.frame(df_train), perc.over = 1, perc.under = 2 )

Modeling

Naive Bayes (NB)

# Naive Bayes as the baseline
library(e1071)
# training model
nb_model <- naiveBayes(Churn~., data = df_train_smote)

# testing model
nb_model_pred <- predict(nb_model, newdata = df_test)

# evaluation
library(caret)
nb_model_confusion_mat <- confusionMatrix(nb_model_pred, df_test$Churn)
nb_model_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1145  123
##        Yes  403  437
##                                           
##                Accuracy : 0.7505          
##                  95% CI : (0.7314, 0.7688)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 0.04857         
##                                           
##                   Kappa : 0.4485          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.7397          
##             Specificity : 0.7804          
##          Pos Pred Value : 0.9030          
##          Neg Pred Value : 0.5202          
##              Prevalence : 0.7343          
##          Detection Rate : 0.5432          
##    Detection Prevalence : 0.6015          
##       Balanced Accuracy : 0.7600          
##                                           
##        'Positive' Class : No              
##

Logistic Regression (LR)

Assumptions: observations independent of each other no multicollinearity among the independent variables

# cross validation with 10 fold
cross <- trainControl(method = "cv", number = 10, classProbs = T, summaryFunction = twoClassSummary)

# model fitting
log_reg <- train(Churn~., data = df_train_smote, method = "glm", metric = "ROC",
                 preProcess = c("center", "scale"), trControl = cross)
summary(log_reg)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.26699  -0.70490   0.09563   0.72727   3.05962  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.26379    0.04421  -5.967 2.42e-09 ***
## SeniorCitizen1                        -0.03294    0.03737  -0.881 0.378053    
## PartnerYes                            -0.07248    0.04146  -1.748 0.080416 .  
## DependentsYes                         -0.09526    0.04127  -2.308 0.020984 *  
## tenure                                -1.70405    0.15315 -11.126  < 2e-16 ***
## MultipleLinesYes                       0.21393    0.05003   4.276 1.90e-05 ***
## OnlineSecurityYes                     -0.09721    0.04192  -2.319 0.020385 *  
## OnlineBackupYes                       -0.09055    0.04161  -2.176 0.029553 *  
## DeviceProtectionYes                    0.11118    0.04398   2.528 0.011480 *  
## TechSupportYes                        -0.10531    0.04283  -2.459 0.013942 *  
## StreamingTVYes                         0.23721    0.05377   4.411 1.03e-05 ***
## StreamingMoviesYes                     0.26257    0.05572   4.713 2.45e-06 ***
## PaperlessBillingYes                    0.18641    0.03880   4.805 1.55e-06 ***
## MonthlyCharges                        -0.99276    0.21170  -4.689 2.74e-06 ***
## TotalCharges                           1.06164    0.15888   6.682 2.36e-11 ***
## InternetServiceDSL1                    1.44546    0.60192   2.401 0.016332 *  
## InternetServiceFiber.optic1            2.26206    0.65458   3.456 0.000549 ***
## InternetServiceNo1                     0.58112    0.50400   1.153 0.248907    
## ContractOne.year1                     -0.33529    0.04156  -8.068 7.14e-16 ***
## ContractTwo.year1                     -0.61591    0.06417  -9.598  < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.03420    0.04644  -0.737 0.461420    
## PaymentMethodElectronic.check1         0.13800    0.05055   2.730 0.006337 ** 
## PaymentMethodMailed.check1            -0.07659    0.05022  -1.525 0.127244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7258.6  on 5235  degrees of freedom
## Residual deviance: 4891.8  on 5213  degrees of freedom
## AIC: 4937.8
## 
## Number of Fisher Scoring iterations: 6

InternetServiceNo summary indicates a problem with perfect separation or perfect multicollinearity in the data, hence it is removed from the model

log_reg2 <- train(Churn~. -InternetServiceNo, data = df_train_smote, method = "glm", metric = "ROC",
                 preProcess = c("center", "scale"), trControl = cross)
summary(log_reg2)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2673  -0.7079   0.1605   0.7274   3.0564  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.26212    0.04413  -5.940 2.86e-09 ***
## SeniorCitizen1                        -0.03282    0.03738  -0.878  0.37995    
## PartnerYes                            -0.07241    0.04146  -1.747  0.08071 .  
## DependentsYes                         -0.09472    0.04126  -2.296  0.02169 *  
## tenure                                -1.70210    0.15305 -11.121  < 2e-16 ***
## MultipleLinesYes                       0.22013    0.04981   4.420 9.89e-06 ***
## OnlineSecurityYes                     -0.09454    0.04184  -2.259  0.02387 *  
## OnlineBackupYes                       -0.08831    0.04157  -2.125  0.03363 *  
## DeviceProtectionYes                    0.11397    0.04391   2.595  0.00945 ** 
## TechSupportYes                        -0.10192    0.04272  -2.385  0.01706 *  
## StreamingTVYes                         0.24412    0.05349   4.564 5.01e-06 ***
## StreamingMoviesYes                     0.26775    0.05558   4.817 1.45e-06 ***
## PaperlessBillingYes                    0.18543    0.03879   4.781 1.75e-06 ***
## MonthlyCharges                        -1.03751    0.20885  -4.968 6.77e-07 ***
## TotalCharges                           1.05965    0.15877   6.674 2.48e-11 ***
## InternetServiceDSL1                    0.76095    0.09136   8.329  < 2e-16 ***
## InternetServiceFiber.optic1            1.54227    0.19130   8.062 7.49e-16 ***
## ContractOne.year1                     -0.33456    0.04153  -8.056 7.91e-16 ***
## ContractTwo.year1                     -0.61033    0.06385  -9.559  < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.03195    0.04639  -0.689  0.49101    
## PaymentMethodElectronic.check1         0.13840    0.05054   2.739  0.00617 ** 
## PaymentMethodMailed.check1            -0.07482    0.05020  -1.490  0.13610    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7258.6  on 5235  degrees of freedom
## Residual deviance: 4893.3  on 5214  degrees of freedom
## AIC: 4937.3
## 
## Number of Fisher Scoring iterations: 6

Check multicollinearity among independent variables using VIF:

If VIF = 1, variables are not correlated 
If VIF between 1 and 5, variables are moderately correlated
If VIF > 5, variables are highly correlated

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif_values <- vif(log_reg2$finalModel)
vif_values

##                        SeniorCitizen1                            PartnerYes 
##                              1.175002                              1.355302 
##                         DependentsYes                                tenure 
##                              1.274230                             15.579454 
##                      MultipleLinesYes                     OnlineSecurityYes 
##                              1.983112                              1.374907 
##                       OnlineBackupYes                   DeviceProtectionYes 
##                              1.430221                              1.563433 
##                        TechSupportYes                        StreamingTVYes 
##                              1.448386                              2.314756 
##                    StreamingMoviesYes                   PaperlessBillingYes 
##                              2.487115                              1.172481 
##                        MonthlyCharges                          TotalCharges 
##                             33.063765                             21.013789 
##                   InternetServiceDSL1           InternetServiceFiber.optic1 
##                              6.706288                             28.849259 
##                     ContractOne.year1                     ContractTwo.year1 
##                              1.364193                              1.453915 
## PaymentMethodCredit.card..automatic.1        PaymentMethodElectronic.check1 
##                              1.614033                              2.068606 
##            PaymentMethodMailed.check1 
##                              2.093463

Remove variable with highest VIF values: MonthlyCharges

excl_var <- c("InternetServiceNo", "MonthlyCharges")
formula <- as.formula(paste("Churn ~ . -", paste(excl_var, collapse = " - ")))
log_reg3 <- train(formula, data = df_train_smote, method = "glm", metric = "ROC",
                  preProcess = c("center", "scale"), trControl = cross)
summary(log_reg3)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2811  -0.7207   0.1547   0.7301   2.9900  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.23973    0.04339  -5.525 3.30e-08 ***
## SeniorCitizen1                        -0.02246    0.03728  -0.603  0.54675    
## PartnerYes                            -0.06884    0.04131  -1.666  0.09565 .  
## DependentsYes                         -0.10283    0.04110  -2.502  0.01235 *  
## tenure                                -1.51520    0.14605 -10.374  < 2e-16 ***
## MultipleLinesYes                       0.09943    0.04346   2.288  0.02214 *  
## OnlineSecurityYes                     -0.17307    0.03885  -4.455 8.38e-06 ***
## OnlineBackupYes                       -0.15668    0.03926  -3.991 6.58e-05 ***
## DeviceProtectionYes                    0.04444    0.04172   1.065  0.28681    
## TechSupportYes                        -0.18032    0.03974  -4.538 5.68e-06 ***
## StreamingTVYes                         0.09824    0.04481   2.192  0.02835 *  
## StreamingMoviesYes                     0.10868    0.04538   2.395  0.01661 *  
## PaperlessBillingYes                    0.18727    0.03864   4.847 1.26e-06 ***
## TotalCharges                           0.86401    0.15215   5.679 1.36e-08 ***
## InternetServiceDSL1                    0.43005    0.06120   7.027 2.11e-12 ***
## InternetServiceFiber.optic1            0.66233    0.07287   9.090  < 2e-16 ***
## ContractOne.year1                     -0.34008    0.04134  -8.227  < 2e-16 ***
## ContractTwo.year1                     -0.61083    0.06355  -9.612  < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.02694    0.04619  -0.583  0.55968    
## PaymentMethodElectronic.check1         0.14555    0.05039   2.888  0.00387 ** 
## PaymentMethodMailed.check1            -0.07572    0.04985  -1.519  0.12875    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7258.6  on 5235  degrees of freedom
## Residual deviance: 4918.6  on 5215  degrees of freedom
## AIC: 4960.6
## 
## Number of Fisher Scoring iterations: 6

vif_values2 <- vif(log_reg3$finalModel)
vif_values2

##                        SeniorCitizen1                            PartnerYes 
##                              1.168945                              1.352660 
##                         DependentsYes                                tenure 
##                              1.271493                             14.204876 
##                      MultipleLinesYes                     OnlineSecurityYes 
##                              1.516832                              1.191720 
##                       OnlineBackupYes                   DeviceProtectionYes 
##                              1.279963                              1.413164 
##                        TechSupportYes                        StreamingTVYes 
##                              1.259979                              1.627063 
##                    StreamingMoviesYes                   PaperlessBillingYes 
##                              1.662533                              1.170588 
##                          TotalCharges                   InternetServiceDSL1 
##                             19.203947                              3.061111 
##           InternetServiceFiber.optic1                     ContractOne.year1 
##                              4.223566                              1.365325 
##                     ContractTwo.year1 PaymentMethodCredit.card..automatic.1 
##                              1.449664                              1.612181 
##        PaymentMethodElectronic.check1            PaymentMethodMailed.check1 
##                              2.063831                              2.082495

Remove variable with highest VIF values: TotalCharges

excl_var <- c("InternetServiceNo", "MonthlyCharges", "TotalCharges")
formula <- as.formula(paste("Churn ~ . -", paste(excl_var, collapse = " - ")))
log_reg4 <- train(formula, data = df_train_smote, method = "glm", metric = "ROC",
                  preProcess = c("center", "scale"), trControl = cross)
summary(log_reg4)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.37222  -0.72722   0.09665   0.72949   2.86532  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.14716    0.03836  -3.836 0.000125 ***
## SeniorCitizen1                        -0.02101    0.03750  -0.560 0.575355    
## PartnerYes                            -0.06816    0.04127  -1.651 0.098658 .  
## DependentsYes                         -0.10630    0.04094  -2.596 0.009422 ** 
## tenure                                -0.78451    0.06032 -13.005  < 2e-16 ***
## MultipleLinesYes                       0.14864    0.04267   3.484 0.000494 ***
## OnlineSecurityYes                     -0.14650    0.03866  -3.789 0.000151 ***
## OnlineBackupYes                       -0.12326    0.03902  -3.159 0.001586 ** 
## DeviceProtectionYes                    0.08616    0.04142   2.080 0.037507 *  
## TechSupportYes                        -0.14707    0.03946  -3.727 0.000194 ***
## StreamingTVYes                         0.14738    0.04429   3.328 0.000876 ***
## StreamingMoviesYes                     0.16472    0.04462   3.692 0.000223 ***
## PaperlessBillingYes                    0.17935    0.03836   4.675 2.94e-06 ***
## InternetServiceDSL1                    0.41175    0.05964   6.903 5.07e-12 ***
## InternetServiceFiber.optic1            0.78964    0.06903  11.438  < 2e-16 ***
## ContractOne.year1                     -0.33935    0.04078  -8.322  < 2e-16 ***
## ContractTwo.year1                     -0.59655    0.06204  -9.616  < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.02760    0.04608  -0.599 0.549203    
## PaymentMethodElectronic.check1         0.15038    0.05038   2.985 0.002837 ** 
## PaymentMethodMailed.check1            -0.04521    0.04908  -0.921 0.356962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7258.6  on 5235  degrees of freedom
## Residual deviance: 4953.2  on 5216  degrees of freedom
## AIC: 4993.2
## 
## Number of Fisher Scoring iterations: 5

vif_values3 <- vif(log_reg4$finalModel)
vif_values3

##                        SeniorCitizen1                            PartnerYes 
##                              1.169375                              1.354581 
##                         DependentsYes                                tenure 
##                              1.271709                              2.417176 
##                      MultipleLinesYes                     OnlineSecurityYes 
##                              1.454261                              1.175463 
##                       OnlineBackupYes                   DeviceProtectionYes 
##                              1.255848                              1.379160 
##                        TechSupportYes                        StreamingTVYes 
##                              1.228959                              1.575330 
##                    StreamingMoviesYes                   PaperlessBillingYes 
##                              1.593262                              1.169039 
##                   InternetServiceDSL1           InternetServiceFiber.optic1 
##                              3.010753                              3.846592 
##                     ContractOne.year1                     ContractTwo.year1 
##                              1.357364                              1.420734 
## PaymentMethodCredit.card..automatic.1        PaymentMethodElectronic.check1 
##                              1.612775                              2.064202 
##            PaymentMethodMailed.check1 
##                              2.053385

VIF of all variables are below the rule of thumb of 5 now

# test the model
log_reg_pred <- predict(log_reg4, newdata = df_test)

# evaluation
log_reg_confusion_mat <- confusionMatrix(log_reg_pred, df_test$Churn)
log_reg_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1134  106
##        Yes  414  454
##                                           
##                Accuracy : 0.7533          
##                  95% CI : (0.7343, 0.7716)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 0.02505         
##                                           
##                   Kappa : 0.4622          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.7326          
##             Specificity : 0.8107          
##          Pos Pred Value : 0.9145          
##          Neg Pred Value : 0.5230          
##              Prevalence : 0.7343          
##          Detection Rate : 0.5380          
##    Detection Prevalence : 0.5882          
##       Balanced Accuracy : 0.7716          
##                                           
##        'Positive' Class : No              
##

Random Forest (RF)

# grid search
ctrl <- trainControl(method = "cv", number = 10, search = "grid", classProbs = T)
param_grid <- expand.grid(.mtry = (1:10))
modellist <- list()

# training model
for(ntree in c(seq(200, 300, by = 50))){
  set.seed(123)
  rf_model <- train(Churn ~ ., data = df_train_smote, method = "rf",
                    trControl = ctrl, tuneGrid = param_grid, ntree = ntree)
  key <- toString(ntree)
  modellist[[key]] <- rf_model
}

rf_model_result <- resamples(modellist)
summary(rf_model_result)

## 
## Call:
## summary.resamples(object = rf_model_result)
## 
## Models: 200, 250, 300 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## 200 0.8396947 0.8611641 0.8650707 0.8638283 0.8706423 0.889313    0
## 250 0.8377863 0.8573473 0.8641129 0.8619195 0.8682577 0.889313    0
## 300 0.8396947 0.8573473 0.8650689 0.8632561 0.8695658 0.889313    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## 200 0.6793893 0.7223282 0.7301468 0.7276587 0.7412872 0.778626    0
## 250 0.6755725 0.7146947 0.7282311 0.7238410 0.7365175 0.778626    0
## 300 0.6793893 0.7146947 0.7301421 0.7265146 0.7391423 0.778626    0

ntree = 250 has the highest accuracy, hence the parameter is chosen as the best parameter for random forest model

rf_model_final <- train(Churn ~., data = df_train_smote, method = "rf",
                        trControl = ctrl, tuneGrid = param_grid, ntree = 250)

# testing model
rf_model_pred <- predict(rf_model_final, newdata = df_test)

# evaluation
rf_model_confusion_mat <- confusionMatrix(rf_model_pred, df_test$Churn)
rf_model_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1140  124
##        Yes  408  436
##                                           
##                Accuracy : 0.7476          
##                  95% CI : (0.7285, 0.7661)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 0.087           
##                                           
##                   Kappa : 0.4433          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7364          
##             Specificity : 0.7786          
##          Pos Pred Value : 0.9019          
##          Neg Pred Value : 0.5166          
##              Prevalence : 0.7343          
##          Detection Rate : 0.5408          
##    Detection Prevalence : 0.5996          
##       Balanced Accuracy : 0.7575          
##                                           
##        'Positive' Class : No              
##

Support Vector Machines (SVM)

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

param_grid <- expand.grid(C = c(0.1, 1, 5, 10))
svm_model <- train(Churn~., data = df_train_smote, method = "svmLinear",
                   trControl = cross, tuneGrid = param_grid)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

# testing model
svm_model_pred <- predict(svm_model, newdata = df_test)

# evaluation
svm_model_confusion_mat <- confusionMatrix(svm_model_pred, df_test$Churn)
svm_model_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1055  100
##        Yes  493  460
##                                          
##                Accuracy : 0.7187         
##                  95% CI : (0.699, 0.7378)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 0.9501         
##                                          
##                   Kappa : 0.4109         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6815         
##             Specificity : 0.8214         
##          Pos Pred Value : 0.9134         
##          Neg Pred Value : 0.4827         
##              Prevalence : 0.7343         
##          Detection Rate : 0.5005         
##    Detection Prevalence : 0.5479         
##       Balanced Accuracy : 0.7515         
##                                          
##        'Positive' Class : No             
##

k-Nearest Neighbours (kNN)

# extract the best model
knn_model <- train(Churn~., data = df_train_smote, method = "knn",
                         trControl = ctrl)
knn_model

## k-Nearest Neighbors 
## 
## 5236 samples
##   22 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4713, 4713, 4712, 4712, 4712, 4712, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7265183  0.4530373
##   7  0.7322453  0.4644925
##   9  0.7330080  0.4660186
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

# testing model
knn_model_pred <- predict(knn_model, newdata = df_test)
knn_model_confusion_mat <- confusionMatrix(knn_model_pred, df_test$Churn)
knn_model_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1096  161
##        Yes  452  399
##                                           
##                Accuracy : 0.7092          
##                  95% CI : (0.6893, 0.7285)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 0.9956          
##                                           
##                   Kappa : 0.3607          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7080          
##             Specificity : 0.7125          
##          Pos Pred Value : 0.8719          
##          Neg Pred Value : 0.4689          
##              Prevalence : 0.7343          
##          Detection Rate : 0.5199          
##    Detection Prevalence : 0.5963          
##       Balanced Accuracy : 0.7103          
##                                           
##        'Positive' Class : No              
##

Adaptive Boosting (AdaBoost)

library(ada)

## Loading required package: rpart

# training model
set.seed(123)
adaboost_model <- ada(Churn ~ ., data = df_train_smote, iter = 50)

# testing model
adaboost_model_pred <- predict(adaboost_model, newdata = df_test)
adaboost_model_confusion_mat <- confusionMatrix(adaboost_model_pred, factor(df_test$Churn))
adaboost_model_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1150  120
##        Yes  398  440
##                                           
##                Accuracy : 0.7543          
##                  95% CI : (0.7353, 0.7725)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 0.01973         
##                                           
##                   Kappa : 0.4563          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.7429          
##             Specificity : 0.7857          
##          Pos Pred Value : 0.9055          
##          Neg Pred Value : 0.5251          
##              Prevalence : 0.7343          
##          Detection Rate : 0.5455          
##    Detection Prevalence : 0.6025          
##       Balanced Accuracy : 0.7643          
##                                           
##        'Positive' Class : No              
##

Xtreme Gradient Boodting (XGBoost)

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

str(df_train_smote)

## 'data.frame':    5236 obs. of  23 variables:
##  $ SeniorCitizen                       : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Partner                             : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 2 2 1 ...
##  $ Dependents                          : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 1 1 1 ...
##  $ tenure                              : num  41 25 54 53 4 16 72 14 61 5 ...
##  $ MultipleLines                       : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 2 1 ...
##  $ OnlineSecurity                      : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ OnlineBackup                        : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 2 2 ...
##  $ DeviceProtection                    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 1 1 1 ...
##  $ TechSupport                         : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 1 1 1 ...
##  $ StreamingTV                         : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 2 1 ...
##  $ StreamingMovies                     : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 2 2 1 ...
##  $ PaperlessBilling                    : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 1 2 2 2 ...
##  $ MonthlyCharges                      : num  89.2 54.3 90 92.5 56.8 ...
##  $ TotalCharges                        : num  3646 1297 4932 4779 245 ...
##  $ Churn                               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ InternetServiceDSL                  : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 1 1 ...
##  $ InternetServiceFiber.optic          : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 2 2 2 ...
##  $ InternetServiceNo                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ContractOne.year                    : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ContractTwo.year                    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
##  $ PaymentMethodCredit.card..automatic.: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ PaymentMethodElectronic.check       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 1 2 ...
##  $ PaymentMethodMailed.check           : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...

convert_factors_to_binary <- function(df, cols) {
  for (col in cols) {
    df[[col]] <- as.numeric(df[[col]]) - 1
  }
  return(df)
}

# Specify the columns to convert
columns_to_convert <- c("SeniorCitizen", "Partner", "Dependents", "MultipleLines",
                        "OnlineSecurity", "OnlineBackup", "DeviceProtection",
                        "TechSupport", "StreamingTV", "StreamingMovies", "PaperlessBilling",
                        "Churn", "InternetServiceDSL", "InternetServiceFiber.optic",
                        "InternetServiceNo", "ContractOne.year", "ContractTwo.year",
                        "PaymentMethodCredit.card..automatic.", "PaymentMethodElectronic.check",
                        "PaymentMethodMailed.check")
# Convert the specified columns in df_train_smote
df_train_smote <- convert_factors_to_binary(df_train_smote, columns_to_convert)

xgb_model <- xgboost(data = as.matrix(df_train_smote[, -c(15)]),
                     label = df_train_smote$Churn,
                     nrounds = 50, 
                     objective = "binary:logistic",
                     eval_metric = "logloss",
                     max_depth = 3,
                     eta = 0.1, 
                     nthread = 4,
                     verbose = 0)

df_test <- convert_factors_to_binary(df_test, columns_to_convert)

dtest <- xgb.DMatrix(as.matrix(df_test[, -c(15)]))
xgb_model_pred <- predict(xgb_model, dtest)
xgb_model_pred <- ifelse(xgb_model_pred >= 0.5, 1, 0)

# testing model
xgb_model_pred <- as.factor(xgb_model_pred)
df_test$Churn <- as.factor(df_test$Churn)
xgb_model_confusion_mat <- confusionMatrix(xgb_model_pred, df_test$Churn)
xgb_model_confusion_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1118   99
##          1  430  461
##                                         
##                Accuracy : 0.7491        
##                  95% CI : (0.73, 0.7674)
##     No Information Rate : 0.7343        
##     P-Value [Acc > NIR] : 0.06564       
##                                         
##                   Kappa : 0.4589        
##                                         
##  Mcnemar's Test P-Value : < 2e-16       
##                                         
##             Sensitivity : 0.7222        
##             Specificity : 0.8232        
##          Pos Pred Value : 0.9187        
##          Neg Pred Value : 0.5174        
##              Prevalence : 0.7343        
##          Detection Rate : 0.5304        
##    Detection Prevalence : 0.5773        
##       Balanced Accuracy : 0.7727        
##                                         
##        'Positive' Class : 0             
##

Model Comparison

Evaluation Metrics

Models are compared and evaluated with the below metrics:

Accuracy
Precision
Sensitivity
Specificity
F1-score
AUC-ROC

Receiver Operating Characteristics (ROC) Curves

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

df_test$Churn <- as.numeric(df_test$Churn) - 1
nb_model_pred <- as.numeric(nb_model_pred) - 1
roc_nb <- roc(df_test$Churn, nb_model_pred)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

log_reg_pred <- as.numeric(log_reg_pred) - 1
roc_lr <- roc(df_test$Churn, log_reg_pred)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

rf_model_pred <- as.numeric(rf_model_pred) - 1
roc_rf <- roc(df_test$Churn, rf_model_pred)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

svm_model_pred <- as.numeric(svm_model_pred) - 1
roc_svm <- roc(df_test$Churn, svm_model_pred)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

knn_model_pred <- as.numeric(knn_model_pred) - 1
roc_knn <- roc(df_test$Churn, knn_model_pred)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

adaboost_model_pred <- as.numeric(adaboost_model_pred) - 1
roc_adaboost <- roc(df_test$Churn, adaboost_model_pred)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

xgb_model_pred <- as.numeric(xgb_model_pred) - 1
roc_xgb <- roc(df_test$Churn, xgb_model_pred)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc_nb, col = "blue", main = "AUC-ROC Curve", print.auc = F)
plot(roc_lr, col = "red", add = TRUE, print.auc = F)
plot(roc_rf, col = "green", add = TRUE, print.auc = F)
plot(roc_svm, col = "orange", add = TRUE, print.auc = F)
plot(roc_knn, col = "purple", add = TRUE, print.auc = F)
plot(roc_adaboost, col = "brown", add = TRUE, print.auc = F)
plot(roc_xgb, col = "black", add = TRUE, print.auc = F)

# Add legend
legend("topright", legend = c("Naive Bayes", 
                              "Logistic Regression", 
                              "Random Forest", 
                              "SVM", 
                              "KNN", 
                              "AdaBoost", 
                              "XGBoost"), 
       col = c("blue", 
               "red", 
               "green", 
               "orange", 
               "purple", 
               "brown", 
               "black"), lwd = 2)

result_nb <- c(accuracy = nb_model_confusion_mat$overall["Accuracy"], 
               precision = nb_model_confusion_mat$byClass["Precision"],
               sensitivity = nb_model_confusion_mat$byClass["Sensitivity"],
               specificity = nb_model_confusion_mat$byClass["Specificity"],
               f1_score = nb_model_confusion_mat$byClass["F1"],
               auc_roc = roc_nb$auc)

result_lr <- c(accuracy = log_reg_confusion_mat$overall["Accuracy"], 
               precision = log_reg_confusion_mat$byClass["Precision"],
               sensitivity = log_reg_confusion_mat$byClass["Sensitivity"],
               specificity = log_reg_confusion_mat$byClass["Specificity"],
               f1_score = log_reg_confusion_mat$byClass["F1"],
               auc_roc = roc_lr$auc)

result_rf <- c(accuracy = rf_model_confusion_mat$overall["Accuracy"], 
               precision = rf_model_confusion_mat$byClass["Precision"],
               sensitivity = rf_model_confusion_mat$byClass["Sensitivity"],
               specificity = rf_model_confusion_mat$byClass["Specificity"],
               f1_score = rf_model_confusion_mat$byClass["F1"],
               auc_roc = roc_rf$auc)

result_svm <- c(accuracy = svm_model_confusion_mat$overall["Accuracy"], 
                precision = svm_model_confusion_mat$byClass["Precision"],
                sensitivity = svm_model_confusion_mat$byClass["Sensitivity"],
                specificity = svm_model_confusion_mat$byClass["Specificity"],
                f1_score = svm_model_confusion_mat$byClass["F1"],
                auc_roc = roc_svm$auc)

result_knn <- c(accuracy = knn_model_confusion_mat$overall["Accuracy"], 
                precision = knn_model_confusion_mat$byClass["Precision"],
                sensitivity = knn_model_confusion_mat$byClass["Sensitivity"],
                specificity = knn_model_confusion_mat$byClass["Specificity"],
                f1_score = knn_model_confusion_mat$byClass["F1"],
                auc_roc = roc_knn$auc)

result_adaboost <- c(accuracy = adaboost_model_confusion_mat$overall["Accuracy"], 
                precision = adaboost_model_confusion_mat$byClass["Precision"],
                sensitivity = adaboost_model_confusion_mat$byClass["Sensitivity"],
                specificity = adaboost_model_confusion_mat$byClass["Specificity"],
                f1_score = adaboost_model_confusion_mat$byClass["F1"],
                auc_roc = roc_adaboost$auc)

result_xgb <- c(accuracy = xgb_model_confusion_mat$overall["Accuracy"], 
                precision = xgb_model_confusion_mat$byClass["Precision"],
                sensitivity = xgb_model_confusion_mat$byClass["Sensitivity"],
                specificity = xgb_model_confusion_mat$byClass["Specificity"],
                f1_score = xgb_model_confusion_mat$byClass["F1"],
                auc_roc = roc_xgb$auc)

summary_result <- data.frame(NaiveBayes = result_nb,
                             LogisticRegression = result_lr,
                             RandomForest = result_rf,
                             SVM = result_svm,
                             KNN = result_knn,
                             AdaBoost = result_adaboost,
                             XGBoost = result_xgb)
summary_result <- t(summary_result)
summary_result

##                    accuracy.Accuracy precision.Precision
## NaiveBayes                 0.7504744           0.9029968
## LogisticRegression         0.7533207           0.9145161
## RandomForest               0.7476281           0.9018987
## SVM                        0.7186907           0.9134199
## KNN                        0.7092030           0.8719173
## AdaBoost                   0.7542694           0.9055118
## XGBoost                    0.7490512           0.9186524
##                    sensitivity.Sensitivity specificity.Specificity f1_score.F1
## NaiveBayes                       0.7396641               0.7803571   0.8132102
## LogisticRegression               0.7325581               0.8107143   0.8134864
## RandomForest                     0.7364341               0.7785714   0.8108108
## SVM                              0.6815245               0.8214286   0.7806141
## KNN                              0.7080103               0.7125000   0.7814617
## AdaBoost                         0.7428941               0.7857143   0.8161817
## XGBoost                          0.7222222               0.8232143   0.8086799
##                      auc_roc
## NaiveBayes         0.7600106
## LogisticRegression 0.7716362
## RandomForest       0.7575028
## SVM                0.7514766
## KNN                0.7102552
## AdaBoost           0.7643042
## XGBoost            0.7727183

Based on the metrics output, all models demonstrated competitive performance across multiple metrics.
All models achieved relatively high values in these metrics, indicating a good balance between identifying churn cases and acoiding false positives.
Further evaluation and considering additional factors, such as interpretability and computational requirements, Logistic Regression model will be chosen as the final model due to its simplicity.

Association Analysis

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:kernlab':
## 
##     size

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

# Select rows that all have internet service
Telco_customer_churn_with_InternetService <- df[df$InternetService != 'No', ]

# Features extraction
association_rule_features <- Telco_customer_churn_with_InternetService[, c('PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies')]

# Convert the association rule features data frame to a transaction object
association_transactions <- as(association_rule_features, "transactions")

# Fitting model
# Training Apriori on the dataset with support > 0.2 , confidence > 0.7 and max item number = 4
set.seed = 123 # Setting seed
associa_rules <- apriori(association_transactions, 
                         parameter = list(support = 0.2, 
                                          confidence = 0.7,
                                          maxlen = 4),
                         control = list(verbose =TRUE))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1102 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 5512 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4

## Warning in apriori(association_transactions, parameter = list(support = 0.2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 4 returned!

##  done [0.00s].
## writing ... [185 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# The Apriori algorithm generated 185 rules with the parameter of support = 2% and confidence = 80%.
# The maximum length number of rules involved 4 item sets.
# Absolute minimum is 2% from the total of 5512 transactions, which is 1102

library(RColorBrewer)
itemFrequencyPlot(items(associa_rules), topN = 12, 
                  col = brewer.pal(8, 'Pastel2'),
                  main = 'Relative Item Frequency Plot',
                  type = "relative",
                  ylab = "Item Frequency (Relative)")

According to the plot, majority of data composed of those with phone services
The items frequency of them with streaming TV, Streaming movies, devices protection, online backup and online security were far lower

# Visualizing the results
inspect(sort(associa_rules, by = 'lift')[1:20])

##      lhs                        rhs                     support confidence  coverage     lift count
## [1]  {DeviceProtection=Yes,                                                                        
##       StreamingTV=Yes}       => {StreamingMovies=Yes} 0.2273222  0.8016635 0.2835631 1.618004  1253
## [2]  {PhoneService=Yes,                                                                            
##       DeviceProtection=Yes,                                                                        
##       StreamingMovies=Yes}   => {StreamingTV=Yes}     0.2022859  0.7919034 0.2554427 1.614862  1115
## [3]  {PhoneService=Yes,                                                                            
##       DeviceProtection=Yes,                                                                        
##       StreamingTV=Yes}       => {StreamingMovies=Yes} 0.2022859  0.7998565 0.2529028 1.614357  1115
## [4]  {DeviceProtection=Yes,                                                                        
##       StreamingMovies=Yes}   => {StreamingTV=Yes}     0.2273222  0.7860728 0.2891872 1.602972  1253
## [5]  {DeviceProtection=No,                                                                         
##       TechSupport=No,                                                                              
##       StreamingTV=No}        => {StreamingMovies=No}  0.2117199  0.7811245 0.2710450 1.548205  1167
## [6]  {DeviceProtection=No,                                                                         
##       TechSupport=No,                                                                              
##       StreamingMovies=No}    => {StreamingTV=No}      0.2117199  0.7853297 0.2695936 1.541024  1167
## [7]  {DeviceProtection=No,                                                                         
##       StreamingTV=No}        => {StreamingMovies=No}  0.2726778  0.7691914 0.3544993 1.524553  1503
## [8]  {PhoneService=Yes,                                                                            
##       DeviceProtection=No,                                                                         
##       StreamingTV=No}        => {StreamingMovies=No}  0.2340348  0.7628622 0.3067852 1.512009  1290
## [9]  {DeviceProtection=No,                                                                         
##       StreamingMovies=No}    => {StreamingTV=No}      0.2726778  0.7680123 0.3550435 1.507043  1503
## [10] {PhoneService=Yes,                                                                            
##       DeviceProtection=No,                                                                         
##       StreamingMovies=No}    => {StreamingTV=No}      0.2340348  0.7624113 0.3069666 1.496052  1290
## [11] {OnlineBackup=No,                                                                             
##       StreamingMovies=No}    => {StreamingTV=No}      0.2407475  0.7604585 0.3165820 1.492220  1327
## [12] {OnlineBackup=No,                                                                             
##       StreamingTV=No}        => {StreamingMovies=No}  0.2407475  0.7480271 0.3218433 1.482605  1327
## [13] {PhoneService=Yes,                                                                            
##       OnlineBackup=No,                                                                             
##       StreamingMovies=No}    => {StreamingTV=No}      0.2060958  0.7538155 0.2734035 1.479185  1136
## [14] {TechSupport=No,                                                                              
##       StreamingTV=No}        => {StreamingMovies=No}  0.2677794  0.7439516 0.3599419 1.474528  1476
## [15] {PhoneService=Yes,                                                                            
##       OnlineBackup=No,                                                                             
##       StreamingTV=No}        => {StreamingMovies=No}  0.2060958  0.7434555 0.2772134 1.473544  1136
## [16] {PhoneService=Yes,                                                                            
##       TechSupport=No,                                                                              
##       StreamingTV=No}        => {StreamingMovies=No}  0.2347605  0.7423982 0.3162192 1.471449  1294
## [17] {TechSupport=No,                                                                              
##       StreamingMovies=No}    => {StreamingTV=No}      0.2677794  0.7496191 0.3572206 1.470951  1476
## [18] {PhoneService=Yes,                                                                            
##       TechSupport=No,                                                                              
##       StreamingMovies=No}    => {StreamingTV=No}      0.2347605  0.7428243 0.3160377 1.457618  1294
## [19] {PhoneService=Yes,                                                                            
##       StreamingMovies=Yes}   => {StreamingTV=Yes}     0.3154935  0.7147554 0.4414006 1.457541  1739
## [20] {OnlineSecurity=No,                                                                           
##       StreamingTV=No}        => {StreamingMovies=No}  0.2443759  0.7320652 0.3338171 1.450969  1347

plot(associa_rules, method = "graph", measure = "confidence", shading = "lift")

## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

The top 20 rules were inspected.
To determine the existing services that attained the customers,only rhs=yes is focused

People who open device protection and streamingTV services are very likely to open streaming movies service, with support = 0.23, confidence = 0.80, lift = 1.62;
People who open phone services, device protection and streamingMovies services are very likely to open streaming TV service, with support = 0.20, confidence = 0.80, lift = 1.62;

# 3rd objective -> promote streaming Movies and streaming TV sales 
library(dplyr) 
aveMonthlyCharges <- mean(df$MonthlyCharges) 
#find MonthlyCharges for customer who open StreamingTV and StreamingMovies 
averageMonthlyCharges<-df %>% group_by(StreamingMovies,StreamingTV) %>% summarise(Average = mean(MonthlyCharges))

## `summarise()` has grouped output by 'StreamingMovies'. You can override using
## the `.groups` argument.

avgMonthlyChargesBoth <- averageMonthlyCharges[4,3] 
avgMonthlyChargesMovies <- averageMonthlyCharges[3,3] 
avgMonthlyChargesTV <- averageMonthlyCharges[2,3]
avgMonthlyChargesMoviesorTV <- sum(avgMonthlyChargesTV,avgMonthlyChargesMovies)/2

print(paste('Average monthly charges for customers with both Movies and TV =',avgMonthlyChargesBoth))

## [1] "Average monthly charges for customers with both Movies and TV = 93.2438886023724"

print(paste('Average monthly charges for customers with only Movies =',avgMonthlyChargesMovies))

## [1] "Average monthly charges for customers with only Movies = 76.8117424242424"

print(paste('Average monthly charges for customers with only TV =',avgMonthlyChargesTV))

## [1] "Average monthly charges for customers with only TV = 77.418390052356"

print(paste('Average monthly charges for customers with only 1 service =',avgMonthlyChargesMoviesorTV))

## [1] "Average monthly charges for customers with only 1 service = 77.1150662382992"

totalCount <- df %>% count(df$StreamingMovies,StreamingTV) 
totalCountBoth <- totalCount[4,3] 
totalCountMovies <- totalCount[3,3] 
totalCountTV <- totalCount[2,3] 
totalCountMoviesorTV <- sum(totalCountMovies,totalCountTV)/2 
print(paste('Total count for customers with both Movies and TV =', totalCountBoth))

## [1] "Total count for customers with both Movies and TV = 1939"

print(paste('Total count for customers with only Movies =', totalCountMovies))

## [1] "Total count for customers with only Movies = 792"

print(paste('Total count for customers with only TV =', totalCountTV))

## [1] "Total count for customers with only TV = 764"

print(paste('Total count for customers with only 1 service =', totalCountMoviesorTV))

## [1] "Total count for customers with only 1 service = 778"

diffenceMonthlyCharges <- avgMonthlyChargesBoth - avgMonthlyChargesMoviesorTV 
print(paste('Difference in monthly charges of both services and 1 service =', diffenceMonthlyCharges))

## [1] "Difference in monthly charges of both services and 1 service = 16.1288223640731"

Sales / revenue can be increased if it is able to attract customers with only 1 services (792) to open both services as the monthly charges different is less than 20 only

Assuming bigger discount could attract more customers to purchase
Setting maximum discount allowed to 30%
Assuming half of the customers (50%) with only one Streaming Service will also have another Streaming Service by given 30% discount.

#Set the range of discount percentages
discount_percent <- seq(0, 0.3, length.out = 11) 
discount_percent

##  [1] 0.00 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.30

for (i in discount_percent) {
  attractive_percent <- i / 0.6
  loss <- as.numeric(diffenceMonthlyCharges) * i * totalCountBoth
  gain <- as.numeric(diffenceMonthlyCharges) * (1 - i) * (totalCountMovies+totalCountTV) * attractive_percent
  marginal_profit <- gain - loss
  print(paste("The marginal profit is:", marginal_profit, "when the discount is", i))
}

## [1] "The marginal profit is: 0 when the discount is 0"
## [1] "The marginal profit is: 278.964111609009 when the discount is 0.03"
## [1] "The marginal profit is: 482.638880422524 when the discount is 0.06"
## [1] "The marginal profit is: 611.024306440547 when the discount is 0.09"
## [1] "The marginal profit is: 664.120389663075 when the discount is 0.12"
## [1] "The marginal profit is: 641.92713009011 when the discount is 0.15"
## [1] "The marginal profit is: 544.444527721653 when the discount is 0.18"
## [1] "The marginal profit is: 371.672582557701 when the discount is 0.21"
## [1] "The marginal profit is: 123.611294598258 when the discount is 0.24"
## [1] "The marginal profit is: -199.739336156681 when the discount is 0.27"
## [1] "The marginal profit is: -598.379309707114 when the discount is 0.3"

Result showed that the marginal profit is the highest when discount percentage is 12%.
Hence, it is suggested that the telco company is likely to have existing customers with only 1 service to subscribe for second service at an optimum discount rate of 12%.

Conclusion

In conclusion, multiple classification models were adopted, namely Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine (SVM), k-Nearest Neighbors (kNN), Adaptive Boosting (AdaBoost), Xtreme Gradient Boodting (XGBoost). Despite each model perform relatively well in predicting the customer churn, Logistic Regression was chosen as the final model to be implemented due to its simplicity and generally less computational expensive.
Based on the estimates of Logistic Regression, variables as stated below were found to be statistically significant at 5% signicance level in predicting the customer churn likelihood. This includes:

SeniorCitizen: Indicator of whether the customer is 65 years old or older
tenure: Total amount of months that the customer has been with the company
MultipleLine: Indicator of whether the customer subscribes to multiple telephone lines
OnlineSecurity: Indicator of whether the customer subscribes to additional online security service
TechSupport: Indicator if the customer subscribes to an additional technical support plan
StreamingTV: Indicator if the customer uses their Internet service to stream television programing from a third party provider
StreamingMovies: Indicator if the customer uses their Internet service to stream movies from a third party provider
PaperlessBilling: Indicator if the customer has chosen paperless billing
InternetService: Type of Internet service that the customer has with the company
Contract: Customer’s current contract type

After that, association analysis was conducted to determine the relationship between the factors. It is found that people who subscribe for device protection and streaming TV services are very likely to subscribe for streaming movies service.
The same goes for people who have subscribed to open phone services, device protection and streaming movies services, the customer has high change to subscribe streaming TV service as well.

Last but not least, it is recommended that the Telco company to have discount of 12% for the existing customers with only one service to subscribe for more service. By having this suggested discount rate, the marginal profit is estimated to have increased by approximately $644.12

Prediction and Association Analysis of Telco Customer Churn

Group 8: 888

Date: 2023-06-15