| Name | Matric No |
|---|---|
| Choo En Ming | 22079929 |
| Han Die | 22062358 |
| Lim Min Kye Andy | 22083095 |
| Soong Sing Ying | S2191652 |
| Tneu Zhen Yang | 22081192 |
Customer churn, the loss of customers or subscribers, can have a significant impact on a company’s profitability. For businesses, retaining existing customers is more cost-effective than acquiring new ones. Understanding the factors that contribute to customer churn is essential for reducing churn rates and improving revenue. This project aims to analyze customer data and develop targeted customer retention programs in the competitive telecommunication industry.
Data Collection: The Telco Customer Churn dataset was obtained from Kaggle. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Data Cleaning: The process starts by performing some initial data
cleaning steps. This includes setting the row names / index using the
customerID column, converting factor columns into that of
factor type. Followed by displaying the summary statistics, structure
information, etc.
Exploratory Data Analysis: Various visualizations info graphics are displayed to understand different aspects of the dataset, such as distribution, kurtosis and skewness of the variables, identifying possible outliers and correlations between the variables.
Data Preprocessing: Chi-squared tests are adopted to identify if
there is significant association categorical variables associated with
Churn, performs one-hot encoding on selected variables, and
applies SMOTE oversampling to create a more balanced training set for
modelling.
Modelling: Several classification machine learning models are trained to predict customer churn, including:
Model Evaluation: The evaluation results (accuracy, precision, sensitivity, specificity, F1-score) for each model are summarized in a data frame, and the performance is visualized using ROC curves.
Association Analysis: Apriori algorithm is utilised to conduct association rule mining on Telco customer churn data with Internet service. It mines not only the association rules but also generates visualizations to uncover the most meaningful associations among various features.
# read data file
df <- read.csv("/Users/chooenming/Desktop/Programming for Data Science/Telco_Customer_Churn_Data.csv",
header = T, stringsAsFactors = T)
rownames(df) <- df$customerID # put customerID as index
df$customerID <- NULL # remove customerID from variables
str(df) #
## 'data.frame': 7043 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Dataset has 7043 observations of 20 variables
# Change SeniorCitizen to factor
df$SeniorCitizen <- as.factor(df$SeniorCitizen)
# return distinct values of those factor variables
for(i in colnames(df)){
if(is.factor(df[[i]])){
cat(i, "\n distinct values: ", levels(df[[i]]), "\n")
}
}
## gender
## distinct values: Female Male
## SeniorCitizen
## distinct values: 0 1
## Partner
## distinct values: No Yes
## Dependents
## distinct values: No Yes
## PhoneService
## distinct values: No Yes
## MultipleLines
## distinct values: No No phone service Yes
## InternetService
## distinct values: DSL Fiber optic No
## OnlineSecurity
## distinct values: No No internet service Yes
## OnlineBackup
## distinct values: No No internet service Yes
## DeviceProtection
## distinct values: No No internet service Yes
## TechSupport
## distinct values: No No internet service Yes
## StreamingTV
## distinct values: No No internet service Yes
## StreamingMovies
## distinct values: No No internet service Yes
## Contract
## distinct values: Month-to-month One year Two year
## PaperlessBilling
## distinct values: No Yes
## PaymentMethod
## distinct values: Bank transfer (automatic) Credit card (automatic) Electronic check Mailed check
## Churn
## distinct values: No Yes
# replace those with more than 3 levels to 2 levels if possible
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- df %>%
mutate(MultipleLines = as.factor(case_when(MultipleLines == "No phone service"~"No",
MultipleLines == "No"~"No",
TRUE ~ "Yes"))) %>%
mutate(OnlineSecurity = as.factor(case_when(OnlineSecurity == "No internet service"~"No",
OnlineSecurity == "No"~"No",
TRUE ~ "Yes"))) %>%
mutate(OnlineBackup = as.factor(case_when(OnlineBackup == "No internet service"~"No",
OnlineBackup == "No"~"No",
TRUE ~ "Yes"))) %>%
mutate(DeviceProtection = as.factor(case_when(DeviceProtection == "No internet service"~"No",
DeviceProtection == "No"~"No",
TRUE ~ "Yes"))) %>%
mutate(TechSupport = as.factor(case_when(TechSupport == "No internet service"~"No",
TechSupport == "No"~"No",
TRUE ~ "Yes"))) %>%
mutate(StreamingTV = as.factor(case_when(StreamingTV == "No internet service"~"No",
StreamingTV == "No"~"No",
TRUE ~ "Yes"))) %>%
mutate(StreamingMovies = as.factor(case_when(StreamingMovies == "No internet service"~"No",
StreamingMovies == "No"~"No",
TRUE ~ "Yes")))
# check for missing values
sum(is.na(df))
## [1] 11
There are 11 rows that contain missing data.
Since the amount is relatively small, the rows with missing value will
be removed
# Remove missing rows
df <- na.omit(df)
summary(df)
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:3483 0:5890 No :3639 No :4933 Min. : 1.00 No : 680
## Male :3549 1:1142 Yes:3393 Yes:2099 1st Qu.: 9.00 Yes:6352
## Median :29.00
## Mean :32.42
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## No :4065 DSL :2416 No :5017 No :4607 No :4614
## Yes:2967 Fiber optic:3096 Yes:2015 Yes:2425 Yes:2418
## No :1520
##
##
##
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## No :4992 No :4329 No :4301 Month-to-month:3875 No :2864
## Yes:2040 Yes:2703 Yes:2731 One year :1472 Yes:4168
## Two year :1685
##
##
##
## PaymentMethod MonthlyCharges TotalCharges Churn
## Bank transfer (automatic):1542 Min. : 18.25 Min. : 18.8 No :5163
## Credit card (automatic) :1521 1st Qu.: 35.59 1st Qu.: 401.4 Yes:1869
## Electronic check :2365 Median : 70.35 Median :1397.5
## Mailed check :1604 Mean : 64.80 Mean :2283.3
## 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
# check if each attribute is numeric
is_number <- sapply(df, is.numeric)
# Plot boxplot for numerical attributes to check for outliers
par(mfrow = c(ceiling(sum(is_number)/3), 3))
for(col in colnames(df)){
if(is_number[col]){
boxplot(df[[col]], main = col)
}
}
No outliers were found for all numerical attributes
library(ggplot2)
library(cowplot)
library(ggcorrplot)
g1 <- ggplot(data = df , aes(x = Churn,fill = Churn)) + geom_bar() + geom_text(stat = 'count' , aes(label = paste(NULL , formatC(..count..))),vjust = 3.5) + theme_minimal() + ggtitle("Number of customer churn") + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g2 <- ggplot(data = df , aes(x = gender,fill = gender)) + geom_bar() + geom_text(stat = 'count' , aes(label = paste(NULL , formatC(..count..))),vjust = 3.5) + theme_minimal() + ggtitle("Number by Gender") + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g3 <- ggplot(data = df, aes(x = Churn,fill = gender)) + geom_bar(stat = "count",position = position_dodge()) + geom_text(stat = "count" , aes(label = paste(NULL , formatC(..count..))),vjust = 3.5 , position = position_dodge(0.9)) + ggtitle("Customers Churn by Gender") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
print(plot_grid(g1,g2,g3, ncol = 2, nrow = 2))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
g1 = ggplot(data = df , aes(x = Churn , y = MonthlyCharges, fill = Churn)) + geom_bar(stat = "summary" , fun = "median") +
stat_summary(aes(label = paste(..y..)) , fun = median , geom = "text" , vjust = 3.5) + labs( y = "Median of MonthlyCharges") +
ggtitle("Median monthly charges of customers") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
g2 = ggplot(data = df , aes(x = Churn , y = tenure, fill = Churn)) + geom_bar(stat = "summary" , fun = "median") +
stat_summary(aes(label = paste(..y..)) , fun = median , geom = "text" , vjust = 3.5) + labs( y = "Median of tenure") +
ggtitle("Median tenure of customers") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
g3 = ggplot(data = df , aes(x = Churn , y = TotalCharges, fill = Churn)) + geom_bar(stat = "summary" , fun = "median") +
stat_summary(aes(label = paste(..y..)) , fun = median , geom = "text" , vjust = 3.5) + labs( y = "Median of TotalCharges") +
ggtitle("Median total charges") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
print(plot_grid(g1,g2,g3, ncol = 2, nrow = 2))
g1 <- ggplot(data = df, aes(x = Churn,fill = Contract)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("Customers Churn by Contract Type") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g2 <- ggplot(data = df, aes(x = Churn,fill = PaymentMethod)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("Customers Churn by Contract Type") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g3 <- ggplot(data = df, aes(x = Churn,fill = InternetService)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("Customers Churn by Contract Type") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
print(plot_grid(g1,g2,g3, ncol = 1, nrow = 3))
g1 <- ggplot(data = df, aes(x = Churn,fill = OnlineSecurity)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("OnlineSecurity") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g2 <- ggplot(data = df, aes(x = Churn,fill = OnlineBackup)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("OnlineBackup") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g3 <- ggplot(data = df, aes(x = Churn,fill = DeviceProtection)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("DeviceProtection") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g4 <- ggplot(data = df, aes(x = Churn,fill = TechSupport)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("TechSupport") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g5 <- ggplot(data = df, aes(x = Churn,fill = StreamingTV)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("StreamingTV") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
g6 <- ggplot(data = df, aes(x = Churn,fill = StreamingMovies)) + geom_bar(stat = "count",position = position_dodge()) + ggtitle("StreamingMovies") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(y="Count")
print(plot_grid(g1,g2,g3,g4,g5,g6, ncol = 2, nrow = 3))
g1 <- ggplot(data = df , aes(x = tenure)) + geom_histogram(aes(y = ..density..),color = "darkblue",fill="lightblue" , binwidth = 5) + theme_classic() + ggtitle("Histogram of tenure") + theme(plot.title = element_text(hjust = 0.5))
g2 <- ggplot(data = df , aes(x = tenure)) + geom_boxplot(fill="lightgreen") + theme_classic() + ggtitle("Boxplot of tenure") + theme(plot.title = element_text(hjust = 0.5))
g3 <- ggplot(data = df , aes(x = MonthlyCharges)) + geom_histogram(aes(y = ..density..),color = "darkblue",fill="lightblue" , binwidth = 5) + theme_classic() + ggtitle("Histogram of MonthlyCharges") + theme(plot.title = element_text(hjust = 0.5))
g4 <- ggplot(data = df , aes(x = MonthlyCharges)) + geom_boxplot(fill = "lightgreen") + theme_classic() + ggtitle("Boxplot of MonthlyCharges") + theme(plot.title = element_text(hjust = 0.5))
g5 <- ggplot(data = df , aes(x = TotalCharges)) + geom_histogram(aes(y = ..density..),color = "darkblue",fill="lightblue" , binwidth = 5) + theme_classic() + ggtitle("Histogram of TotalCharges") + theme(plot.title = element_text(hjust = 0.5))
g6 <- ggplot(data = df , aes(x = TotalCharges)) + geom_boxplot(fill = "lightgreen") + theme_classic() + ggtitle("Boxplot of TotalCharges") + theme(plot.title = element_text(hjust = 0.5))
print(plot_grid(g1,g2,g3,g4,g5,g6,ncol = 2, nrow = 3))
To show the correlation among the numerical variables
data_col <- df[ , c('tenure','MonthlyCharges','TotalCharges')]
corr <- cor(data_col)
ggcorrplot(corr , hc.order = TRUE , type = 'lower', lab = TRUE)
To include only those having significant association to
Churn
ChurnChurnvar_categorical <- df[, sapply(df, is.factor)]
colnames(var_categorical)
## [1] "gender" "SeniorCitizen" "Partner" "Dependents"
## [5] "PhoneService" "MultipleLines" "InternetService" "OnlineSecurity"
## [9] "OnlineBackup" "DeviceProtection" "TechSupport" "StreamingTV"
## [13] "StreamingMovies" "Contract" "PaperlessBilling" "PaymentMethod"
## [17] "Churn"
chi_sq <- lapply(var_categorical[, -which(colnames(var_categorical) == "Churn")],
function(x) chisq.test(var_categorical[, which(colnames(var_categorical) == "Churn")], x))
chi_sq_result <- data.frame(Variable = colnames(var_categorical)[-which(colnames(var_categorical) == "Churn")],
p_value = sapply(chi_sq, function(x) x$p.value),
stringsAsFactors = F)
chi_sq_result$Conclusion <- ifelse(chi_sq_result$p_value > 0.05, "No Association", "Significant Association")
chi_sq_result
## Variable p_value Conclusion
## gender gender 4.904885e-01 No Association
## SeniorCitizen SeniorCitizen 2.479256e-36 Significant Association
## Partner Partner 3.973798e-36 Significant Association
## Dependents Dependents 2.019659e-42 Significant Association
## PhoneService PhoneService 3.499240e-01 No Association
## MultipleLines MultipleLines 8.694083e-04 Significant Association
## InternetService InternetService 5.831199e-159 Significant Association
## OnlineSecurity OnlineSecurity 1.374240e-46 Significant Association
## OnlineBackup OnlineBackup 6.259257e-12 Significant Association
## DeviceProtection DeviceProtection 3.346075e-08 Significant Association
## TechSupport TechSupport 3.232868e-43 Significant Association
## StreamingTV StreamingTV 1.316434e-07 Significant Association
## StreamingMovies StreamingMovies 3.857900e-07 Significant Association
## Contract Contract 7.326182e-257 Significant Association
## PaperlessBilling PaperlessBilling 8.236203e-58 Significant Association
## PaymentMethod PaymentMethod 1.426310e-139 Significant Association
Result indicates that gender and
PhoneService have no association to Churn at
5% significance level.
library(caret)
## Loading required package: lattice
col_encode <- c("InternetService", "Contract", "PaymentMethod")
df_encode <- data.frame(df[, !names(df) %in% col_encode],
model.matrix(~. - 1, data = df[, col_encode]))
str(df_encode)
## 'data.frame': 7032 obs. of 25 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ PaperlessBilling : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## $ InternetServiceDSL : num 1 1 1 1 0 0 0 1 0 1 ...
## $ InternetServiceFiber.optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ContractOne.year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo.year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaymentMethodCredit.card..automatic.: num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic.check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed.check : num 0 1 1 0 0 0 0 1 0 0 ...
df_encode <- df_encode %>%
mutate(
InternetServiceDSL = as.factor(InternetServiceDSL),
InternetServiceFiber.optic = as.factor(InternetServiceFiber.optic),
InternetServiceNo = as.factor(InternetServiceNo),
ContractOne.year = as.factor(ContractOne.year),
ContractTwo.year = as.factor(ContractTwo.year),
PaymentMethodCredit.card..automatic. = as.factor(PaymentMethodCredit.card..automatic.),
PaymentMethodElectronic.check = as.factor(PaymentMethodElectronic.check),
PaymentMethodMailed.check = as.factor(PaymentMethodMailed.check)
)
df_churn_summary <- df_encode %>% group_by(Churn) %>%
summarise(count = n(), percentage = n() / nrow(df_encode) * 100)
df_churn_summary
## # A tibble: 2 × 3
## Churn count percentage
## <fct> <int> <dbl>
## 1 No 5163 73.4
## 2 Yes 1869 26.6
The data is quite imbalanced, where customer churning from the telco subscription has only 26.58% of the total customer base.
Since data is quite imbalanced, Synthetic Minority Oversampling (SMOTE) is used to create more evenly distributed training set
set.seed(123)
df_churn <- createDataPartition(df_encode$Churn, p = 0.7, list = F)
col_exclude <- c("gender", "PhoneService")
df_train <- df_encode[df_churn, setdiff(names(df_encode), col_exclude)]
df_test <- df_encode[-df_churn, setdiff(names(df_encode), col_exclude)]
library(performanceEstimation)
df_train_smote <- smote(Churn~., data = data.frame(df_train), perc.over = 1, perc.under = 2 )
# Naive Bayes as the baseline
library(e1071)
# training model
nb_model <- naiveBayes(Churn~., data = df_train_smote)
# testing model
nb_model_pred <- predict(nb_model, newdata = df_test)
# evaluation
library(caret)
nb_model_confusion_mat <- confusionMatrix(nb_model_pred, df_test$Churn)
nb_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1145 123
## Yes 403 437
##
## Accuracy : 0.7505
## 95% CI : (0.7314, 0.7688)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.04857
##
## Kappa : 0.4485
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.7397
## Specificity : 0.7804
## Pos Pred Value : 0.9030
## Neg Pred Value : 0.5202
## Prevalence : 0.7343
## Detection Rate : 0.5432
## Detection Prevalence : 0.6015
## Balanced Accuracy : 0.7600
##
## 'Positive' Class : No
##
Assumptions: observations independent of each other no multicollinearity among the independent variables
# cross validation with 10 fold
cross <- trainControl(method = "cv", number = 10, classProbs = T, summaryFunction = twoClassSummary)
# model fitting
log_reg <- train(Churn~., data = df_train_smote, method = "glm", metric = "ROC",
preProcess = c("center", "scale"), trControl = cross)
summary(log_reg)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.26699 -0.70490 0.09563 0.72727 3.05962
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26379 0.04421 -5.967 2.42e-09 ***
## SeniorCitizen1 -0.03294 0.03737 -0.881 0.378053
## PartnerYes -0.07248 0.04146 -1.748 0.080416 .
## DependentsYes -0.09526 0.04127 -2.308 0.020984 *
## tenure -1.70405 0.15315 -11.126 < 2e-16 ***
## MultipleLinesYes 0.21393 0.05003 4.276 1.90e-05 ***
## OnlineSecurityYes -0.09721 0.04192 -2.319 0.020385 *
## OnlineBackupYes -0.09055 0.04161 -2.176 0.029553 *
## DeviceProtectionYes 0.11118 0.04398 2.528 0.011480 *
## TechSupportYes -0.10531 0.04283 -2.459 0.013942 *
## StreamingTVYes 0.23721 0.05377 4.411 1.03e-05 ***
## StreamingMoviesYes 0.26257 0.05572 4.713 2.45e-06 ***
## PaperlessBillingYes 0.18641 0.03880 4.805 1.55e-06 ***
## MonthlyCharges -0.99276 0.21170 -4.689 2.74e-06 ***
## TotalCharges 1.06164 0.15888 6.682 2.36e-11 ***
## InternetServiceDSL1 1.44546 0.60192 2.401 0.016332 *
## InternetServiceFiber.optic1 2.26206 0.65458 3.456 0.000549 ***
## InternetServiceNo1 0.58112 0.50400 1.153 0.248907
## ContractOne.year1 -0.33529 0.04156 -8.068 7.14e-16 ***
## ContractTwo.year1 -0.61591 0.06417 -9.598 < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.03420 0.04644 -0.737 0.461420
## PaymentMethodElectronic.check1 0.13800 0.05055 2.730 0.006337 **
## PaymentMethodMailed.check1 -0.07659 0.05022 -1.525 0.127244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7258.6 on 5235 degrees of freedom
## Residual deviance: 4891.8 on 5213 degrees of freedom
## AIC: 4937.8
##
## Number of Fisher Scoring iterations: 6
InternetServiceNo summary indicates a problem with
perfect separation or perfect multicollinearity in the data, hence it is
removed from the model
log_reg2 <- train(Churn~. -InternetServiceNo, data = df_train_smote, method = "glm", metric = "ROC",
preProcess = c("center", "scale"), trControl = cross)
summary(log_reg2)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2673 -0.7079 0.1605 0.7274 3.0564
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26212 0.04413 -5.940 2.86e-09 ***
## SeniorCitizen1 -0.03282 0.03738 -0.878 0.37995
## PartnerYes -0.07241 0.04146 -1.747 0.08071 .
## DependentsYes -0.09472 0.04126 -2.296 0.02169 *
## tenure -1.70210 0.15305 -11.121 < 2e-16 ***
## MultipleLinesYes 0.22013 0.04981 4.420 9.89e-06 ***
## OnlineSecurityYes -0.09454 0.04184 -2.259 0.02387 *
## OnlineBackupYes -0.08831 0.04157 -2.125 0.03363 *
## DeviceProtectionYes 0.11397 0.04391 2.595 0.00945 **
## TechSupportYes -0.10192 0.04272 -2.385 0.01706 *
## StreamingTVYes 0.24412 0.05349 4.564 5.01e-06 ***
## StreamingMoviesYes 0.26775 0.05558 4.817 1.45e-06 ***
## PaperlessBillingYes 0.18543 0.03879 4.781 1.75e-06 ***
## MonthlyCharges -1.03751 0.20885 -4.968 6.77e-07 ***
## TotalCharges 1.05965 0.15877 6.674 2.48e-11 ***
## InternetServiceDSL1 0.76095 0.09136 8.329 < 2e-16 ***
## InternetServiceFiber.optic1 1.54227 0.19130 8.062 7.49e-16 ***
## ContractOne.year1 -0.33456 0.04153 -8.056 7.91e-16 ***
## ContractTwo.year1 -0.61033 0.06385 -9.559 < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.03195 0.04639 -0.689 0.49101
## PaymentMethodElectronic.check1 0.13840 0.05054 2.739 0.00617 **
## PaymentMethodMailed.check1 -0.07482 0.05020 -1.490 0.13610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7258.6 on 5235 degrees of freedom
## Residual deviance: 4893.3 on 5214 degrees of freedom
## AIC: 4937.3
##
## Number of Fisher Scoring iterations: 6
Check multicollinearity among independent variables using VIF:
If VIF = 1, variables are not correlated
If VIF between 1 and 5, variables are moderately correlated
If VIF > 5, variables are highly correlated
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif_values <- vif(log_reg2$finalModel)
vif_values
## SeniorCitizen1 PartnerYes
## 1.175002 1.355302
## DependentsYes tenure
## 1.274230 15.579454
## MultipleLinesYes OnlineSecurityYes
## 1.983112 1.374907
## OnlineBackupYes DeviceProtectionYes
## 1.430221 1.563433
## TechSupportYes StreamingTVYes
## 1.448386 2.314756
## StreamingMoviesYes PaperlessBillingYes
## 2.487115 1.172481
## MonthlyCharges TotalCharges
## 33.063765 21.013789
## InternetServiceDSL1 InternetServiceFiber.optic1
## 6.706288 28.849259
## ContractOne.year1 ContractTwo.year1
## 1.364193 1.453915
## PaymentMethodCredit.card..automatic.1 PaymentMethodElectronic.check1
## 1.614033 2.068606
## PaymentMethodMailed.check1
## 2.093463
Remove variable with highest VIF values:
MonthlyCharges
excl_var <- c("InternetServiceNo", "MonthlyCharges")
formula <- as.formula(paste("Churn ~ . -", paste(excl_var, collapse = " - ")))
log_reg3 <- train(formula, data = df_train_smote, method = "glm", metric = "ROC",
preProcess = c("center", "scale"), trControl = cross)
summary(log_reg3)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2811 -0.7207 0.1547 0.7301 2.9900
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.23973 0.04339 -5.525 3.30e-08 ***
## SeniorCitizen1 -0.02246 0.03728 -0.603 0.54675
## PartnerYes -0.06884 0.04131 -1.666 0.09565 .
## DependentsYes -0.10283 0.04110 -2.502 0.01235 *
## tenure -1.51520 0.14605 -10.374 < 2e-16 ***
## MultipleLinesYes 0.09943 0.04346 2.288 0.02214 *
## OnlineSecurityYes -0.17307 0.03885 -4.455 8.38e-06 ***
## OnlineBackupYes -0.15668 0.03926 -3.991 6.58e-05 ***
## DeviceProtectionYes 0.04444 0.04172 1.065 0.28681
## TechSupportYes -0.18032 0.03974 -4.538 5.68e-06 ***
## StreamingTVYes 0.09824 0.04481 2.192 0.02835 *
## StreamingMoviesYes 0.10868 0.04538 2.395 0.01661 *
## PaperlessBillingYes 0.18727 0.03864 4.847 1.26e-06 ***
## TotalCharges 0.86401 0.15215 5.679 1.36e-08 ***
## InternetServiceDSL1 0.43005 0.06120 7.027 2.11e-12 ***
## InternetServiceFiber.optic1 0.66233 0.07287 9.090 < 2e-16 ***
## ContractOne.year1 -0.34008 0.04134 -8.227 < 2e-16 ***
## ContractTwo.year1 -0.61083 0.06355 -9.612 < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.02694 0.04619 -0.583 0.55968
## PaymentMethodElectronic.check1 0.14555 0.05039 2.888 0.00387 **
## PaymentMethodMailed.check1 -0.07572 0.04985 -1.519 0.12875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7258.6 on 5235 degrees of freedom
## Residual deviance: 4918.6 on 5215 degrees of freedom
## AIC: 4960.6
##
## Number of Fisher Scoring iterations: 6
vif_values2 <- vif(log_reg3$finalModel)
vif_values2
## SeniorCitizen1 PartnerYes
## 1.168945 1.352660
## DependentsYes tenure
## 1.271493 14.204876
## MultipleLinesYes OnlineSecurityYes
## 1.516832 1.191720
## OnlineBackupYes DeviceProtectionYes
## 1.279963 1.413164
## TechSupportYes StreamingTVYes
## 1.259979 1.627063
## StreamingMoviesYes PaperlessBillingYes
## 1.662533 1.170588
## TotalCharges InternetServiceDSL1
## 19.203947 3.061111
## InternetServiceFiber.optic1 ContractOne.year1
## 4.223566 1.365325
## ContractTwo.year1 PaymentMethodCredit.card..automatic.1
## 1.449664 1.612181
## PaymentMethodElectronic.check1 PaymentMethodMailed.check1
## 2.063831 2.082495
Remove variable with highest VIF values:
TotalCharges
excl_var <- c("InternetServiceNo", "MonthlyCharges", "TotalCharges")
formula <- as.formula(paste("Churn ~ . -", paste(excl_var, collapse = " - ")))
log_reg4 <- train(formula, data = df_train_smote, method = "glm", metric = "ROC",
preProcess = c("center", "scale"), trControl = cross)
summary(log_reg4)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.37222 -0.72722 0.09665 0.72949 2.86532
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14716 0.03836 -3.836 0.000125 ***
## SeniorCitizen1 -0.02101 0.03750 -0.560 0.575355
## PartnerYes -0.06816 0.04127 -1.651 0.098658 .
## DependentsYes -0.10630 0.04094 -2.596 0.009422 **
## tenure -0.78451 0.06032 -13.005 < 2e-16 ***
## MultipleLinesYes 0.14864 0.04267 3.484 0.000494 ***
## OnlineSecurityYes -0.14650 0.03866 -3.789 0.000151 ***
## OnlineBackupYes -0.12326 0.03902 -3.159 0.001586 **
## DeviceProtectionYes 0.08616 0.04142 2.080 0.037507 *
## TechSupportYes -0.14707 0.03946 -3.727 0.000194 ***
## StreamingTVYes 0.14738 0.04429 3.328 0.000876 ***
## StreamingMoviesYes 0.16472 0.04462 3.692 0.000223 ***
## PaperlessBillingYes 0.17935 0.03836 4.675 2.94e-06 ***
## InternetServiceDSL1 0.41175 0.05964 6.903 5.07e-12 ***
## InternetServiceFiber.optic1 0.78964 0.06903 11.438 < 2e-16 ***
## ContractOne.year1 -0.33935 0.04078 -8.322 < 2e-16 ***
## ContractTwo.year1 -0.59655 0.06204 -9.616 < 2e-16 ***
## PaymentMethodCredit.card..automatic.1 -0.02760 0.04608 -0.599 0.549203
## PaymentMethodElectronic.check1 0.15038 0.05038 2.985 0.002837 **
## PaymentMethodMailed.check1 -0.04521 0.04908 -0.921 0.356962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7258.6 on 5235 degrees of freedom
## Residual deviance: 4953.2 on 5216 degrees of freedom
## AIC: 4993.2
##
## Number of Fisher Scoring iterations: 5
vif_values3 <- vif(log_reg4$finalModel)
vif_values3
## SeniorCitizen1 PartnerYes
## 1.169375 1.354581
## DependentsYes tenure
## 1.271709 2.417176
## MultipleLinesYes OnlineSecurityYes
## 1.454261 1.175463
## OnlineBackupYes DeviceProtectionYes
## 1.255848 1.379160
## TechSupportYes StreamingTVYes
## 1.228959 1.575330
## StreamingMoviesYes PaperlessBillingYes
## 1.593262 1.169039
## InternetServiceDSL1 InternetServiceFiber.optic1
## 3.010753 3.846592
## ContractOne.year1 ContractTwo.year1
## 1.357364 1.420734
## PaymentMethodCredit.card..automatic.1 PaymentMethodElectronic.check1
## 1.612775 2.064202
## PaymentMethodMailed.check1
## 2.053385
VIF of all variables are below the rule of thumb of 5 now
# test the model
log_reg_pred <- predict(log_reg4, newdata = df_test)
# evaluation
log_reg_confusion_mat <- confusionMatrix(log_reg_pred, df_test$Churn)
log_reg_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1134 106
## Yes 414 454
##
## Accuracy : 0.7533
## 95% CI : (0.7343, 0.7716)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.02505
##
## Kappa : 0.4622
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.7326
## Specificity : 0.8107
## Pos Pred Value : 0.9145
## Neg Pred Value : 0.5230
## Prevalence : 0.7343
## Detection Rate : 0.5380
## Detection Prevalence : 0.5882
## Balanced Accuracy : 0.7716
##
## 'Positive' Class : No
##
# grid search
ctrl <- trainControl(method = "cv", number = 10, search = "grid", classProbs = T)
param_grid <- expand.grid(.mtry = (1:10))
modellist <- list()
# training model
for(ntree in c(seq(200, 300, by = 50))){
set.seed(123)
rf_model <- train(Churn ~ ., data = df_train_smote, method = "rf",
trControl = ctrl, tuneGrid = param_grid, ntree = ntree)
key <- toString(ntree)
modellist[[key]] <- rf_model
}
rf_model_result <- resamples(modellist)
summary(rf_model_result)
##
## Call:
## summary.resamples(object = rf_model_result)
##
## Models: 200, 250, 300
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 200 0.8396947 0.8611641 0.8650707 0.8638283 0.8706423 0.889313 0
## 250 0.8377863 0.8573473 0.8641129 0.8619195 0.8682577 0.889313 0
## 300 0.8396947 0.8573473 0.8650689 0.8632561 0.8695658 0.889313 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 200 0.6793893 0.7223282 0.7301468 0.7276587 0.7412872 0.778626 0
## 250 0.6755725 0.7146947 0.7282311 0.7238410 0.7365175 0.778626 0
## 300 0.6793893 0.7146947 0.7301421 0.7265146 0.7391423 0.778626 0
ntree = 250 has the highest accuracy, hence the
parameter is chosen as the best parameter for random forest model
rf_model_final <- train(Churn ~., data = df_train_smote, method = "rf",
trControl = ctrl, tuneGrid = param_grid, ntree = 250)
# testing model
rf_model_pred <- predict(rf_model_final, newdata = df_test)
# evaluation
rf_model_confusion_mat <- confusionMatrix(rf_model_pred, df_test$Churn)
rf_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1140 124
## Yes 408 436
##
## Accuracy : 0.7476
## 95% CI : (0.7285, 0.7661)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.087
##
## Kappa : 0.4433
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7364
## Specificity : 0.7786
## Pos Pred Value : 0.9019
## Neg Pred Value : 0.5166
## Prevalence : 0.7343
## Detection Rate : 0.5408
## Detection Prevalence : 0.5996
## Balanced Accuracy : 0.7575
##
## 'Positive' Class : No
##
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
param_grid <- expand.grid(C = c(0.1, 1, 5, 10))
svm_model <- train(Churn~., data = df_train_smote, method = "svmLinear",
trControl = cross, tuneGrid = param_grid)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
# testing model
svm_model_pred <- predict(svm_model, newdata = df_test)
# evaluation
svm_model_confusion_mat <- confusionMatrix(svm_model_pred, df_test$Churn)
svm_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1055 100
## Yes 493 460
##
## Accuracy : 0.7187
## 95% CI : (0.699, 0.7378)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.9501
##
## Kappa : 0.4109
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6815
## Specificity : 0.8214
## Pos Pred Value : 0.9134
## Neg Pred Value : 0.4827
## Prevalence : 0.7343
## Detection Rate : 0.5005
## Detection Prevalence : 0.5479
## Balanced Accuracy : 0.7515
##
## 'Positive' Class : No
##
# extract the best model
knn_model <- train(Churn~., data = df_train_smote, method = "knn",
trControl = ctrl)
knn_model
## k-Nearest Neighbors
##
## 5236 samples
## 22 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4713, 4713, 4712, 4712, 4712, 4712, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7265183 0.4530373
## 7 0.7322453 0.4644925
## 9 0.7330080 0.4660186
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
# testing model
knn_model_pred <- predict(knn_model, newdata = df_test)
knn_model_confusion_mat <- confusionMatrix(knn_model_pred, df_test$Churn)
knn_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1096 161
## Yes 452 399
##
## Accuracy : 0.7092
## 95% CI : (0.6893, 0.7285)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.9956
##
## Kappa : 0.3607
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7080
## Specificity : 0.7125
## Pos Pred Value : 0.8719
## Neg Pred Value : 0.4689
## Prevalence : 0.7343
## Detection Rate : 0.5199
## Detection Prevalence : 0.5963
## Balanced Accuracy : 0.7103
##
## 'Positive' Class : No
##
library(ada)
## Loading required package: rpart
# training model
set.seed(123)
adaboost_model <- ada(Churn ~ ., data = df_train_smote, iter = 50)
# testing model
adaboost_model_pred <- predict(adaboost_model, newdata = df_test)
adaboost_model_confusion_mat <- confusionMatrix(adaboost_model_pred, factor(df_test$Churn))
adaboost_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1150 120
## Yes 398 440
##
## Accuracy : 0.7543
## 95% CI : (0.7353, 0.7725)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.01973
##
## Kappa : 0.4563
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.7429
## Specificity : 0.7857
## Pos Pred Value : 0.9055
## Neg Pred Value : 0.5251
## Prevalence : 0.7343
## Detection Rate : 0.5455
## Detection Prevalence : 0.6025
## Balanced Accuracy : 0.7643
##
## 'Positive' Class : No
##
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
str(df_train_smote)
## 'data.frame': 5236 obs. of 23 variables:
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 2 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 1 1 1 ...
## $ tenure : num 41 25 54 53 4 16 72 14 61 5 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 2 2 ...
## $ DeviceProtection : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 1 1 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 1 1 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 2 2 1 ...
## $ PaperlessBilling : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 1 2 2 2 ...
## $ MonthlyCharges : num 89.2 54.3 90 92.5 56.8 ...
## $ TotalCharges : num 3646 1297 4932 4779 245 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ InternetServiceDSL : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 1 1 ...
## $ InternetServiceFiber.optic : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 2 2 2 ...
## $ InternetServiceNo : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ ContractOne.year : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ ContractTwo.year : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
## $ PaymentMethodCredit.card..automatic.: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ PaymentMethodElectronic.check : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 1 2 ...
## $ PaymentMethodMailed.check : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
convert_factors_to_binary <- function(df, cols) {
for (col in cols) {
df[[col]] <- as.numeric(df[[col]]) - 1
}
return(df)
}
# Specify the columns to convert
columns_to_convert <- c("SeniorCitizen", "Partner", "Dependents", "MultipleLines",
"OnlineSecurity", "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies", "PaperlessBilling",
"Churn", "InternetServiceDSL", "InternetServiceFiber.optic",
"InternetServiceNo", "ContractOne.year", "ContractTwo.year",
"PaymentMethodCredit.card..automatic.", "PaymentMethodElectronic.check",
"PaymentMethodMailed.check")
# Convert the specified columns in df_train_smote
df_train_smote <- convert_factors_to_binary(df_train_smote, columns_to_convert)
xgb_model <- xgboost(data = as.matrix(df_train_smote[, -c(15)]),
label = df_train_smote$Churn,
nrounds = 50,
objective = "binary:logistic",
eval_metric = "logloss",
max_depth = 3,
eta = 0.1,
nthread = 4,
verbose = 0)
df_test <- convert_factors_to_binary(df_test, columns_to_convert)
dtest <- xgb.DMatrix(as.matrix(df_test[, -c(15)]))
xgb_model_pred <- predict(xgb_model, dtest)
xgb_model_pred <- ifelse(xgb_model_pred >= 0.5, 1, 0)
# testing model
xgb_model_pred <- as.factor(xgb_model_pred)
df_test$Churn <- as.factor(df_test$Churn)
xgb_model_confusion_mat <- confusionMatrix(xgb_model_pred, df_test$Churn)
xgb_model_confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1118 99
## 1 430 461
##
## Accuracy : 0.7491
## 95% CI : (0.73, 0.7674)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.06564
##
## Kappa : 0.4589
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.7222
## Specificity : 0.8232
## Pos Pred Value : 0.9187
## Neg Pred Value : 0.5174
## Prevalence : 0.7343
## Detection Rate : 0.5304
## Detection Prevalence : 0.5773
## Balanced Accuracy : 0.7727
##
## 'Positive' Class : 0
##
Models are compared and evaluated with the below metrics:
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
df_test$Churn <- as.numeric(df_test$Churn) - 1
nb_model_pred <- as.numeric(nb_model_pred) - 1
roc_nb <- roc(df_test$Churn, nb_model_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
log_reg_pred <- as.numeric(log_reg_pred) - 1
roc_lr <- roc(df_test$Churn, log_reg_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
rf_model_pred <- as.numeric(rf_model_pred) - 1
roc_rf <- roc(df_test$Churn, rf_model_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
svm_model_pred <- as.numeric(svm_model_pred) - 1
roc_svm <- roc(df_test$Churn, svm_model_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
knn_model_pred <- as.numeric(knn_model_pred) - 1
roc_knn <- roc(df_test$Churn, knn_model_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
adaboost_model_pred <- as.numeric(adaboost_model_pred) - 1
roc_adaboost <- roc(df_test$Churn, adaboost_model_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
xgb_model_pred <- as.numeric(xgb_model_pred) - 1
roc_xgb <- roc(df_test$Churn, xgb_model_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_nb, col = "blue", main = "AUC-ROC Curve", print.auc = F)
plot(roc_lr, col = "red", add = TRUE, print.auc = F)
plot(roc_rf, col = "green", add = TRUE, print.auc = F)
plot(roc_svm, col = "orange", add = TRUE, print.auc = F)
plot(roc_knn, col = "purple", add = TRUE, print.auc = F)
plot(roc_adaboost, col = "brown", add = TRUE, print.auc = F)
plot(roc_xgb, col = "black", add = TRUE, print.auc = F)
# Add legend
legend("topright", legend = c("Naive Bayes",
"Logistic Regression",
"Random Forest",
"SVM",
"KNN",
"AdaBoost",
"XGBoost"),
col = c("blue",
"red",
"green",
"orange",
"purple",
"brown",
"black"), lwd = 2)
result_nb <- c(accuracy = nb_model_confusion_mat$overall["Accuracy"],
precision = nb_model_confusion_mat$byClass["Precision"],
sensitivity = nb_model_confusion_mat$byClass["Sensitivity"],
specificity = nb_model_confusion_mat$byClass["Specificity"],
f1_score = nb_model_confusion_mat$byClass["F1"],
auc_roc = roc_nb$auc)
result_lr <- c(accuracy = log_reg_confusion_mat$overall["Accuracy"],
precision = log_reg_confusion_mat$byClass["Precision"],
sensitivity = log_reg_confusion_mat$byClass["Sensitivity"],
specificity = log_reg_confusion_mat$byClass["Specificity"],
f1_score = log_reg_confusion_mat$byClass["F1"],
auc_roc = roc_lr$auc)
result_rf <- c(accuracy = rf_model_confusion_mat$overall["Accuracy"],
precision = rf_model_confusion_mat$byClass["Precision"],
sensitivity = rf_model_confusion_mat$byClass["Sensitivity"],
specificity = rf_model_confusion_mat$byClass["Specificity"],
f1_score = rf_model_confusion_mat$byClass["F1"],
auc_roc = roc_rf$auc)
result_svm <- c(accuracy = svm_model_confusion_mat$overall["Accuracy"],
precision = svm_model_confusion_mat$byClass["Precision"],
sensitivity = svm_model_confusion_mat$byClass["Sensitivity"],
specificity = svm_model_confusion_mat$byClass["Specificity"],
f1_score = svm_model_confusion_mat$byClass["F1"],
auc_roc = roc_svm$auc)
result_knn <- c(accuracy = knn_model_confusion_mat$overall["Accuracy"],
precision = knn_model_confusion_mat$byClass["Precision"],
sensitivity = knn_model_confusion_mat$byClass["Sensitivity"],
specificity = knn_model_confusion_mat$byClass["Specificity"],
f1_score = knn_model_confusion_mat$byClass["F1"],
auc_roc = roc_knn$auc)
result_adaboost <- c(accuracy = adaboost_model_confusion_mat$overall["Accuracy"],
precision = adaboost_model_confusion_mat$byClass["Precision"],
sensitivity = adaboost_model_confusion_mat$byClass["Sensitivity"],
specificity = adaboost_model_confusion_mat$byClass["Specificity"],
f1_score = adaboost_model_confusion_mat$byClass["F1"],
auc_roc = roc_adaboost$auc)
result_xgb <- c(accuracy = xgb_model_confusion_mat$overall["Accuracy"],
precision = xgb_model_confusion_mat$byClass["Precision"],
sensitivity = xgb_model_confusion_mat$byClass["Sensitivity"],
specificity = xgb_model_confusion_mat$byClass["Specificity"],
f1_score = xgb_model_confusion_mat$byClass["F1"],
auc_roc = roc_xgb$auc)
summary_result <- data.frame(NaiveBayes = result_nb,
LogisticRegression = result_lr,
RandomForest = result_rf,
SVM = result_svm,
KNN = result_knn,
AdaBoost = result_adaboost,
XGBoost = result_xgb)
summary_result <- t(summary_result)
summary_result
## accuracy.Accuracy precision.Precision
## NaiveBayes 0.7504744 0.9029968
## LogisticRegression 0.7533207 0.9145161
## RandomForest 0.7476281 0.9018987
## SVM 0.7186907 0.9134199
## KNN 0.7092030 0.8719173
## AdaBoost 0.7542694 0.9055118
## XGBoost 0.7490512 0.9186524
## sensitivity.Sensitivity specificity.Specificity f1_score.F1
## NaiveBayes 0.7396641 0.7803571 0.8132102
## LogisticRegression 0.7325581 0.8107143 0.8134864
## RandomForest 0.7364341 0.7785714 0.8108108
## SVM 0.6815245 0.8214286 0.7806141
## KNN 0.7080103 0.7125000 0.7814617
## AdaBoost 0.7428941 0.7857143 0.8161817
## XGBoost 0.7222222 0.8232143 0.8086799
## auc_roc
## NaiveBayes 0.7600106
## LogisticRegression 0.7716362
## RandomForest 0.7575028
## SVM 0.7514766
## KNN 0.7102552
## AdaBoost 0.7643042
## XGBoost 0.7727183
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:kernlab':
##
## size
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
# Select rows that all have internet service
Telco_customer_churn_with_InternetService <- df[df$InternetService != 'No', ]
# Features extraction
association_rule_features <- Telco_customer_churn_with_InternetService[, c('PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies')]
# Convert the association rule features data frame to a transaction object
association_transactions <- as(association_rule_features, "transactions")
# Fitting model
# Training Apriori on the dataset with support > 0.2 , confidence > 0.7 and max item number = 4
set.seed = 123 # Setting seed
associa_rules <- apriori(association_transactions,
parameter = list(support = 0.2,
confidence = 0.7,
maxlen = 4),
control = list(verbose =TRUE))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1102
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 5512 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4
## Warning in apriori(association_transactions, parameter = list(support = 0.2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 4 returned!
## done [0.00s].
## writing ... [185 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# The Apriori algorithm generated 185 rules with the parameter of support = 2% and confidence = 80%.
# The maximum length number of rules involved 4 item sets.
# Absolute minimum is 2% from the total of 5512 transactions, which is 1102
library(RColorBrewer)
itemFrequencyPlot(items(associa_rules), topN = 12,
col = brewer.pal(8, 'Pastel2'),
main = 'Relative Item Frequency Plot',
type = "relative",
ylab = "Item Frequency (Relative)")
According to the plot, majority of data composed of those with phone
services
The items frequency of them with streaming TV, Streaming movies, devices
protection, online backup and online security were far lower
# Visualizing the results
inspect(sort(associa_rules, by = 'lift')[1:20])
## lhs rhs support confidence coverage lift count
## [1] {DeviceProtection=Yes,
## StreamingTV=Yes} => {StreamingMovies=Yes} 0.2273222 0.8016635 0.2835631 1.618004 1253
## [2] {PhoneService=Yes,
## DeviceProtection=Yes,
## StreamingMovies=Yes} => {StreamingTV=Yes} 0.2022859 0.7919034 0.2554427 1.614862 1115
## [3] {PhoneService=Yes,
## DeviceProtection=Yes,
## StreamingTV=Yes} => {StreamingMovies=Yes} 0.2022859 0.7998565 0.2529028 1.614357 1115
## [4] {DeviceProtection=Yes,
## StreamingMovies=Yes} => {StreamingTV=Yes} 0.2273222 0.7860728 0.2891872 1.602972 1253
## [5] {DeviceProtection=No,
## TechSupport=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2117199 0.7811245 0.2710450 1.548205 1167
## [6] {DeviceProtection=No,
## TechSupport=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2117199 0.7853297 0.2695936 1.541024 1167
## [7] {DeviceProtection=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2726778 0.7691914 0.3544993 1.524553 1503
## [8] {PhoneService=Yes,
## DeviceProtection=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2340348 0.7628622 0.3067852 1.512009 1290
## [9] {DeviceProtection=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2726778 0.7680123 0.3550435 1.507043 1503
## [10] {PhoneService=Yes,
## DeviceProtection=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2340348 0.7624113 0.3069666 1.496052 1290
## [11] {OnlineBackup=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2407475 0.7604585 0.3165820 1.492220 1327
## [12] {OnlineBackup=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2407475 0.7480271 0.3218433 1.482605 1327
## [13] {PhoneService=Yes,
## OnlineBackup=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2060958 0.7538155 0.2734035 1.479185 1136
## [14] {TechSupport=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2677794 0.7439516 0.3599419 1.474528 1476
## [15] {PhoneService=Yes,
## OnlineBackup=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2060958 0.7434555 0.2772134 1.473544 1136
## [16] {PhoneService=Yes,
## TechSupport=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2347605 0.7423982 0.3162192 1.471449 1294
## [17] {TechSupport=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2677794 0.7496191 0.3572206 1.470951 1476
## [18] {PhoneService=Yes,
## TechSupport=No,
## StreamingMovies=No} => {StreamingTV=No} 0.2347605 0.7428243 0.3160377 1.457618 1294
## [19] {PhoneService=Yes,
## StreamingMovies=Yes} => {StreamingTV=Yes} 0.3154935 0.7147554 0.4414006 1.457541 1739
## [20] {OnlineSecurity=No,
## StreamingTV=No} => {StreamingMovies=No} 0.2443759 0.7320652 0.3338171 1.450969 1347
plot(associa_rules, method = "graph", measure = "confidence", shading = "lift")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
The top 20 rules were inspected.
To determine the existing services that attained the customers,only
rhs=yes is focused
# 3rd objective -> promote streaming Movies and streaming TV sales
library(dplyr)
aveMonthlyCharges <- mean(df$MonthlyCharges)
#find MonthlyCharges for customer who open StreamingTV and StreamingMovies
averageMonthlyCharges<-df %>% group_by(StreamingMovies,StreamingTV) %>% summarise(Average = mean(MonthlyCharges))
## `summarise()` has grouped output by 'StreamingMovies'. You can override using
## the `.groups` argument.
avgMonthlyChargesBoth <- averageMonthlyCharges[4,3]
avgMonthlyChargesMovies <- averageMonthlyCharges[3,3]
avgMonthlyChargesTV <- averageMonthlyCharges[2,3]
avgMonthlyChargesMoviesorTV <- sum(avgMonthlyChargesTV,avgMonthlyChargesMovies)/2
print(paste('Average monthly charges for customers with both Movies and TV =',avgMonthlyChargesBoth))
## [1] "Average monthly charges for customers with both Movies and TV = 93.2438886023724"
print(paste('Average monthly charges for customers with only Movies =',avgMonthlyChargesMovies))
## [1] "Average monthly charges for customers with only Movies = 76.8117424242424"
print(paste('Average monthly charges for customers with only TV =',avgMonthlyChargesTV))
## [1] "Average monthly charges for customers with only TV = 77.418390052356"
print(paste('Average monthly charges for customers with only 1 service =',avgMonthlyChargesMoviesorTV))
## [1] "Average monthly charges for customers with only 1 service = 77.1150662382992"
totalCount <- df %>% count(df$StreamingMovies,StreamingTV)
totalCountBoth <- totalCount[4,3]
totalCountMovies <- totalCount[3,3]
totalCountTV <- totalCount[2,3]
totalCountMoviesorTV <- sum(totalCountMovies,totalCountTV)/2
print(paste('Total count for customers with both Movies and TV =', totalCountBoth))
## [1] "Total count for customers with both Movies and TV = 1939"
print(paste('Total count for customers with only Movies =', totalCountMovies))
## [1] "Total count for customers with only Movies = 792"
print(paste('Total count for customers with only TV =', totalCountTV))
## [1] "Total count for customers with only TV = 764"
print(paste('Total count for customers with only 1 service =', totalCountMoviesorTV))
## [1] "Total count for customers with only 1 service = 778"
diffenceMonthlyCharges <- avgMonthlyChargesBoth - avgMonthlyChargesMoviesorTV
print(paste('Difference in monthly charges of both services and 1 service =', diffenceMonthlyCharges))
## [1] "Difference in monthly charges of both services and 1 service = 16.1288223640731"
Sales / revenue can be increased if it is able to attract customers with only 1 services (792) to open both services as the monthly charges different is less than 20 only
#Set the range of discount percentages
discount_percent <- seq(0, 0.3, length.out = 11)
discount_percent
## [1] 0.00 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.30
for (i in discount_percent) {
attractive_percent <- i / 0.6
loss <- as.numeric(diffenceMonthlyCharges) * i * totalCountBoth
gain <- as.numeric(diffenceMonthlyCharges) * (1 - i) * (totalCountMovies+totalCountTV) * attractive_percent
marginal_profit <- gain - loss
print(paste("The marginal profit is:", marginal_profit, "when the discount is", i))
}
## [1] "The marginal profit is: 0 when the discount is 0"
## [1] "The marginal profit is: 278.964111609009 when the discount is 0.03"
## [1] "The marginal profit is: 482.638880422524 when the discount is 0.06"
## [1] "The marginal profit is: 611.024306440547 when the discount is 0.09"
## [1] "The marginal profit is: 664.120389663075 when the discount is 0.12"
## [1] "The marginal profit is: 641.92713009011 when the discount is 0.15"
## [1] "The marginal profit is: 544.444527721653 when the discount is 0.18"
## [1] "The marginal profit is: 371.672582557701 when the discount is 0.21"
## [1] "The marginal profit is: 123.611294598258 when the discount is 0.24"
## [1] "The marginal profit is: -199.739336156681 when the discount is 0.27"
## [1] "The marginal profit is: -598.379309707114 when the discount is 0.3"
Result showed that the marginal profit is the highest when discount
percentage is 12%.
Hence, it is suggested that the telco company is likely to have existing
customers with only 1 service to subscribe for second service at an
optimum discount rate of 12%.
In conclusion, multiple classification models were adopted, namely
Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine
(SVM), k-Nearest Neighbors (kNN), Adaptive Boosting (AdaBoost), Xtreme
Gradient Boodting (XGBoost). Despite each model perform relatively well
in predicting the customer churn, Logistic Regression
was chosen as the final model to be implemented due to its simplicity
and generally less computational expensive.
Based on the estimates of Logistic Regression, variables as stated below
were found to be statistically significant at 5% signicance level in
predicting the customer churn likelihood. This includes:
SeniorCitizen: Indicator of whether the customer is 65
years old or oldertenure: Total amount of months that the customer has
been with the companyMultipleLine: Indicator of whether the customer
subscribes to multiple telephone linesOnlineSecurity: Indicator of whether the customer
subscribes to additional online security serviceTechSupport: Indicator if the customer subscribes to an
additional technical support planStreamingTV: Indicator if the customer uses their
Internet service to stream television programing from a third party
providerStreamingMovies: Indicator if the customer uses their
Internet service to stream movies from a third party providerPaperlessBilling: Indicator if the customer has chosen
paperless billingInternetService: Type of Internet service that the
customer has with the companyContract: Customer’s current contract typeAfter that, association analysis was conducted to determine the
relationship between the factors. It is found that people who subscribe
for device protection and streaming TV services are very likely to
subscribe for streaming movies service.
The same goes for people who have subscribed to open phone services,
device protection and streaming movies services, the customer has high
change to subscribe streaming TV service as well.
Last but not least, it is recommended that the Telco company to have discount of 12% for the existing customers with only one service to subscribe for more service. By having this suggested discount rate, the marginal profit is estimated to have increased by approximately $644.12