In this project, I analyzed the Telco Customer Churn dataset, one of the most popular datasets on Kaggle. Provided by IBM, this dataset represents a sample of a telecommunications company offering different services to approximately 7,000 customers in California. It includes 19 predictors related to the services customers have subscribed to with their values as shown,
The purpose of this dataset is to predict whether a customer will churn or not. It Includes 2 quantitative and 17 qualitative predictors.
There are several questions that needs investigations like,
We explored all the answers collectively in the below sections
Below the Dataset View and the summary of the Dataset.
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:7043 | Length:7043 | Min. :0.0000 | Length:7043 | Length:7043 | Min. : 0.00 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Min. : 18.25 | Min. : 18.8 | Length:7043 | |
| Class :character | Class :character | 1st Qu.:0.0000 | Class :character | Class :character | 1st Qu.: 9.00 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 35.50 | 1st Qu.: 401.4 | Class :character | |
| Mode :character | Mode :character | Median :0.0000 | Mode :character | Mode :character | Median :29.00 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 70.35 | Median :1397.5 | Mode :character | |
| NA | NA | Mean :0.1621 | NA | NA | Mean :32.37 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 64.76 | Mean :2283.3 | NA | |
| NA | NA | 3rd Qu.:0.0000 | NA | NA | 3rd Qu.:55.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 89.85 | 3rd Qu.:3794.7 | NA | |
| NA | NA | Max. :1.0000 | NA | NA | Max. :72.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :118.75 | Max. :8684.8 | NA | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :11 | NA |
Below shows the visualization of the correlation matrics and some relationships between predictors. We can see that not much of predictors are correlated.
Telco.Churn.Dataset.numeric$tenure <- as.numeric(Telco.Churn.Dataset.numeric$tenure)
options(repr.plot.width = 50, repr.plot.height = 30)
corr <-cor(Telco.Churn.Dataset.numeric[,-c(1,21)])
ggcorrplot(corr, hc.order = TRUE, lab = FALSE, colors = c("#6D9EC1", "white", "#cb1f1f"))
Next, we create a x-y graph to compare between Tenure and Total charges.
ggplot(
data = Telco.Churn.Dataset.numeric,
mapping = aes(x = tenure, y = TotalCharges, color = Churn)) +
geom_point(size = 1) +
labs(title = "Tenure Vs Total Charges",x = "Tenure",y = "Total Charges")+
scale_color_manual(values = c("0" = "#0072B2", "1" = "#D55E00")) +
theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"), panel.background = element_rect(fill = "#f5eee1")) +
scale_x_continuous(breaks = seq(0, max(Telco.Churn.Dataset.numeric$tenure), by = 10)) +
geom_smooth(color = "#f5af2e")
From above plot we see that the Churned customers decrease with their increase in the tenure and hence as described by the correlation matrix.
ggplot(
data = Telco.Churn.Dataset.numeric, aes(x = tenure, fill = as.factor(tenure))) +
geom_bar(color = "black") +
labs(title = "Distribution of Tenure", x = "Tenure", y = "Count") +
theme_minimal() +
scale_fill_manual(values = colrs)+
theme(legend.position = "none") +
scale_x_continuous(breaks = seq(0, max(Telco.Churn.Dataset.numeric$tenure), by = 5)) +
theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f6f2f7"))
Stratifying the data: It involves dividing a dataset into distinct subgroups (also called strata), based on a specific characteristic before performing analysis or training. This technique ensures that each subgroup is adequately represented, sometimes reducing bias and improving statistical reliability. Stratification is particularly useful in machine learning ensuring that the model has a fair chance to get trained on a particular characteristic or even output, ensuring training and test datasets maintain the same distribution of key variables for better model. In my case, I am stratifying the data based upon the output(Churn).
In my stratified sampling, I am using 75% training and 25% testing data while maintaing the ratio as shown below.
\[ Ratio = \frac{\text{Number of "1/Yes" Churn examples in Training Set}}{\text{Number of "0/No" Churn examples in Training Set}} = \frac{\text{Number of "1/Yes" Churn Examples in Testing Set}}{\text{Number of "0/No" Churn Examples in Testing Set}} \]
Below is the Code for the same where the function ‘stratify.data.train.indicies’ stratify the data based on the training split ratio.
customers.left <- vector(mode = "numeric", length = 0)
customers.detained <- vector(mode = "numeric", length = 0)
for (index in seq_along(Churn)){
if (Churn[index] == 'Yes')
customers.left <- c(customers.left, index)
else
customers.detained <- c(customers.detained, index)
}
cat("Total number of customers Left:", length(customers.left)," and ", "Total number of customers detained:", length(customers.detained))
## Total number of customers Left: 1869 and Total number of customers detained: 5163
stratify.data.train.indicies <- function(data.train.ratio, ...){
number.of.left.train.samples <- data.train.ratio * length(customers.left)
number.of.detained.train.samples <- data.train.ratio * length(customers.detained)
total.train.samples.indicies <- vector(mode = "numeric", length = 0)
total.train.samples.indicies <- c(total.train.samples.indicies, sample(customers.left, number.of.left.train.samples), sample(customers.detained, number.of.detained.train.samples))
return (total.train.samples.indicies)
}
Since, most of the variables involved are categorical, I am representing it using some most commonly used categorical encoders for a financial dataset namely,
For Bayesian Target encoding, I am creating the encoder on my own since, I didn’t found any suitable R dependency to do so. Mathematically, the formula for Bayesian Target encoder looks like,
\[ E(x \mid c) = \frac{N_c \cdot \mu_c + m \cdot \mu}{N_c + m} \]
where, \(E(x \mid c)\) = Encoded value for category \(c\), \(N_c\) = Number of occurrences of category c in the dataset, \(\mu_c\) = Mean target value for category c, \(\mu\) = Global mean of the target variable across all categories, \(m\) = Smoothing factor (a hyperparameter). In my I am taking the smoothing factor to be 2.
Below is the code of Bayesian Target encoder where the function “bayesian_encoding()” takes a single category at a time and encode it using the target which is Churn.
bayesian_encoding <- function(data, category, target, column.index){ #bayesian.mean.target.encoding
#Weighted Mean = (n * Option Mean + m * Overall mean)/n+m
#Option Mean = No of Instances of (Yes/1/postive) of a particular category / Total Instances of a particular category
#Overall mean = Total Instances of a particular category / Total Instances
#n = Weight of Option Mean (usually number of rows of the category)
#m = Weight of Overall Mean (user defined)
total.rows <- length(category)
num.categories <- length(unique(category))
names.of.categories <- unique(category)
categories.count <- rep(0, num.categories)
cotegories.count.yes <- rep(0, num.categories)
m <- 2
total.count.yes <- sum(target == 'Yes')
for (index in seq_along(names.of.categories)){
x <- sum(category == names.of.categories[index])
categories.count[index] <- x
cotegories.count.yes[index] <- sum(target[category == names.of.categories[index]] == 'Yes')
#n = Weight of Option Mean (number of rows of the category)
n <- categories.count[index]
#m = Weight of Overall Mean (smoothing factor)
m <-2
#Option Mean = No of Instances of (Yes/1/postive) of a particular category / Total Instances of a particular category
option.mean <- cotegories.count.yes[index]/categories.count[index]
#Overall mean = Total Instances of a (Yes/1/postive)/ Total Instances
overall.mean <- total.count.yes/total.rows
#Weighted Mean
weighted.mean <- (n * option.mean + m * overall.mean)/n+m
data[, column.index:column.index] <- replace(data[, column.index:column.index], category == names.of.categories[index], weighted.mean)
}
return (data)
}
Similarly, for the Weight of Evidence which is a statistical technique used primarily in credit scoring and binary classification to measure the predictive power of an independent variable in relation to a binary target. Mathematically, it is defined as the logarithm of the ratio between the percentage of goods and bads in each bin as stated,
\[ a = \frac{\text{(Postives|Yes|1) of the Group}}{\text{Total Postivies}} \]
\[ b = \frac{\text{(Negative|No|O) of the Group}}{\text{Total Negatives}} \]
\[ \text{WoE} = \ln \left( \frac{a}{b} \right) \]
And the code for the same is shown below where the function “woe_encoding” takes a single category at a time and encode it using the target which is Churn.
woe_encoding <- function(data, category, target, column.index){
#WOE
# a = (Postives|Yes|1) of the Group)/(Total Postivies)
# b = (Negative|No|O) of the Group)/(Total Negatives)
# WOE = log(a/b)
total.rows <- length(category)
num.categories <- length(unique(category))
names.of.categories <- unique(category)
categories.count <- rep(0, num.categories)
categories.count.yes <- rep(0, num.categories)
categories.count.no <- rep(0, num.categories)
total.count.yes <- sum(target == 'Yes')
total.count.no <- sum(target == 'No')
for (index in seq_along(names.of.categories)){
categories.count.yes[index] <- sum(target[category == names.of.categories[index]] == 'Yes')
categories.count.no[index] <- sum(target[category == names.of.categories[index]] == 'No')
a = categories.count.yes[index] / total.count.yes
b = categories.count.no[index] / total.count.no
woe <- log(a/b)
#cat('WOE:', woe,'\n')
data[, column.index:column.index] <- replace(data[, column.index:column.index], category == names.of.categories[index], woe)
}
return (data)
}
Also, for the Leave One Out encoding which helps reducing data leakage. Mathematically it is described as,
\[ \text{LOO}_i = \frac{\sum_{j \ne i} y_j \cdot \mathbb{1}(x_j = c)}{\sum_{j \ne i} \mathbb{1}(x_j = c)} \]
where, \(\text{LOO}_i\) is the encoded value for row \(i\), \(y_j\) is the target value for row \(j\), \(x_j\) is the category of row \(j\), and \(c\) is the category being encoded, \(\mathbb{1}(x_j = c)\) is an indicator function that equals 1 if \(x_j = c\), else 0.
And the code for the same is shown below where the function “loo_encoding” takes a single category at a time and encode them into leave one out encoding.
loo_encoding <- function(data, category_col, target_col) {
data_encoded <- data
unique_categories <- unique(data[[category_col]])
encoded_values <- numeric(nrow(data))
# Loop over each row and apply LOO Encoding
for (i in 1:nrow(data)) {
subset_data <- data[-i, ]
# Calculate mean of target variable excluding current row
mean_value <- mean(subset_data[subset_data[[category_col]] == data[i, category_col], target_col], na.rm = TRUE)
encoded_values[i] <- mean_value
}
data_encoded[[paste0("loo_", category_col)]] <- encoded_values
# Drop the original categorical column
data_encoded[[category_col]] <- NULL
return(data_encoded)
}
With simple Logistic Regression we observed significant changes for z-stat and p-values for ‘Payment Method’, ‘Phone Service’, Tech Support when trained the model with the all the predictors in all cases.
| Predictor | Encoding | p - value |
|---|---|---|
| Payment Method | N/A | 0.000811 (Lowest among all) |
| Payment Method | Bayesian | 1.78e-06 |
| Payment Method | WOE | 9.87e-07 |
| Payment Method | LOO | 7.47e-06 |
| Phone Service | N/A | 0.43 (Lowest among two) |
| Phone Service | Bayesian | 2.21e-06 |
| Phone Service | WOE | 1.61e-06 |
| Phone Service | LOO | 1.64e-11 |
| Tech Support | N/A | 0.798317 |
| Tech Support | WOE | 5.77e-05 |
| Tech Support | WOE | 6.02e-05 |
| Tech Support | LOO | 2.04e-05 |
Also, some other predictors also showed more significance with the different encodings.
## True Value: Not Churned Churned
## Predicted (by Simple):
## Not Churned 1154 202
## Churned 137 266
## Accuracy for Simple Logistic Regression: 80.728 %
## Akaike Information Criterion(AIC) for the simple logistic regression is 4423.434
## Bayesian Information Criterion(BIC) for the simple logistic regression is 4581.123
## True Value: Not Churned Churned
## Predicted (by Bayesian):
## Not Churned 1148 202
## Churned 143 266
## Accuracy for Bayesian Encoding: 80.387 %
## Akaike Information Criterion(AIC) for the Bayesian Encoded logistic regression is 4423.48
## Bayesian Information Criterion(BIC) for the Bayesian Encoded logistic regression is 4554.887
## True Value: Not Churned Churned
## Predicted (by WOE):
## Not Churned 1150 203
## Churned 141 265
## Accuracy for WOE encoding: 80.44343 %
## Akaike Information Criterion(AIC) for the WOE Encoded logistic regression is 4418.805
## Bayesian Information Criterion(BIC) for the WOE Encoded logistic regression is 4550.212
## True Value: Not Churned Churned
## Predicted (by LOO):
## Not Churned 1154 198
## Churned 137 270
## Accuracy for Leave One Out: 80.955 %
## Akaike Information Criterion(AIC) for the LOO Encoded logistic regression is 4418.805
## Bayesian Information Criterion(BIC) for the LOO Encoded logistic regression is 4550.212
pred <- prediction(logistic.regression.train.fit.prob, Y.test)
# Calculate performance (True Positive Rate and False Positive Rate)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = "darkgreen", lwd = 2, main = "ROC Curve")
abline(a = 0, b = 1, col = "red", lty = 2)
auc_perf <- performance(pred, measure = "auc")
auc_value <- auc_perf@y.values[[1]]
grid(col = "lightgray", lty = "dotted")
legend(
"bottomright",
legend = paste("AUC =", round(auc_value,3)),
col = "#1f77b4",
lwd = 2.5,
bty = "n",
cex = 1.1
)
cat("AUC Score for all:\033[34m", round(auc_value, 3), "\033[0m")
## AUC Score for all:[34m 0.847 [0m
## Best Accuracy in Simple KNN 79.193 % with a k-value of 16
## Best Accuracy of KNN in Baysain Target Encoder 79.022 % with a k-value of 17
## Best Accuracy of KNN in Weight of Evidence 79.022 % with a k-value of 17
## Best Accuracy of KNN in Leave One Out Encoding 80.728 % with a k-value of 15
knn.data <- data.frame(
k = k.values,
KNN.Simple.Accuracies = k.accuracies,
KNN.Bayesian.accuracies = k.bayesian.accuracies,
KNN.WOE.accuracies = k.woe.accuracies,
KNN.LOO.accuracies = k.loo.accuracies
)
knn.data.all <- knn.data %>%
pivot_longer(
cols = -k,
names_to = "Method",
values_to = "Accuracy"
)
ggplot(knn.data.all, aes(x = k, y = Accuracy, color = Method)) +
geom_line(size = 0.5) +
geom_point(size = 1) +
geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(
title = "K-Values vs. Accuracy for Different Encodings",
x = "K-Value",
y = "Accuracy"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.background = element_rect(fill = "#f6f2f7"),panel.grid = element_blank()) +
scale_color_manual(values = c("KNN.Simple.Accuracies" = "#38d6e3", "KNN.Bayesian.accuracies" = "#b0d83e", "KNN.WOE.accuracies" = "#E2D947", "KNN.LOO.accuracies" ="#D55E00" )) +
theme(legend.background = element_rect(fill = "#e2f5ec", color = NA))
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 886 86
## Yes 405 382
##
## Accuracy : 0.7209
## 95% CI : (0.6993, 0.7417)
## No Information Rate : 0.7339
## P-Value [Acc > NIR] : 0.8972
##
## Kappa : 0.4128
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8162
## Specificity : 0.6863
## Pos Pred Value : 0.4854
## Neg Pred Value : 0.9115
## Prevalence : 0.2661
## Detection Rate : 0.2172
## Detection Prevalence : 0.4474
## Balanced Accuracy : 0.7513
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 852 80
## 1 439 388
##
## Accuracy : 0.7049
## 95% CI : (0.683, 0.7262)
## No Information Rate : 0.7339
## P-Value [Acc > NIR] : 0.997
##
## Kappa : 0.3929
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8291
## Specificity : 0.6600
## Pos Pred Value : 0.4692
## Neg Pred Value : 0.9142
## Prevalence : 0.2661
## Detection Rate : 0.2206
## Detection Prevalence : 0.4702
## Balanced Accuracy : 0.7445
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 810 70
## 1 481 398
##
## Accuracy : 0.6868
## 95% CI : (0.6645, 0.7084)
## No Information Rate : 0.7339
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3733
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8504
## Specificity : 0.6274
## Pos Pred Value : 0.4528
## Neg Pred Value : 0.9205
## Prevalence : 0.2661
## Detection Rate : 0.2263
## Detection Prevalence : 0.4997
## Balanced Accuracy : 0.7389
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 849 79
## 1 442 389
##
## Accuracy : 0.7038
## 95% CI : (0.6819, 0.7251)
## No Information Rate : 0.7339
## P-Value [Acc > NIR] : 0.9979
##
## Kappa : 0.3919
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8312
## Specificity : 0.6576
## Pos Pred Value : 0.4681
## Neg Pred Value : 0.9149
## Prevalence : 0.2661
## Detection Rate : 0.2211
## Detection Prevalence : 0.4724
## Balanced Accuracy : 0.7444
##
## 'Positive' Class : 1
##
plot(
ROCurve4,
response = Y.test.naive.bayes.loo, predictor = as.numeric(naive.bayes.loo.class),
plot = TRUE, legacy.axes = TRUE,
xlim = c(1, 0), ylim = c(0, 1),
col = "#008D10", lwd = 1,
main = "ROC Curve for Naive Bayes",
axes = FALSE
)
axis(1, at = seq(0, 1, by = 0.2), labels = seq(1, 0, by = -0.2)) # Specificity axis
axis(2, at = seq(0, 1, by = 0.2), labels = seq(0, 1, by = 0.2), pos = 1) # Sensitivity axis
lines(ROCurve1, col = "#FE0D2D", lwd = 1)
lines(ROCurve2, col = "#46CCEA", lwd = 1)
lines(ROCurve3, col = "#C1D634", lwd = 1)
par(bg = "#e2f5ec")
legend("right", legend = c("LOO", "Simple", "Bayesian", "WOE"), col = c("#008D10", "#FE0D2D", "#46CCEA","#C1D634"), lwd = 2)
## The AUC Score for Simpler Version of Naive Bayes obeserved was 0.751 ,for Bayesian encoding it was 0.745 ,for WOE encoding it was 0.739 and for the LOO encoding it was 0.744
-For LDA with p=1, we are taking the tenure since it was most statistically significant in all previous models.
## Y.test.lda
## No Yes
## No 1291 468
## Yes 0 0
## The accuracy for the LDA with p = 1 fit is 72.086 %
ggplot(data.frame(tenure), aes(x = tenure)) +
geom_density(fill = "#f2bfb9", alpha = 0.4) +
xlim(-40, max(tenure)) +
ylim(0,0.05) +
stat_function(fun = dnorm, args = list(mean = 5, sd = 16), color = "red", linetype = "dashed", size = 1.2) +
labs(title = "Probability Distribution for Tenure",x = "Tenure",y = "Probability Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.grid = element_blank())
For LDA with p=2, we are taking the tenure and Contract since they turned out to be most statistically significant.
ggplot(data.frame(Telco.Churn.Dataset.numeric$Contract), aes(x = Telco.Churn.Dataset.numeric$Contract)) +
geom_density(fill = "#b4cef5", alpha = 0.4) +
xlim(-10,10) +
ylim(0,2) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 0.25), color = "blue", linetype = "dashed", size = 1.2) +
labs(title = "Probability Distribution for Contract",x = "Contract",y = "Probability Density") +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.grid = element_blank())
So we state our Assumption: By combining these we may expect to have an joint probability density almost same as of a multivariate normal distribution.
## Y.test.lda
## No Yes
## No 1209 348
## Yes 82 120
## The accuracy for the LDA with p = 2 (using tenure and Contract) fit is 72.086 %
Result: The accuracy doesn’t showed a increment in accuracy, hence their is some evidence that our assumption is wrong about two predictors having almost normal distribution is expected to give a multivariate normal distribution but not totally correct. Moreover, the two predictors are highly correlated (by 0.68). Hence a absolute bell-shaped curve was not possible.
Confsion Matrix and Error rate with number of predictors being 12 using no encoder with 500 number of trees in the forest.
## True Value: Not Churned Churned
## Predicted (by Random Forest):
## Not Churned 1140 238
## Churned 151 230
## Error Rate for Random Forest is: 22.11 % and maximumn Out-of-Bag Error is 0.551
** We note the error the rate is approximately 22% with 12 predictors. Now we will analyse the testing accuracies with the encoders taking different number of predictors.**
ggplot(RD.data.all, aes(x = Predictors, y = Accuracy, color = Encodings)) +
geom_line(size = 0.5) +
geom_point(size = 0.5) +
#ylim(75,100) +
geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(
title = "Number of Predictors vs. RF Accuracy for Different Encodings",
x = "Predictors",
y = "Accuracy %"
) +
theme_minimal() +
scale_color_manual(values = c("RF.Simple.Accuracies" = "#38d6e3", "RF.Bayesian.accuracies" = "#b0d83e", "RF.WOE.accuracies" = "#E2D947", "RF.LOO.accuracies" ="#D55E00"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))+
theme(legend.background = element_rect(fill = "#e2f5ec", color = NA))
Plot without LOO encoder
ggplot(RD.data.all, aes(x = Predictors, y = Accuracy, color = Encodings)) +
geom_line(size = 0.5) +
geom_point(size = 0.5) +
ylim(76,81) +
#geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(
title = "Number of Predictors vs. RF Accuracy for Different Encodings",
x = "Predictors",
y = "Accuracy %"
) +
theme_minimal() +
scale_color_manual(values = c("RF.Simple.Accuracies" = "#38d6e3", "RF.Bayesian.accuracies" = "#b0d83e", "RF.WOE.accuracies" = "#E2D947"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))+
theme(legend.background = element_rect(fill = "#f7efed", color = NA))
sum(
Random.Forest.X.loo.encoding.Y.test$loo_SeniorCitizen == 0.2364 &
Random.Forest.X.loo.encoding.Y.test$Churn == "1"
)
## [1] 350
sum(
Random.Forest.X.loo.encoding.Y.test$loo_SeniorCitizen == 0.4163 &
Random.Forest.X.loo.encoding.Y.test$Churn == "1"
)
## [1] 118
sum(Random.Forest.X.loo.encoding.Y.test$Churn == "1")
## [1] 468
This questions the credibility of the dataset because usually it doesn’t turns out to be the case where one encoder gets 100% accuracy, although it was downloaded from the Kaggle website and this particular dataset by contributed by a famous tech firm IBM. Thus, for the further analysis we will omit the use of this LOO encodings.
Next, we analysed the accuracy with respect the change in its number of trees from 1 to 1000 in a Simple Random Forest taking 4 predictors and it turned out to be the most accurate/stable number of predictor from our above analysis. Since we do not have an GPU, we would take jump of 100 steps for traversing from 1 to 7000.
rf.number.of.trees <- seq(1, 7000, by = 100)
rf.trees.accuracies <- rep(0, length(rf.number.of.trees))
for (index in seq_along(rf.number.of.trees)){
rf.model.fit <- randomForest(Churn ~ ., data = Random.Forest.X.Y.train, mtry = 4, importance = TRUE, ntree = index)
predictions <- predict(rf.model.fit , newdata = Random.Forest.X.Y.test)
confusion_matrix <- table(predictions, Random.Forest.X.Y.test$Churn)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
rf.trees.accuracies[index] <- accuracy
}
rf.trees.accuracies <- rf.trees.accuracies * 100
rf.trees.data <- data.frame(mtry = rf.number.of.trees, accuracy = rf.trees.accuracies)
# Create the plot
ggplot(rf.trees.data, aes(x = mtry, y = accuracy)) +
#ylim(68,82) +
geom_line(color = "#38d6e3") + # Line plot
geom_point(color = "#b0d83e") + # Points on the plot
labs(title = "Random Forest Accuracy vs Number of Trees", x = "Number of Trees", y = "Test Accuracy %") +
theme_minimal() + # Aesthetic theme
geom_smooth(color = "#38d6e3") +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))
- We can see from the above graph that the accuracy increases with the
increase in number of trees with some fluctuations and after 5000, it
almost becomes steady.
X.dataset.Cross.Validation <- Telco.Churn.Dataset[,-c(1)]
X.dataset.Cross.Validation$Churn <- as.factor(X.dataset.Cross.Validation$Churn)
train_control <- trainControl(method = "cv", number = 10)
rf_cv <- train(Churn ~ tenure + PaymentMethod + PaperlessBilling + TotalCharges + PhoneService + Contract, data = X.dataset.Cross.Validation, method = "rf", trControl = train_control)
print(rf_cv)
## Random Forest
##
## 7032 samples
## 6 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6328, 6329, 6329, 6329, 6328, 6330, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7780170 0.3470699
## 5 0.7792956 0.4046720
## 9 0.7555496 0.3454467
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
ggplot(rf_cv$results, aes(x = mtry, y = Accuracy)) +
geom_line(color = "#1f77b4", size = 1) +
geom_point(size = 2, color = "#ff7f0e") +
geom_errorbar(aes(ymin = Accuracy - 1 * AccuracySD, ymax = Accuracy + 1 * AccuracySD),
width = 0.2, color = "#1f77b4", alpha = 0.6) +
labs(
title = "Cross-Validated Accuracy vs. mtry (Random Forest)",
x = "Number of Predictors Sampled (mtry)",
y = "Accuracy"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))
- Hence, we 5 metrics we can choose the Random Forest Model
But, before that we ran training of a simple XGBoost model to get the estimate of the accuracy. Here we considered maximum depth of tree to be 6 and number of boosting rounds to be 100.
## True Values Not Churned Churned
## Predicted (by XGBoost)
## Not Churned 1149 231
## Churned 142 237
## Accuracy: 0.7879477
xgboost.accuracies <- xgboost.accuracies * 100
xgboost.bayesian.accuracies <- xgboost.bayesian.accuracies * 100
xgboost.woe.accuracies <- xgboost.woe.accuracies * 100
ggplot(XGBoost.data.all, aes(x = Depth, y = Accuracy, color = Encodings)) +
geom_line(size = 0.5) +
geom_point(size = 0.5) +
#coord_cartesian(ylim = c(65, 82)) +
geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(title = "XGBoost Accuracy vs. Depth of Tree", x = "Depth", y = "Test Accuracy %") +
theme_minimal() +
scale_color_manual(values = c("XGBoost.Simple.Accuracies" = "#38d6e3", "XGBoost.Bayesian.accuracies" = "#b0d83e", "XGBoost.WOE.accuracies" = "#E2D947"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))+
theme(legend.background = element_rect(fill = "#ecf8f6", color = NA))
cv <- xgb.cv(
data = X.train.xg.matrix,
label = Y.train.xg.matrix,
max_depth = 55,
eta = 0.1,
nrounds = 500,
nfold = 10,
objective = "binary:logistic",
eval_metric = "logloss",
early_stopping_rounds = 10,
verbose = 0
)
best.round <- cv$best_iteration
cat("Best round turns out to be",best.round,"th")
## Best round turns out to be 24 th
Before that we take do quick analysis by taking the interaction depth as 4 and number of trees as 1500 with different encoders.
| var | rel.inf | |
|---|---|---|
| TotalCharges | TotalCharges | 29.1053032 |
| MonthlyCharges | MonthlyCharges | 24.4155948 |
| tenure | tenure | 12.2500672 |
| Contract | Contract | 11.5236500 |
| OnlineSecurity | OnlineSecurity | 3.4202746 |
| PaymentMethod | PaymentMethod | 2.9221656 |
| TechSupport | TechSupport | 2.7904124 |
| gender | gender | 1.7202024 |
| PaperlessBilling | PaperlessBilling | 1.6276123 |
| OnlineBackup | OnlineBackup | 1.5575649 |
| SeniorCitizen | SeniorCitizen | 1.4474142 |
| DeviceProtection | DeviceProtection | 1.2657916 |
| StreamingTV | StreamingTV | 1.0762316 |
| MultipleLines | MultipleLines | 1.0420367 |
| Dependents | Dependents | 0.9315417 |
| StreamingMovies | StreamingMovies | 0.9200107 |
| InternetService | InternetService | 0.7660315 |
| Partner | Partner | 0.6786471 |
| PhoneService | PhoneService | 0.5394474 |
colnames(importance_df) <- c("Predictor", "RelativeInfluence")
ggplot(importance_df, aes(x = reorder(Predictor, RelativeInfluence), y = RelativeInfluence)) +
geom_bar(stat = "identity", fill = "#e8532e") +
coord_flip() +
labs(title = "Predictor Importance in Simple GBM (Boosting)",
x = "Predictor",
y = "Relative Influence %") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))
- From the above plot, we observe that Total Charges
and Monthly Charges which are numerical predictor has
much higher influence than tenure. - Similarly, Payment
Method, Online Security, Tech
Support are highly influential with Contract -
So, can select take 6 main predictors as, Total
Charges, Monthly Charges,
Contract, tenure, Payment
Method and Online Security.
Also, the Partial Dependence of Churn on Total Charges, Monthly Charges and tenure are shown below.
pdp.totalCharges <- plot(boost.model.fit, i.var = "TotalCharges", return.grid = TRUE)
ggplot(pdp.totalCharges, aes(x = TotalCharges, y = y)) +
geom_line(color = "#38d6e3", size = 1.2) +
geom_point(color = "#0073c2ff", size = 1.5) +
labs(title = "Partial Dependence of Churn on Total Charges", x = "Total Charges", y = "Partial Dependence (log-odds)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
) +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))
pdp.tenure<- plot(boost.model.fit , i = "tenure", return.grid = TRUE)
ggplot(pdp.tenure, aes(x = tenure, y = y)) +
geom_line(color = "#9ff09b", size = 1.2) +
geom_point(color = "#2aca22", size = 1.5) +
labs(title = "Partial Dependence of Churn on Tenure", x = "Tenure", y = "Partial Dependence (log-odds)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))
pdp.MonthlyCharges <- plot(boost.model.fit, i.var = "MonthlyCharges", return.grid = TRUE)
ggplot(pdp.MonthlyCharges, aes(x = MonthlyCharges, y = y)) +
geom_line(color = "#eacd92", size = 1.2) +
geom_point(color = "#d59f0e", size = 1.5) +
labs(title = "Partial Dependence of Churn on Monthly Charges", x = "Monthly Charges", y = "Partial Dependence (log-odds)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
) +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f9f9ea"))
# Create the plot
ggplot(boosting.data, aes(x = num.trees, y = boost.accuracies)) +
geom_line(color = "#38d6e3") + # Line plot
geom_point(color = "#b0d83e") + # Points on the plot
labs(title = "Boosting Accuracy vs. Number of Trees", x = "Trees", y = "Accuracy") +
theme_minimal() + # Aesthetic theme
geom_smooth(color = "#38d6e3")+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))
-Firstly we fit the model using weak learner as 50 with different encoders. This would give us Confusion Matrix and Accuracies as,
## True Values Not Churned Churned
## Predicted (by ADABoost)
## Not Churned 1145 220
## Churned 146 248
## [1] "Accuracy: 79.19 %"
## True Values Not Churned Churned
## Predicted (by ADABoost Bayesian)
## Not Churned 1142 217
## Churned 149 251
## [1] "Accuracy: 79.19 %"
## True Values Not Churned Churned
## Predicted (by ADABoost WOE)
## Not Churned 1140 218
## Churned 151 250
## [1] "Accuracy: 79.02 %"
Now we will vary the weak classifiers from 1 to 100. Extending beyond this value we would be more prune to overfitting.
m.final.values <- seq(1, 100, by = 1)
m.accuracies <- rep(0, 100)
for (index in seq_along(m.final.values)){
adaboost.model.fit <- boosting(Churn ~ ., data = X.Y.train.AdaBoost, boos = TRUE, mfinal = index)
adaboost.predictions <- predict(adaboost.model.fit, X.Y.test.AdaBoost)
confusion.matrix <- table(Predicted = adaboost.predictions$class, Actual = X.Y.test.AdaBoost$Churn)
accuracy <- sum(diag(confusion.matrix)) / sum(confusion.matrix)
m.accuracies[index] <- accuracy
}
m.accuracies <- m.accuracies * 100
AdaBoost.data <- data.frame(Number.of.Weak.Classifiers = m.final.values, ADAboost.Simple.Accuracies = m.accuracies)
# Create the plot
ggplot(AdaBoost.data, aes(x = Number.of.Weak.Classifiers, y = m.accuracies)) +
geom_line(color = "#38d6e3") + # Line plot
geom_point(color = "#b0d83e") + # Points on the plot
labs(title = "ADABoost Accuracy vs Weak Classifier", x = "Weak Classifier", y = "Accuracy") +
theme_minimal() + # Aesthetic theme
geom_smooth(color = "#38d6e3")+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))
Since, we now have some meaningful predictors which are predictors as, Total Charges, Monthly Charges, Contract, tenure, Payment Method and Online Security.
We apply the Logistic and XGBoost model again with their respective inferred values.
X.train <- Telco.Churn.Dataset[train.X.vector.indicies.unmasked,-c(1,21)]
X.test <- Telco.Churn.Dataset[!train.X.vector.indicies.unmasked,-c(1,21)]
Y.test <- Telco.Churn.Dataset$Churn[!train.X.vector.indicies.unmasked]
logistic.regression.train.fit <- glm(Churn ~ TotalCharges + MonthlyCharges + Contract + tenure + PaymentMethod + OnlineSecurity, data = New.Telco.Churn.Dataset[,-c(1)], family = binomial, subset = train.X.vector.indicies.unmasked)
#summary(logistic.regression.train.fit)
logistic.regression.train.fit.prob <- predict(logistic.regression.train.fit, X.test, type = "response")
logistic.regression.train.fit.pred <- rep("No", length(logistic.regression.train.fit.prob))
logistic.regression.train.fit.pred[logistic.regression.train.fit.prob > 0.5] = "Yes"
tab <- table(logistic.regression.train.fit.pred, Y.test)
dimnames(tab) <- list(
"Predicted (by Simple):" = c("Not Churned", "Churned"),
"True Value:" = c("Not Churned", "Churned")
)
print(ftable(tab))
## True Value: Not Churned Churned
## Predicted (by Simple):
## Not Churned 1126 220
## Churned 165 248
accuracy <- mean(logistic.regression.train.fit.pred == Y.test) * 100
cat('Accuracy for Simple Logistic Regression:', round (accuracy, 3),'%','\n\n')
## Accuracy for Simple Logistic Regression: 78.113 %
cat("Akaike Information Criterion(AIC) for the simple logistic regression is",AIC(logistic.regression.train.fit),'\n') # Akaike Information Criterion
## Akaike Information Criterion(AIC) for the simple logistic regression is 4501.192
cat("Bayesian Information Criterion(BIC) for the simple logistic regression is",BIC(logistic.regression.train.fit)) # Bayesian Information Criterion
## Bayesian Information Criterion(BIC) for the simple logistic regression is 4573.466
xgboost.model <- xgboost(
data = X.train.xg.matrix,
label = Y.train.xg.matrix,
max_depth = 6,
eta = 0.1,
nrounds = 100,
objective = "binary:logistic",
eval_metric = "logloss",
verbose = 0,
)
y.pred <- predict(xgboost.model, X.test.xg.matrix)
y.pred.class <- ifelse(y.pred > 0.5, 1, 0) # Convert probabilities to 0/1
conf_matrix <- table(Predicted = y.pred.class, Actual = Y.test.xg.matrix)
dimnames(conf_matrix) <- list(
"Predicted (by XGBoost)" = c("Not Churned", "Churned"),
"True Values" = c("Not Churned", "Churned")
)
print(ftable(conf_matrix))
## True Values Not Churned Churned
## Predicted (by XGBoost)
## Not Churned 1142 229
## Churned 149 239
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat('Accuracy for XGBoost:', round (accuracy * 100, 3),'%','\n\n')
## Accuracy for XGBoost: 78.511 %
In this project, we wanted to predict whether a customer churns or not. From the start, we had 19 predictors and most of them were not looking statistically significant. So, we applied different types of encoding techniques like Bayesian Target Encoding, Weight of Evidence, and Leave One out. We did our analysis by choosing different techniques using different encoding to infer the suitable predictors. While predictors like Payment Method, Phone Service, and Tech Support showed low p-values for the encoding. Hence, a total of 12 to 13 predictors seems statistically significant in total with different encoding. Further, we ensure at least tenure is highly influential to the output because of its low p-value in most of the cases. The same thing we showed in the LDA with p=1. Furthermore, we performed and analyzed tree-based methods where in Boosting we saw that Total charges, Monthly Charges, and Contact Period are having high relative influence as compared to others. Similarly, the 10-fold cross-validation in Random Forest also suggested that near about 5-6 predictors are sufficient for a significantly meaningful model. Hence, our observations picked up six predictors namely, Total Charges, Monthly Charges, Contract, tenure, Payment Method, and Online Security. Also, we verified that with these predictors we obtained 78% accuracy with Simple Logistic Regression and XGBoost and the highest accuracy detected was (80.5%) by using various techniques which included all the predictors. Also, the AIC and BIC scores were higher for a model with six predictors and hence, it gave us strong evidence of our assumption of having six to seven influential predictors.
Also, in our analysis, we didn’t include Backward or Forward propagation methods that would help us identify more about these predictors. Similarly, we didn’t include any removal of outliers or leverages that could help us gain accuracy and hence more insight into the model. In our case, the data is limited, so we can consider including information/data across other parts of the US by using Cluster or Systematic Sampling. Hence in general, more data would give us more security that we are not underfitting our inferred model. So the best model we can use as of now are:
For the future impact, if the above methods are deployed in the real-world scenario, we can expect to get the right prediction for the churn 80% of the time (except Tree-Methods with Leave One Out Encoder), which is probably better than a random guess. Since it contained most of the predictors that don’t contribute to the outcome as several model analyses, it suffers from the curse of dimensionality. Also, there were not many interactions observed during my analysis. If we try to reduce the model cost further it will suffer overfitting, so increased bias would have a negative impact on the model. Hence, it is suggested that more data be collected since only 7k can’t guarantee an accuracy that is 80% achievable in practice most of the time. If some “outliers” were observed in the future related to this dataset, especially with tenure, then many techniques used would suffer. Putting the extreme cases aside, the company can hugely benefit if, they are doubtful about a customer being churned in the future. The firm can focus on these categories of customers to maintain their profits. Also, the model is trained using the dataset of a specified region, namely California, which could not match the other regions. In general, not all of the features of customers for one region matches with the feature of customers in other region. Hence, the model is not giving a holistic view of the churn prediction across the US. In these cases, our model will yield low accuracy and is expected to give the firm some benefit in their churn prediction.