customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
Churn Prediction
Churn Prediction Dataset: Data Science Project
Comprehending meaningful Predictors for Churn Prediction
By: Kushal Devgun
Data Introduction
In this project, I analyzed the Telco Customer Churn dataset. This dataset represents a sample of a telecommunications company offering different services to approximately 7,000 customers. It includes 19 predictors related to the services customers have subscribed to with their values as shown,
- Gender: Whether the customer is a male or a female
- Senior Citizen Status: Whether the customer is a senior citizen or not (1, 0)
- Partner:: Whether the customer has a partner or not (Yes, No)
- Dependents: Whether the customer has dependents or not (Yes, No)
- Tenure: Number of months the customer has stayed with the company
- Phone Service: Whether the customer has a phone service or not (Yes, No)
- Multiple Lines: Whether the customer has multiple lines or not (Yes, No, No phone service)
- Internet Service: Customer’s internet service provider (DSL, Fiber optic, No)
- Online Security: Whether the customer has online security or not (Yes, No, No internet service)
- Online Backup: Whether the customer has device protection or not (Yes, No, No internet service)
- Device Protection: Whether the customer has tech support or not (Yes, No, No internet service)
- Tech Support: Whether the customer has tech support or not (Yes, No, No internet service)
- Streaming TV: Whether the customer has streaming TV or not (Yes, No, No internet service)
- Streaming Movies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract: The contract term of the customer (Month-to-month, One year, Two year)
- Paperless Billing: Whether the customer has paperless billing or not (Yes, No)
- Payment Method: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- Monthly Charges: The amount charged to the customer monthly
- Total Charges: The total amount charged to the customer
The purpose of this dataset is to predict whether a customer will churn or not. It Includes 2 quantitative and 17 qualitative predictors.
Research Questions
There are several questions that needs investigations like,
- Which of the following predictors are correlated?
- Which model will guarantee a good relationship for the future predictors?
- What the the predictors that influence the outcome the most and should be used for the future predictions?
- Does encoding categorical variables by several encoding techniques makes the model more meanginful?
- Can Tree-Based method be good for the outcome? If yes? what should be Depth, Number of Trees and other metrics?
- Which Validation method would be trustable and does the given model suffer overfitting?
- Any synergy effect or other transformation effect observed in the dataset?
We explored all the answers collectively in the below sections
Data Visualization
Below the Dataset View and the summary of the Dataset.
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length:7043 | Length:7043 | Min. :0.0000 | Length:7043 | Length:7043 | Min. : 0.00 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Min. : 18.25 | Min. : 18.8 | Length:7043 | |
Class :character | Class :character | 1st Qu.:0.0000 | Class :character | Class :character | 1st Qu.: 9.00 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 35.50 | 1st Qu.: 401.4 | Class :character | |
Mode :character | Mode :character | Median :0.0000 | Mode :character | Mode :character | Median :29.00 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 70.35 | Median :1397.5 | Mode :character | |
NA | NA | Mean :0.1621 | NA | NA | Mean :32.37 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 64.76 | Mean :2283.3 | NA | |
NA | NA | 3rd Qu.:0.0000 | NA | NA | 3rd Qu.:55.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 89.85 | 3rd Qu.:3794.7 | NA | |
NA | NA | Max. :1.0000 | NA | NA | Max. :72.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :118.75 | Max. :8684.8 | NA | |
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :11 | NA |
Below shows the visualization of the correlation matrics and some relationships between predictors. We can see that not much of predictors are correlated.
$tenure <- as.numeric(Telco.Churn.Dataset.numeric$tenure)
Telco.Churn.Dataset.numericoptions(repr.plot.width = 50, repr.plot.height = 30)
<-cor(Telco.Churn.Dataset.numeric[,-c(1,21)])
corr ggcorrplot(corr, hc.order = TRUE, lab = FALSE, colors = c("#6D9EC1", "white", "#cb1f1f"))
Next, we create a x-y graph to compare between Tenure and Total charges.
ggplot(
data = Telco.Churn.Dataset.numeric,
mapping = aes(x = tenure, y = TotalCharges, color = Churn)) +
geom_point(size = 1) +
labs(title = "Tenure Vs Total Charges",x = "Tenure",y = "Total Charges")+
scale_color_manual(values = c("0" = "#0072B2", "1" = "#D55E00")) +
theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"), panel.background = element_rect(fill = "#f5eee1")) +
scale_x_continuous(breaks = seq(0, max(Telco.Churn.Dataset.numeric$tenure), by = 10)) +
geom_smooth(color = "#f5af2e")
From above plot we see that the Churned customers decrease with their increase in the tenure and hence as described by the correlation matrix.
ggplot(
data = Telco.Churn.Dataset.numeric, aes(x = tenure, fill = as.factor(tenure))) +
geom_bar(color = "black") +
labs(title = "Distribution of Tenure", x = "Tenure", y = "Count") +
theme_minimal() +
scale_fill_manual(values = colrs)+
theme(legend.position = "none") +
scale_x_continuous(breaks = seq(0, max(Telco.Churn.Dataset.numeric$tenure), by = 5)) +
theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f6f2f7"))
Working Methodology
Stratifying the Dataset
Stratifying the data: It involves dividing a dataset into distinct subgroups (also called strata), based on a specific characteristic before performing analysis or training. This technique ensures that each subgroup is adequately represented, sometimes reducing bias and improving statistical reliability. Stratification is particularly useful in machine learning ensuring that the model has a fair chance to get trained on a particular characteristic or even output, ensuring training and test datasets maintain the same distribution of key variables for better model. In my case, I am stratifying the data based upon the output(Churn).
In my stratified sampling, I am using 75% training and 25% testing data while maintaing the ratio as shown below.
[ Ratio = = ]
Below is the Code for the same where the function ‘stratify.data.train.indicies’ stratify the data based on the training split ratio.
<- vector(mode = "numeric", length = 0)
customers.left <- vector(mode = "numeric", length = 0)
customers.detained
for (index in seq_along(Churn)){
if (Churn[index] == 'Yes')
<- c(customers.left, index)
customers.left else
<- c(customers.detained, index)
customers.detained
}
cat("Total number of customers Left:", length(customers.left)," and ", "Total number of customers detained:", length(customers.detained))
Total number of customers Left: 1869 and Total number of customers detained: 5163
<- function(data.train.ratio, ...){
stratify.data.train.indicies <- data.train.ratio * length(customers.left)
number.of.left.train.samples <- data.train.ratio * length(customers.detained)
number.of.detained.train.samples
<- vector(mode = "numeric", length = 0)
total.train.samples.indicies <- c(total.train.samples.indicies, sample(customers.left, number.of.left.train.samples), sample(customers.detained, number.of.detained.train.samples))
total.train.samples.indicies
return (total.train.samples.indicies)
}
Using Different Encoders:
Since, most of the variables involved are categorical, I am representing it using some most commonly used categorical encoders for a financial dataset namely,
- Bayesian Target Encoding
- Weight of Evidence
- Leave One Out Encoding
For Bayesian Target encoding, I am creating the encoder on my own since, I didn’t found any suitable R dependency to do so. Mathematically, the formula for Bayesian Target encoder looks like,
[ E(x c) = ]
where, ( E(x c) ) = Encoded value for category ( c ), ( N_c ) = Number of occurrences of category c in the dataset, ( _c ) = Mean target value for category c, ( ) = Global mean of the target variable across all categories, ( m ) = Smoothing factor (a hyperparameter). In my I am taking the smoothing factor to be 2.
Below is the code of Bayesian Target encoder where the function “bayesian_encoding()” takes a single category at a time and encode it using the target which is Churn.
<- function(data, category, target, column.index){ #bayesian.mean.target.encoding
bayesian_encoding
#Weighted Mean = (n * Option Mean + m * Overall mean)/n+m
#Option Mean = No of Instances of (Yes/1/postive) of a particular category / Total Instances of a particular category
#Overall mean = Total Instances of a particular category / Total Instances
#n = Weight of Option Mean (usually number of rows of the category)
#m = Weight of Overall Mean (user defined)
<- length(category)
total.rows <- length(unique(category))
num.categories <- unique(category)
names.of.categories <- rep(0, num.categories)
categories.count <- rep(0, num.categories)
cotegories.count.yes <- 2
m <- sum(target == 'Yes')
total.count.yes
for (index in seq_along(names.of.categories)){
<- sum(category == names.of.categories[index])
x <- x
categories.count[index] <- sum(target[category == names.of.categories[index]] == 'Yes')
cotegories.count.yes[index]
#n = Weight of Option Mean (number of rows of the category)
<- categories.count[index]
n
#m = Weight of Overall Mean (smoothing factor)
<-2
m
#Option Mean = No of Instances of (Yes/1/postive) of a particular category / Total Instances of a particular category
<- cotegories.count.yes[index]/categories.count[index]
option.mean
#Overall mean = Total Instances of a (Yes/1/postive)/ Total Instances
<- total.count.yes/total.rows
overall.mean
#Weighted Mean
<- (n * option.mean + m * overall.mean)/n+m
weighted.mean
:column.index] <- replace(data[, column.index:column.index], category == names.of.categories[index], weighted.mean)
data[, column.index
}
return (data)
}
Similarly, for the Weight of Evidence which is a statistical technique used primarily in credit scoring and binary classification to measure the predictive power of an independent variable in relation to a binary target. Mathematically, it is defined as the logarithm of the ratio between the percentage of goods and bads in each bin as stated,
[ a = ]
[ b = ]
[ = ( ) ]
And the code for the same is shown below where the function “woe_encoding” takes a single category at a time and encode it using the target which is Churn.
<- function(data, category, target, column.index){
woe_encoding
#WOE
# a = (Postives|Yes|1) of the Group)/(Total Postivies)
# b = (Negative|No|O) of the Group)/(Total Negatives)
# WOE = log(a/b)
<- length(category)
total.rows <- length(unique(category))
num.categories <- unique(category)
names.of.categories <- rep(0, num.categories)
categories.count <- rep(0, num.categories)
categories.count.yes <- rep(0, num.categories)
categories.count.no
<- sum(target == 'Yes')
total.count.yes <- sum(target == 'No')
total.count.no
for (index in seq_along(names.of.categories)){
<- sum(target[category == names.of.categories[index]] == 'Yes')
categories.count.yes[index] <- sum(target[category == names.of.categories[index]] == 'No')
categories.count.no[index]
= categories.count.yes[index] / total.count.yes
a = categories.count.no[index] / total.count.no
b <- log(a/b)
woe #cat('WOE:', woe,'\n')
:column.index] <- replace(data[, column.index:column.index], category == names.of.categories[index], woe)
data[, column.index
}return (data)
}
Also, for the Leave One Out encoding which helps reducing data leakage. Mathematically it is described as,
[ _i = ]
where, ( _i ) is the encoded value for row ( i ), ( y_j ) is the target value for row ( j ), ( x_j ) is the category of row ( j ), and ( c ) is the category being encoded, ( (x_j = c) ) is an indicator function that equals 1 if ( x_j = c ), else 0.
And the code for the same is shown below where the function “loo_encoding” takes a single category at a time and encode them into leave one out encoding.
<- function(data, category_col, target_col) {
loo_encoding <- data
data_encoded
<- unique(data[[category_col]])
unique_categories <- numeric(nrow(data))
encoded_values
# Loop over each row and apply LOO Encoding
for (i in 1:nrow(data)) {
<- data[-i, ]
subset_data
# Calculate mean of target variable excluding current row
<- mean(subset_data[subset_data[[category_col]] == data[i, category_col], target_col], na.rm = TRUE)
mean_value <- mean_value
encoded_values[i]
}
paste0("loo_", category_col)]] <- encoded_values
data_encoded[[
# Drop the original categorical column
<- NULL
data_encoded[[category_col]]
return(data_encoded)
}
Modelling:
- Logistic Regression (with different of encodings)
- KNN (with different k and different encoding)
- QDA, LDA, Naive Bayes
- Random Forest and Bagging, Boosting, XGBoost, ADABoost, SVM (with different of encodings, different number and depth of trees)
With simple Logistic Regression we observed significant changes for z-stat and p-values for ‘Payment Method’, ‘Phone Service’, Tech Support when trained the model with the all the predictors in all cases.
Predictor | Encoding | p - value |
---|---|---|
Payment Method | N/A | 0.000811 (Lowest among all) |
Payment Method | Bayesian | 1.78e-06 |
Payment Method | WOE | 9.87e-07 |
Payment Method | LOO | 7.47e-06 |
Phone Service | N/A | 0.43 (Lowest among two) |
Phone Service | Bayesian | 2.21e-06 |
Phone Service | WOE | 1.61e-06 |
Phone Service | LOO | 1.64e-11 |
Tech Support | N/A | 0.798317 |
Tech Support | WOE | 5.77e-05 |
Tech Support | WOE | 6.02e-05 |
Tech Support | LOO | 2.04e-05 |
Also, some other predictors also showed more significance with the different encodings.
Logistic Regression accuracies, ROC Curve and AUC scores.
- We observe different accuracies for different encodings using all the predictors at first, more and less for different values of random seeds values among the different encoding.
- It is important to note that regardless of the encoding, the ROC Curve and AUC scores were same for all of them.
- Also, the slightly less AIC and BIC scores for the WOE encoder and LOO encoders suggests that these encoders might be candidates for this dataset.
Simple Logistic Regression
True Value: Not Churned Churned
Predicted (by Simple):
Not Churned 1154 202
Churned 137 266
Accuracy for Simple Logistic Regression: 80.728 %
Akaike Information Criterion(AIC) for the simple logistic regression is 4423.434
Bayesian Information Criterion(BIC) for the simple logistic regression is 4581.123
Bayesian Target Encoding
True Value: Not Churned Churned
Predicted (by Bayesian):
Not Churned 1148 202
Churned 143 266
Accuracy for Bayesian Encoding: 80.387 %
Akaike Information Criterion(AIC) for the Bayesian Encoded logistic regression is 4423.48
Bayesian Information Criterion(BIC) for the Bayesian Encoded logistic regression is 4554.887
Weight of Evidence
True Value: Not Churned Churned
Predicted (by WOE):
Not Churned 1150 203
Churned 141 265
Accuracy for WOE encoding: 80.44343 %
Akaike Information Criterion(AIC) for the WOE Encoded logistic regression is 4418.805
Bayesian Information Criterion(BIC) for the WOE Encoded logistic regression is 4550.212
Leave One Out Encoding
True Value: Not Churned Churned
Predicted (by LOO):
Not Churned 1154 198
Churned 137 270
Accuracy for Leave One Out: 80.955 %
Akaike Information Criterion(AIC) for the LOO Encoded logistic regression is 4418.805
Bayesian Information Criterion(BIC) for the LOO Encoded logistic regression is 4550.212
<- prediction(logistic.regression.train.fit.prob, Y.test)
pred
# Calculate performance (True Positive Rate and False Positive Rate)
<- performance(pred, "tpr", "fpr")
perf
plot(perf, col = "darkgreen", lwd = 2, main = "ROC Curve")
abline(a = 0, b = 1, col = "red", lty = 2)
<- performance(pred, measure = "auc")
auc_perf <- auc_perf@y.values[[1]]
auc_value grid(col = "lightgray", lty = "dotted")
legend(
"bottomright",
legend = paste("AUC =", round(auc_value,3)),
col = "#1f77b4",
lwd = 2.5,
bty = "n",
cex = 1.1
)
cat("AUC Score for all:\033[34m", round(auc_value, 3), "\033[0m")
AUC Score for all:[34m 0.847 [0m
KNN accuracies, best K- value and trend of accuracy with k-values
- We Observe that the Leave One Out Encoding tends to gives slightly better accuracy almost every time as compared to the rest.
- Nearly, all the encoders gives the maximum accuracy for the k-value ranging between 9-18
- Interestingly, for Bayesian and Weight of Evidence encoding, sometimes the maximum value for accuracy is observed for K is 71 and 72 respectively. Pretty Surprising!
Best Accuracy in Simple KNN 79.193 % with a k-value of 16
Best Accuracy of KNN in Baysain Target Encoder 79.022 % with a k-value of 17
Best Accuracy of KNN in Weight of Evidence 79.022 % with a k-value of 17
Best Accuracy of KNN in Leave One Out Encoding 80.728 % with a k-value of 15
<- data.frame(
knn.data k = k.values,
KNN.Simple.Accuracies = k.accuracies,
KNN.Bayesian.accuracies = k.bayesian.accuracies,
KNN.WOE.accuracies = k.woe.accuracies,
KNN.LOO.accuracies = k.loo.accuracies
)
<- knn.data %>%
knn.data.all pivot_longer(
cols = -k,
names_to = "Method",
values_to = "Accuracy"
)
ggplot(knn.data.all, aes(x = k, y = Accuracy, color = Method)) +
geom_line(size = 0.5) +
geom_point(size = 1) +
geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(
title = "K-Values vs. Accuracy for Different Encodings",
x = "K-Value",
y = "Accuracy"
+
) theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.background = element_rect(fill = "#f6f2f7"),panel.grid = element_blank()) +
scale_color_manual(values = c("KNN.Simple.Accuracies" = "#38d6e3", "KNN.Bayesian.accuracies" = "#b0d83e", "KNN.WOE.accuracies" = "#E2D947", "KNN.LOO.accuracies" ="#D55E00" )) +
theme(legend.background = element_rect(fill = "#e2f5ec", color = NA))
Analysis with Naive Bayes With Different Encoders
- We observe that all the accuracies and AUC scores are slightly different when using different encoders.
- Similarly we also saw different 95% Confidence Intervals, Mcnemar’s Test P-Value and Kappa values for each encoder.
- Along all WOE Encoding has the best Sensitivity meaning it catches more of the positive class.
- All models has Kappa scores in the 0.37–0.41 range, with Simple Naive Bayes best at 0.4128.
- Also simple Naive Bayes gives the best overall balance.
For Simple Naive Bayes:
Confusion Matrix and Statistics
Actual
Predicted No Yes
No 886 86
Yes 405 382
Accuracy : 0.7209
95% CI : (0.6993, 0.7417)
No Information Rate : 0.7339
P-Value [Acc > NIR] : 0.8972
Kappa : 0.4128
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8162
Specificity : 0.6863
Pos Pred Value : 0.4854
Neg Pred Value : 0.9115
Prevalence : 0.2661
Detection Rate : 0.2172
Detection Prevalence : 0.4474
Balanced Accuracy : 0.7513
'Positive' Class : Yes
For Bayesian encoding:
Confusion Matrix and Statistics
actual
predicted 0 1
0 852 80
1 439 388
Accuracy : 0.7049
95% CI : (0.683, 0.7262)
No Information Rate : 0.7339
P-Value [Acc > NIR] : 0.997
Kappa : 0.3929
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8291
Specificity : 0.6600
Pos Pred Value : 0.4692
Neg Pred Value : 0.9142
Prevalence : 0.2661
Detection Rate : 0.2206
Detection Prevalence : 0.4702
Balanced Accuracy : 0.7445
'Positive' Class : 1
For WOE encoding:
Confusion Matrix and Statistics
actual
predicted 0 1
0 810 70
1 481 398
Accuracy : 0.6868
95% CI : (0.6645, 0.7084)
No Information Rate : 0.7339
P-Value [Acc > NIR] : 1
Kappa : 0.3733
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8504
Specificity : 0.6274
Pos Pred Value : 0.4528
Neg Pred Value : 0.9205
Prevalence : 0.2661
Detection Rate : 0.2263
Detection Prevalence : 0.4997
Balanced Accuracy : 0.7389
'Positive' Class : 1
For LOO encoding:
Confusion Matrix and Statistics
actual
predicted 0 1
0 849 79
1 442 389
Accuracy : 0.7038
95% CI : (0.6819, 0.7251)
No Information Rate : 0.7339
P-Value [Acc > NIR] : 0.9979
Kappa : 0.3919
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8312
Specificity : 0.6576
Pos Pred Value : 0.4681
Neg Pred Value : 0.9149
Prevalence : 0.2661
Detection Rate : 0.2211
Detection Prevalence : 0.4724
Balanced Accuracy : 0.7444
'Positive' Class : 1
plot(
ROCurve4,response = Y.test.naive.bayes.loo, predictor = as.numeric(naive.bayes.loo.class),
plot = TRUE, legacy.axes = TRUE,
xlim = c(1, 0), ylim = c(0, 1),
col = "#008D10", lwd = 1,
main = "ROC Curve for Naive Bayes",
axes = FALSE
)
axis(1, at = seq(0, 1, by = 0.2), labels = seq(1, 0, by = -0.2)) # Specificity axis
axis(2, at = seq(0, 1, by = 0.2), labels = seq(0, 1, by = 0.2), pos = 1) # Sensitivity axis
lines(ROCurve1, col = "#FE0D2D", lwd = 1)
lines(ROCurve2, col = "#46CCEA", lwd = 1)
lines(ROCurve3, col = "#C1D634", lwd = 1)
par(bg = "#e2f5ec")
legend("right", legend = c("LOO", "Simple", "Bayesian", "WOE"), col = c("#008D10", "#FE0D2D", "#46CCEA","#C1D634"), lwd = 2)
The AUC Score for Simpler Version of Naive Bayes obeserved was 0.751 ,for Bayesian encoding it was 0.745 ,for WOE encoding it was 0.739 and for the LOO encoding it was 0.744
Analysis for LDA and QDA
-For LDA with p=1, we are taking the tenure since it was most statistically significant in all previous models.
Y.test.lda
No Yes
No 1291 468
Yes 0 0
The accuracy for the LDA with p = 1 fit is 72.086 %
- It is important to note that the tenure is getting almost 30% error rate due to fact that it doesn’t align much with the Normal distribution curve but still covers 70% of the accuracy.
- It is also important to note that LDA predicted all the True Negatives in the testing data accurately!. The same findings are shown below where dashed line is the Normal distribution function with mean as 5 and SD as 16.
ggplot(data.frame(tenure), aes(x = tenure)) +
geom_density(fill = "#f2bfb9", alpha = 0.4) +
xlim(-40, max(tenure)) +
ylim(0,0.05) +
stat_function(fun = dnorm, args = list(mean = 5, sd = 16), color = "red", linetype = "dashed", size = 1.2) +
labs(title = "Probability Distribution for Tenure",x = "Tenure",y = "Probability Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.grid = element_blank())
For LDA with p=2, we are taking the tenure and Contract since they turned out to be most statistically significant.
- The Probability Distribution of Contract are shown below where dashed line is the Normal distribution function with mean as 0 and SD as 0.25
ggplot(data.frame(Telco.Churn.Dataset.numeric$Contract), aes(x = Telco.Churn.Dataset.numeric$Contract)) +
geom_density(fill = "#b4cef5", alpha = 0.4) +
xlim(-10,10) +
ylim(0,2) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 0.25), color = "blue", linetype = "dashed", size = 1.2) +
labs(title = "Probability Distribution for Contract",x = "Contract",y = "Probability Density") +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.grid = element_blank())
So we state our Assumption: By combining these we may expect to have an joint probability density almost same as of a multivariate normal distribution.
Y.test.lda
No Yes
No 1209 348
Yes 82 120
The accuracy for the LDA with p = 2 (using tenure and Contract) fit is 72.086 %
Result: The accuracy doesn’t showed a increment in accuracy, hence their is some evidence that our assumption is wrong about two predictors having almost normal distribution is expected to give a multivariate normal distribution but not totally correct. Moreover, the two predictors are highly correlated (by 0.68). Hence a absolute bell-shaped curve was not possible.
Analysis of Random Forest with Encoders
Confsion Matrix and Error rate with number of predictors being 12 using no encoder with 500 number of trees in the forest.
True Value: Not Churned Churned
Predicted (by Random Forest):
Not Churned 1140 238
Churned 151 230
Error Rate for Random Forest is: 22.11 % and maximumn Out-of-Bag Error is 0.551
** We note the error the rate is approximately 22% with 12 predictors. Now we will analyse the testing accuracies with the encoders taking different number of predictors.**
ggplot(RD.data.all, aes(x = Predictors, y = Accuracy, color = Encodings)) +
geom_line(size = 0.5) +
geom_point(size = 0.5) +
#ylim(75,100) +
geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(
title = "Number of Predictors vs. RF Accuracy for Different Encodings",
x = "Predictors",
y = "Accuracy %"
+
) theme_minimal() +
scale_color_manual(values = c("RF.Simple.Accuracies" = "#38d6e3", "RF.Bayesian.accuracies" = "#b0d83e", "RF.WOE.accuracies" = "#E2D947", "RF.LOO.accuracies" ="#D55E00"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))+
theme(legend.background = element_rect(fill = "#e2f5ec", color = NA))
Plot without LOO encoder
ggplot(RD.data.all, aes(x = Predictors, y = Accuracy, color = Encodings)) +
geom_line(size = 0.5) +
geom_point(size = 0.5) +
ylim(76,81) +
#geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(
title = "Number of Predictors vs. RF Accuracy for Different Encodings",
x = "Predictors",
y = "Accuracy %"
+
) theme_minimal() +
scale_color_manual(values = c("RF.Simple.Accuracies" = "#38d6e3", "RF.Bayesian.accuracies" = "#b0d83e", "RF.WOE.accuracies" = "#E2D947"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))+
theme(legend.background = element_rect(fill = "#f7efed", color = NA))
- We observe that the accuracy rate remains constant after reaching a maximum value for 3 or 4 predictors for all the encoders expect for the Leave One Out Encoder.
- The reason for the Leave One showing the 100% accuracy is that it splits the two outputs into 2 steps for Senior Citizen predictor. Same phenomenon was observed for other predictors also. They alone can contribute towards 100% accuracy. Below the evaluation of why it happens.
sum(
$loo_SeniorCitizen == 0.2364 &
Random.Forest.X.loo.encoding.Y.test$Churn == "1"
Random.Forest.X.loo.encoding.Y.test )
[1] 350
sum(
$loo_SeniorCitizen == 0.4163 &
Random.Forest.X.loo.encoding.Y.test$Churn == "1"
Random.Forest.X.loo.encoding.Y.test )
[1] 118
sum(Random.Forest.X.loo.encoding.Y.test$Churn == "1")
[1] 468
This questions the credibility of the dataset because usually it doesn’t turns out to be the case where one encoder gets 100% accuracy, although it was downloaded from the Kaggle website and this particular dataset by contributed by a famous tech firm IBM. Thus, for the further analysis we will omit the use of this LOO encodings.
Simple Random Forest with Number of Trees
Next, we analysed the accuracy with respect the change in its number of trees from 1 to 1000 in a Simple Random Forest taking 4 predictors and it turned out to be the most accurate/stable number of predictor from our above analysis. Since we do not have an GPU, we would take jump of 100 steps for traversing from 1 to 7000.
<- seq(1, 7000, by = 100)
rf.number.of.trees <- rep(0, length(rf.number.of.trees))
rf.trees.accuracies
for (index in seq_along(rf.number.of.trees)){
<- randomForest(Churn ~ ., data = Random.Forest.X.Y.train, mtry = 4, importance = TRUE, ntree = index)
rf.model.fit <- predict(rf.model.fit , newdata = Random.Forest.X.Y.test)
predictions <- table(predictions, Random.Forest.X.Y.test$Churn)
confusion_matrix <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy <- accuracy
rf.trees.accuracies[index]
}
<- rf.trees.accuracies * 100
rf.trees.accuracies
<- data.frame(mtry = rf.number.of.trees, accuracy = rf.trees.accuracies) rf.trees.data
# Create the plot
ggplot(rf.trees.data, aes(x = mtry, y = accuracy)) +
#ylim(68,82) +
geom_line(color = "#38d6e3") + # Line plot
geom_point(color = "#b0d83e") + # Points on the plot
labs(title = "Random Forest Accuracy vs Number of Trees", x = "Number of Trees", y = "Test Accuracy %") +
theme_minimal() + # Aesthetic theme
geom_smooth(color = "#38d6e3") +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))
We can see from the above graph that the accuracy increases with the increase in number of trees with some fluctuations and after 5000, it almost becomes steady.
Hence a Random Forest with four or five number of predictors and almost 1500 number of trees looks sufficient.**
10-fold Cross-Validation
- We next perform 10-fold cross validation on the Random Forest to check the number of predictors required.
<- Telco.Churn.Dataset[,-c(1)]
X.dataset.Cross.Validation $Churn <- as.factor(X.dataset.Cross.Validation$Churn)
X.dataset.Cross.Validation
<- trainControl(method = "cv", number = 10)
train_control
<- train(Churn ~ tenure + PaymentMethod + PaperlessBilling + TotalCharges + PhoneService + Contract, data = X.dataset.Cross.Validation, method = "rf", trControl = train_control) rf_cv
print(rf_cv)
Random Forest
7032 samples
6 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6328, 6329, 6329, 6329, 6328, 6330, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7780170 0.3470699
5 0.7792956 0.4046720
9 0.7555496 0.3454467
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.
ggplot(rf_cv$results, aes(x = mtry, y = Accuracy)) +
geom_line(color = "#1f77b4", size = 1) +
geom_point(size = 2, color = "#ff7f0e") +
geom_errorbar(aes(ymin = Accuracy - 1 * AccuracySD, ymax = Accuracy + 1 * AccuracySD),
width = 0.2, color = "#1f77b4", alpha = 0.6) +
labs(
title = "Cross-Validated Accuracy vs. mtry (Random Forest)",
x = "Number of Predictors Sampled (mtry)",
y = "Accuracy"
+
) theme_minimal(base_size = 13) +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))
- Hence, we 5 metrics we can choose the Random Forest Model
XGBoost with Encoders
- We next analysed the XGBoost with different encoders with different depth (1 upto 100) and using all the predictors.
But, before that we ran training of a simple XGBoost model to get the estimate of the accuracy. Here we considered maximum depth of tree to be 6 and number of boosting rounds to be 100.
True Values Not Churned Churned
Predicted (by XGBoost)
Not Churned 1149 231
Churned 142 237
Accuracy: 0.7879477
- Now, we analyse it different number of depths with number of rounds being 100.
<- xgboost.accuracies * 100
xgboost.accuracies <- xgboost.bayesian.accuracies * 100
xgboost.bayesian.accuracies <- xgboost.woe.accuracies * 100 xgboost.woe.accuracies
ggplot(XGBoost.data.all, aes(x = Depth, y = Accuracy, color = Encodings)) +
geom_line(size = 0.5) +
geom_point(size = 0.5) +
#coord_cartesian(ylim = c(65, 82)) +
geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
labs(title = "XGBoost Accuracy vs. Depth of Tree", x = "Depth", y = "Test Accuracy %") +
theme_minimal() +
scale_color_manual(values = c("XGBoost.Simple.Accuracies" = "#38d6e3", "XGBoost.Bayesian.accuracies" = "#b0d83e", "XGBoost.WOE.accuracies" = "#E2D947"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))+
theme(legend.background = element_rect(fill = "#ecf8f6", color = NA))
- Now we analyse the number of rounds using 10-fold cross validation with depth as 55(When test accuracy reaches equilibrium)
<- xgb.cv(
cv data = X.train.xg.matrix,
label = Y.train.xg.matrix,
max_depth = 55,
eta = 0.1,
nrounds = 500,
nfold = 10,
objective = "binary:logistic",
eval_metric = "logloss",
early_stopping_rounds = 10,
verbose = 0
)
<- cv$best_iteration
best.round cat("Best round turns out to be",best.round,"th")
Best round turns out to be 24 th
- From the above plots we can see that the XGBoost is not performing well with the Weight of Evidence and has lot of fluctuations within it.
- All the XGBoost encodings tends to achieve an equilibrium when depth reaches beyond 55 in approximation.
- Also, the Simpler version of XGBoost was the quickest to converge, followed by the XGBoost WOE model.
- Also, by changing different depth values we observed the best round to be in between 20-40.
Bosting
- We next analyse boosting with different number of trees and interaction depths.
Before that we take do quick analysis by taking the interaction depth as 4 and number of trees as 1500 with different encoders.
Summary, Relative Improtance and Partial Dependence Plots with Simple Boosting
var | rel.inf | |
---|---|---|
TotalCharges | TotalCharges | 29.1053032 |
MonthlyCharges | MonthlyCharges | 24.4155948 |
tenure | tenure | 12.2500672 |
Contract | Contract | 11.5236500 |
OnlineSecurity | OnlineSecurity | 3.4202746 |
PaymentMethod | PaymentMethod | 2.9221656 |
TechSupport | TechSupport | 2.7904124 |
gender | gender | 1.7202024 |
PaperlessBilling | PaperlessBilling | 1.6276123 |
OnlineBackup | OnlineBackup | 1.5575649 |
SeniorCitizen | SeniorCitizen | 1.4474142 |
DeviceProtection | DeviceProtection | 1.2657916 |
StreamingTV | StreamingTV | 1.0762316 |
MultipleLines | MultipleLines | 1.0420367 |
Dependents | Dependents | 0.9315417 |
StreamingMovies | StreamingMovies | 0.9200107 |
InternetService | InternetService | 0.7660315 |
Partner | Partner | 0.6786471 |
PhoneService | PhoneService | 0.5394474 |
colnames(importance_df) <- c("Predictor", "RelativeInfluence")
ggplot(importance_df, aes(x = reorder(Predictor, RelativeInfluence), y = RelativeInfluence)) +
geom_bar(stat = "identity", fill = "#e8532e") +
coord_flip() +
labs(title = "Predictor Importance in Simple GBM (Boosting)",
x = "Predictor",
y = "Relative Influence %") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))
- From the above plot, we observe that Total Charges and Monthly Charges which are numerical predictor has much higher influence than tenure.
- Similarly, Payment Method, Online Security, Tech Support are highly influential with Contract
- So, can select take 6 main predictors as, Total Charges, Monthly Charges, Contract, tenure, Payment Method and Online Security.
Also, the Partial Dependence of Churn on Total Charges, Monthly Charges and tenure are shown below.
<- plot(boost.model.fit, i.var = "TotalCharges", return.grid = TRUE)
pdp.totalCharges
ggplot(pdp.totalCharges, aes(x = TotalCharges, y = y)) +
geom_line(color = "#38d6e3", size = 1.2) +
geom_point(color = "#0073c2ff", size = 1.5) +
labs(title = "Partial Dependence of Churn on Total Charges", x = "Total Charges", y = "Partial Dependence (log-odds)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
+
) theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))
<- plot(boost.model.fit , i = "tenure", return.grid = TRUE)
pdp.tenure
ggplot(pdp.tenure, aes(x = tenure, y = y)) +
geom_line(color = "#9ff09b", size = 1.2) +
geom_point(color = "#2aca22", size = 1.5) +
labs(title = "Partial Dependence of Churn on Tenure", x = "Tenure", y = "Partial Dependence (log-odds)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
+
)theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))
<- plot(boost.model.fit, i.var = "MonthlyCharges", return.grid = TRUE)
pdp.MonthlyCharges
ggplot(pdp.MonthlyCharges, aes(x = MonthlyCharges, y = y)) +
geom_line(color = "#eacd92", size = 1.2) +
geom_point(color = "#d59f0e", size = 1.5) +
labs(title = "Partial Dependence of Churn on Monthly Charges", x = "Monthly Charges", y = "Partial Dependence (log-odds)") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
+
) theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f9f9ea"))
Summary of Boosting with other Encoders
- Interestingly, we observed that we got same feature importance, PDP’s in the Boosting regardless of the encoding we used!
Boosting Test Accuracy with the number of trees
- Next we analyse the XGboost with a maximum of 5000 number of trees and taking the jump of 10 between 1 to 5000.
# Create the plot
ggplot(boosting.data, aes(x = num.trees, y = boost.accuracies)) +
geom_line(color = "#38d6e3") + # Line plot
geom_point(color = "#b0d83e") + # Points on the plot
labs(title = "Boosting Accuracy vs. Number of Trees", x = "Trees", y = "Accuracy") +
theme_minimal() + # Aesthetic theme
geom_smooth(color = "#38d6e3")+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))
- From the above plot we can say that number of trees taken around 100 to 500 would be good interval because its interval of increasing accuracy.
ADABoost
- We next analyse the ADABoost, which focuses more on the hard-to-classify examples. We will vary the number of weak learners(typically trees) to be trained in the AdaBoost. We will be all the predictors in this case.
-Firstly we fit the model using weak learner as 50 with different encoders. This would give us Confusion Matrix and Accuracies as,
True Values Not Churned Churned
Predicted (by ADABoost)
Not Churned 1145 220
Churned 146 248
[1] "Accuracy: 79.19 %"
True Values Not Churned Churned
Predicted (by ADABoost Bayesian)
Not Churned 1142 217
Churned 149 251
[1] "Accuracy: 79.19 %"
True Values Not Churned Churned
Predicted (by ADABoost WOE)
Not Churned 1140 218
Churned 151 250
[1] "Accuracy: 79.02 %"
- We observed that the Bayesian encoder is doing slighty better in this case.
Now we will vary the weak classifiers from 1 to 100. Extending beyond this value we would be more prune to overfitting.
<- seq(1, 100, by = 1)
m.final.values <- rep(0, 100)
m.accuracies
for (index in seq_along(m.final.values)){
<- boosting(Churn ~ ., data = X.Y.train.AdaBoost, boos = TRUE, mfinal = index)
adaboost.model.fit <- predict(adaboost.model.fit, X.Y.test.AdaBoost)
adaboost.predictions <- table(Predicted = adaboost.predictions$class, Actual = X.Y.test.AdaBoost$Churn)
confusion.matrix
<- sum(diag(confusion.matrix)) / sum(confusion.matrix)
accuracy
<- accuracy
m.accuracies[index]
}
<- m.accuracies * 100
m.accuracies <- data.frame(Number.of.Weak.Classifiers = m.final.values, ADAboost.Simple.Accuracies = m.accuracies) AdaBoost.data
# Create the plot
ggplot(AdaBoost.data, aes(x = Number.of.Weak.Classifiers, y = m.accuracies)) +
geom_line(color = "#38d6e3") + # Line plot
geom_point(color = "#b0d83e") + # Points on the plot
labs(title = "ADABoost Accuracy vs Weak Classifier", x = "Weak Classifier", y = "Accuracy") +
theme_minimal() + # Aesthetic theme
geom_smooth(color = "#38d6e3")+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))
Training Models with meaningful predictors
Since, we now have some meaningful predictors which are predictors as, Total Charges, Monthly Charges, Contract, tenure, Payment Method and Online Security.
We apply the Logistic and XGBoost model again with their respective inferred values.
<- Telco.Churn.Dataset
X.Y.Mixed
<- list('Contract', 'PaymentMethod', 'OnlineSecurity')
encoded.categories for (category in encoded.categories){
<- which(colnames(X.Y.Mixed)== category)
column.index <- bayesian_encoding(X.Y.Mixed, X.Y.Mixed[,column.index:column.index], X.Y.Mixed[,21:21], column.index)
X.Y.Mixed
if(category == 'Contract')
<- woe_encoding(X.Y.Mixed, X.Y.Mixed[,column.index:column.index], Telco.Churn.Dataset[,21:21], column.index)
X.Y.Mixed
<- as.numeric(X.Y.Mixed[[category]])
X.Y.Mixed[[category]]
}
$Churn <- as.numeric(as.factor(X.Y.Mixed$Churn)) -1
X.Y.Mixed
<- X.Y.Mixed[train.X.vector.indicies.unmasked,-c(1,21)]
X.train <- X.Y.Mixed[!train.X.vector.indicies.unmasked,-c(1,21)]
X.test <- X.Y.Mixed$Churn[!train.X.vector.indicies.unmasked]
Y.test
<- glm(Churn ~ TotalCharges + MonthlyCharges + Contract + tenure + PaymentMethod + OnlineSecurity, data = X.Y.Mixed[,-c(1)], family = binomial, subset = train.X.vector.indicies.unmasked)
logistic.regression.train.fit
summary(logistic.regression.train.fit)
Call:
glm(formula = Churn ~ TotalCharges + MonthlyCharges + Contract +
tenure + PaymentMethod + OnlineSecurity, family = binomial,
data = X.Y.Mixed[, -c(1)], subset = train.X.vector.indicies.unmasked)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.055e+01 8.519e-01 -12.382 < 2e-16 ***
TotalCharges 3.058e-04 7.839e-05 3.901 9.59e-05 ***
MonthlyCharges 1.434e-02 2.285e-03 6.275 3.50e-10 ***
Contract 5.300e-01 5.064e-02 10.465 < 2e-16 ***
tenure -5.555e-02 6.880e-03 -8.074 6.82e-16 ***
PaymentMethod 1.781e+00 2.738e-01 6.506 7.72e-11 ***
OnlineSecurity 2.299e+00 3.019e-01 7.617 2.61e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6105.4 on 5272 degrees of freedom
Residual deviance: 4480.5 on 5266 degrees of freedom
AIC: 4494.5
Number of Fisher Scoring iterations: 6
<- predict(logistic.regression.train.fit, X.test, type = "response")
logistic.regression.train.fit.prob
<- rep(0, length(logistic.regression.train.fit.prob))
logistic.regression.train.fit.pred > 0.5] = 1
logistic.regression.train.fit.pred[logistic.regression.train.fit.prob
<- table(logistic.regression.train.fit.pred, Y.test)
tab
dimnames(tab) <- list(
"Predicted (by Simple):" = c("Not Churned", "Churned"),
"True Value:" = c("Not Churned", "Churned")
)
print(ftable(tab))
True Value: Not Churned Churned
Predicted (by Simple):
Not Churned 1126 218
Churned 165 250
<- mean(logistic.regression.train.fit.pred == Y.test) * 100
accuracy cat('Accuracy for Simple Logistic Regression:', round (accuracy, 3),'%','\n\n')
Accuracy for Simple Logistic Regression: 78.226 %
cat("Akaike Information Criterion(AIC) for the simple logistic regression is",AIC(logistic.regression.train.fit),'\n') # Akaike Information Criterion
Akaike Information Criterion(AIC) for the simple logistic regression is 4494.482
cat("Bayesian Information Criterion(BIC) for the simple logistic regression is",BIC(logistic.regression.train.fit)) # Bayesian Information Criterion
Bayesian Information Criterion(BIC) for the simple logistic regression is 4540.475
<- xgboost(
xgboost.model data = X.train.xg.matrix,
label = Y.train.xg.matrix,
max_depth = 7,
eta = 0.1,
nrounds = 24,
objective = "binary:logistic",
eval_metric = "logloss",
verbose = 0,
)
<- xgb.importance(model = xgboost.model)
importance_matrix print(importance_matrix)
Feature Gain Cover Frequency
<char> <num> <num> <num>
1: Contract 0.42949843 0.20502537 0.02753304
2: MonthlyCharges 0.22225678 0.31848539 0.43392070
3: TotalCharges 0.20611161 0.27845407 0.39812775
4: OnlineSecurity 0.08957889 0.09750718 0.03083700
5: PaymentMethod 0.02754038 0.05309149 0.07929515
6: TechSupport 0.02501391 0.04743650 0.03028634
<- predict(xgboost.model, X.test.xg.matrix)
y.pred <- ifelse(y.pred > 0.5, 1, 0) # Convert probabilities to 0/1
y.pred.class
<- table(Predicted = y.pred.class, Actual = Y.test.xg.matrix)
conf_matrix dimnames(conf_matrix) <- list(
"Predicted (by XGBoost)" = c("Not Churned", "Churned"),
"True Values" = c("Not Churned", "Churned")
)print(ftable(conf_matrix))
True Values Not Churned Churned
Predicted (by XGBoost)
Not Churned 1150 232
Churned 141 236
<- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy cat('Accuracy for XGBoost:', round (accuracy * 100, 3),'%','\n\n')
Accuracy for XGBoost: 78.795 %
Conclusion
In this project, we wanted to predict whether a customer churns or not. From the start, we had 19 predictors and most of them were not looking statistically significant. So, we applied different types of encoding techniques like Bayesian Target Encoding, Weight of Evidence, and Leave One out. We did our analysis by choosing different techniques using different encoding to infer the suitable predictors. While predictors like Payment Method, Phone Service, and Tech Support showed low p-values for the encoding. Hence, a total of 12 to 13 predictors seems statistically significant in total with different encoding. Further, we ensure at least tenure is highly influential to the output because of its low p-value in most of the cases. The same thing we showed in the LDA with p=1. Furthermore, we performed and analyzed tree-based methods where in Boosting we saw that Total charges, Monthly Charges, and Contact Period are having high relative influence as compared to others. Similarly, the 10-fold cross-validation in Random Forest also suggested that near about 5-6 predictors are sufficient for a significantly meaningful model. Hence, our observations picked up six predictors namely, Total Charges, Monthly Charges, Contract, tenure, Payment Method, and Online Security. Also, we verified that with these predictors we obtained 78% accuracy with Simple Logistic Regression and XGBoost and the highest accuracy detected was (80.5%) by using various techniques which included all the predictors. Also, the AIC and BIC scores were slightly lower for a model with six predictors and hence, it gave us strong evidence of our assumption of having six to seven influential predictors.
Also, in our analysis, we didn’t include Backward or Forward propagation methods that would help us identify more about these predictors. Similarly, we didn’t include any removal of outliers or leverages that could help us gain accuracy and hence more insight into the model. In our case, the data is limited, so we can consider including information/data across other parts of the US by using Cluster or Systematic Sampling. Hence in general, more data would give us more security that we are not underfitting our inferred model. So the best model we can use as of now are:
- Simple Logistic Regression, Random Forest and XGBoost with almost six to seven predictors.
Future Impact
For the future impact, if the above methods are deployed in the real-world scenario, we can expect to get the right prediction for the churn 80% of the time (except Tree-Methods with Leave One Out Encoder), which is probably better than a random guess. Since it contained most of the predictors that don’t contribute to the outcome as several model analyses, it suffers from the curse of dimensionality. Also, there were not many interactions observed during my analysis. If we try to reduce the model cost further it will suffer overfitting, so increased bias would have a negative impact on the model. Hence, it is suggested that more data be collected since only 7k can’t guarantee an accuracy that is 80% achievable in practice most of the time. If some “outliers” were observed in the future related to this dataset, especially with tenure, then many techniques used would suffer. Putting the extreme cases aside, the company can hugely benefit if, they are doubtful about a customer being churned in the future. The firm can focus on these categories of customers to maintain their profits. Also, the model is trained using the dataset of a specified region, namely California, which could not match the other regions. In general, not all of the features of customers for one region matches with the feature of customers in other region. Hence, the model is not giving a holistic view of the churn prediction across the US. In these cases, our model will yield low accuracy and is expected to give the firm some benefit in their churn prediction.