Churn Prediction

Author

Kushal Devgun

Published

April 11, 2025

Churn Prediction Dataset: Data Science Project

Comprehending meaningful Predictors for Churn Prediction

By: Kushal Devgun


Data Introduction

In this project, I analyzed the Telco Customer Churn dataset. This dataset represents a sample of a telecommunications company offering different services to approximately 7,000 customers. It includes 19 predictors related to the services customers have subscribed to with their values as shown,

  • Gender: Whether the customer is a male or a female
  • Senior Citizen Status: Whether the customer is a senior citizen or not (1, 0)
  • Partner:: Whether the customer has a partner or not (Yes, No)
  • Dependents: Whether the customer has dependents or not (Yes, No)
  • Tenure: Number of months the customer has stayed with the company
  • Phone Service: Whether the customer has a phone service or not (Yes, No)
  • Multiple Lines: Whether the customer has multiple lines or not (Yes, No, No phone service)
  • Internet Service: Customer’s internet service provider (DSL, Fiber optic, No)
  • Online Security: Whether the customer has online security or not (Yes, No, No internet service)
  • Online Backup: Whether the customer has device protection or not (Yes, No, No internet service)
  • Device Protection: Whether the customer has tech support or not (Yes, No, No internet service)
  • Tech Support: Whether the customer has tech support or not (Yes, No, No internet service)
  • Streaming TV: Whether the customer has streaming TV or not (Yes, No, No internet service)
  • Streaming Movies: Whether the customer has streaming movies or not (Yes, No, No internet service)
  • Contract: The contract term of the customer (Month-to-month, One year, Two year)
  • Paperless Billing: Whether the customer has paperless billing or not (Yes, No)
  • Payment Method: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
  • Monthly Charges: The amount charged to the customer monthly
  • Total Charges: The total amount charged to the customer

The purpose of this dataset is to predict whether a customer will churn or not. It Includes 2 quantitative and 17 qualitative predictors.


Research Questions

There are several questions that needs investigations like,

  • Which of the following predictors are correlated?
  • Which model will guarantee a good relationship for the future predictors?
  • What the the predictors that influence the outcome the most and should be used for the future predictions?
  • Does encoding categorical variables by several encoding techniques makes the model more meanginful?
  • Can Tree-Based method be good for the outcome? If yes? what should be Depth, Number of Trees and other metrics?
  • Which Validation method would be trustable and does the given model suffer overfitting?
  • Any synergy effect or other transformation effect observed in the dataset?

We explored all the answers collectively in the below sections

Data Visualization

Below the Dataset View and the summary of the Dataset.

customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
Length:7043 Length:7043 Min. :0.0000 Length:7043 Length:7043 Min. : 0.00 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Min. : 18.25 Min. : 18.8 Length:7043
Class :character Class :character 1st Qu.:0.0000 Class :character Class :character 1st Qu.: 9.00 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4 Class :character
Mode :character Mode :character Median :0.0000 Mode :character Mode :character Median :29.00 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 70.35 Median :1397.5 Mode :character
NA NA Mean :0.1621 NA NA Mean :32.37 NA NA NA NA NA NA NA NA NA NA NA NA Mean : 64.76 Mean :2283.3 NA
NA NA 3rd Qu.:0.0000 NA NA 3rd Qu.:55.00 NA NA NA NA NA NA NA NA NA NA NA NA 3rd Qu.: 89.85 3rd Qu.:3794.7 NA
NA NA Max. :1.0000 NA NA Max. :72.00 NA NA NA NA NA NA NA NA NA NA NA NA Max. :118.75 Max. :8684.8 NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :11 NA

Below shows the visualization of the correlation matrics and some relationships between predictors. We can see that not much of predictors are correlated.

Telco.Churn.Dataset.numeric$tenure <- as.numeric(Telco.Churn.Dataset.numeric$tenure)
options(repr.plot.width = 50, repr.plot.height = 30)
corr <-cor(Telco.Churn.Dataset.numeric[,-c(1,21)])
ggcorrplot(corr, hc.order = TRUE, lab = FALSE, colors = c("#6D9EC1", "white", "#cb1f1f"))

Next, we create a x-y graph to compare between Tenure and Total charges.

ggplot(
    data = Telco.Churn.Dataset.numeric,
    mapping = aes(x = tenure, y = TotalCharges, color = Churn)) + 
    geom_point(size = 1) + 
    labs(title = "Tenure Vs Total Charges",x = "Tenure",y = "Total Charges")+
    scale_color_manual(values = c("0" = "#0072B2", "1" = "#D55E00")) +
    theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"), panel.background = element_rect(fill = "#f5eee1")) +
    scale_x_continuous(breaks = seq(0, max(Telco.Churn.Dataset.numeric$tenure), by = 10)) +
    geom_smooth(color = "#f5af2e")

From above plot we see that the Churned customers decrease with their increase in the tenure and hence as described by the correlation matrix.

ggplot(
    data = Telco.Churn.Dataset.numeric, aes(x = tenure, fill = as.factor(tenure))) +
    geom_bar(color = "black") +
    labs(title = "Distribution of Tenure", x = "Tenure", y = "Count") +
    theme_minimal() + 
    scale_fill_manual(values = colrs)+
    theme(legend.position = "none") + 
    scale_x_continuous(breaks = seq(0, max(Telco.Churn.Dataset.numeric$tenure), by = 5)) + 
    theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f6f2f7")) 


Working Methodology


Stratifying the Dataset

Stratifying the data: It involves dividing a dataset into distinct subgroups (also called strata), based on a specific characteristic before performing analysis or training. This technique ensures that each subgroup is adequately represented, sometimes reducing bias and improving statistical reliability. Stratification is particularly useful in machine learning ensuring that the model has a fair chance to get trained on a particular characteristic or even output, ensuring training and test datasets maintain the same distribution of key variables for better model. In my case, I am stratifying the data based upon the output(Churn).

In my stratified sampling, I am using 75% training and 25% testing data while maintaing the ratio as shown below.

[ Ratio = = ]

Below is the Code for the same where the function ‘stratify.data.train.indicies’ stratify the data based on the training split ratio.

customers.left <- vector(mode = "numeric", length = 0)
customers.detained <- vector(mode = "numeric", length = 0)

for (index in seq_along(Churn)){
    if (Churn[index] == 'Yes')
        customers.left <- c(customers.left, index)
    else
        customers.detained <- c(customers.detained, index)
}

cat("Total number of customers Left:", length(customers.left)," and ", "Total number of customers detained:", length(customers.detained))
Total number of customers Left: 1869  and  Total number of customers detained: 5163
stratify.data.train.indicies <- function(data.train.ratio, ...){
    number.of.left.train.samples <- data.train.ratio * length(customers.left)
    number.of.detained.train.samples <- data.train.ratio * length(customers.detained)
    
    total.train.samples.indicies <- vector(mode = "numeric", length = 0)
    total.train.samples.indicies <- c(total.train.samples.indicies, sample(customers.left, number.of.left.train.samples), sample(customers.detained, number.of.detained.train.samples))
    
    return (total.train.samples.indicies)
}

Using Different Encoders:

Since, most of the variables involved are categorical, I am representing it using some most commonly used categorical encoders for a financial dataset namely,

  • Bayesian Target Encoding
  • Weight of Evidence
  • Leave One Out Encoding

For Bayesian Target encoding, I am creating the encoder on my own since, I didn’t found any suitable R dependency to do so. Mathematically, the formula for Bayesian Target encoder looks like,

[ E(x c) = ]

where, ( E(x c) ) = Encoded value for category ( c ), ( N_c ) = Number of occurrences of category c in the dataset, ( _c ) = Mean target value for category c, ( ) = Global mean of the target variable across all categories, ( m ) = Smoothing factor (a hyperparameter). In my I am taking the smoothing factor to be 2.

Below is the code of Bayesian Target encoder where the function “bayesian_encoding()” takes a single category at a time and encode it using the target which is Churn.

bayesian_encoding <- function(data, category, target, column.index){ #bayesian.mean.target.encoding
    
    #Weighted Mean = (n * Option Mean + m * Overall mean)/n+m
    #Option Mean = No of Instances of (Yes/1/postive) of a particular category / Total Instances of a particular category
    #Overall mean = Total Instances of a particular category / Total Instances
    #n = Weight of Option Mean (usually number of rows of the category)
    #m = Weight of Overall Mean (user defined)
  
    total.rows <- length(category)
    num.categories <- length(unique(category))
    names.of.categories <- unique(category)
    categories.count <- rep(0, num.categories)
    cotegories.count.yes <- rep(0, num.categories)
    m <- 2
    total.count.yes <- sum(target == 'Yes')
    
    for (index in seq_along(names.of.categories)){
        x <- sum(category == names.of.categories[index])
        categories.count[index] <- x
        cotegories.count.yes[index] <- sum(target[category == names.of.categories[index]] == 'Yes')

        #n = Weight of Option Mean (number of rows of the category)
        n <- categories.count[index]
        
        #m = Weight of Overall Mean (smoothing factor)
        m <-2
        
        #Option Mean = No of Instances of (Yes/1/postive) of a particular category / Total Instances of a particular category
        option.mean <- cotegories.count.yes[index]/categories.count[index]

        #Overall mean = Total Instances of a (Yes/1/postive)/ Total Instances
        overall.mean <- total.count.yes/total.rows

        #Weighted Mean
        weighted.mean <- (n * option.mean + m * overall.mean)/n+m
        
        data[, column.index:column.index] <- replace(data[, column.index:column.index], category == names.of.categories[index], weighted.mean)
    }
    
    return (data)
    
}

Similarly, for the Weight of Evidence which is a statistical technique used primarily in credit scoring and binary classification to measure the predictive power of an independent variable in relation to a binary target. Mathematically, it is defined as the logarithm of the ratio between the percentage of goods and bads in each bin as stated,

[ a = ]

[ b = ]

[ = ( ) ]

And the code for the same is shown below where the function “woe_encoding” takes a single category at a time and encode it using the target which is Churn.

woe_encoding <- function(data, category, target, column.index){
    
    #WOE
    # a = (Postives|Yes|1) of the Group)/(Total Postivies)
    # b = (Negative|No|O) of the Group)/(Total Negatives)
    # WOE = log(a/b)
  
    total.rows <- length(category)
    num.categories <- length(unique(category))
    names.of.categories <- unique(category)
    categories.count <- rep(0, num.categories)
    categories.count.yes <- rep(0, num.categories)
    categories.count.no <- rep(0, num.categories)
    
    total.count.yes <- sum(target == 'Yes')
    total.count.no <- sum(target == 'No')

    for (index in seq_along(names.of.categories)){
        
        categories.count.yes[index] <- sum(target[category == names.of.categories[index]] == 'Yes')
        categories.count.no[index] <- sum(target[category == names.of.categories[index]] == 'No')

        a = categories.count.yes[index] / total.count.yes
        b = categories.count.no[index] / total.count.no
        woe <- log(a/b)
        #cat('WOE:', woe,'\n')
        data[, column.index:column.index] <- replace(data[, column.index:column.index], category == names.of.categories[index], woe)
    }
    return (data)
}

Also, for the Leave One Out encoding which helps reducing data leakage. Mathematically it is described as,

[ _i = ]

where, ( _i ) is the encoded value for row ( i ), ( y_j ) is the target value for row ( j ), ( x_j ) is the category of row ( j ), and ( c ) is the category being encoded, ( (x_j = c) ) is an indicator function that equals 1 if ( x_j = c ), else 0.

And the code for the same is shown below where the function “loo_encoding” takes a single category at a time and encode them into leave one out encoding.

  loo_encoding <- function(data, category_col, target_col) {
    data_encoded <- data
  
    unique_categories <- unique(data[[category_col]])
    encoded_values <- numeric(nrow(data))
  
    # Loop over each row and apply LOO Encoding
    for (i in 1:nrow(data)) {
        subset_data <- data[-i, ]
    
        # Calculate mean of target variable excluding current row
        mean_value <- mean(subset_data[subset_data[[category_col]] == data[i, category_col], target_col], na.rm = TRUE)
        encoded_values[i] <- mean_value
      }
  
      data_encoded[[paste0("loo_", category_col)]] <- encoded_values
  
      # Drop the original categorical column
      data_encoded[[category_col]] <- NULL
  
      return(data_encoded)
}

Modelling:

  • Logistic Regression (with different of encodings)
  • KNN (with different k and different encoding)
  • QDA, LDA, Naive Bayes
  • Random Forest and Bagging, Boosting, XGBoost, ADABoost, SVM (with different of encodings, different number and depth of trees)

With simple Logistic Regression we observed significant changes for z-stat and p-values for ‘Payment Method’, ‘Phone Service’, Tech Support when trained the model with the all the predictors in all cases.

Predictor Encoding p - value
Payment Method N/A 0.000811 (Lowest among all)
Payment Method Bayesian 1.78e-06
Payment Method WOE 9.87e-07
Payment Method LOO 7.47e-06
Phone Service N/A 0.43 (Lowest among two)
Phone Service Bayesian 2.21e-06
Phone Service WOE 1.61e-06
Phone Service LOO 1.64e-11
Tech Support N/A 0.798317
Tech Support WOE 5.77e-05
Tech Support WOE 6.02e-05
Tech Support LOO 2.04e-05

Also, some other predictors also showed more significance with the different encodings.

Logistic Regression accuracies, ROC Curve and AUC scores.

  • We observe different accuracies for different encodings using all the predictors at first, more and less for different values of random seeds values among the different encoding.
  • It is important to note that regardless of the encoding, the ROC Curve and AUC scores were same for all of them.
  • Also, the slightly less AIC and BIC scores for the WOE encoder and LOO encoders suggests that these encoders might be candidates for this dataset.

Simple Logistic Regression

                       True Value: Not Churned Churned
Predicted (by Simple):                                
Not Churned                               1154     202
Churned                                    137     266
Accuracy for Simple Logistic Regression: 80.728 % 
Akaike Information Criterion(AIC) for the simple logistic regression is 4423.434 
Bayesian Information Criterion(BIC) for the simple logistic regression is 4581.123

Bayesian Target Encoding

                         True Value: Not Churned Churned
Predicted (by Bayesian):                                
Not Churned                                 1148     202
Churned                                      143     266
Accuracy for Bayesian Encoding: 80.387 % 
Akaike Information Criterion(AIC) for the Bayesian Encoded logistic regression is 4423.48 
Bayesian Information Criterion(BIC) for the Bayesian Encoded logistic regression is 4554.887

Weight of Evidence

                    True Value: Not Churned Churned
Predicted (by WOE):                                
Not Churned                            1150     203
Churned                                 141     265
Accuracy for WOE encoding: 80.44343 % 
Akaike Information Criterion(AIC) for the WOE Encoded logistic regression is 4418.805 
Bayesian Information Criterion(BIC) for the WOE Encoded logistic regression is 4550.212

Leave One Out Encoding

                    True Value: Not Churned Churned
Predicted (by LOO):                                
Not Churned                            1154     198
Churned                                 137     270
Accuracy for Leave One Out: 80.955 % 
Akaike Information Criterion(AIC) for the LOO Encoded logistic regression is 4418.805 
Bayesian Information Criterion(BIC) for the LOO Encoded logistic regression is 4550.212
pred <- prediction(logistic.regression.train.fit.prob, Y.test)

# Calculate performance (True Positive Rate and False Positive Rate)
perf <- performance(pred, "tpr", "fpr")

plot(perf, col = "darkgreen", lwd = 2, main = "ROC Curve")
abline(a = 0, b = 1, col = "red", lty = 2)
auc_perf <- performance(pred, measure = "auc")
auc_value <- auc_perf@y.values[[1]]
grid(col = "lightgray", lty = "dotted")
legend(
  "bottomright",
  legend = paste("AUC =", round(auc_value,3)),
  col = "#1f77b4",
  lwd = 2.5,
  bty = "n",
  cex = 1.1
)

cat("AUC Score for all:\033[34m", round(auc_value, 3), "\033[0m")
AUC Score for all: 0.847 

KNN accuracies, best K- value and trend of accuracy with k-values

  • We Observe that the Leave One Out Encoding tends to gives slightly better accuracy almost every time as compared to the rest.
  • Nearly, all the encoders gives the maximum accuracy for the k-value ranging between 9-18
  • Interestingly, for Bayesian and Weight of Evidence encoding, sometimes the maximum value for accuracy is observed for K is 71 and 72 respectively. Pretty Surprising!
Best Accuracy in Simple KNN 79.193 % with a k-value of 16 
Best Accuracy of KNN in Baysain Target Encoder 79.022 % with a k-value of 17 
Best Accuracy of KNN in Weight of Evidence 79.022 % with a k-value of 17 
Best Accuracy of KNN in Leave One Out Encoding 80.728 % with a k-value of 15
knn.data <- data.frame(
  k = k.values,
  KNN.Simple.Accuracies = k.accuracies,
  KNN.Bayesian.accuracies = k.bayesian.accuracies,
  KNN.WOE.accuracies = k.woe.accuracies,
  KNN.LOO.accuracies = k.loo.accuracies
)



knn.data.all <- knn.data %>%
  pivot_longer(
    cols = -k,
    names_to = "Method",
    values_to = "Accuracy"
  )


ggplot(knn.data.all, aes(x = k, y = Accuracy, color = Method)) +
  geom_line(size = 0.5) +
  geom_point(size = 1) +
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
  labs(
    title = "K-Values vs. Accuracy for Different Encodings",
    x = "K-Value",
    y = "Accuracy"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.background = element_rect(fill = "#f6f2f7"),panel.grid = element_blank()) +
  scale_color_manual(values = c("KNN.Simple.Accuracies" = "#38d6e3", "KNN.Bayesian.accuracies" = "#b0d83e", "KNN.WOE.accuracies" = "#E2D947", "KNN.LOO.accuracies" ="#D55E00" )) +
 theme(legend.background = element_rect(fill = "#e2f5ec", color = NA))

Analysis with Naive Bayes With Different Encoders

  • We observe that all the accuracies and AUC scores are slightly different when using different encoders.
  • Similarly we also saw different 95% Confidence Intervals, Mcnemar’s Test P-Value and Kappa values for each encoder.
  • Along all WOE Encoding has the best Sensitivity meaning it catches more of the positive class.
  • All models has Kappa scores in the 0.37–0.41 range, with Simple Naive Bayes best at 0.4128.
  • Also simple Naive Bayes gives the best overall balance.

For Simple Naive Bayes:

Confusion Matrix and Statistics

         Actual
Predicted  No Yes
      No  886  86
      Yes 405 382
                                          
               Accuracy : 0.7209          
                 95% CI : (0.6993, 0.7417)
    No Information Rate : 0.7339          
    P-Value [Acc > NIR] : 0.8972          
                                          
                  Kappa : 0.4128          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8162          
            Specificity : 0.6863          
         Pos Pred Value : 0.4854          
         Neg Pred Value : 0.9115          
             Prevalence : 0.2661          
         Detection Rate : 0.2172          
   Detection Prevalence : 0.4474          
      Balanced Accuracy : 0.7513          
                                          
       'Positive' Class : Yes             
                                          

For Bayesian encoding:

Confusion Matrix and Statistics

         actual
predicted   0   1
        0 852  80
        1 439 388
                                         
               Accuracy : 0.7049         
                 95% CI : (0.683, 0.7262)
    No Information Rate : 0.7339         
    P-Value [Acc > NIR] : 0.997          
                                         
                  Kappa : 0.3929         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8291         
            Specificity : 0.6600         
         Pos Pred Value : 0.4692         
         Neg Pred Value : 0.9142         
             Prevalence : 0.2661         
         Detection Rate : 0.2206         
   Detection Prevalence : 0.4702         
      Balanced Accuracy : 0.7445         
                                         
       'Positive' Class : 1              
                                         

For WOE encoding:

Confusion Matrix and Statistics

         actual
predicted   0   1
        0 810  70
        1 481 398
                                          
               Accuracy : 0.6868          
                 95% CI : (0.6645, 0.7084)
    No Information Rate : 0.7339          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3733          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8504          
            Specificity : 0.6274          
         Pos Pred Value : 0.4528          
         Neg Pred Value : 0.9205          
             Prevalence : 0.2661          
         Detection Rate : 0.2263          
   Detection Prevalence : 0.4997          
      Balanced Accuracy : 0.7389          
                                          
       'Positive' Class : 1               
                                          

For LOO encoding:

Confusion Matrix and Statistics

         actual
predicted   0   1
        0 849  79
        1 442 389
                                          
               Accuracy : 0.7038          
                 95% CI : (0.6819, 0.7251)
    No Information Rate : 0.7339          
    P-Value [Acc > NIR] : 0.9979          
                                          
                  Kappa : 0.3919          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8312          
            Specificity : 0.6576          
         Pos Pred Value : 0.4681          
         Neg Pred Value : 0.9149          
             Prevalence : 0.2661          
         Detection Rate : 0.2211          
   Detection Prevalence : 0.4724          
      Balanced Accuracy : 0.7444          
                                          
       'Positive' Class : 1               
                                          
plot(
    ROCurve4,
    response = Y.test.naive.bayes.loo, predictor = as.numeric(naive.bayes.loo.class),
    plot = TRUE, legacy.axes = TRUE,
    xlim = c(1, 0), ylim = c(0, 1),    
    col = "#008D10", lwd = 1,
    main = "ROC Curve for Naive Bayes",
    axes = FALSE
)

axis(1, at = seq(0, 1, by = 0.2), labels = seq(1, 0, by = -0.2))  # Specificity axis
axis(2, at = seq(0, 1, by = 0.2), labels = seq(0, 1, by = 0.2), pos = 1)  # Sensitivity axis
lines(ROCurve1, col = "#FE0D2D", lwd = 1)
lines(ROCurve2, col = "#46CCEA", lwd = 1)
lines(ROCurve3, col = "#C1D634", lwd = 1) 
par(bg = "#e2f5ec")  
legend("right", legend = c("LOO", "Simple", "Bayesian", "WOE"), col = c("#008D10", "#FE0D2D", "#46CCEA","#C1D634"), lwd = 2)

The AUC Score for Simpler Version of Naive Bayes obeserved was 0.751 ,for Bayesian encoding it was 0.745 ,for WOE encoding it was 0.739 and for the LOO encoding it was 0.744

Analysis for LDA and QDA

-For LDA with p=1, we are taking the tenure since it was most statistically significant in all previous models.

     Y.test.lda
        No  Yes
  No  1291  468
  Yes    0    0
The accuracy for the LDA with p = 1 fit is 72.086 %
  • It is important to note that the tenure is getting almost 30% error rate due to fact that it doesn’t align much with the Normal distribution curve but still covers 70% of the accuracy.
  • It is also important to note that LDA predicted all the True Negatives in the testing data accurately!. The same findings are shown below where dashed line is the Normal distribution function with mean as 5 and SD as 16.
ggplot(data.frame(tenure), aes(x = tenure)) +
  geom_density(fill = "#f2bfb9", alpha = 0.4) + 
  xlim(-40, max(tenure)) +
  ylim(0,0.05) +
  stat_function(fun = dnorm, args = list(mean = 5, sd = 16), color = "red", linetype = "dashed", size = 1.2) +
  labs(title = "Probability Distribution for Tenure",x = "Tenure",y = "Probability Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.grid = element_blank())

For LDA with p=2, we are taking the tenure and Contract since they turned out to be most statistically significant.

  • The Probability Distribution of Contract are shown below where dashed line is the Normal distribution function with mean as 0 and SD as 0.25
ggplot(data.frame(Telco.Churn.Dataset.numeric$Contract), aes(x = Telco.Churn.Dataset.numeric$Contract)) +
  geom_density(fill = "#b4cef5", alpha = 0.4) + 
  xlim(-10,10) +
  ylim(0,2) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 0.25), color = "blue", linetype = "dashed", size = 1.2) +
  labs(title = "Probability Distribution for Contract",x = "Contract",y = "Probability Density") +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), panel.grid = element_blank())

So we state our Assumption: By combining these we may expect to have an joint probability density almost same as of a multivariate normal distribution.

     Y.test.lda
        No  Yes
  No  1209  348
  Yes   82  120
The accuracy for the LDA with p = 2 (using tenure and Contract) fit is 72.086 %

Result: The accuracy doesn’t showed a increment in accuracy, hence their is some evidence that our assumption is wrong about two predictors having almost normal distribution is expected to give a multivariate normal distribution but not totally correct. Moreover, the two predictors are highly correlated (by 0.68). Hence a absolute bell-shaped curve was not possible.

Analysis of Random Forest with Encoders

Confsion Matrix and Error rate with number of predictors being 12 using no encoder with 500 number of trees in the forest.

                              True Value: Not Churned Churned
Predicted (by Random Forest):                                
Not Churned                                      1140     238
Churned                                           151     230
Error Rate for Random Forest is: 22.11 % and maximumn Out-of-Bag Error is 0.551

** We note the error the rate is approximately 22% with 12 predictors. Now we will analyse the testing accuracies with the encoders taking different number of predictors.**

ggplot(RD.data.all, aes(x = Predictors, y = Accuracy, color = Encodings)) +
  geom_line(size = 0.5) +
  geom_point(size = 0.5) +
  #ylim(75,100) + 
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
  labs(
    title = "Number of Predictors vs. RF Accuracy for Different Encodings",
    x = "Predictors",
    y = "Accuracy %"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("RF.Simple.Accuracies" = "#38d6e3", "RF.Bayesian.accuracies" = "#b0d83e", "RF.WOE.accuracies" = "#E2D947", "RF.LOO.accuracies" ="#D55E00"))+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))+
  theme(legend.background = element_rect(fill = "#e2f5ec", color = NA))

Plot without LOO encoder

ggplot(RD.data.all, aes(x = Predictors, y = Accuracy, color = Encodings)) +
  geom_line(size = 0.5) +
  geom_point(size = 0.5) +
  ylim(76,81) + 
  #geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
  labs(
    title = "Number of Predictors vs. RF Accuracy for Different Encodings",
    x = "Predictors",
    y = "Accuracy %"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("RF.Simple.Accuracies" = "#38d6e3", "RF.Bayesian.accuracies" = "#b0d83e", "RF.WOE.accuracies" = "#E2D947"))+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))+
  theme(legend.background = element_rect(fill = "#f7efed", color = NA))

  • We observe that the accuracy rate remains constant after reaching a maximum value for 3 or 4 predictors for all the encoders expect for the Leave One Out Encoder.
  • The reason for the Leave One showing the 100% accuracy is that it splits the two outputs into 2 steps for Senior Citizen predictor. Same phenomenon was observed for other predictors also. They alone can contribute towards 100% accuracy. Below the evaluation of why it happens.
sum(
  Random.Forest.X.loo.encoding.Y.test$loo_SeniorCitizen == 0.2364 &
  Random.Forest.X.loo.encoding.Y.test$Churn == "1"
)
[1] 350
sum(
  Random.Forest.X.loo.encoding.Y.test$loo_SeniorCitizen == 0.4163 &
  Random.Forest.X.loo.encoding.Y.test$Churn == "1"
)
[1] 118
sum(Random.Forest.X.loo.encoding.Y.test$Churn == "1")
[1] 468

This questions the credibility of the dataset because usually it doesn’t turns out to be the case where one encoder gets 100% accuracy, although it was downloaded from the Kaggle website and this particular dataset by contributed by a famous tech firm IBM. Thus, for the further analysis we will omit the use of this LOO encodings.

Simple Random Forest with Number of Trees

Next, we analysed the accuracy with respect the change in its number of trees from 1 to 1000 in a Simple Random Forest taking 4 predictors and it turned out to be the most accurate/stable number of predictor from our above analysis. Since we do not have an GPU, we would take jump of 100 steps for traversing from 1 to 7000.

rf.number.of.trees <- seq(1, 7000, by = 100)
rf.trees.accuracies <- rep(0, length(rf.number.of.trees))

for (index in seq_along(rf.number.of.trees)){
    rf.model.fit <- randomForest(Churn ~ ., data = Random.Forest.X.Y.train, mtry = 4, importance = TRUE, ntree = index)
    predictions <- predict(rf.model.fit , newdata = Random.Forest.X.Y.test)
    confusion_matrix <- table(predictions, Random.Forest.X.Y.test$Churn)
    accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
    rf.trees.accuracies[index] <- accuracy
}

rf.trees.accuracies <- rf.trees.accuracies * 100

rf.trees.data <- data.frame(mtry = rf.number.of.trees, accuracy = rf.trees.accuracies)
# Create the plot
ggplot(rf.trees.data, aes(x = mtry, y = accuracy)) +
  #ylim(68,82) + 
  geom_line(color = "#38d6e3") +  # Line plot
  geom_point(color = "#b0d83e") +  # Points on the plot
  labs(title = "Random Forest Accuracy vs Number of Trees", x = "Number of Trees", y = "Test Accuracy %") +
  theme_minimal() +  # Aesthetic theme 
  geom_smooth(color = "#38d6e3") + 
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))

  • We can see from the above graph that the accuracy increases with the increase in number of trees with some fluctuations and after 5000, it almost becomes steady.

  • Hence a Random Forest with four or five number of predictors and almost 1500 number of trees looks sufficient.**

10-fold Cross-Validation

  • We next perform 10-fold cross validation on the Random Forest to check the number of predictors required.
X.dataset.Cross.Validation <- Telco.Churn.Dataset[,-c(1)]
X.dataset.Cross.Validation$Churn <- as.factor(X.dataset.Cross.Validation$Churn)

train_control <- trainControl(method = "cv", number = 10)

rf_cv <- train(Churn ~ tenure + PaymentMethod + PaperlessBilling + TotalCharges + PhoneService + Contract, data = X.dataset.Cross.Validation, method = "rf", trControl = train_control)
print(rf_cv)
Random Forest 

7032 samples
   6 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 6328, 6329, 6329, 6329, 6328, 6330, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.7780170  0.3470699
  5     0.7792956  0.4046720
  9     0.7555496  0.3454467

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.
ggplot(rf_cv$results, aes(x = mtry, y = Accuracy)) +
  geom_line(color = "#1f77b4", size = 1) +
  geom_point(size = 2, color = "#ff7f0e") +
  geom_errorbar(aes(ymin = Accuracy - 1 * AccuracySD, ymax = Accuracy + 1 * AccuracySD),
                width = 0.2, color = "#1f77b4", alpha = 0.6) +
  labs(
    title = "Cross-Validated Accuracy vs. mtry (Random Forest)",
    x = "Number of Predictors Sampled (mtry)",
    y = "Accuracy"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))

  • Hence, we 5 metrics we can choose the Random Forest Model

XGBoost with Encoders

  • We next analysed the XGBoost with different encoders with different depth (1 upto 100) and using all the predictors.

But, before that we ran training of a simple XGBoost model to get the estimate of the accuracy. Here we considered maximum depth of tree to be 6 and number of boosting rounds to be 100.

                       True Values Not Churned Churned
Predicted (by XGBoost)                                
Not Churned                               1149     231
Churned                                    142     237
Accuracy: 0.7879477 
  • Now, we analyse it different number of depths with number of rounds being 100.
xgboost.accuracies <- xgboost.accuracies * 100
xgboost.bayesian.accuracies <- xgboost.bayesian.accuracies * 100
xgboost.woe.accuracies <- xgboost.woe.accuracies * 100
ggplot(XGBoost.data.all, aes(x = Depth, y = Accuracy, color = Encodings)) +
  geom_line(size = 0.5) +
  geom_point(size = 0.5) +
  #coord_cartesian(ylim = c(65, 82))  + 
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed") +
 labs(title = "XGBoost Accuracy vs. Depth of Tree", x = "Depth", y = "Test Accuracy %") +
  theme_minimal() +
  scale_color_manual(values = c("XGBoost.Simple.Accuracies" = "#38d6e3", "XGBoost.Bayesian.accuracies" = "#b0d83e", "XGBoost.WOE.accuracies" = "#E2D947"))+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))+
  theme(legend.background = element_rect(fill = "#ecf8f6", color = NA))

  • Now we analyse the number of rounds using 10-fold cross validation with depth as 55(When test accuracy reaches equilibrium)
cv <- xgb.cv(
  data = X.train.xg.matrix,
  label = Y.train.xg.matrix,
  max_depth = 55,
  eta = 0.1,
  nrounds = 500,
  nfold = 10,
  objective = "binary:logistic",
  eval_metric = "logloss",
  early_stopping_rounds = 10,
  verbose = 0
)

best.round <- cv$best_iteration
cat("Best round turns out to be",best.round,"th")
Best round turns out to be 24 th
  • From the above plots we can see that the XGBoost is not performing well with the Weight of Evidence and has lot of fluctuations within it.
  • All the XGBoost encodings tends to achieve an equilibrium when depth reaches beyond 55 in approximation.
  • Also, the Simpler version of XGBoost was the quickest to converge, followed by the XGBoost WOE model.
  • Also, by changing different depth values we observed the best round to be in between 20-40.

Bosting

  • We next analyse boosting with different number of trees and interaction depths.

Before that we take do quick analysis by taking the interaction depth as 4 and number of trees as 1500 with different encoders.

Summary, Relative Improtance and Partial Dependence Plots with Simple Boosting

var rel.inf
TotalCharges TotalCharges 29.1053032
MonthlyCharges MonthlyCharges 24.4155948
tenure tenure 12.2500672
Contract Contract 11.5236500
OnlineSecurity OnlineSecurity 3.4202746
PaymentMethod PaymentMethod 2.9221656
TechSupport TechSupport 2.7904124
gender gender 1.7202024
PaperlessBilling PaperlessBilling 1.6276123
OnlineBackup OnlineBackup 1.5575649
SeniorCitizen SeniorCitizen 1.4474142
DeviceProtection DeviceProtection 1.2657916
StreamingTV StreamingTV 1.0762316
MultipleLines MultipleLines 1.0420367
Dependents Dependents 0.9315417
StreamingMovies StreamingMovies 0.9200107
InternetService InternetService 0.7660315
Partner Partner 0.6786471
PhoneService PhoneService 0.5394474
colnames(importance_df) <- c("Predictor", "RelativeInfluence")
ggplot(importance_df, aes(x = reorder(Predictor, RelativeInfluence), y = RelativeInfluence)) +
  geom_bar(stat = "identity", fill = "#e8532e") +
  coord_flip() +
  labs(title = "Predictor Importance in Simple GBM (Boosting)",
       x = "Predictor",
       y = "Relative Influence %") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f7efed"))

  • From the above plot, we observe that Total Charges and Monthly Charges which are numerical predictor has much higher influence than tenure.
  • Similarly, Payment Method, Online Security, Tech Support are highly influential with Contract
  • So, can select take 6 main predictors as, Total Charges, Monthly Charges, Contract, tenure, Payment Method and Online Security.

Also, the Partial Dependence of Churn on Total Charges, Monthly Charges and tenure are shown below.

pdp.totalCharges <- plot(boost.model.fit, i.var = "TotalCharges", return.grid = TRUE)

ggplot(pdp.totalCharges, aes(x = TotalCharges, y = y)) +
  geom_line(color = "#38d6e3", size = 1.2) +
  geom_point(color = "#0073c2ff", size = 1.5) +
  labs(title = "Partial Dependence of Churn on Total Charges", x = "Total Charges", y = "Partial Dependence (log-odds)") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.title = element_text(face = "bold", size = 12),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank()
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#ecf8f6"))

pdp.tenure<- plot(boost.model.fit , i = "tenure", return.grid = TRUE)

ggplot(pdp.tenure, aes(x = tenure, y = y)) +
  geom_line(color = "#9ff09b", size = 1.2) +
  geom_point(color = "#2aca22", size = 1.5) +
  labs(title = "Partial Dependence of Churn on Tenure", x = "Tenure", y = "Partial Dependence (log-odds)") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.title = element_text(face = "bold", size = 12),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank()
  )+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))

pdp.MonthlyCharges <- plot(boost.model.fit, i.var = "MonthlyCharges", return.grid = TRUE)

ggplot(pdp.MonthlyCharges, aes(x = MonthlyCharges, y = y)) +
  geom_line(color = "#eacd92", size = 1.2) +
  geom_point(color = "#d59f0e", size = 1.5) +
  labs(title = "Partial Dependence of Churn on Monthly Charges", x = "Monthly Charges", y = "Partial Dependence (log-odds)") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.title = element_text(face = "bold", size = 12),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank()
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f9f9ea"))

Summary of Boosting with other Encoders

  • Interestingly, we observed that we got same feature importance, PDP’s in the Boosting regardless of the encoding we used!

Boosting Test Accuracy with the number of trees

  • Next we analyse the XGboost with a maximum of 5000 number of trees and taking the jump of 10 between 1 to 5000.
# Create the plot
ggplot(boosting.data, aes(x = num.trees, y = boost.accuracies)) +
  geom_line(color = "#38d6e3") +  # Line plot
  geom_point(color = "#b0d83e") +  # Points on the plot
  labs(title = "Boosting Accuracy vs. Number of Trees", x = "Trees", y = "Accuracy") +
  theme_minimal() +  # Aesthetic theme 
  geom_smooth(color = "#38d6e3")+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))

  • From the above plot we can say that number of trees taken around 100 to 500 would be good interval because its interval of increasing accuracy.

ADABoost

  • We next analyse the ADABoost, which focuses more on the hard-to-classify examples. We will vary the number of weak learners(typically trees) to be trained in the AdaBoost. We will be all the predictors in this case.

-Firstly we fit the model using weak learner as 50 with different encoders. This would give us Confusion Matrix and Accuracies as,

                        True Values Not Churned Churned
Predicted (by ADABoost)                                
Not Churned                                1145     220
Churned                                     146     248
[1] "Accuracy: 79.19 %"
                                 True Values Not Churned Churned
Predicted (by ADABoost Bayesian)                                
Not Churned                                         1142     217
Churned                                              149     251
[1] "Accuracy: 79.19 %"
                            True Values Not Churned Churned
Predicted (by ADABoost WOE)                                
Not Churned                                    1140     218
Churned                                         151     250
[1] "Accuracy: 79.02 %"
  • We observed that the Bayesian encoder is doing slighty better in this case.

Now we will vary the weak classifiers from 1 to 100. Extending beyond this value we would be more prune to overfitting.

m.final.values <- seq(1, 100, by = 1)
m.accuracies <- rep(0, 100)

for (index in seq_along(m.final.values)){
    adaboost.model.fit <- boosting(Churn ~ ., data = X.Y.train.AdaBoost, boos = TRUE, mfinal = index)
    adaboost.predictions <- predict(adaboost.model.fit, X.Y.test.AdaBoost)
    confusion.matrix <- table(Predicted = adaboost.predictions$class, Actual = X.Y.test.AdaBoost$Churn)
    
    accuracy <- sum(diag(confusion.matrix)) / sum(confusion.matrix)
    
    m.accuracies[index] <- accuracy
}

m.accuracies <- m.accuracies * 100
AdaBoost.data <- data.frame(Number.of.Weak.Classifiers  = m.final.values, ADAboost.Simple.Accuracies = m.accuracies)
# Create the plot
ggplot(AdaBoost.data, aes(x = Number.of.Weak.Classifiers, y = m.accuracies)) +
  geom_line(color = "#38d6e3") +  # Line plot
  geom_point(color = "#b0d83e") +  # Points on the plot
  labs(title = "ADABoost Accuracy vs Weak Classifier", x = "Weak Classifier", y = "Accuracy") +
  theme_minimal() +  # Aesthetic theme 
  geom_smooth(color = "#38d6e3")+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), panel.grid = element_blank(), panel.background = element_rect(fill = "#f3f9ea"))

Training Models with meaningful predictors

  • Since, we now have some meaningful predictors which are predictors as, Total Charges, Monthly Charges, Contract, tenure, Payment Method and Online Security.

  • We apply the Logistic and XGBoost model again with their respective inferred values.

X.Y.Mixed <- Telco.Churn.Dataset

encoded.categories <- list('Contract', 'PaymentMethod', 'OnlineSecurity')
for (category in encoded.categories){
    column.index <- which(colnames(X.Y.Mixed)== category)
    X.Y.Mixed <- bayesian_encoding(X.Y.Mixed, X.Y.Mixed[,column.index:column.index], X.Y.Mixed[,21:21], column.index)
    
    if(category == 'Contract')
      X.Y.Mixed <- woe_encoding(X.Y.Mixed, X.Y.Mixed[,column.index:column.index], Telco.Churn.Dataset[,21:21], column.index)
    
    X.Y.Mixed[[category]] <- as.numeric(X.Y.Mixed[[category]])
}

X.Y.Mixed$Churn <- as.numeric(as.factor(X.Y.Mixed$Churn)) -1

X.train <- X.Y.Mixed[train.X.vector.indicies.unmasked,-c(1,21)]
X.test <- X.Y.Mixed[!train.X.vector.indicies.unmasked,-c(1,21)]
Y.test <- X.Y.Mixed$Churn[!train.X.vector.indicies.unmasked]

logistic.regression.train.fit <- glm(Churn ~ TotalCharges + MonthlyCharges + Contract + tenure + PaymentMethod + OnlineSecurity, data = X.Y.Mixed[,-c(1)], family = binomial, subset = train.X.vector.indicies.unmasked)    

summary(logistic.regression.train.fit)

Call:
glm(formula = Churn ~ TotalCharges + MonthlyCharges + Contract + 
    tenure + PaymentMethod + OnlineSecurity, family = binomial, 
    data = X.Y.Mixed[, -c(1)], subset = train.X.vector.indicies.unmasked)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.055e+01  8.519e-01 -12.382  < 2e-16 ***
TotalCharges    3.058e-04  7.839e-05   3.901 9.59e-05 ***
MonthlyCharges  1.434e-02  2.285e-03   6.275 3.50e-10 ***
Contract        5.300e-01  5.064e-02  10.465  < 2e-16 ***
tenure         -5.555e-02  6.880e-03  -8.074 6.82e-16 ***
PaymentMethod   1.781e+00  2.738e-01   6.506 7.72e-11 ***
OnlineSecurity  2.299e+00  3.019e-01   7.617 2.61e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6105.4  on 5272  degrees of freedom
Residual deviance: 4480.5  on 5266  degrees of freedom
AIC: 4494.5

Number of Fisher Scoring iterations: 6
logistic.regression.train.fit.prob <- predict(logistic.regression.train.fit, X.test, type = "response")

logistic.regression.train.fit.pred <- rep(0, length(logistic.regression.train.fit.prob))
logistic.regression.train.fit.pred[logistic.regression.train.fit.prob > 0.5] = 1

tab <- table(logistic.regression.train.fit.pred, Y.test)

dimnames(tab) <- list(
    "Predicted (by Simple):" = c("Not Churned", "Churned"),
    "True Value:" = c("Not Churned", "Churned")
)

print(ftable(tab))
                       True Value: Not Churned Churned
Predicted (by Simple):                                
Not Churned                               1126     218
Churned                                    165     250
accuracy <- mean(logistic.regression.train.fit.pred == Y.test) * 100
cat('Accuracy for Simple Logistic Regression:', round (accuracy, 3),'%','\n\n')   
Accuracy for Simple Logistic Regression: 78.226 % 
cat("Akaike Information Criterion(AIC) for the simple logistic regression is",AIC(logistic.regression.train.fit),'\n')  # Akaike Information Criterion
Akaike Information Criterion(AIC) for the simple logistic regression is 4494.482 
cat("Bayesian Information Criterion(BIC) for the simple logistic regression is",BIC(logistic.regression.train.fit)) # Bayesian Information Criterion
Bayesian Information Criterion(BIC) for the simple logistic regression is 4540.475
xgboost.model <- xgboost(
  data = X.train.xg.matrix,
  label = Y.train.xg.matrix,
  max_depth = 7,         
  eta = 0.1,             
  nrounds = 24,         
  objective = "binary:logistic", 
  eval_metric = "logloss",
  verbose = 0,
)

importance_matrix <- xgb.importance(model = xgboost.model)
print(importance_matrix)
          Feature       Gain      Cover  Frequency
           <char>      <num>      <num>      <num>
1:       Contract 0.42949843 0.20502537 0.02753304
2: MonthlyCharges 0.22225678 0.31848539 0.43392070
3:   TotalCharges 0.20611161 0.27845407 0.39812775
4: OnlineSecurity 0.08957889 0.09750718 0.03083700
5:  PaymentMethod 0.02754038 0.05309149 0.07929515
6:    TechSupport 0.02501391 0.04743650 0.03028634
y.pred <- predict(xgboost.model, X.test.xg.matrix)
y.pred.class <- ifelse(y.pred > 0.5, 1, 0)  # Convert probabilities to 0/1

conf_matrix <- table(Predicted = y.pred.class, Actual = Y.test.xg.matrix)
    dimnames(conf_matrix) <- list(
        "Predicted (by XGBoost)" = c("Not Churned", "Churned"),
        "True Values" = c("Not Churned", "Churned")
    )
    print(ftable(conf_matrix))
                       True Values Not Churned Churned
Predicted (by XGBoost)                                
Not Churned                               1150     232
Churned                                    141     236
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat('Accuracy for XGBoost:', round (accuracy * 100, 3),'%','\n\n')
Accuracy for XGBoost: 78.795 % 

Conclusion

In this project, we wanted to predict whether a customer churns or not. From the start, we had 19 predictors and most of them were not looking statistically significant. So, we applied different types of encoding techniques like Bayesian Target Encoding, Weight of Evidence, and Leave One out. We did our analysis by choosing different techniques using different encoding to infer the suitable predictors. While predictors like Payment Method, Phone Service, and Tech Support showed low p-values for the encoding. Hence, a total of 12 to 13 predictors seems statistically significant in total with different encoding. Further, we ensure at least tenure is highly influential to the output because of its low p-value in most of the cases. The same thing we showed in the LDA with p=1. Furthermore, we performed and analyzed tree-based methods where in Boosting we saw that Total charges, Monthly Charges, and Contact Period are having high relative influence as compared to others. Similarly, the 10-fold cross-validation in Random Forest also suggested that near about 5-6 predictors are sufficient for a significantly meaningful model. Hence, our observations picked up six predictors namely, Total Charges, Monthly Charges, Contract, tenure, Payment Method, and Online Security. Also, we verified that with these predictors we obtained 78% accuracy with Simple Logistic Regression and XGBoost and the highest accuracy detected was (80.5%) by using various techniques which included all the predictors. Also, the AIC and BIC scores were slightly lower for a model with six predictors and hence, it gave us strong evidence of our assumption of having six to seven influential predictors.

Also, in our analysis, we didn’t include Backward or Forward propagation methods that would help us identify more about these predictors. Similarly, we didn’t include any removal of outliers or leverages that could help us gain accuracy and hence more insight into the model. In our case, the data is limited, so we can consider including information/data across other parts of the US by using Cluster or Systematic Sampling. Hence in general, more data would give us more security that we are not underfitting our inferred model. So the best model we can use as of now are:

  • Simple Logistic Regression, Random Forest and XGBoost with almost six to seven predictors.

Future Impact

For the future impact, if the above methods are deployed in the real-world scenario, we can expect to get the right prediction for the churn 80% of the time (except Tree-Methods with Leave One Out Encoder), which is probably better than a random guess. Since it contained most of the predictors that don’t contribute to the outcome as several model analyses, it suffers from the curse of dimensionality. Also, there were not many interactions observed during my analysis. If we try to reduce the model cost further it will suffer overfitting, so increased bias would have a negative impact on the model. Hence, it is suggested that more data be collected since only 7k can’t guarantee an accuracy that is 80% achievable in practice most of the time. If some “outliers” were observed in the future related to this dataset, especially with tenure, then many techniques used would suffer. Putting the extreme cases aside, the company can hugely benefit if, they are doubtful about a customer being churned in the future. The firm can focus on these categories of customers to maintain their profits. Also, the model is trained using the dataset of a specified region, namely California, which could not match the other regions. In general, not all of the features of customers for one region matches with the feature of customers in other region. Hence, the model is not giving a holistic view of the churn prediction across the US. In these cases, our model will yield low accuracy and is expected to give the firm some benefit in their churn prediction.