Problem Description

Customer churn is a critical issue for businesses, as it can lead to a loss of revenue and market share. Identifying customers who are likely to churn can help businesses take proactive measures to retain them. In this project, we will analyze a dataset of customer interactions with a service provider to predict customer churn. The dataset contains information about customer demographics, usage patterns, payment behavior, and interactions with customer support. Our goal is to build a predictive model that can accurately identify customers who are likely to churn, based on their historical data.

Dataset Features

The dataset consists of the following columns:

CustomerID: A unique identifier for each customer. Age: The age of the customer. Gender: The gender of the customer (Female/Male). Tenure: The number of months the customer has been with the service. Usage.Frequency: The frequency with which the customer uses the service. Support.Calls: The number of calls the customer has made to customer support. Payment.Delay: The number of days the customer has delayed their payments. Subscription.Type: The type of subscription plan the customer has (e.g., Standard, Basic, Premium). Contract.Length: The length of the customer’s contract (e.g., Annual, Monthly, Quarterly). Total.Spend: The total amount the customer has spent on the service. Last.Interaction: The last interaction date with the service provider. Churn: The target variable, where 1 indicates the customer churned, and 0 indicates they did not.

#Project Description

The problem I am trying to solve is predicting customer churn for a telecom company, which is a critical business issue. Churn prediction helps the company identify which customers are likely to leave, enabling the business to take proactive steps to retain them. By identifying high-risk customers, the company can offer targeted promotions, better customer support, or other incentives to reduce churn, ultimately improving customer retention and increasing revenue.

The dataset used for this analysis contains various customer details, such as demographic information, usage patterns, and service-related metrics. The key features in the dataset include Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction. Before conducting the analysis, I prepared the data by handling missing values, either imputing or removing records with excessive missing data. I also encoded categorical variables such as Gender and Subscription Type into dummy variables. For numerical features with large scale differences, such as Total Spend, I standardized the data to ensure that no feature had undue influence on the model due to its scale. Additionally, I checked for outliers in variables like Age and Tenure to ensure they did not skew the results. I also assessed multicollinearity using the Variance Inflation Factor (VIF) and ensured that all predictors had acceptable values below the threshold of 5, meaning multicollinearity was not a concern.

For the analysis, I used logistic regression, focusing on two models. The first model included all predictors: Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction. The second model was a reduced version, which only included the most impactful features: Age, Tenure, Support Calls, Payment Delay, and Total Spend. In both models, I applied logistic regression to predict the likelihood of churn based on customer attributes. I evaluated model performance using metrics like Accuracy, Sensitivity, Specificity, and the Area Under the Curve (AUC), with the AUC serving as a key indicator of how well the model distinguished between churn and non-churn customers.

The purpose of this analysis was to identify the key factors influencing customer churn and to build a reliable predictive model. By understanding which characteristics increase the likelihood of churn, the telecom company can intervene before customers decide to leave. This allows the business to improve retention, allocate resources more effectively, and tailor marketing strategies to high-risk customers, ultimately leading to increased revenue and customer loyalty.

The results from the analysis showed that several factors significantly influence churn. Customers’ Age, Support Calls, Tenure, and Total Spend were found to be key predictors of churn. Specifically, older customers and those who made more support calls were more likely to churn. Customers with shorter tenures were also at a higher risk, while those with higher total spending were less likely to leave. Both the full model and the reduced model showed strong performance, with AUC values between 0.75 and 0.77, indicating that the models could effectively differentiate between customers who would churn and those who would stay.

The business impact of this analysis is substantial. By identifying customers who are most at risk of churning, the company can focus its retention efforts on those individuals, offering them personalized incentives and improving their experience. The insights gained from this analysis can also help optimize marketing campaigns by targeting high-risk customers with specific offers. Additionally, resources can be allocated more efficiently, concentrating on the customers who need the most attention to reduce churn. Overall, this analysis equips the company with valuable tools to improve customer retention, reduce churn rates, and ultimately enhance profitability. # Exploratory Data Analysis (EDA)

#Train Set
train_set <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Final%20Project/customer_churn_dataset-training-master.csv")

#Test Set
test_set <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Final%20Project/customer_churn_dataset-testing-master.csv")

head(train_set)

##   CustomerID Age Gender Tenure Usage.Frequency Support.Calls Payment.Delay
## 1          2  30 Female     39              14             5            18
## 2          3  65 Female     49               1            10             8
## 3          4  55 Female     14               4             6            18
## 4          5  58   Male     38              21             7             7
## 5          6  23   Male     32              20             5             8
## 6          8  51   Male     33              25             9            26
##   Subscription.Type Contract.Length Total.Spend Last.Interaction Churn
## 1          Standard          Annual         932               17     1
## 2             Basic         Monthly         557                6     1
## 3             Basic       Quarterly         185                3     1
## 4          Standard         Monthly         396               29     1
## 5             Basic         Monthly         617               20     1
## 6           Premium          Annual         129                8     1

I will only sample 10,000 rows from the train and test set to speed up the analysis.

# Sample 10,000 rows from train_set
train_set <- sample_n(train_set, 10000)

# Sample 10,000 rows from test_set
test_set <- sample_n(test_set, 10000)

# Train Set Summary
summary(train_set)

##    CustomerID          Age           Gender              Tenure     
##  Min.   :    59   Min.   :18.00   Length:10000       Min.   : 1.00  
##  1st Qu.:112883   1st Qu.:29.00   Class :character   1st Qu.:17.00  
##  Median :224271   Median :39.00   Mode  :character   Median :32.00  
##  Mean   :224777   Mean   :39.41                      Mean   :31.46  
##  3rd Qu.:338566   3rd Qu.:48.00                      3rd Qu.:46.00  
##  Max.   :449206   Max.   :65.00                      Max.   :60.00  
##  Usage.Frequency Support.Calls    Payment.Delay   Subscription.Type 
##  Min.   : 1.00   Min.   : 0.000   Min.   : 0.00   Length:10000      
##  1st Qu.: 8.00   1st Qu.: 1.000   1st Qu.: 6.00   Class :character  
##  Median :16.00   Median : 3.000   Median :12.00   Mode  :character  
##  Mean   :15.74   Mean   : 3.572   Mean   :12.89                     
##  3rd Qu.:23.00   3rd Qu.: 6.000   3rd Qu.:19.00                     
##  Max.   :30.00   Max.   :10.000   Max.   :30.00                     
##  Contract.Length     Total.Spend     Last.Interaction     Churn       
##  Length:10000       Min.   : 100.0   Min.   : 1.00    Min.   :0.0000  
##  Class :character   1st Qu.: 482.0   1st Qu.: 7.00    1st Qu.:0.0000  
##  Mode  :character   Median : 660.0   Median :14.00    Median :1.0000  
##                     Mean   : 632.8   Mean   :14.56    Mean   :0.5683  
##                     3rd Qu.: 831.3   3rd Qu.:22.00    3rd Qu.:1.0000  
##                     Max.   :1000.0   Max.   :30.00    Max.   :1.0000

# Test Set Summary
summary(test_set)

##    CustomerID         Age           Gender              Tenure    
##  Min.   :   14   Min.   :18.00   Length:10000       Min.   : 1.0  
##  1st Qu.:16848   1st Qu.:30.00   Class :character   1st Qu.:18.0  
##  Median :32872   Median :42.00   Mode  :character   Median :33.0  
##  Mean   :32681   Mean   :42.09                      Mean   :32.2  
##  3rd Qu.:48676   3rd Qu.:54.00                      3rd Qu.:47.0  
##  Max.   :64368   Max.   :65.00                      Max.   :60.0  
##  Usage.Frequency Support.Calls    Payment.Delay   Subscription.Type 
##  Min.   : 1.00   Min.   : 0.000   Min.   : 0.00   Length:10000      
##  1st Qu.: 7.00   1st Qu.: 3.000   1st Qu.:10.00   Class :character  
##  Median :15.00   Median : 6.000   Median :19.00   Mode  :character  
##  Mean   :15.06   Mean   : 5.375   Mean   :17.25                     
##  3rd Qu.:23.00   3rd Qu.: 8.000   3rd Qu.:25.00                     
##  Max.   :30.00   Max.   :10.000   Max.   :30.00                     
##  Contract.Length     Total.Spend     Last.Interaction     Churn       
##  Length:10000       Min.   : 100.0   Min.   : 1.0     Min.   :0.0000  
##  Class :character   1st Qu.: 310.0   1st Qu.: 8.0     1st Qu.:0.0000  
##  Mode  :character   Median : 531.0   Median :15.0     Median :0.0000  
##                     Mean   : 540.0   Mean   :15.5     Mean   :0.4828  
##                     3rd Qu.: 768.2   3rd Qu.:23.0     3rd Qu.:1.0000  
##                     Max.   :1000.0   Max.   :30.0     Max.   :1.0000

# Checking for missing values in the train set
str(train_set)

## 'data.frame':    10000 obs. of  12 variables:
##  $ CustomerID       : int  267575 285634 189084 257046 213620 94212 86546 272672 26057 411735 ...
##  $ Age              : int  43 50 40 42 33 61 42 32 34 19 ...
##  $ Gender           : chr  "Female" "Male" "Female" "Male" ...
##  $ Tenure           : int  58 7 52 39 5 54 12 60 57 29 ...
##  $ Usage.Frequency  : int  3 23 19 9 29 15 24 8 21 2 ...
##  $ Support.Calls    : int  2 2 6 7 1 0 7 2 7 0 ...
##  $ Payment.Delay    : int  20 1 8 13 18 7 28 14 20 14 ...
##  $ Subscription.Type: chr  "Basic" "Basic" "Standard" "Premium" ...
##  $ Contract.Length  : chr  "Quarterly" "Quarterly" "Quarterly" "Monthly" ...
##  $ Total.Spend      : num  530 618 692 355 360 ...
##  $ Last.Interaction : int  3 7 24 21 5 27 19 16 12 6 ...
##  $ Churn            : int  0 0 1 1 1 1 1 0 1 0 ...

# Checking for missing values in the test set
str(test_set)

## 'data.frame':    10000 obs. of  12 variables:
##  $ CustomerID       : int  8530 48267 30440 6391 25607 38882 33621 49883 2219 52794 ...
##  $ Age              : int  35 32 41 63 49 47 49 58 53 63 ...
##  $ Gender           : chr  "Female" "Female" "Male" "Male" ...
##  $ Tenure           : int  30 55 13 15 4 8 5 24 10 20 ...
##  $ Usage.Frequency  : int  21 16 12 26 7 20 6 3 28 6 ...
##  $ Support.Calls    : int  4 5 5 7 6 5 6 10 9 10 ...
##  $ Payment.Delay    : int  30 29 20 21 20 17 20 12 8 24 ...
##  $ Subscription.Type: chr  "Standard" "Standard" "Basic" "Premium" ...
##  $ Contract.Length  : chr  "Quarterly" "Monthly" "Quarterly" "Monthly" ...
##  $ Total.Spend      : int  884 386 439 533 846 732 622 749 809 965 ...
##  $ Last.Interaction : int  26 24 29 9 17 29 24 1 24 30 ...
##  $ Churn            : int  0 1 0 0 0 0 0 1 0 1 ...

Correlation Matrix

# Select only numeric variables
numeric_vars <- train_set %>% select_if(is.numeric)

# Calculate correlation matrix
correlation_matrix <- cor(numeric_vars, use = "complete.obs")

# Plot correlation heatmap
corrplot(correlation_matrix, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45,
         addCoef.col = "black", number.cex = 0.7,
         main = "Correlation Matrix")

Feature Distribution

# Identify numeric and categorical columns
numeric_cols <- names(train_set)[sapply(train_set, is.numeric)] 
numeric_cols <- numeric_cols[!numeric_cols %in% c("CustomerID", "Churn")]
categorical_cols <- names(train_set)[sapply(train_set, is.character) | sapply(train_set, is.factor)]

# Create distributions for numeric columns
cat("\nPlotting Distributions for Numeric Columns...\n")

## 
## Plotting Distributions for Numeric Columns...

for (col in numeric_cols) {
  print(ggplot(train_set, aes_string(x = col)) +
    geom_histogram(binwidth = 10, fill = "blue", color = "white", alpha = 0.7) +
    labs(title = paste("Distribution of", col), x = col, y = "Frequency") +
    theme_minimal())
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

cat("\nCreating Boxplots for Numeric Columns...\n")

## 
## Creating Boxplots for Numeric Columns...

for (col in numeric_cols) {
  # Create a boxplot for the numeric column
  boxplot(
    train_set[[col]],
    main = paste("Boxplot of", col),
    ylab = col,
    col = "lightblue",
    border = "black",
    outline = TRUE  # Show outliers
  )
}

Data Preprocessing

The dataset has been cleaned already and since only 1 NA row was found in the train set, it was removed as it would not have a significant impact on the analysis.

# Remove missing values from the train set
#train_set <- na.omit(train_set)  

sum(is.na(train_set))

## [1] 0

# Convert categorical variables to factors
train_set$Gender <- as.factor(train_set$Gender)
train_set$Subscription.Type <- as.factor(train_set$Subscription.Type)
train_set$Contract.Length <- as.factor(train_set$Contract.Length)
train_set$Churn <- as.factor(train_set$Churn)

test_set$Gender <- as.factor(test_set$Gender)
test_set$Subscription.Type <- as.factor(test_set$Subscription.Type)
test_set$Contract.Length <- as.factor(test_set$Contract.Length)
test_set$Churn <- as.factor(test_set$Churn)

# View the structure of the data
str(train_set)

## 'data.frame':    10000 obs. of  12 variables:
##  $ CustomerID       : int  267575 285634 189084 257046 213620 94212 86546 272672 26057 411735 ...
##  $ Age              : int  43 50 40 42 33 61 42 32 34 19 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 2 2 2 2 ...
##  $ Tenure           : int  58 7 52 39 5 54 12 60 57 29 ...
##  $ Usage.Frequency  : int  3 23 19 9 29 15 24 8 21 2 ...
##  $ Support.Calls    : int  2 2 6 7 1 0 7 2 7 0 ...
##  $ Payment.Delay    : int  20 1 8 13 18 7 28 14 20 14 ...
##  $ Subscription.Type: Factor w/ 3 levels "Basic","Premium",..: 1 1 3 2 2 2 3 1 3 3 ...
##  $ Contract.Length  : Factor w/ 3 levels "Annual","Monthly",..: 3 3 3 2 1 1 1 1 1 3 ...
##  $ Total.Spend      : num  530 618 692 355 360 ...
##  $ Last.Interaction : int  3 7 24 21 5 27 19 16 12 6 ...
##  $ Churn            : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...

Model Building

sum(is.na(train_set))

## [1] 0

Logistic Regression

In this analysis, I built two logistic regression models to predict customer churn. The first model includes nine predictors: Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction, while the second model simplifies the predictors to just Age, Tenure, Support Calls, Payment Delay, and Total Spend. I began by checking for multicollinearity using Variance Inflation Factor (VIF), which measures how much the variance of a regression coefficient is inflated due to correlations with other predictors. All VIF values were below 5, suggesting that multicollinearity was not a significant issue.

When interpreting the coefficients, I found that Age had a small but significant positive relationship with churn, meaning that older customers are slightly more likely to churn. Gender had a strong negative coefficient for male customers, indicating that men were less likely to churn. Support Calls had a significant positive effect, suggesting that customers who made more support calls were more likely to churn. Subscription Type also played a role, with both Premium and Standard subscriptions showing negative coefficients, meaning customers with these subscriptions were less likely to churn compared to other types.

The second model, which excluded some predictors, showed similar trends. Age and Support Calls remained significant, with Age positively affecting churn and Support Calls increasing the likelihood of churn. Tenure, Payment Delay, and Total Spend all had negative coefficients, indicating that longer tenure, fewer payment delays, and higher spending were associated with lower churn.

For model evaluation, I looked at various performance metrics. The accuracy of the first model was 58.64%, with a sensitivity of 21.93% and specificity of 98.68%. The AUC (Area Under the Curve) for this model was 0.7697, which indicates the model’s good ability to distinguish between churn and non-churn customers. The second model had a similar accuracy of 58.47%, but its AUC was slightly lower at 0.7566. Although the models’ accuracy is moderate, the AUC values suggest that they are reasonably good at predicting churn, with Support Calls, Age, and Total Spend being key factors in determining customer churn.

set.seed(123)

# Train a Logistic Regression Model
logistic_model <- glm(Churn ~ Age + Gender + Tenure + Usage.Frequency + Support.Calls + 
                      Payment.Delay + Subscription.Type + Total.Spend + 
                      Last.Interaction, 
                      data = train_set, 
                      family = binomial,
                      control = list(maxit = 1000))

# Looking at multicollinearity
vif(logistic_model)

##                       GVIF Df GVIF^(1/(2*Df))
## Age               1.013470  1        1.006712
## Gender            1.109134  1        1.053154
## Tenure            1.011901  1        1.005933
## Usage.Frequency   1.004468  1        1.002232
## Support.Calls     1.181901  1        1.087153
## Payment.Delay     1.075382  1        1.037006
## Subscription.Type 1.009377  2        1.002336
## Total.Spend       1.146048  1        1.070536
## Last.Interaction  1.098976  1        1.048321

# Summary of the Logistic Regression Model
summary(logistic_model)

## 
## Call:
## glm(formula = Churn ~ Age + Gender + Tenure + Usage.Frequency + 
##     Support.Calls + Payment.Delay + Subscription.Type + Total.Spend + 
##     Last.Interaction, family = binomial, data = train_set, control = list(maxit = 1000))
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -0.1023871  0.1931708  -0.530 0.596089    
## Age                        0.0332458  0.0027344  12.159  < 2e-16 ***
## GenderMale                -1.1422204  0.0668059 -17.098  < 2e-16 ***
## Tenure                    -0.0079908  0.0018159  -4.401 1.08e-05 ***
## Usage.Frequency           -0.0143120  0.0036593  -3.911 9.19e-05 ***
## Support.Calls              0.6911225  0.0170988  40.419  < 2e-16 ***
## Payment.Delay              0.1049944  0.0043333  24.230  < 2e-16 ***
## Subscription.TypePremium  -0.2871643  0.0762683  -3.765 0.000166 ***
## Subscription.TypeStandard -0.1840742  0.0762139  -2.415 0.015725 *  
## Total.Spend               -0.0056174  0.0001704 -32.958  < 2e-16 ***
## Last.Interaction           0.0586276  0.0038866  15.085  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13675.8  on 9999  degrees of freedom
## Residual deviance:  6534.7  on 9989  degrees of freedom
## AIC: 6556.7
## 
## Number of Fisher Scoring iterations: 6

I used another logistic regression model to identify key factors influencing customer churn, such as Age, Tenure, Support Calls, Payment Delay, and Total Spend, based on their statistical significance. By refitting the model with only significant variables, I aimed to improve its efficiency and focus on the most impactful predictors for better churn prediction and customer retention strategies.

# Refitting the model with only statistically significant variables
significant_model <- glm(Churn ~ Age + Tenure + Support.Calls + Payment.Delay + Total.Spend, 
                        data = train_set, 
                        family = binomial,
                        control = list(maxit = 1000))

#Looking at multicollinearity
vif(significant_model)

##           Age        Tenure Support.Calls Payment.Delay   Total.Spend 
##      1.009464      1.002272      1.143582      1.066958      1.126317

# Summary of the new model
summary(significant_model)

## 
## Call:
## glm(formula = Churn ~ Age + Tenure + Support.Calls + Payment.Delay + 
##     Total.Spend, family = binomial, data = train_set, control = list(maxit = 1000))
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.3584532  0.1610753  -2.225 0.026056 *  
## Age            0.0320370  0.0026172  12.241  < 2e-16 ***
## Tenure        -0.0064212  0.0017366  -3.698 0.000218 ***
## Support.Calls  0.6675793  0.0162171  41.165  < 2e-16 ***
## Payment.Delay  0.1030072  0.0041463  24.843  < 2e-16 ***
## Total.Spend   -0.0054530  0.0001621 -33.647  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13675.8  on 9999  degrees of freedom
## Residual deviance:  7026.5  on 9994  degrees of freedom
## AIC: 7038.5
## 
## Number of Fisher Scoring iterations: 6

# Predict the probability of churn (0 or 1) on the test set
predictions_prob <- predict(logistic_model, test_set, type = "response")
sig_predictions_prob <- predict(significant_model, test_set, type = "response")

# Convert probabilities to 0 or 1 based on a threshold (e.g., 0.5)
predictions <- ifelse(predictions_prob > 0.5, 1, 0)
sig_predictions <- ifelse(sig_predictions_prob > 0.5, 1, 0)

# Evaluate the Performance of the model
conf_matrix <- confusionMatrix(factor(predictions), factor(test_set$Churn))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1213   68
##          1 3959 4760
##                                           
##                Accuracy : 0.5973          
##                  95% CI : (0.5876, 0.6069)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2147          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.2345          
##             Specificity : 0.9859          
##          Pos Pred Value : 0.9469          
##          Neg Pred Value : 0.5459          
##              Prevalence : 0.5172          
##          Detection Rate : 0.1213          
##    Detection Prevalence : 0.1281          
##       Balanced Accuracy : 0.6102          
##                                           
##        'Positive' Class : 0               
##

# Evaluate the Performance of the significant model
sig_conf_matrix <- confusionMatrix(factor(sig_predictions), factor(test_set$Churn))
print(sig_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1253   98
##          1 3919 4730
##                                           
##                Accuracy : 0.5983          
##                  95% CI : (0.5886, 0.6079)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2163          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.2423          
##             Specificity : 0.9797          
##          Pos Pred Value : 0.9275          
##          Neg Pred Value : 0.5469          
##              Prevalence : 0.5172          
##          Detection Rate : 0.1253          
##    Detection Prevalence : 0.1351          
##       Balanced Accuracy : 0.6110          
##                                           
##        'Positive' Class : 0               
##

# ROC Curve and AUC
roc_curve <- roc(test_set$Churn, predictions_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, main = "ROC Curve", col = "blue")

auc(roc_curve)

## Area under the curve: 0.7761

# ROC Curve and AUC for the significant model
sig_roc_curve <- roc(test_set$Churn, sig_predictions_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(sig_roc_curve, main = "ROC Curve for Significant Model", col = "red")

auc(sig_roc_curve)

## Area under the curve: 0.7675

SVM

The customer churn prediction model using Support Vector Machine (SVM) aims to classify whether a customer will churn (leave the company) or not based on various features like Age, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Contract Length, Total Spend, and Last Interaction. In the initial analysis, several variables were considered, and a logistic regression model was built to predict churn, showing that Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction were significant predictors. The model’s performance was evaluated using a confusion matrix, which revealed an accuracy of approximately 58.64%, with a sensitivity of 21.93% and specificity of 98.68%. This suggests the model is better at identifying non-churning customers (specificity) but has a lower ability to detect actual churners (sensitivity). The Area Under the Curve (AUC) was 0.7697, which indicates a decent overall performance, though there is room for improvement. The final model considered only five predictors (Age, Tenure, Support Calls, Payment Delay, and Total Spend), which provided a similar accuracy of 58.47%, further validating the importance of these features in predicting churn. The confusion matrix showed similar results, with a balanced accuracy of 60.1%. Despite the moderate performance, the analysis indicates that Support Calls, Payment Delay, and Total Spend are key factors influencing churn, and the model could potentially be improved by fine-tuning hyperparameters or applying other techniques like cross-validation.

RFE recursively removes features and builds models to identify the most significant features. Here’s how you can apply RFE to select the best features:

# Preprocess: Center and scale the data (as done before)
preProcess_data <- preProcess(train_set[, -which(names(train_set) == "Churn")], 
                               method = c("center", "scale"))

train_scaled <- predict(preProcess_data, train_set)
test_scaled <- predict(preProcess_data, test_set)

# RFE feature selection
control <- rfeControl(functions=rfFuncs, method="cv", number=10)  # Using Random Forest with 10-fold cross-validation
rfe_result <- rfe(train_scaled[, -which(names(train_scaled) == "Churn")], 
                  train_scaled$Churn, 
                  sizes=c(1:10),  # Adjust the size range to your dataset's context
                  rfeControl=control)

# Get the best features selected by RFE
best_features <- rfe_result$optVariables

# Train SVM models with different kernels using the selected features
# Linear Kernel
svm_linear_model <- svm(Churn ~ ., 
                        data = train_scaled[, c(best_features, "Churn")], 
                        kernel = "linear", 
                        cost = 100, 
                        scale = TRUE)

# Radial Kernel
svm_radial_model <- svm(Churn ~ ., 
                        data = train_scaled[, c(best_features, "Churn")], 
                        kernel = "radial", 
                        cost = 100, 
                        scale = TRUE)

# Polynomial Kernel
svm_polynomial_model <- svm(Churn ~ ., 
                            data = train_scaled[, c(best_features, "Churn")], 
                            kernel = "polynomial", 
                            cost = 100, 
                            scale = TRUE)

# Sigmoid Kernel
svm_sigmoid_model <- svm(Churn ~ ., 
                         data = train_scaled[, c(best_features, "Churn")], 
                         kernel = "sigmoid", 
                         cost = 100, 
                         scale = TRUE)

# Predict on Test Set using the selected features for each kernel
# Linear Kernel Predictions
svm_linear_predictions <- predict(svm_linear_model, test_scaled[, best_features])
# Radial Kernel Predictions
svm_radial_predictions <- predict(svm_radial_model, test_scaled[, best_features])
# Polynomial Kernel Predictions
svm_polynomial_predictions <- predict(svm_polynomial_model, test_scaled[, best_features])
# Sigmoid Kernel Predictions
svm_sigmoid_predictions <- predict(svm_sigmoid_model, test_scaled[, best_features])

# Evaluate Performance for each kernel
cat("Performance for SVM with Linear Kernel:\n")

## Performance for SVM with Linear Kernel:

print(confusionMatrix(svm_linear_predictions, test_scaled$Churn))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 5172 4828
##                                          
##                Accuracy : 0.4828         
##                  95% CI : (0.473, 0.4926)
##     No Information Rate : 0.5172         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.0000         
##             Specificity : 1.0000         
##          Pos Pred Value :    NaN         
##          Neg Pred Value : 0.4828         
##              Prevalence : 0.5172         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 0              
##

cat("\nPerformance for SVM with Radial Kernel:\n")

## 
## Performance for SVM with Radial Kernel:

print(confusionMatrix(svm_radial_predictions, test_scaled$Churn))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  224    7
##          1 4948 4821
##                                           
##                Accuracy : 0.5045          
##                  95% CI : (0.4947, 0.5143)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : 0.9946          
##                                           
##                   Kappa : 0.0405          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.04331         
##             Specificity : 0.99855         
##          Pos Pred Value : 0.96970         
##          Neg Pred Value : 0.49350         
##              Prevalence : 0.51720         
##          Detection Rate : 0.02240         
##    Detection Prevalence : 0.02310         
##       Balanced Accuracy : 0.52093         
##                                           
##        'Positive' Class : 0               
##

cat("\nPerformance for SVM with Polynomial Kernel:\n")

## 
## Performance for SVM with Polynomial Kernel:

print(confusionMatrix(svm_polynomial_predictions, test_scaled$Churn))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  203    5
##          1 4969 4823
##                                           
##                Accuracy : 0.5026          
##                  95% CI : (0.4928, 0.5124)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : 0.9983          
##                                           
##                   Kappa : 0.037           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.03925         
##             Specificity : 0.99896         
##          Pos Pred Value : 0.97596         
##          Neg Pred Value : 0.49254         
##              Prevalence : 0.51720         
##          Detection Rate : 0.02030         
##    Detection Prevalence : 0.02080         
##       Balanced Accuracy : 0.51911         
##                                           
##        'Positive' Class : 0               
##

cat("\nPerformance for SVM with Sigmoid Kernel:\n")

## 
## Performance for SVM with Sigmoid Kernel:

print(confusionMatrix(svm_sigmoid_predictions, test_scaled$Churn))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 5172 4828
##                                          
##                Accuracy : 0.4828         
##                  95% CI : (0.473, 0.4926)
##     No Information Rate : 0.5172         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.0000         
##             Specificity : 1.0000         
##          Pos Pred Value :    NaN         
##          Neg Pred Value : 0.4828         
##              Prevalence : 0.5172         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 0              
##

Random forest

In my analysis, I applied a random forest model to predict customer churn, using a dataset that includes various customer attributes. I started by training the model with the full set of features, which resulted in an Out-Of-Bag (OOB) error rate of 0.47%. The confusion matrix showed an impressive classification performance for the negative class (no churn), with a specificity of 1.00000, meaning that the model was very accurate in identifying non-churn customers. However, the sensitivity was quite low at 0.02185, indicating that the model struggled to detect churn cases, which led to an imbalance between the positive and negative classes.

Next, I focused on the top 5 features that had the most significant impact on the prediction. These included attributes such as CustomerID, Age, and Support.Calls. After retraining the model with only these key features, the OOB error rate slightly increased to 0.96%. The confusion matrix showed an improvement in sensitivity, rising to 0.02415, but it still had a relatively low detection rate of 0.01260. Despite these issues, the accuracy remained almost unchanged at 0.4909, which suggested that the model’s performance was somewhat limited by the imbalance in the dataset.

To address the class imbalance, I applied class weights, assigning a higher weight to the positive class (churn). This adjustment resulted in an OOB error rate of 0.97%. The confusion matrix showed a slight improvement in the performance of the positive class, with a sensitivity of 0.02569, although the detection rate was still quite low. This indicates that while the model could better identify churn, it still struggled to predict churn accurately when compared to non-churn customers.

In conclusion, the random forest model provided valuable insights into customer churn, but the performance could be improved by further refining the model. Addressing class imbalance and incorporating more advanced techniques such as feature engineering or resampling might help boost sensitivity and detection rates.

# Train Random Forest model
rf_model <- randomForest(Churn ~ ., data = train_scaled, importance = TRUE)

# Print the model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = Churn ~ ., data = train_scaled, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.45%
## Confusion matrix:
##      0    1 class.error
## 0 4272   45  0.01042391
## 1    0 5683  0.00000000

# Make predictions on the test data
rf_predictions <- predict(rf_model, newdata = test_scaled)

# Evaluate performance using confusion matrix
library(caret)
confusionMatrix(rf_predictions, test_scaled$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   94    0
##          1 5078 4828
##                                          
##                Accuracy : 0.4922         
##                  95% CI : (0.4824, 0.502)
##     No Information Rate : 0.5172         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0176         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.01817        
##             Specificity : 1.00000        
##          Pos Pred Value : 1.00000        
##          Neg Pred Value : 0.48738        
##              Prevalence : 0.51720        
##          Detection Rate : 0.00940        
##    Detection Prevalence : 0.00940        
##       Balanced Accuracy : 0.50909        
##                                          
##        'Positive' Class : 0              
##

# Plot feature importance
importance(rf_model)

##                           0         1 MeanDecreaseAccuracy MeanDecreaseGini
## CustomerID        89.688056 39.417032            63.976439      2892.609677
## Age               16.120199 37.230447            18.120451       233.390907
## Gender             9.638174 27.623485            11.738817        39.207900
## Tenure             3.222604  1.787152             3.493085        13.696222
## Usage.Frequency    2.270625  0.434517             1.982608        10.297279
## Support.Calls     15.414775 60.376977            19.706360       669.872441
## Payment.Delay     15.182943 45.717833            17.529104       222.951377
## Subscription.Type  2.240037  2.482856             3.247690         3.754543
## Contract.Length   16.857257 43.808748            18.751483       268.335506
## Total.Spend       19.071894 53.296989            21.515974       490.249717
## Last.Interaction   9.219669 18.664482            10.333623        62.248186

varImpPlot(rf_model)

# Tune hyperparameters
tuned_rf <- tuneRF(train_scaled[, -which(names(train_scaled) == "Churn")], 
                   train_scaled$Churn, 
                   stepFactor = 1.5, 
                   improve = 0.01, 
                   trace = TRUE)

## mtry = 3  OOB error = 0.49% 
## Searching left ...
## mtry = 2     OOB error = 0.51% 
## -0.04081633 0.01 
## Searching right ...
## mtry = 4     OOB error = 0.38% 
## 0.2244898 0.01 
## mtry = 6     OOB error = 0.42% 
## -0.1052632 0.01

# Evaluate performance using confusion matrix
confusionMatrix(rf_predictions, test_scaled$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   94    0
##          1 5078 4828
##                                          
##                Accuracy : 0.4922         
##                  95% CI : (0.4824, 0.502)
##     No Information Rate : 0.5172         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0176         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.01817        
##             Specificity : 1.00000        
##          Pos Pred Value : 1.00000        
##          Neg Pred Value : 0.48738        
##              Prevalence : 0.51720        
##          Detection Rate : 0.00940        
##    Detection Prevalence : 0.00940        
##       Balanced Accuracy : 0.50909        
##                                          
##        'Positive' Class : 0              
##

Now that I have the model I will be building it with the best subset of features

# Create a data frame of feature importance
importance_scores <- data.frame(
  Feature = rownames(importance(rf_model)),
  MeanDecreaseAccuracy = importance(rf_model)[, "MeanDecreaseAccuracy"],
  MeanDecreaseGini = importance(rf_model)[, "MeanDecreaseGini"]
)

# Sort by MeanDecreaseAccuracy (or MeanDecreaseGini) to select the most important features
top_features <- importance_scores[order(-importance_scores$MeanDecreaseAccuracy), "Feature"]

# Select the top 5 features (You can adjust this number)
top_5_features <- top_features[1:5]

# Train Random Forest model using the top 5 features
rf_model_top <- randomForest(Churn ~ ., 
                             data = train_scaled[, c(top_5_features, "Churn")], 
                             importance = TRUE)

# Print the model summary
print(rf_model_top)

## 
## Call:
##  randomForest(formula = Churn ~ ., data = train_scaled[, c(top_5_features,      "Churn")], importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.94%
## Confusion matrix:
##      0    1 class.error
## 0 4245   72 0.016678249
## 1   22 5661 0.003871195

# Make predictions on the test data using the top 5 features
rf_predictions_top <- predict(rf_model_top, newdata = test_scaled[, top_5_features])

# Evaluate performance using confusion matrix
library(caret)
confusionMatrix(rf_predictions_top, test_scaled$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   38    0
##          1 5134 4828
##                                           
##                Accuracy : 0.4866          
##                  95% CI : (0.4768, 0.4964)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0071          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.007347        
##             Specificity : 1.000000        
##          Pos Pred Value : 1.000000        
##          Neg Pred Value : 0.484642        
##              Prevalence : 0.517200        
##          Detection Rate : 0.003800        
##    Detection Prevalence : 0.003800        
##       Balanced Accuracy : 0.503674        
##                                           
##        'Positive' Class : 0               
##

# Plot feature importance of the top model
importance(rf_model_top)

##                         0        1 MeanDecreaseAccuracy MeanDecreaseGini
## CustomerID      495.82597 46.64745            103.75761        3222.8634
## Total.Spend      16.79722 48.61767             18.44506         502.7290
## Support.Calls    14.61203 64.95974             17.18003         730.8382
## Contract.Length  15.67537 39.89741             16.97312         274.1084
## Age              11.79699 37.76864             13.54505         171.0931

varImpPlot(rf_model_top)

# Train Random Forest model with class weights to handle imbalance
rf_model_balanced <- randomForest(Churn ~ ., 
                                  data = train_scaled[, c(top_5_features, "Churn")], 
                                  importance = TRUE, 
                                  classwt = c(0.2, 0.8))  # Adjust class weights (minority: 0.2, majority: 0.8)

# Print the model summary
print(rf_model_balanced)

## 
## Call:
##  randomForest(formula = Churn ~ ., data = train_scaled[, c(top_5_features,      "Churn")], importance = TRUE, classwt = c(0.2, 0.8)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.95%
## Confusion matrix:
##      0    1 class.error
## 0 4242   75 0.017373176
## 1   20 5663 0.003519268

# Make predictions on the test data
rf_predictions_balanced <- predict(rf_model_balanced, newdata = test_scaled[, top_5_features])

# Evaluate performance using confusion matrix
confusionMatrix(rf_predictions_balanced, test_scaled$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   38    0
##          1 5134 4828
##                                           
##                Accuracy : 0.4866          
##                  95% CI : (0.4768, 0.4964)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0071          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.007347        
##             Specificity : 1.000000        
##          Pos Pred Value : 1.000000        
##          Neg Pred Value : 0.484642        
##              Prevalence : 0.517200        
##          Detection Rate : 0.003800        
##    Detection Prevalence : 0.003800        
##       Balanced Accuracy : 0.503674        
##                                           
##        'Positive' Class : 0               
##

# Plot feature importance
importance(rf_model_balanced)

##                         0        1 MeanDecreaseAccuracy MeanDecreaseGini
## CustomerID      496.33662 50.27460            107.33262        2385.9252
## Total.Spend      17.37626 49.00818             19.12205         236.6473
## Support.Calls    11.84786 68.11028             14.73909         323.3242
## Contract.Length  16.21887 40.26281             17.45866         133.9138
## Age              13.01834 38.19712             14.70748         116.2204

varImpPlot(rf_model_balanced)

Gradient Boosting Machine (GBM)

For my Gradient Boosting Machine (GBM) model, I first set up 10-fold cross-validation for model training using the trainControl function from the caret package. The goal was to evaluate the performance of the model with more reliable validation. I used this control setup to train the model, suppressing verbose output to keep the results clean.

Once the model was trained, I moved on to making predictions using the test set. The predict function was applied to the test data, and I evaluated the model’s performance by calculating the confusion matrix with confusionMatrix. The confusion matrix allowed me to measure the accuracy, sensitivity, specificity, and other important metrics, which helped in understanding how well the model performed. I also plotted the ROC curve using the pROC package, which is helpful for evaluating the model’s performance across different thresholds, and calculated the AUC (Area Under the Curve) to gauge how well the model distinguishes between the classes.

Next, I removed the CustomerID column from both the train and test sets, as it didn’t provide any meaningful contribution to the prediction. I also converted the Churn variable to a numeric binary variable, as this was required for the GBM model.

I then trained the GBM model again with specific parameters: n.trees = 500, interaction.depth = 3, and shrinkage = 0.01. These settings helped control the complexity of the model and mitigate overfitting. I used 5-fold cross-validation and ensured that only one core was used for training.

After training, I evaluated the model’s performance once again using a confusion matrix, which revealed some insights. The accuracy was 52.64%, which is slightly above the baseline accuracy of 52.17%. However, the model showed high specificity (99.85%) but low sensitivity (9.35%). This indicates that the model is good at correctly identifying the negative class (non-churn) but struggles with identifying actual churn cases. The AUC for the ROC curve was also quite informative.

The final step involved analyzing the model’s feature importance, which can provide valuable insights into which variables were most influential in the prediction. The feature importance plot revealed that Support.Calls, Total.Spend, and Contract.Length were the top predictors.

Overall, while the model performed decently, there’s room for improvement, especially in terms of handling the positive class (churn). I might experiment with further tuning the hyperparameters, handling class imbalance, or trying other models to improve performance in detecting churn cases.

# Set up cross-validation for model training
control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

# Train the GBM model using caret's train function
gbm_model <- train(Churn ~ ., data = train_set, method = "gbm", 
                   trControl = control, 
                   verbose = FALSE)  # Suppress verbose output

# Print the model results
print(gbm_model)

## Stochastic Gradient Boosting 
## 
## 10000 samples
##    11 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9000, 9000, 8999, 9000, 8999, 9000, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9908999  0.9814062
##   1                  100      0.9908999  0.9814062
##   1                  150      0.9908999  0.9814062
##   2                   50      0.9908999  0.9814062
##   2                  100      0.9914998  0.9826345
##   2                  150      0.9936999  0.9871357
##   3                   50      0.9909999  0.9816112
##   3                  100      0.9934995  0.9867281
##   3                  150      0.9947997  0.9893866
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

# Predict on Test Set
gbm_predictions <- predict(gbm_model, newdata = test_set)

# Evaluate the performance using confusion matrix
conf_matrix <- confusionMatrix(gbm_predictions, test_set$Churn)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   88    0
##          1 5084 4828
##                                           
##                Accuracy : 0.4916          
##                  95% CI : (0.4818, 0.5014)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0164          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.01701         
##             Specificity : 1.00000         
##          Pos Pred Value : 1.00000         
##          Neg Pred Value : 0.48709         
##              Prevalence : 0.51720         
##          Detection Rate : 0.00880         
##    Detection Prevalence : 0.00880         
##       Balanced Accuracy : 0.50851         
##                                           
##        'Positive' Class : 0               
##

# Plot ROC Curve
library(pROC)
gbm_prob <- predict(gbm_model, newdata = test_set, type = "prob")[,2]  # Get probabilities
roc_curve <- roc(test_set$Churn, gbm_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, main = "ROC Curve for GBM", col = "red")

print(paste("AUC: ", auc(roc_curve)))

## [1] "AUC:  0.804496809344306"

# Remove 'CustomerID' column from both train and test sets
train_no_customerid <- train_set[, -which(names(train_set) == "CustomerID")]
test_no_customerid <- test_set[, -which(names(test_set) == "CustomerID")]

# Convert 'Churn' to numeric binary for both datasets
train_no_customerid$Churn <- as.numeric(as.factor(train_no_customerid$Churn)) - 1
test_no_customerid$Churn <- as.numeric(as.factor(test_no_customerid$Churn)) - 1

# Train the Gradient Boosting Machine model
gbm_model <- gbm(Churn ~ ., 
                 data = train_no_customerid, 
                 distribution = "bernoulli",   
                 n.trees = 500,               
                 interaction.depth = 3,       
                 shrinkage = 0.01,            
                 cv.folds = 5,                
                 n.cores = 1,                 
                 verbose = TRUE)

## CV: 1 
## CV: 2 
## CV: 3 
## CV: 4 
## CV: 5 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3533             nan     0.0100    0.0071
##      2        1.3394             nan     0.0100    0.0070
##      3        1.3255             nan     0.0100    0.0069
##      4        1.3119             nan     0.0100    0.0068
##      5        1.2985             nan     0.0100    0.0067
##      6        1.2854             nan     0.0100    0.0065
##      7        1.2727             nan     0.0100    0.0064
##      8        1.2601             nan     0.0100    0.0063
##      9        1.2476             nan     0.0100    0.0062
##     10        1.2354             nan     0.0100    0.0061
##     20        1.1242             nan     0.0100    0.0051
##     40        0.9482             nan     0.0100    0.0038
##     60        0.8174             nan     0.0100    0.0030
##     80        0.7137             nan     0.0100    0.0023
##    100        0.6325             nan     0.0100    0.0017
##    120        0.5649             nan     0.0100    0.0015
##    140        0.5085             nan     0.0100    0.0013
##    160        0.4627             nan     0.0100    0.0009
##    180        0.4248             nan     0.0100    0.0009
##    200        0.3931             nan     0.0100    0.0008
##    220        0.3659             nan     0.0100    0.0005
##    240        0.3428             nan     0.0100    0.0005
##    260        0.3214             nan     0.0100    0.0005
##    280        0.3033             nan     0.0100    0.0004
##    300        0.2864             nan     0.0100    0.0002
##    320        0.2700             nan     0.0100    0.0004
##    340        0.2559             nan     0.0100    0.0003
##    360        0.2423             nan     0.0100    0.0002
##    380        0.2301             nan     0.0100    0.0003
##    400        0.2188             nan     0.0100    0.0003
##    420        0.2083             nan     0.0100    0.0002
##    440        0.1988             nan     0.0100    0.0002
##    460        0.1903             nan     0.0100    0.0002
##    480        0.1826             nan     0.0100    0.0001
##    500        0.1752             nan     0.0100    0.0002

# Print the summary of the model
summary(gbm_model)

##                                 var    rel.inf
## Support.Calls         Support.Calls 38.5707820
## Total.Spend             Total.Spend 24.3989000
## Contract.Length     Contract.Length 12.2535538
## Age                             Age 11.8358052
## Payment.Delay         Payment.Delay 11.6733543
## Last.Interaction   Last.Interaction  0.7565079
## Gender                       Gender  0.5110968
## Tenure                       Tenure  0.0000000
## Usage.Frequency     Usage.Frequency  0.0000000
## Subscription.Type Subscription.Type  0.0000000

# Predict on the test set
gbm_predictions <- predict(gbm_model, test_no_customerid, n.trees = gbm_model$n.trees, type = "response")

# Convert continuous predictions to binary outcomes
gbm_predictions_binary <- ifelse(gbm_predictions > 0.5, 1, 0)

# Evaluate the model's performance
cat("Performance for Gradient Boosting Machine (GBM):\n")

## Performance for Gradient Boosting Machine (GBM):

print(confusionMatrix(factor(gbm_predictions_binary), factor(test_no_customerid$Churn)))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  522    9
##          1 4650 4819
##                                           
##                Accuracy : 0.5341          
##                  95% CI : (0.5243, 0.5439)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : 0.0003711       
##                                           
##                   Kappa : 0.096           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.1009          
##             Specificity : 0.9981          
##          Pos Pred Value : 0.9831          
##          Neg Pred Value : 0.5089          
##              Prevalence : 0.5172          
##          Detection Rate : 0.0522          
##    Detection Prevalence : 0.0531          
##       Balanced Accuracy : 0.5495          
##                                           
##        'Positive' Class : 0               
##

# Plot the feature importance
plot(gbm_model, i.var = 1)  # Adjust the variable index as needed

Conclusion

After looking at the results, I can see that the model performed okay, but not as well as I hoped. The model, which only included key factors like Age, Tenure, Support Calls, Payment Delay, and Total Spend, had an accuracy of about 58.47%, with an AUC of 0.76. While this is a decent starting point for predicting churn, the results show there are some reasons why the model didn’t perform better. One major reason could be the dataset itself. While the data we used contains useful features, there might be other important factors affecting churn that we didn’t capture, such as customer behavior or factors outside of the company, like competitor offers. These things could have influenced churn, but were not included in the model.

Another possible reason for the model’s performance is the class imbalance in the data. Churn is a relatively rare event in telecom, meaning that most of the data points are for customers who don’t churn. This imbalance could have caused the model to focus more on predicting non-churn customers, which would lead to higher specificity (accurately predicting non-churn) but lower sensitivity (missing actual churn customers). To improve this, I could try techniques like oversampling churn cases, undersampling non-churn cases, or adjusting the loss function to better handle the imbalance.

Also, I might not have used the best model for this problem. While logistic regression is a good starting point, there are more complex models, like Random Forest or XGBoost, that might do better at capturing complex patterns in the data. These models could handle the class imbalance more effectively and might produce better results.

The choice of features used in the model also impacted performance. Some features, like Gender and Subscription Type, were left out of the reduced model. These might still have valuable information that could help improve predictions. Trying different feature selection methods or creating new features could help improve the model.

Lastly, there might be improvements to the data cleaning and preprocessing steps. I handled missing values and encoded categorical variables, but I didn’t try other preprocessing methods, like more advanced ways to fill missing data or different ways of scaling the data. Also, some features might be too closely related to each other, and using techniques like dimensionality reduction or regularization could help reduce this problem.

In conclusion, while the model gave us some insights, there are several reasons why it didn’t perform as well as expected. Improving how we handle class imbalance, using more advanced models, including additional features, and refining the data preprocessing steps are some areas to explore to improve the model’s performance.

Customer Churn Prediction Analysis

Mikhail Broomes

2024-12-15