Customer churn is a critical issue for businesses, as it can lead to a loss of revenue and market share. Identifying customers who are likely to churn can help businesses take proactive measures to retain them. In this project, we will analyze a dataset of customer interactions with a service provider to predict customer churn. The dataset contains information about customer demographics, usage patterns, payment behavior, and interactions with customer support. Our goal is to build a predictive model that can accurately identify customers who are likely to churn, based on their historical data.
The dataset consists of the following columns:
CustomerID: A unique identifier for each customer. Age: The age of the customer. Gender: The gender of the customer (Female/Male). Tenure: The number of months the customer has been with the service. Usage.Frequency: The frequency with which the customer uses the service. Support.Calls: The number of calls the customer has made to customer support. Payment.Delay: The number of days the customer has delayed their payments. Subscription.Type: The type of subscription plan the customer has (e.g., Standard, Basic, Premium). Contract.Length: The length of the customer’s contract (e.g., Annual, Monthly, Quarterly). Total.Spend: The total amount the customer has spent on the service. Last.Interaction: The last interaction date with the service provider. Churn: The target variable, where 1 indicates the customer churned, and 0 indicates they did not.
#Project Description
The problem I am trying to solve is predicting customer churn for a telecom company, which is a critical business issue. Churn prediction helps the company identify which customers are likely to leave, enabling the business to take proactive steps to retain them. By identifying high-risk customers, the company can offer targeted promotions, better customer support, or other incentives to reduce churn, ultimately improving customer retention and increasing revenue.
The dataset used for this analysis contains various customer details, such as demographic information, usage patterns, and service-related metrics. The key features in the dataset include Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction. Before conducting the analysis, I prepared the data by handling missing values, either imputing or removing records with excessive missing data. I also encoded categorical variables such as Gender and Subscription Type into dummy variables. For numerical features with large scale differences, such as Total Spend, I standardized the data to ensure that no feature had undue influence on the model due to its scale. Additionally, I checked for outliers in variables like Age and Tenure to ensure they did not skew the results. I also assessed multicollinearity using the Variance Inflation Factor (VIF) and ensured that all predictors had acceptable values below the threshold of 5, meaning multicollinearity was not a concern.
For the analysis, I used logistic regression, focusing on two models. The first model included all predictors: Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction. The second model was a reduced version, which only included the most impactful features: Age, Tenure, Support Calls, Payment Delay, and Total Spend. In both models, I applied logistic regression to predict the likelihood of churn based on customer attributes. I evaluated model performance using metrics like Accuracy, Sensitivity, Specificity, and the Area Under the Curve (AUC), with the AUC serving as a key indicator of how well the model distinguished between churn and non-churn customers.
The purpose of this analysis was to identify the key factors influencing customer churn and to build a reliable predictive model. By understanding which characteristics increase the likelihood of churn, the telecom company can intervene before customers decide to leave. This allows the business to improve retention, allocate resources more effectively, and tailor marketing strategies to high-risk customers, ultimately leading to increased revenue and customer loyalty.
The results from the analysis showed that several factors significantly influence churn. Customers’ Age, Support Calls, Tenure, and Total Spend were found to be key predictors of churn. Specifically, older customers and those who made more support calls were more likely to churn. Customers with shorter tenures were also at a higher risk, while those with higher total spending were less likely to leave. Both the full model and the reduced model showed strong performance, with AUC values between 0.75 and 0.77, indicating that the models could effectively differentiate between customers who would churn and those who would stay.
The business impact of this analysis is substantial. By identifying customers who are most at risk of churning, the company can focus its retention efforts on those individuals, offering them personalized incentives and improving their experience. The insights gained from this analysis can also help optimize marketing campaigns by targeting high-risk customers with specific offers. Additionally, resources can be allocated more efficiently, concentrating on the customers who need the most attention to reduce churn. Overall, this analysis equips the company with valuable tools to improve customer retention, reduce churn rates, and ultimately enhance profitability. # Exploratory Data Analysis (EDA)
#Train Set
train_set <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Final%20Project/customer_churn_dataset-training-master.csv")
#Test Set
test_set <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Final%20Project/customer_churn_dataset-testing-master.csv")
head(train_set)
## CustomerID Age Gender Tenure Usage.Frequency Support.Calls Payment.Delay
## 1 2 30 Female 39 14 5 18
## 2 3 65 Female 49 1 10 8
## 3 4 55 Female 14 4 6 18
## 4 5 58 Male 38 21 7 7
## 5 6 23 Male 32 20 5 8
## 6 8 51 Male 33 25 9 26
## Subscription.Type Contract.Length Total.Spend Last.Interaction Churn
## 1 Standard Annual 932 17 1
## 2 Basic Monthly 557 6 1
## 3 Basic Quarterly 185 3 1
## 4 Standard Monthly 396 29 1
## 5 Basic Monthly 617 20 1
## 6 Premium Annual 129 8 1
I will only sample 10,000 rows from the train and test set to speed up the analysis.
# Sample 10,000 rows from train_set
train_set <- sample_n(train_set, 10000)
# Sample 10,000 rows from test_set
test_set <- sample_n(test_set, 10000)
# Train Set Summary
summary(train_set)
## CustomerID Age Gender Tenure
## Min. : 59 Min. :18.00 Length:10000 Min. : 1.00
## 1st Qu.:112883 1st Qu.:29.00 Class :character 1st Qu.:17.00
## Median :224271 Median :39.00 Mode :character Median :32.00
## Mean :224777 Mean :39.41 Mean :31.46
## 3rd Qu.:338566 3rd Qu.:48.00 3rd Qu.:46.00
## Max. :449206 Max. :65.00 Max. :60.00
## Usage.Frequency Support.Calls Payment.Delay Subscription.Type
## Min. : 1.00 Min. : 0.000 Min. : 0.00 Length:10000
## 1st Qu.: 8.00 1st Qu.: 1.000 1st Qu.: 6.00 Class :character
## Median :16.00 Median : 3.000 Median :12.00 Mode :character
## Mean :15.74 Mean : 3.572 Mean :12.89
## 3rd Qu.:23.00 3rd Qu.: 6.000 3rd Qu.:19.00
## Max. :30.00 Max. :10.000 Max. :30.00
## Contract.Length Total.Spend Last.Interaction Churn
## Length:10000 Min. : 100.0 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 482.0 1st Qu.: 7.00 1st Qu.:0.0000
## Mode :character Median : 660.0 Median :14.00 Median :1.0000
## Mean : 632.8 Mean :14.56 Mean :0.5683
## 3rd Qu.: 831.3 3rd Qu.:22.00 3rd Qu.:1.0000
## Max. :1000.0 Max. :30.00 Max. :1.0000
# Test Set Summary
summary(test_set)
## CustomerID Age Gender Tenure
## Min. : 14 Min. :18.00 Length:10000 Min. : 1.0
## 1st Qu.:16848 1st Qu.:30.00 Class :character 1st Qu.:18.0
## Median :32872 Median :42.00 Mode :character Median :33.0
## Mean :32681 Mean :42.09 Mean :32.2
## 3rd Qu.:48676 3rd Qu.:54.00 3rd Qu.:47.0
## Max. :64368 Max. :65.00 Max. :60.0
## Usage.Frequency Support.Calls Payment.Delay Subscription.Type
## Min. : 1.00 Min. : 0.000 Min. : 0.00 Length:10000
## 1st Qu.: 7.00 1st Qu.: 3.000 1st Qu.:10.00 Class :character
## Median :15.00 Median : 6.000 Median :19.00 Mode :character
## Mean :15.06 Mean : 5.375 Mean :17.25
## 3rd Qu.:23.00 3rd Qu.: 8.000 3rd Qu.:25.00
## Max. :30.00 Max. :10.000 Max. :30.00
## Contract.Length Total.Spend Last.Interaction Churn
## Length:10000 Min. : 100.0 Min. : 1.0 Min. :0.0000
## Class :character 1st Qu.: 310.0 1st Qu.: 8.0 1st Qu.:0.0000
## Mode :character Median : 531.0 Median :15.0 Median :0.0000
## Mean : 540.0 Mean :15.5 Mean :0.4828
## 3rd Qu.: 768.2 3rd Qu.:23.0 3rd Qu.:1.0000
## Max. :1000.0 Max. :30.0 Max. :1.0000
# Checking for missing values in the train set
str(train_set)
## 'data.frame': 10000 obs. of 12 variables:
## $ CustomerID : int 267575 285634 189084 257046 213620 94212 86546 272672 26057 411735 ...
## $ Age : int 43 50 40 42 33 61 42 32 34 19 ...
## $ Gender : chr "Female" "Male" "Female" "Male" ...
## $ Tenure : int 58 7 52 39 5 54 12 60 57 29 ...
## $ Usage.Frequency : int 3 23 19 9 29 15 24 8 21 2 ...
## $ Support.Calls : int 2 2 6 7 1 0 7 2 7 0 ...
## $ Payment.Delay : int 20 1 8 13 18 7 28 14 20 14 ...
## $ Subscription.Type: chr "Basic" "Basic" "Standard" "Premium" ...
## $ Contract.Length : chr "Quarterly" "Quarterly" "Quarterly" "Monthly" ...
## $ Total.Spend : num 530 618 692 355 360 ...
## $ Last.Interaction : int 3 7 24 21 5 27 19 16 12 6 ...
## $ Churn : int 0 0 1 1 1 1 1 0 1 0 ...
# Checking for missing values in the test set
str(test_set)
## 'data.frame': 10000 obs. of 12 variables:
## $ CustomerID : int 8530 48267 30440 6391 25607 38882 33621 49883 2219 52794 ...
## $ Age : int 35 32 41 63 49 47 49 58 53 63 ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Tenure : int 30 55 13 15 4 8 5 24 10 20 ...
## $ Usage.Frequency : int 21 16 12 26 7 20 6 3 28 6 ...
## $ Support.Calls : int 4 5 5 7 6 5 6 10 9 10 ...
## $ Payment.Delay : int 30 29 20 21 20 17 20 12 8 24 ...
## $ Subscription.Type: chr "Standard" "Standard" "Basic" "Premium" ...
## $ Contract.Length : chr "Quarterly" "Monthly" "Quarterly" "Monthly" ...
## $ Total.Spend : int 884 386 439 533 846 732 622 749 809 965 ...
## $ Last.Interaction : int 26 24 29 9 17 29 24 1 24 30 ...
## $ Churn : int 0 1 0 0 0 0 0 1 0 1 ...
# Select only numeric variables
numeric_vars <- train_set %>% select_if(is.numeric)
# Calculate correlation matrix
correlation_matrix <- cor(numeric_vars, use = "complete.obs")
# Plot correlation heatmap
corrplot(correlation_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7,
main = "Correlation Matrix")
# Identify numeric and categorical columns
numeric_cols <- names(train_set)[sapply(train_set, is.numeric)]
numeric_cols <- numeric_cols[!numeric_cols %in% c("CustomerID", "Churn")]
categorical_cols <- names(train_set)[sapply(train_set, is.character) | sapply(train_set, is.factor)]
# Create distributions for numeric columns
cat("\nPlotting Distributions for Numeric Columns...\n")
##
## Plotting Distributions for Numeric Columns...
for (col in numeric_cols) {
print(ggplot(train_set, aes_string(x = col)) +
geom_histogram(binwidth = 10, fill = "blue", color = "white", alpha = 0.7) +
labs(title = paste("Distribution of", col), x = col, y = "Frequency") +
theme_minimal())
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cat("\nCreating Boxplots for Numeric Columns...\n")
##
## Creating Boxplots for Numeric Columns...
for (col in numeric_cols) {
# Create a boxplot for the numeric column
boxplot(
train_set[[col]],
main = paste("Boxplot of", col),
ylab = col,
col = "lightblue",
border = "black",
outline = TRUE # Show outliers
)
}
The dataset has been cleaned already and since only 1 NA row was found in the train set, it was removed as it would not have a significant impact on the analysis.
# Remove missing values from the train set
#train_set <- na.omit(train_set)
sum(is.na(train_set))
## [1] 0
# Convert categorical variables to factors
train_set$Gender <- as.factor(train_set$Gender)
train_set$Subscription.Type <- as.factor(train_set$Subscription.Type)
train_set$Contract.Length <- as.factor(train_set$Contract.Length)
train_set$Churn <- as.factor(train_set$Churn)
test_set$Gender <- as.factor(test_set$Gender)
test_set$Subscription.Type <- as.factor(test_set$Subscription.Type)
test_set$Contract.Length <- as.factor(test_set$Contract.Length)
test_set$Churn <- as.factor(test_set$Churn)
# View the structure of the data
str(train_set)
## 'data.frame': 10000 obs. of 12 variables:
## $ CustomerID : int 267575 285634 189084 257046 213620 94212 86546 272672 26057 411735 ...
## $ Age : int 43 50 40 42 33 61 42 32 34 19 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 2 2 2 2 ...
## $ Tenure : int 58 7 52 39 5 54 12 60 57 29 ...
## $ Usage.Frequency : int 3 23 19 9 29 15 24 8 21 2 ...
## $ Support.Calls : int 2 2 6 7 1 0 7 2 7 0 ...
## $ Payment.Delay : int 20 1 8 13 18 7 28 14 20 14 ...
## $ Subscription.Type: Factor w/ 3 levels "Basic","Premium",..: 1 1 3 2 2 2 3 1 3 3 ...
## $ Contract.Length : Factor w/ 3 levels "Annual","Monthly",..: 3 3 3 2 1 1 1 1 1 3 ...
## $ Total.Spend : num 530 618 692 355 360 ...
## $ Last.Interaction : int 3 7 24 21 5 27 19 16 12 6 ...
## $ Churn : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...
sum(is.na(train_set))
## [1] 0
In this analysis, I built two logistic regression models to predict customer churn. The first model includes nine predictors: Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction, while the second model simplifies the predictors to just Age, Tenure, Support Calls, Payment Delay, and Total Spend. I began by checking for multicollinearity using Variance Inflation Factor (VIF), which measures how much the variance of a regression coefficient is inflated due to correlations with other predictors. All VIF values were below 5, suggesting that multicollinearity was not a significant issue.
When interpreting the coefficients, I found that Age had a small but significant positive relationship with churn, meaning that older customers are slightly more likely to churn. Gender had a strong negative coefficient for male customers, indicating that men were less likely to churn. Support Calls had a significant positive effect, suggesting that customers who made more support calls were more likely to churn. Subscription Type also played a role, with both Premium and Standard subscriptions showing negative coefficients, meaning customers with these subscriptions were less likely to churn compared to other types.
The second model, which excluded some predictors, showed similar trends. Age and Support Calls remained significant, with Age positively affecting churn and Support Calls increasing the likelihood of churn. Tenure, Payment Delay, and Total Spend all had negative coefficients, indicating that longer tenure, fewer payment delays, and higher spending were associated with lower churn.
For model evaluation, I looked at various performance metrics. The accuracy of the first model was 58.64%, with a sensitivity of 21.93% and specificity of 98.68%. The AUC (Area Under the Curve) for this model was 0.7697, which indicates the model’s good ability to distinguish between churn and non-churn customers. The second model had a similar accuracy of 58.47%, but its AUC was slightly lower at 0.7566. Although the models’ accuracy is moderate, the AUC values suggest that they are reasonably good at predicting churn, with Support Calls, Age, and Total Spend being key factors in determining customer churn.
set.seed(123)
# Train a Logistic Regression Model
logistic_model <- glm(Churn ~ Age + Gender + Tenure + Usage.Frequency + Support.Calls +
Payment.Delay + Subscription.Type + Total.Spend +
Last.Interaction,
data = train_set,
family = binomial,
control = list(maxit = 1000))
# Looking at multicollinearity
vif(logistic_model)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.013470 1 1.006712
## Gender 1.109134 1 1.053154
## Tenure 1.011901 1 1.005933
## Usage.Frequency 1.004468 1 1.002232
## Support.Calls 1.181901 1 1.087153
## Payment.Delay 1.075382 1 1.037006
## Subscription.Type 1.009377 2 1.002336
## Total.Spend 1.146048 1 1.070536
## Last.Interaction 1.098976 1 1.048321
# Summary of the Logistic Regression Model
summary(logistic_model)
##
## Call:
## glm(formula = Churn ~ Age + Gender + Tenure + Usage.Frequency +
## Support.Calls + Payment.Delay + Subscription.Type + Total.Spend +
## Last.Interaction, family = binomial, data = train_set, control = list(maxit = 1000))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1023871 0.1931708 -0.530 0.596089
## Age 0.0332458 0.0027344 12.159 < 2e-16 ***
## GenderMale -1.1422204 0.0668059 -17.098 < 2e-16 ***
## Tenure -0.0079908 0.0018159 -4.401 1.08e-05 ***
## Usage.Frequency -0.0143120 0.0036593 -3.911 9.19e-05 ***
## Support.Calls 0.6911225 0.0170988 40.419 < 2e-16 ***
## Payment.Delay 0.1049944 0.0043333 24.230 < 2e-16 ***
## Subscription.TypePremium -0.2871643 0.0762683 -3.765 0.000166 ***
## Subscription.TypeStandard -0.1840742 0.0762139 -2.415 0.015725 *
## Total.Spend -0.0056174 0.0001704 -32.958 < 2e-16 ***
## Last.Interaction 0.0586276 0.0038866 15.085 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13675.8 on 9999 degrees of freedom
## Residual deviance: 6534.7 on 9989 degrees of freedom
## AIC: 6556.7
##
## Number of Fisher Scoring iterations: 6
I used another logistic regression model to identify key factors influencing customer churn, such as Age, Tenure, Support Calls, Payment Delay, and Total Spend, based on their statistical significance. By refitting the model with only significant variables, I aimed to improve its efficiency and focus on the most impactful predictors for better churn prediction and customer retention strategies.
# Refitting the model with only statistically significant variables
significant_model <- glm(Churn ~ Age + Tenure + Support.Calls + Payment.Delay + Total.Spend,
data = train_set,
family = binomial,
control = list(maxit = 1000))
#Looking at multicollinearity
vif(significant_model)
## Age Tenure Support.Calls Payment.Delay Total.Spend
## 1.009464 1.002272 1.143582 1.066958 1.126317
# Summary of the new model
summary(significant_model)
##
## Call:
## glm(formula = Churn ~ Age + Tenure + Support.Calls + Payment.Delay +
## Total.Spend, family = binomial, data = train_set, control = list(maxit = 1000))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3584532 0.1610753 -2.225 0.026056 *
## Age 0.0320370 0.0026172 12.241 < 2e-16 ***
## Tenure -0.0064212 0.0017366 -3.698 0.000218 ***
## Support.Calls 0.6675793 0.0162171 41.165 < 2e-16 ***
## Payment.Delay 0.1030072 0.0041463 24.843 < 2e-16 ***
## Total.Spend -0.0054530 0.0001621 -33.647 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13675.8 on 9999 degrees of freedom
## Residual deviance: 7026.5 on 9994 degrees of freedom
## AIC: 7038.5
##
## Number of Fisher Scoring iterations: 6
# Predict the probability of churn (0 or 1) on the test set
predictions_prob <- predict(logistic_model, test_set, type = "response")
sig_predictions_prob <- predict(significant_model, test_set, type = "response")
# Convert probabilities to 0 or 1 based on a threshold (e.g., 0.5)
predictions <- ifelse(predictions_prob > 0.5, 1, 0)
sig_predictions <- ifelse(sig_predictions_prob > 0.5, 1, 0)
# Evaluate the Performance of the model
conf_matrix <- confusionMatrix(factor(predictions), factor(test_set$Churn))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1213 68
## 1 3959 4760
##
## Accuracy : 0.5973
## 95% CI : (0.5876, 0.6069)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2147
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.2345
## Specificity : 0.9859
## Pos Pred Value : 0.9469
## Neg Pred Value : 0.5459
## Prevalence : 0.5172
## Detection Rate : 0.1213
## Detection Prevalence : 0.1281
## Balanced Accuracy : 0.6102
##
## 'Positive' Class : 0
##
# Evaluate the Performance of the significant model
sig_conf_matrix <- confusionMatrix(factor(sig_predictions), factor(test_set$Churn))
print(sig_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1253 98
## 1 3919 4730
##
## Accuracy : 0.5983
## 95% CI : (0.5886, 0.6079)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2163
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.2423
## Specificity : 0.9797
## Pos Pred Value : 0.9275
## Neg Pred Value : 0.5469
## Prevalence : 0.5172
## Detection Rate : 0.1253
## Detection Prevalence : 0.1351
## Balanced Accuracy : 0.6110
##
## 'Positive' Class : 0
##
# ROC Curve and AUC
roc_curve <- roc(test_set$Churn, predictions_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve", col = "blue")
auc(roc_curve)
## Area under the curve: 0.7761
# ROC Curve and AUC for the significant model
sig_roc_curve <- roc(test_set$Churn, sig_predictions_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(sig_roc_curve, main = "ROC Curve for Significant Model", col = "red")
auc(sig_roc_curve)
## Area under the curve: 0.7675
The customer churn prediction model using Support Vector Machine (SVM) aims to classify whether a customer will churn (leave the company) or not based on various features like Age, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Contract Length, Total Spend, and Last Interaction. In the initial analysis, several variables were considered, and a logistic regression model was built to predict churn, showing that Age, Gender, Tenure, Usage Frequency, Support Calls, Payment Delay, Subscription Type, Total Spend, and Last Interaction were significant predictors. The model’s performance was evaluated using a confusion matrix, which revealed an accuracy of approximately 58.64%, with a sensitivity of 21.93% and specificity of 98.68%. This suggests the model is better at identifying non-churning customers (specificity) but has a lower ability to detect actual churners (sensitivity). The Area Under the Curve (AUC) was 0.7697, which indicates a decent overall performance, though there is room for improvement. The final model considered only five predictors (Age, Tenure, Support Calls, Payment Delay, and Total Spend), which provided a similar accuracy of 58.47%, further validating the importance of these features in predicting churn. The confusion matrix showed similar results, with a balanced accuracy of 60.1%. Despite the moderate performance, the analysis indicates that Support Calls, Payment Delay, and Total Spend are key factors influencing churn, and the model could potentially be improved by fine-tuning hyperparameters or applying other techniques like cross-validation.
RFE recursively removes features and builds models to identify the most significant features. Here’s how you can apply RFE to select the best features:
# Preprocess: Center and scale the data (as done before)
preProcess_data <- preProcess(train_set[, -which(names(train_set) == "Churn")],
method = c("center", "scale"))
train_scaled <- predict(preProcess_data, train_set)
test_scaled <- predict(preProcess_data, test_set)
# RFE feature selection
control <- rfeControl(functions=rfFuncs, method="cv", number=10) # Using Random Forest with 10-fold cross-validation
rfe_result <- rfe(train_scaled[, -which(names(train_scaled) == "Churn")],
train_scaled$Churn,
sizes=c(1:10), # Adjust the size range to your dataset's context
rfeControl=control)
# Get the best features selected by RFE
best_features <- rfe_result$optVariables
# Train SVM models with different kernels using the selected features
# Linear Kernel
svm_linear_model <- svm(Churn ~ .,
data = train_scaled[, c(best_features, "Churn")],
kernel = "linear",
cost = 100,
scale = TRUE)
# Radial Kernel
svm_radial_model <- svm(Churn ~ .,
data = train_scaled[, c(best_features, "Churn")],
kernel = "radial",
cost = 100,
scale = TRUE)
# Polynomial Kernel
svm_polynomial_model <- svm(Churn ~ .,
data = train_scaled[, c(best_features, "Churn")],
kernel = "polynomial",
cost = 100,
scale = TRUE)
# Sigmoid Kernel
svm_sigmoid_model <- svm(Churn ~ .,
data = train_scaled[, c(best_features, "Churn")],
kernel = "sigmoid",
cost = 100,
scale = TRUE)
# Predict on Test Set using the selected features for each kernel
# Linear Kernel Predictions
svm_linear_predictions <- predict(svm_linear_model, test_scaled[, best_features])
# Radial Kernel Predictions
svm_radial_predictions <- predict(svm_radial_model, test_scaled[, best_features])
# Polynomial Kernel Predictions
svm_polynomial_predictions <- predict(svm_polynomial_model, test_scaled[, best_features])
# Sigmoid Kernel Predictions
svm_sigmoid_predictions <- predict(svm_sigmoid_model, test_scaled[, best_features])
# Evaluate Performance for each kernel
cat("Performance for SVM with Linear Kernel:\n")
## Performance for SVM with Linear Kernel:
print(confusionMatrix(svm_linear_predictions, test_scaled$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 5172 4828
##
## Accuracy : 0.4828
## 95% CI : (0.473, 0.4926)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.4828
## Prevalence : 0.5172
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
cat("\nPerformance for SVM with Radial Kernel:\n")
##
## Performance for SVM with Radial Kernel:
print(confusionMatrix(svm_radial_predictions, test_scaled$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 224 7
## 1 4948 4821
##
## Accuracy : 0.5045
## 95% CI : (0.4947, 0.5143)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 0.9946
##
## Kappa : 0.0405
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.04331
## Specificity : 0.99855
## Pos Pred Value : 0.96970
## Neg Pred Value : 0.49350
## Prevalence : 0.51720
## Detection Rate : 0.02240
## Detection Prevalence : 0.02310
## Balanced Accuracy : 0.52093
##
## 'Positive' Class : 0
##
cat("\nPerformance for SVM with Polynomial Kernel:\n")
##
## Performance for SVM with Polynomial Kernel:
print(confusionMatrix(svm_polynomial_predictions, test_scaled$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 203 5
## 1 4969 4823
##
## Accuracy : 0.5026
## 95% CI : (0.4928, 0.5124)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 0.9983
##
## Kappa : 0.037
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.03925
## Specificity : 0.99896
## Pos Pred Value : 0.97596
## Neg Pred Value : 0.49254
## Prevalence : 0.51720
## Detection Rate : 0.02030
## Detection Prevalence : 0.02080
## Balanced Accuracy : 0.51911
##
## 'Positive' Class : 0
##
cat("\nPerformance for SVM with Sigmoid Kernel:\n")
##
## Performance for SVM with Sigmoid Kernel:
print(confusionMatrix(svm_sigmoid_predictions, test_scaled$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 5172 4828
##
## Accuracy : 0.4828
## 95% CI : (0.473, 0.4926)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.4828
## Prevalence : 0.5172
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
In my analysis, I applied a random forest model to predict customer churn, using a dataset that includes various customer attributes. I started by training the model with the full set of features, which resulted in an Out-Of-Bag (OOB) error rate of 0.47%. The confusion matrix showed an impressive classification performance for the negative class (no churn), with a specificity of 1.00000, meaning that the model was very accurate in identifying non-churn customers. However, the sensitivity was quite low at 0.02185, indicating that the model struggled to detect churn cases, which led to an imbalance between the positive and negative classes.
Next, I focused on the top 5 features that had the most significant impact on the prediction. These included attributes such as CustomerID, Age, and Support.Calls. After retraining the model with only these key features, the OOB error rate slightly increased to 0.96%. The confusion matrix showed an improvement in sensitivity, rising to 0.02415, but it still had a relatively low detection rate of 0.01260. Despite these issues, the accuracy remained almost unchanged at 0.4909, which suggested that the model’s performance was somewhat limited by the imbalance in the dataset.
To address the class imbalance, I applied class weights, assigning a higher weight to the positive class (churn). This adjustment resulted in an OOB error rate of 0.97%. The confusion matrix showed a slight improvement in the performance of the positive class, with a sensitivity of 0.02569, although the detection rate was still quite low. This indicates that while the model could better identify churn, it still struggled to predict churn accurately when compared to non-churn customers.
In conclusion, the random forest model provided valuable insights into customer churn, but the performance could be improved by further refining the model. Addressing class imbalance and incorporating more advanced techniques such as feature engineering or resampling might help boost sensitivity and detection rates.
# Train Random Forest model
rf_model <- randomForest(Churn ~ ., data = train_scaled, importance = TRUE)
# Print the model summary
print(rf_model)
##
## Call:
## randomForest(formula = Churn ~ ., data = train_scaled, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.45%
## Confusion matrix:
## 0 1 class.error
## 0 4272 45 0.01042391
## 1 0 5683 0.00000000
# Make predictions on the test data
rf_predictions <- predict(rf_model, newdata = test_scaled)
# Evaluate performance using confusion matrix
library(caret)
confusionMatrix(rf_predictions, test_scaled$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 94 0
## 1 5078 4828
##
## Accuracy : 0.4922
## 95% CI : (0.4824, 0.502)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0176
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.01817
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.48738
## Prevalence : 0.51720
## Detection Rate : 0.00940
## Detection Prevalence : 0.00940
## Balanced Accuracy : 0.50909
##
## 'Positive' Class : 0
##
# Plot feature importance
importance(rf_model)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## CustomerID 89.688056 39.417032 63.976439 2892.609677
## Age 16.120199 37.230447 18.120451 233.390907
## Gender 9.638174 27.623485 11.738817 39.207900
## Tenure 3.222604 1.787152 3.493085 13.696222
## Usage.Frequency 2.270625 0.434517 1.982608 10.297279
## Support.Calls 15.414775 60.376977 19.706360 669.872441
## Payment.Delay 15.182943 45.717833 17.529104 222.951377
## Subscription.Type 2.240037 2.482856 3.247690 3.754543
## Contract.Length 16.857257 43.808748 18.751483 268.335506
## Total.Spend 19.071894 53.296989 21.515974 490.249717
## Last.Interaction 9.219669 18.664482 10.333623 62.248186
varImpPlot(rf_model)
# Tune hyperparameters
tuned_rf <- tuneRF(train_scaled[, -which(names(train_scaled) == "Churn")],
train_scaled$Churn,
stepFactor = 1.5,
improve = 0.01,
trace = TRUE)
## mtry = 3 OOB error = 0.49%
## Searching left ...
## mtry = 2 OOB error = 0.51%
## -0.04081633 0.01
## Searching right ...
## mtry = 4 OOB error = 0.38%
## 0.2244898 0.01
## mtry = 6 OOB error = 0.42%
## -0.1052632 0.01
# Evaluate performance using confusion matrix
confusionMatrix(rf_predictions, test_scaled$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 94 0
## 1 5078 4828
##
## Accuracy : 0.4922
## 95% CI : (0.4824, 0.502)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0176
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.01817
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.48738
## Prevalence : 0.51720
## Detection Rate : 0.00940
## Detection Prevalence : 0.00940
## Balanced Accuracy : 0.50909
##
## 'Positive' Class : 0
##
Now that I have the model I will be building it with the best subset of features
# Create a data frame of feature importance
importance_scores <- data.frame(
Feature = rownames(importance(rf_model)),
MeanDecreaseAccuracy = importance(rf_model)[, "MeanDecreaseAccuracy"],
MeanDecreaseGini = importance(rf_model)[, "MeanDecreaseGini"]
)
# Sort by MeanDecreaseAccuracy (or MeanDecreaseGini) to select the most important features
top_features <- importance_scores[order(-importance_scores$MeanDecreaseAccuracy), "Feature"]
# Select the top 5 features (You can adjust this number)
top_5_features <- top_features[1:5]
# Train Random Forest model using the top 5 features
rf_model_top <- randomForest(Churn ~ .,
data = train_scaled[, c(top_5_features, "Churn")],
importance = TRUE)
# Print the model summary
print(rf_model_top)
##
## Call:
## randomForest(formula = Churn ~ ., data = train_scaled[, c(top_5_features, "Churn")], importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.94%
## Confusion matrix:
## 0 1 class.error
## 0 4245 72 0.016678249
## 1 22 5661 0.003871195
# Make predictions on the test data using the top 5 features
rf_predictions_top <- predict(rf_model_top, newdata = test_scaled[, top_5_features])
# Evaluate performance using confusion matrix
library(caret)
confusionMatrix(rf_predictions_top, test_scaled$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 0
## 1 5134 4828
##
## Accuracy : 0.4866
## 95% CI : (0.4768, 0.4964)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0071
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.007347
## Specificity : 1.000000
## Pos Pred Value : 1.000000
## Neg Pred Value : 0.484642
## Prevalence : 0.517200
## Detection Rate : 0.003800
## Detection Prevalence : 0.003800
## Balanced Accuracy : 0.503674
##
## 'Positive' Class : 0
##
# Plot feature importance of the top model
importance(rf_model_top)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## CustomerID 495.82597 46.64745 103.75761 3222.8634
## Total.Spend 16.79722 48.61767 18.44506 502.7290
## Support.Calls 14.61203 64.95974 17.18003 730.8382
## Contract.Length 15.67537 39.89741 16.97312 274.1084
## Age 11.79699 37.76864 13.54505 171.0931
varImpPlot(rf_model_top)
# Train Random Forest model with class weights to handle imbalance
rf_model_balanced <- randomForest(Churn ~ .,
data = train_scaled[, c(top_5_features, "Churn")],
importance = TRUE,
classwt = c(0.2, 0.8)) # Adjust class weights (minority: 0.2, majority: 0.8)
# Print the model summary
print(rf_model_balanced)
##
## Call:
## randomForest(formula = Churn ~ ., data = train_scaled[, c(top_5_features, "Churn")], importance = TRUE, classwt = c(0.2, 0.8))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.95%
## Confusion matrix:
## 0 1 class.error
## 0 4242 75 0.017373176
## 1 20 5663 0.003519268
# Make predictions on the test data
rf_predictions_balanced <- predict(rf_model_balanced, newdata = test_scaled[, top_5_features])
# Evaluate performance using confusion matrix
confusionMatrix(rf_predictions_balanced, test_scaled$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 0
## 1 5134 4828
##
## Accuracy : 0.4866
## 95% CI : (0.4768, 0.4964)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0071
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.007347
## Specificity : 1.000000
## Pos Pred Value : 1.000000
## Neg Pred Value : 0.484642
## Prevalence : 0.517200
## Detection Rate : 0.003800
## Detection Prevalence : 0.003800
## Balanced Accuracy : 0.503674
##
## 'Positive' Class : 0
##
# Plot feature importance
importance(rf_model_balanced)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## CustomerID 496.33662 50.27460 107.33262 2385.9252
## Total.Spend 17.37626 49.00818 19.12205 236.6473
## Support.Calls 11.84786 68.11028 14.73909 323.3242
## Contract.Length 16.21887 40.26281 17.45866 133.9138
## Age 13.01834 38.19712 14.70748 116.2204
varImpPlot(rf_model_balanced)
For my Gradient Boosting Machine (GBM) model, I first set up 10-fold
cross-validation for model training using the trainControl
function from the caret package. The goal was to evaluate the
performance of the model with more reliable validation. I used this
control setup to train the model, suppressing verbose output to keep the
results clean.
Once the model was trained, I moved on to making predictions using
the test set. The predict
function was applied to the test
data, and I evaluated the model’s performance by calculating the
confusion matrix with confusionMatrix
. The confusion matrix
allowed me to measure the accuracy, sensitivity, specificity, and other
important metrics, which helped in understanding how well the model
performed. I also plotted the ROC curve using the pROC
package, which is helpful for evaluating the model’s performance across
different thresholds, and calculated the AUC (Area Under the Curve) to
gauge how well the model distinguishes between the classes.
Next, I removed the CustomerID
column from both the
train and test sets, as it didn’t provide any meaningful contribution to
the prediction. I also converted the Churn
variable to a
numeric binary variable, as this was required for the GBM model.
I then trained the GBM model again with specific parameters:
n.trees = 500
, interaction.depth = 3
, and
shrinkage = 0.01
. These settings helped control the
complexity of the model and mitigate overfitting. I used 5-fold
cross-validation and ensured that only one core was used for
training.
After training, I evaluated the model’s performance once again using a confusion matrix, which revealed some insights. The accuracy was 52.64%, which is slightly above the baseline accuracy of 52.17%. However, the model showed high specificity (99.85%) but low sensitivity (9.35%). This indicates that the model is good at correctly identifying the negative class (non-churn) but struggles with identifying actual churn cases. The AUC for the ROC curve was also quite informative.
The final step involved analyzing the model’s feature importance,
which can provide valuable insights into which variables were most
influential in the prediction. The feature importance plot revealed that
Support.Calls
, Total.Spend
, and
Contract.Length
were the top predictors.
Overall, while the model performed decently, there’s room for improvement, especially in terms of handling the positive class (churn). I might experiment with further tuning the hyperparameters, handling class imbalance, or trying other models to improve performance in detecting churn cases.
# Set up cross-validation for model training
control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
# Train the GBM model using caret's train function
gbm_model <- train(Churn ~ ., data = train_set, method = "gbm",
trControl = control,
verbose = FALSE) # Suppress verbose output
# Print the model results
print(gbm_model)
## Stochastic Gradient Boosting
##
## 10000 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 9000, 9000, 8999, 9000, 8999, 9000, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9908999 0.9814062
## 1 100 0.9908999 0.9814062
## 1 150 0.9908999 0.9814062
## 2 50 0.9908999 0.9814062
## 2 100 0.9914998 0.9826345
## 2 150 0.9936999 0.9871357
## 3 50 0.9909999 0.9816112
## 3 100 0.9934995 0.9867281
## 3 150 0.9947997 0.9893866
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
# Predict on Test Set
gbm_predictions <- predict(gbm_model, newdata = test_set)
# Evaluate the performance using confusion matrix
conf_matrix <- confusionMatrix(gbm_predictions, test_set$Churn)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 0
## 1 5084 4828
##
## Accuracy : 0.4916
## 95% CI : (0.4818, 0.5014)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0164
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.01701
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.48709
## Prevalence : 0.51720
## Detection Rate : 0.00880
## Detection Prevalence : 0.00880
## Balanced Accuracy : 0.50851
##
## 'Positive' Class : 0
##
# Plot ROC Curve
library(pROC)
gbm_prob <- predict(gbm_model, newdata = test_set, type = "prob")[,2] # Get probabilities
roc_curve <- roc(test_set$Churn, gbm_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve for GBM", col = "red")
print(paste("AUC: ", auc(roc_curve)))
## [1] "AUC: 0.804496809344306"
# Remove 'CustomerID' column from both train and test sets
train_no_customerid <- train_set[, -which(names(train_set) == "CustomerID")]
test_no_customerid <- test_set[, -which(names(test_set) == "CustomerID")]
# Convert 'Churn' to numeric binary for both datasets
train_no_customerid$Churn <- as.numeric(as.factor(train_no_customerid$Churn)) - 1
test_no_customerid$Churn <- as.numeric(as.factor(test_no_customerid$Churn)) - 1
# Train the Gradient Boosting Machine model
gbm_model <- gbm(Churn ~ .,
data = train_no_customerid,
distribution = "bernoulli",
n.trees = 500,
interaction.depth = 3,
shrinkage = 0.01,
cv.folds = 5,
n.cores = 1,
verbose = TRUE)
## CV: 1
## CV: 2
## CV: 3
## CV: 4
## CV: 5
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3533 nan 0.0100 0.0071
## 2 1.3394 nan 0.0100 0.0070
## 3 1.3255 nan 0.0100 0.0069
## 4 1.3119 nan 0.0100 0.0068
## 5 1.2985 nan 0.0100 0.0067
## 6 1.2854 nan 0.0100 0.0065
## 7 1.2727 nan 0.0100 0.0064
## 8 1.2601 nan 0.0100 0.0063
## 9 1.2476 nan 0.0100 0.0062
## 10 1.2354 nan 0.0100 0.0061
## 20 1.1242 nan 0.0100 0.0051
## 40 0.9482 nan 0.0100 0.0038
## 60 0.8174 nan 0.0100 0.0030
## 80 0.7137 nan 0.0100 0.0023
## 100 0.6325 nan 0.0100 0.0017
## 120 0.5649 nan 0.0100 0.0015
## 140 0.5085 nan 0.0100 0.0013
## 160 0.4627 nan 0.0100 0.0009
## 180 0.4248 nan 0.0100 0.0009
## 200 0.3931 nan 0.0100 0.0008
## 220 0.3659 nan 0.0100 0.0005
## 240 0.3428 nan 0.0100 0.0005
## 260 0.3214 nan 0.0100 0.0005
## 280 0.3033 nan 0.0100 0.0004
## 300 0.2864 nan 0.0100 0.0002
## 320 0.2700 nan 0.0100 0.0004
## 340 0.2559 nan 0.0100 0.0003
## 360 0.2423 nan 0.0100 0.0002
## 380 0.2301 nan 0.0100 0.0003
## 400 0.2188 nan 0.0100 0.0003
## 420 0.2083 nan 0.0100 0.0002
## 440 0.1988 nan 0.0100 0.0002
## 460 0.1903 nan 0.0100 0.0002
## 480 0.1826 nan 0.0100 0.0001
## 500 0.1752 nan 0.0100 0.0002
# Print the summary of the model
summary(gbm_model)
## var rel.inf
## Support.Calls Support.Calls 38.5707820
## Total.Spend Total.Spend 24.3989000
## Contract.Length Contract.Length 12.2535538
## Age Age 11.8358052
## Payment.Delay Payment.Delay 11.6733543
## Last.Interaction Last.Interaction 0.7565079
## Gender Gender 0.5110968
## Tenure Tenure 0.0000000
## Usage.Frequency Usage.Frequency 0.0000000
## Subscription.Type Subscription.Type 0.0000000
# Predict on the test set
gbm_predictions <- predict(gbm_model, test_no_customerid, n.trees = gbm_model$n.trees, type = "response")
# Convert continuous predictions to binary outcomes
gbm_predictions_binary <- ifelse(gbm_predictions > 0.5, 1, 0)
# Evaluate the model's performance
cat("Performance for Gradient Boosting Machine (GBM):\n")
## Performance for Gradient Boosting Machine (GBM):
print(confusionMatrix(factor(gbm_predictions_binary), factor(test_no_customerid$Churn)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 522 9
## 1 4650 4819
##
## Accuracy : 0.5341
## 95% CI : (0.5243, 0.5439)
## No Information Rate : 0.5172
## P-Value [Acc > NIR] : 0.0003711
##
## Kappa : 0.096
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.1009
## Specificity : 0.9981
## Pos Pred Value : 0.9831
## Neg Pred Value : 0.5089
## Prevalence : 0.5172
## Detection Rate : 0.0522
## Detection Prevalence : 0.0531
## Balanced Accuracy : 0.5495
##
## 'Positive' Class : 0
##
# Plot the feature importance
plot(gbm_model, i.var = 1) # Adjust the variable index as needed
After looking at the results, I can see that the model performed okay, but not as well as I hoped. The model, which only included key factors like Age, Tenure, Support Calls, Payment Delay, and Total Spend, had an accuracy of about 58.47%, with an AUC of 0.76. While this is a decent starting point for predicting churn, the results show there are some reasons why the model didn’t perform better. One major reason could be the dataset itself. While the data we used contains useful features, there might be other important factors affecting churn that we didn’t capture, such as customer behavior or factors outside of the company, like competitor offers. These things could have influenced churn, but were not included in the model.
Another possible reason for the model’s performance is the class imbalance in the data. Churn is a relatively rare event in telecom, meaning that most of the data points are for customers who don’t churn. This imbalance could have caused the model to focus more on predicting non-churn customers, which would lead to higher specificity (accurately predicting non-churn) but lower sensitivity (missing actual churn customers). To improve this, I could try techniques like oversampling churn cases, undersampling non-churn cases, or adjusting the loss function to better handle the imbalance.
Also, I might not have used the best model for this problem. While logistic regression is a good starting point, there are more complex models, like Random Forest or XGBoost, that might do better at capturing complex patterns in the data. These models could handle the class imbalance more effectively and might produce better results.
The choice of features used in the model also impacted performance. Some features, like Gender and Subscription Type, were left out of the reduced model. These might still have valuable information that could help improve predictions. Trying different feature selection methods or creating new features could help improve the model.
Lastly, there might be improvements to the data cleaning and preprocessing steps. I handled missing values and encoded categorical variables, but I didn’t try other preprocessing methods, like more advanced ways to fill missing data or different ways of scaling the data. Also, some features might be too closely related to each other, and using techniques like dimensionality reduction or regularization could help reduce this problem.
In conclusion, while the model gave us some insights, there are several reasons why it didn’t perform as well as expected. Improving how we handle class imbalance, using more advanced models, including additional features, and refining the data preprocessing steps are some areas to explore to improve the model’s performance.