Customer Churn Prediction

1 Introduction
2 Data Loading
3 Understanding Key Variables and Handling Missing Values
4 Explanatory Data Analysis (EDA)
5 Data Preparation for Machine Learning
6 Data Splitting
7 Model Selection and Training with Gradient Boosting
8 Model Evaluation

1 Introduction

Building a machine learning model in R involves a structured workflow.
Load the dataset.
We will use the Telco customer churn dataset, which is ideal for churn prediction tasks.

# Install the necessary packages if you haven't already
# install.packages("xgboost")
# install.packages("caret")  # For data splitting and training
# install.packages("dplyr")   # For data manipulation
# install.packages("ggplot2") # For visualization

# Load the libraries
library(xgboost)
library(caret)
library(dplyr)
library(ggplot2)

1.0.1 Explanation:

xgboost: For implementing the gradient boosting algorithm. caret: For data manipulation, model training, and evaluation. dplyr: For data manipulation. ggplot2: For data visualization.

2 Data Loading

# Load the dataset
telco_data <- read.csv("https://raw.githubusercontent.com/DataGuy-Kariuki/Customer.Churn-Project/refs/heads/main/Telco-Customer-Churn.csv")

# View the first few rows and summary statistics
head(telco_data)
str(telco_data)
summary(telco_data)

telco_data <- telco_data[, -1]  # Remove the first column (likely an ID column)

2.0.1 Explanation:

The Telco customer churn dataset is read directly from a URL into a data frame. Displays the first few rows of the dataset. Provides a structure overview, including variable types and dimensions. Gives summary statistics for each variable, helping understand the data’s characteristics Removes the first column, which is often an ID or index column not needed for analysis.

3 Understanding Key Variables and Handling Missing Values

# Handling missing values
telco_data$TotalCharges <- as.numeric(telco_data$TotalCharges)
telco_data$TotalCharges[is.na(telco_data$TotalCharges)] <- median(telco_data$TotalCharges, na.rm = TRUE)
sum(is.na(telco_data$TotalCharges))  # Confirm no missing values

3.0.1 Explanation:

A brief overview of the key variables and their types. This includes customer demographics, account information, service information, and charges. Uses median imputation to handle missing values in the Total Charges column, ensuring that the data is ready for modeling.

4 Explanatory Data Analysis (EDA)

# EDA Using ggplot2
# Histogram of tenure
ggplot(telco_data, aes(x= tenure)) +
  geom_histogram(fill = "skyblue", bins = 30) +
  labs(title = "Distribution of Tenure", x = "Tenure", y = "Count")

# Histogram for MonthlyCharges
ggplot(telco_data, aes(x= MonthlyCharges)) +
  geom_histogram(fill = "salmon", bins = 30) +
  labs(title = "Distribution of Monthly Charges", x = "Monthly Charges", y= "count")

# Histogram of the TotalCharges
ggplot(telco_data, aes(x= TotalCharges)) +
  geom_histogram(fill = "lightgreen", bins = 30) +
  labs(title = 'Distribution of Total Charges', x ="Total charges", y ="count")

# Categorical Variables vs. Target (Churn)
# Churn by gender
ggplot(telco_data, aes(x = gender, fill = factor(Churn))) +
  geom_bar(position = "fill") + 
  labs(title = "Churn Rate by Gender", x = "Gender", y = "Proportion") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("No" = "skyblue", "Yes" = "salmon"), name = "Churn")

# Churn by Contract Type
ggplot(telco_data, aes(x = Contract, fill = factor(Churn))) +
  geom_bar(position = "fill") +
  labs(title = "Churn by Contract Type", x = "Contract Type", y = "Proportion")

4.0.1 Explanation:

This section emphasizes understanding the dataset’s structure and identifying relationships.
Three histograms visualize the distributions of tenure, MonthlyCharges, and TotalCharges. This helps identify trends, outliers, and skewness.
The bar plots for gender and Contract type against churn rates help analyze how demographic factors and service contracts affect customer churn. The position = “fill” argument provides a proportionate view of churn across categories.

5 Data Preparation for Machine Learning

# Convert relevant columns to factors
categorical_vars <- c("gender", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Churn")
telco_data[categorical_vars] <- lapply(telco_data[categorical_vars], as.factor)

# Convert Churn to numeric (0 and 1)
telco_data$Churn <- ifelse(telco_data$Churn == "Yes", 1, 0)

# Normalize/Scale numerical features
telco_data$tenure <- scale(telco_data$tenure)
telco_data$MonthlyCharges <- scale(telco_data$MonthlyCharges)
telco_data$TotalCharges <- scale(telco_data$TotalCharges)

5.0.1 Explanation:

This converts categorical variables into factors, which is important for machine learning algorithms that require categorical input.
Converts the Churn variable to a binary format (0 and 1) for easier processing during model training.
Scales numerical features (tenure, MonthlyCharges, TotalCharges) to ensure they are on a similar scale, which can enhance the model’s performance.

6 Data Splitting

# Data Splitting
# Set a seed for reproducibility
set.seed(123)

# Split data into 70% training and 30% for test and validation
trainIndex <- createDataPartition(telco_data$Churn, p = 0.7, list = FALSE)
train_data <- telco_data[trainIndex, ]
temp_data <- telco_data[-trainIndex, ]

# Split remaining data into test (15%) and validation (15%)
testIndex <- createDataPartition(temp_data$Churn, p = 0.5, list = FALSE)
test_data <- temp_data[testIndex, ]
validation_data <- temp_data[-testIndex, ]

6.0.1 Explanation:

Emphasizes the importance of data splitting for unbiased model evaluation.
Ensures reproducibility of the random split.
The data is divided into training (70%), testing (15%), and validation (15%) sets. This helps to train the model and test its performance on unseen data.

7 Model Selection and Training with Gradient Boosting

# Model Selection and Training with Gradient Boosting
# Prepare Data for xgboost
train_matrix <- model.matrix(Churn ~ . - 1, data = train_data)
train_label <- train_data$Churn
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)

# Prepare test data for early stopping and evaluation
test_matrix <- model.matrix(Churn ~ . - 1, data = test_data)
test_label <- test_data$Churn
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

# Define model parameters
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  eta = 0.1,
  max_depth = 6,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train the model
set.seed(123)
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, eval = dtest),
  early_stopping_rounds = 10,
  print_every_n = 10
)

7.0.1 Explanation:

Introduces gradient boosting, which uses multiple decision trees to improve prediction accuracy.
The training and test datasets are converted into xgb.DMatrix format, suitable for xgboost.
The model’s parameters are defined, optimizing for binary classification.
The model is trained with early stopping to avoid overfitting and to monitor performance on the test set.

8 Model Evaluation

# Model Evaluation
# Predictions on the test set
predictions <- predict(xgb_model, dtest)

# Convert probabilities to binary outcome using a threshold (0.5)
predicted_labels <- ifelse(predictions > 0.5, 1, 0)

# Confusion Matrix
confusion_matrix <- confusionMatrix(factor(predicted_labels), factor(test_label))
print(confusion_matrix)

8.0.1 Explanation:

Uses the trained model to predict churn probabilities for the test dataset.
Converts predicted probabilities into binary outcomes (0 or 1) based on a chosen threshold (0.5).
Evaluates model performance by comparing predicted labels against actual labels.
This helps assess accuracy, sensitivity, specificity, etc.
Visualizes the model’s ability to discriminate between classes.

# Assuming you have your predicted classes and actual labels
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

# Create a confusion matrix
confusion_matrix <- table(Actual = test_label, Predicted = predicted_classes)

# Convert the confusion matrix to a data frame
confusion_df <- as.data.frame(confusion_matrix)

# Load necessary libraries
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Assuming you have your predictions and test labels defined
roc_curve <- roc(test_label, predictions)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot the ROC curve
plot(roc_curve, col = "blue", lwd = 2, main = "ROC Curve")

ggplot(confusion_df, aes(x = Predicted, y = Actual)) +
  geom_tile(aes(fill = Freq), color = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  geom_text(aes(label = Freq), vjust = 1) +
  labs(title = "Confusion Matrix", x = "Predicted", y = "Actual") +
  theme_minimal()

# Calculate accuracy for various thresholds
thresholds <- seq(0, 1, by = 0.05)
accuracy_values <- sapply(thresholds, function(thresh) {
  predicted_classes <- ifelse(predictions > thresh, 1, 0)
  sum(predicted_classes == test_label) / length(test_label)
})

# Create a data frame for ggplot
accuracy_df <- data.frame(Threshold = thresholds, Accuracy = accuracy_values)

# Plot accuracy vs threshold
ggplot(accuracy_df, aes(x = Threshold, y = Accuracy)) +
  geom_line(color = "blue") +
  labs(title = "Accuracy vs. Threshold", x = "Threshold", y = "Accuracy") +
  theme_minimal()

This code provides a comprehensive way to evaluate a binary classification model by plotting the ROC curve, confusion matrix, and accuracy against different thresholds, using both pROC and ggplot2 libraries in R.
Each visualization gives insights into the model’s performance and helps in understanding the impact of various decision thresholds on classification accuracy.

# Save the model to a file
saveRDS(xgb_model, file = "customer_churn_model.rds")