Introduction

Customer churn, or customer attrition, is a significant loss for businesses. This metric is especially important for subscription-based services, such as telecommunications companies. In this project, we aim to conduct a churn analysis using data sourced from IBM sample data sets. We will leverage the R programming language to identify key variables associated with customer churn.

Tasks

To achieve this objective, we will undertake the following tasks:

Load the data and relevant R libraries.
Preprocess the data using various cleaning and recoding techniques.
Generate descriptive statistics visualizations to explore the data.
Fit model using commonly-used statistical classification methods for churn analysis with random forest analysis.
Visualize selected variables based on our modeling techniques to gain additional insights.

Load Necesssary Packages

We’ll load the necessary package for data cleaning

library(dplyr)
library(tidyr)
library(stringr)
library(readxl)
library(ggplot2)
library(gridExtra)
library(tidyverse)
library(tibble)
library(caret)
library(randomForest)

Loading data

dat <- read_excel("C:/Users/asus/Downloads/Customer churn dataset/Telecom Churn Rate Dataset.xlsx")
head(dat)

Checking data

str(dat)

## tibble [7,043 × 23] (S3: tbl_df/tbl/data.frame)
##  $ customerID      : chr [1:7043] "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr [1:7043] "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr [1:7043] "Yes" "No" "No" "No" ...
##  $ Dependents      : chr [1:7043] "No" "No" "No" "No" ...
##  $ tenure          : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr [1:7043] "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr [1:7043] "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr [1:7043] "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr [1:7043] "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr [1:7043] "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr [1:7043] "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr [1:7043] "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr [1:7043] "No" "No" "No" "No" ...
##  $ StreamingMovies : chr [1:7043] "No" "No" "No" "No" ...
##  $ Contract        : chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr [1:7043] "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr [1:7043] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ numAdminTickets : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
##  $ numTechTickets  : num [1:7043] 0 0 0 3 0 0 0 0 2 0 ...
##  $ Churn           : chr [1:7043] "No" "No" "Yes" "No" ...

The dataset contains information about various customer attributes and their subscription services, which can be used to predict customer churn.

Objective: Predict customer churn based on customer attributes and their subscription services.

Problem type: Classification (churn: Yes or No)

Let’s break down the features in the dataset:

Churn: Whether the customer churned or not (Yes or No) - Target variable

Categorical Variables
SeniorCitizen: Whether the customer is a senior citizen or not (Yes or No)
Partner: Whether the customer has a partner or not (Yes or No)
Dependents: Whether the customer has dependents or not (Yes or No)
PhoneService: Whether the customer has a phone service or not (Yes or No)
MultipleLines: Whether the customer has multiple lines or not (Yes or No)
OnlineSecurity: Whether the customer has online security or not (Yes or No)
OnlineBackup: Whether the customer has online backup or not (Yes or No)
DeviceProtection: Whether the customer has device protection or not (Yes or No)
TechSupport: Whether the customer has tech support or not (Yes or No)
StreamingTV: Whether the customer has streaming TV or not (Yes or No)
StreamingMovies: Whether the customer has streaming movies or not (Yes or No)
PaperlessBilling: Whether the customer has paperless billing or not (Yes or No)
Gender: The gender of the customer (Male or Female)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
InternetService: The customer’s internet service provider (DSL, Fiber optic, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

Numerical Variables
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
Tenure: Number of months the customer has stayed with the company
NumAdminTickets: Number of administrative tickets raised by the customer
NumTechTickets: Number of technical tickets raised by the customer.

Data Cleaning

# customerID has no significance so we are dropping this column

dat1 <- subset(dat, select = -customerID)
head(dat1)

# Handling Missing, Duplicate Data and null data

sum(duplicated(dat1))

## [1] 17

colSums(is.na(dat1) | dat1 == "")

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges  numAdminTickets 
##                0                0               11                0 
##   numTechTickets            Churn 
##                0                0

# There are 17 Duplicate rows & 11 missing values in TotalCharges Column.
# The number of Duplicate rows & missing values is relatively small compared 
# to the size of the dataset. Therefore, we have decided to remove these missing 
# values from our analysis.

# Let's call it datc.

datc <- dat1 %>% 
  na.omit(dat1) %>% 
  distinct()
sum(duplicated(datc))

## [1] 0

colSums(is.na(datc) | datc == "")

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges  numAdminTickets 
##                0                0                0                0 
##   numTechTickets            Churn 
##                0                0

# Checking Data structure 

summary(dat)

##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##  numAdminTickets  numTechTickets      Churn          
##  Min.   :0.0000   Min.   :0.0000   Length:7043       
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :0.5157   Mean   :0.4196                     
##  3rd Qu.:0.0000   3rd Qu.:0.0000                     
##  Max.   :5.0000   Max.   :9.0000                     
##

# The "SeniorCitizen" variable in the dataset is currently coded as a binary variable with
# values of "0" or "1". This coding scheme is not intuitive and can be difficult to interpret. 

# To make our analysis easier and more understandable, we can recode this variable using a more   
# intuitive labeling scheme, such as "yes" and "no".

# The "MultipleLines" variable is related to the "PhoneService" variable.

# If "PhoneService" is set to "No", then "MultipleLines" is also automatically set to "No". 

# To simplify our graphics and modeling, we can change the "No phone service" response in the 
# "MultipleLines" variable to "No".

# Same goes with "No internet service"

datrecode <- datc %>% 
mutate(SeniorCitizen = if_else(SeniorCitizen == 0, "No", "Yes")) %>% 
mutate(MultipleLines = if_else(MultipleLines == "No phone service", "No", MultipleLines)) %>% 
mutate(across(starts_with("Online") | starts_with("Device") | starts_with("Tech") | 
                  
    starts_with("Streaming"), ~if_else(. == "No internet service", "No", .)))
head(datrecode)

Visualizations

Let’s take a look at numerical Features

# Histogram of Tenure:

p1 <- ggplot(datrecode, aes(x = tenure)) +
  geom_histogram(binwidth = 5, color = "white", fill = "#FFA07A") +
  labs(title = "Distribution of Tenure", 
       x = "Tenure (Months)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 80), breaks = seq(0, 80, 5)) +
  theme_minimal()

# Histogram of TotalCharges
p2 <- ggplot(datrecode, aes(x = TotalCharges)) + 
  geom_histogram(binwidth = 200, fill = "#69b3a2", color = "black") +
  scale_x_continuous(breaks = seq(0, 9000, by = 1000)) +
  labs(x = "Total Charges", y = "Frequency", 
       title = "Histogram of Total Charges")

# Histogram of MonthlyCharges
p3 <- ggplot(data = datrecode, aes(x = MonthlyCharges)) +
  geom_histogram(binwidth = 10, color = "white", fill = "#FFA07A") +
  labs(title = "Distribution of Monthly Charges", 
       x = "Monthly Charges ($)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 130), breaks = seq(0, 150, 20)) +
  theme_minimal()

grid.arrange(p1, p2, p3, ncol = 2)

Maximum frequency (number of customers) is observed in the 0-5 and 65-70 months tenure ranges, whereas the average frequency for intermediate ranges lies between 250 and 400 customers per range.
The “TotalCharges” graph depicts the distribution of the charged amount to customers, there is a down trend of decreasing frequency and increasing total charges.
The histogram of “Monthly Charges” shows that the majority of customers have monthly charges between 20-40 dollars and a normal distribution around 80 dollars.

# Let's look at numAdminTickets and numTechTickets 

# histogram of numAdminTickets

p4 <- ggplot(data = datrecode, aes(x = numAdminTickets)) +
  geom_histogram(binwidth = 1, color = "white", fill = "#FFA07A") +
geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..), 4)*100, "%")),
          stat = "count")+labs(title = "Distribution numAdminTickets", 
       x = "numAdminTickets", 
       y = "Frequency") +
  scale_x_continuous(limits = c(-1, 6), breaks = seq(0, 5, 1)) +
  theme_minimal()

# histogram of numTechtickets

p5 <- ggplot(data = datrecode, aes(x = numTechTickets)) +
  geom_histogram(binwidth = 1, color = "white", fill = "#FFA07A") +
geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..), 4)*100, "%")), 
          stat = "count")+labs(title = "Distribution numTechTickets", 
       x = "numTechTickets", 
       y = "Frequency") +
  scale_x_continuous(limits = c(-1, 10), breaks = seq(0, 10, 1)) +
  theme_minimal()

grid.arrange(p4,p5,ncol = 1)

In the numAdminTickets graph, it appears that most customers have not raised any tickets.
The highest number of admin tickets raised by a customer is 5, though this may not be representative of every customer’s experience.
Similarly, in the numTechtickets graph, the maximum number of tech tickets raised by a customer is 9, but this may not be the case for all customers.

Let’s take a look at Categorical Features

# Bar graph of Contract

p9 <-  ggplot(datrecode, aes(x = Contract, fill = Contract)) +
geom_bar() +geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..),
                                                                   1)*100, "%")), stat = "count")+labs(title = "Distribution of Contract Type",
x = "Contract Type",
y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))


# Bar graph of gender

colors <- c("#0072B2", "#E69F00")

p10 <- ggplot(data = datrecode, aes(x = gender))+ 
  geom_bar(aes(fill = gender))+ 
geom_text(aes(y = ..count.. -500, label = paste0(round(prop.table(..count..),4)* 100, '%')), 
            stat = "count", position = position_stack(vjust = 0.5))+ xlab("Gender") + ylab("Count") + ggtitle("Customer Gender Distribution")+ 
  theme()


# Arrange plots in a grid
grid.arrange(p9, p10)

The most common contract type is Month-to-Month, followed by Two Year and One Year contracts.
The bar graph indicates an almost equal distribution between genders.

# Bar graph of InternetService 

p11 <- ggplot(datrecode, aes(x = InternetService, fill = InternetService)) +
  geom_bar() +
  labs(title = "Distribution of Internet Service", 
       x = "Internet Service", 
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Bar graph of PaymentMethod 

p12 <- ggplot(datrecode, aes(x = PaymentMethod, fill = PaymentMethod)) +
  geom_bar() +
  labs(title = "Distribution of Payment Method", 
       x = "Payment Method", 
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Arrange plots in a grid
grid.arrange(p11, p12, nrow = 1,ncol = 2)

Looking at the bar graph for “Internet Service,” we can see that Fiber optic is the top choice among internet users, followed by DSL and a small percentage of users not utilizing any internet service.
Check it out - Electronic Check is by far the most popular payment method, with the other three options lagging behind by almost 25%. Looks like we’re moving towards a more digital world!

# Plotting Bar graphs of SeniorCitizen, Partner, Dependents, PhoneService,MultipleLines 
# and OnlineSecurity.

p14 <- ggplot(datrecode, aes(x = SeniorCitizen)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("SeniorCitizen")
 

p15 <- ggplot(datrecode, aes(x = Partner)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("Partner")

p16 <- ggplot(datrecode, aes(x = Dependents)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("Dependents")

p17 <- ggplot(datrecode, aes(x = PhoneService)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("PhoneService")

p18 <- ggplot(datrecode, aes(x = MultipleLines)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("MultipleLines")

p19 <- ggplot(datrecode, aes(x = OnlineSecurity)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("OnlineSecurity")


# Arrange plots in a grid
grid.arrange(p14,p15,p16,p17,p18,p19, nrow = 2, ncol = 3)

The majority of people are not senior citizens.
The number of users with partners and without partners is almost the same.
About 70% of users have dependents.
More than 90% of users have phone service, and the majority of them have a single line.
Approximately 70% of users are vulnerable (not secured).

# Bar graphs of OnlinBackup, DeviceProtection, TechSupport, StreamingTV,
# StreamingMovies and paperLessbill.

p20 <- ggplot(datrecode, aes(x = OnlineBackup)) + geom_bar() +
  geom_text(aes(y = ..count.. -1000, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("OnlineBackup")


p21 <- ggplot(datrecode, aes(x = DeviceProtection)) + geom_bar()+
  geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")), 
                           stat = "count") +ggtitle("DeviceProtection")



p22 <- ggplot(datrecode, aes(x = TechSupport)) + geom_bar()+ 
   geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")), 
                           stat = "count")  +ggtitle("TechSupport")


p23 <- ggplot(datrecode, aes(x = StreamingTV)) + geom_bar() +
  geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")),
            stat = "count") + ggtitle("StreamingTV")

p24 <- ggplot(datrecode, aes(x = StreamingMovies)) +
  geom_bar() +geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")),
                        stat = "count") + ggtitle("StreamingMovies")


p25 <- ggplot(datrecode, aes(x = PaperlessBilling)) + 
  geom_bar() +geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")),
                        stat = "count") + ggtitle("PaperlessBilling")


grid.arrange(p20,p21,p22,p23,p24,p25, nrow = 2, ncol = 3)

Most customers(Above 60%) have OnlinBackup, DeviceProtection, TechSupport, StreamingTV and StreamingMovies. It means these services are very popular among the population and a must have in any company.

# How many people has chuned from our data 

p13 <- ggplot(datrecode, aes(x = Churn)) +
  geom_bar()+ geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count') + ggtitle("Churn")+
  theme(plot.title = element_text(hjust = .5))
p13

The graph shows the distribution of customer churn, with approximately 73% of customers remaining active and 27% of customers churning.

Statistical Modeling

Supervised

Classification

Random Forest

Note : When there is a difference in the number of customers in each class, such as fewer customers churning in this case, the model may tend to predict the majority class (in this instance, “No” for not churned) more often, as this leads to higher accuracy.

library(tibble)
library(caret)
set.seed(123)
trainIndex <- createDataPartition(datrecode$Churn, p = .7, list=FALSE, times=1)
trainSet <- datrecode[trainIndex,]
testSet <- datrecode[-trainIndex,]


Model1 <- randomForest(as.factor(Churn) ~ ., data = trainSet,mtry = 4, ntree = 400, 
                       importance= TRUE)

Model1

## 
## Call:
##  randomForest(formula = as.factor(Churn) ~ ., data = trainSet,      mtry = 4, ntree = 400, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 400
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 14.5%
## Confusion matrix:
##       No Yes class.error
## No  3322 287  0.07952341
## Yes  425 877  0.32642089

importance(Model1)

##                          No         Yes MeanDecreaseAccuracy MeanDecreaseGini
## gender           -2.8828634  -2.4246713           -3.7345378         30.52423
## SeniorCitizen    10.8580800   0.1094152            9.3583026         27.79091
## Partner           3.3982833   0.4407542            3.3032271         27.84312
## Dependents       -2.9749091   9.7342889            5.3204954         25.36001
## tenure           26.0181782  31.4209342           44.7802826        252.70322
## PhoneService      1.8197101  11.7297989           10.6453395         11.01264
## MultipleLines     3.3933439  12.2979055           11.8788514         28.74293
## InternetService  17.5444257  22.8373573           26.4965962         73.15133
## OnlineSecurity    4.5295764  18.8511354           15.2826782         37.10449
## OnlineBackup      8.0222559  14.7390057           14.7406169         31.76444
## DeviceProtection  9.7705416   0.2064048            9.9093998         26.01393
## TechSupport       1.0548766  20.1539708           14.8994975         30.40627
## StreamingTV      11.4499756   9.3796617           16.4384529         27.74302
## StreamingMovies  12.1572572   9.4221865           17.0537916         27.72715
## Contract         -5.0131795  30.0177061           28.6978169        140.53040
## PaperlessBilling  6.3663897   5.9612556            9.1536206         38.31996
## PaymentMethod     4.7276611   9.7927882           10.9666208         68.68784
## MonthlyCharges   24.9445291  29.2168896           38.7892048        247.03385
## TotalCharges     24.9824655  25.5758890           35.9320283        253.77535
## numAdminTickets  -0.6444313   2.0932069            0.7567089         38.38885
## numTechTickets   51.0852906 107.6261879          106.0643216        299.85094

Regarding the class imbalance concern, the class error for “No” (not churned) is 0.0795 (7.95%), while the class error for “Yes” (churned) is 0.3264 (32.64%). This indicates that the model is indeed better at predicting the majority class (not churned), but it’s still able to predict the minority class (churned) to some extent.

feature_importance <- importance(Model1)
 top_10_features <- sort(feature_importance[, "MeanDecreaseAccuracy"], 
                         decreasing = TRUE, index.return = TRUE)$ix[1:10]
 feature_names <- rownames(feature_importance)[top_10_features]
 top_10_importance_values <- feature_importance[top_10_features, 
              c("No", "Yes", "MeanDecreaseAccuracy", "MeanDecreaseGini")]
 data.frame(Feature = feature_names, top_10_importance_values)

Now let’s test our Model by

Using test set 1. Confusion Matrix

prediction <- predict(Model1,testSet)

confusion_matrix <- confusionMatrix(as.factor(prediction), as.factor(testSet$Churn))
confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1418  173
##        Yes  128  385
##                                           
##                Accuracy : 0.8569          
##                  95% CI : (0.8412, 0.8716)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : 0.01121         
##                                           
##             Sensitivity : 0.9172          
##             Specificity : 0.6900          
##          Pos Pred Value : 0.8913          
##          Neg Pred Value : 0.7505          
##              Prevalence : 0.7348          
##          Detection Rate : 0.6740          
##    Detection Prevalence : 0.7562          
##       Balanced Accuracy : 0.8036          
##                                           
##        'Positive' Class : No              
##

Based on the confusion matrix and the performance metrics, the random forest model exhibits an accuracy of 85.69% on the test set, with a sensitivity of 91.72% and specificity of 69.00%.
The model performs better in identifying the ‘No’ class compared to the ‘Yes’ class. The balanced accuracy of 80.36% indicates that the model has a relatively good overall performance.

Let’s Plot the graphs based on our Models

Categorical Variables VS plot churn rate

# Bar Graph of Churn VS Contract type 

p26 <- ggplot(datrecode, aes(x = Churn, fill = Contract)) +
  geom_bar(position = "dodge") +
geom_text(aes(y = ..count.., label = paste0(round(prop.table
                                  (..count.. / sum(..count..)), 4) * 100, "%")),
            stat = "count", position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Churn vs Contract", x = "Churn", y = "Count")
p26

From this plot it’s clear that Month to month contract type is responsible for the churn followed by one year and very small proportion of two year.

# bar graph of Internet Service Vs Churn

p27 <- ggplot(datrecode, aes(x = Churn, fill = InternetService)) +
  geom_bar(position = "dodge") + 
  geom_text(aes(y = ..count.., label = paste0(round(prop.table
                        (..count.. / sum(..count..)), 4) * 100, "%")),
        stat = "count", position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Churn vs Internet Service", x = "Churn", y = "Count")
p27

About 18% of the cutomers with Fiber Optics have churned followed by 6.5% for DSL.

# Bar graph of  Churn vs Streaming Movies

p28 <- ggplot(datrecode, aes(x = Churn, fill = StreamingMovies)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Streaming Movies", x = "Churn", y = "Count")

p29 <- ggplot(datrecode, aes(x = Churn, fill = StreamingTV)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Streaming TV", x = "Churn", y = "Count")

p30 <- ggplot(datrecode, aes(x = Churn, fill = OnlineSecurity)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Online Security", x = "Churn", y = "Count")

p31 <- ggplot(datrecode, aes(x = Churn, fill = TechSupport)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Tech Support", x = "Churn", y = "Count")

grid.arrange(p28,p29,p30,p31, ncol = 2, nrow =2)

Mostly in No cases in Streaming Movies, Steaming TV, Online Security and Tech Support are customers Churned the Most.

Numerical Variables VS Churn

# Now we'll check them against Churn Aspect to get if there's any good insight

# Tenure & Churn:

p6 <- ggplot(datrecode, aes(x = tenure, fill = Churn)) +
  geom_histogram(binwidth = 5, color = "white", position = "identity", alpha = 0.7) +
  labs(title = "Distribution of Tenure by Churn", 
       x = "Tenure (Months)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 80), breaks = seq(0, 80, 5)) +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))


# TotalCharges & Churn :
  
p7 <- ggplot(datrecode, aes(x = TotalCharges, fill = Churn)) + 
  geom_histogram(binwidth = 200, color = "white", position = "identity", alpha = 0.7) +
  scale_x_continuous(breaks = seq(0, 9000, by = 1000)) +
  labs(x = "Total Charges", y = "Frequency", 
       title = "Histogram of Total Charges by Churn") +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))

# MonthlyCharges & Churn :
  
p8 <- ggplot(data = datrecode, aes(x = MonthlyCharges, fill = Churn)) +
  geom_histogram(binwidth = 10, color = "white", position = "identity", alpha = 0.7) +
  labs(title = "Distribution of Monthly Charges by Churn", 
       x = "Monthly Charges ($)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 130), breaks = seq(0, 150, 20)) +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))



grid.arrange(p6,p7,p8, ncol = 1)

Churn rate decreases with increasing tenure, implying long-term customers are more satisfied and less likely to churn.
Higher churn rate for customers with total charges < 1,000, while negligible for > 3,000, suggesting greater commitment and satisfaction for higher-spending customers.
Customers with monthly charges between 60 and 100 exhibit higher churn rates, possibly due to unmet value expectations or competition in this price range.

# Churn vs numTechTickets
p32 <- ggplot(datrecode, aes(x = Churn, y = numTechTickets)) +
  geom_boxplot() + scale_y_continuous(limits = c(0,9), breaks = seq(0, 9, 1))+
  theme_minimal() +
  labs(title = "Churn vs NumTechTickets", x = "Churn", y = "Number of Technical Tickets")
p32

The boxplot suggests that customers who churned generally had more technical tickets than those who did not churn, and the distribution of the number of technical tickets for churned customers is more diverse.

library(ggplot2)

# Bar Plot of numTechtickets vs Churn

p34 <- ggplot(data = datrecode, aes(x = tenure, y = numTechTickets, color = Churn)) +
  geom_point() +
  labs(title = "Tenure vs. Number of Tech Tickets",
       x = "Tenure",
       y = "Number of Tech Tickets") +
  scale_x_continuous(breaks = seq(0, 100, 10)) + # Updated interval for x-axis
  scale_y_continuous(breaks = seq(0, 10, 1)) + # Updated interval and range for y-axis
  theme_minimal()
p34

Customers with a short tenure who raise fewer tech tickets are more likely to churn. Similarly, customers with a longer tenure who raise more than three tech tickets are also more likely to churn.
Customers with a tenure greater than 30 and who raise fewer than three tech tickets are much less likely to churn.

In conclusion, the analysis of customer churn reveals several key findings:

The box plot reveals that customers who churned generally had more technical tickets than those who did not churn, and the distribution of technical tickets for churned customers is more diverse, highlighting the impact of technical issues on customer retention.
Churn rate decreases with increasing tenure, implying that long-term customers are more satisfied and less likely to churn. This emphasizes the importance of nurturing long-term relationships with customers.
Short-tenured customers with fewer tech tickets (1 or 2) and long-tenured customers with more than three tech tickets are more likely to churn. This suggests that both initial service issues and ongoing problems contribute to customer attrition.
Customers with a tenure greater than 30 months and fewer than three tech tickets are much less likely to churn, indicating that long-term satisfaction and minimal service issues help retain customers.
Monthly charges between 60 and 100 result in higher churn rates, possibly due to unmet value expectations or competition in this price range. This highlights the need for businesses to focus on providing competitive pricing and value.
A higher churn rate is observed for customers with total charges below 1,000, while negligible churn is seen for customers with charges above 3,000. This implies that higher-spending customers tend to be more committed and satisfied with the service.
Month-to-month contract types contribute to the highest churn rates, followed by one-year contracts, with a very small proportion of churn for two-year contracts. This suggests that customers with longer contracts may feel more secure and satisfied with the service.
Approximately 18% of customers with Fiber Optics churned, followed by 6.5% for DSL customers, indicating that internet service type can also influence customer satisfaction and retention.
Customers who do not have streaming movies, streaming TV, online security, and tech support services are more likely to churn. This suggests that the absence of these features may lead to dissatisfaction and ultimately, customer attrition.

To reduce churn rates, businesses should focus on improving customer service, offering competitive pricing and value, providing desired features in service packages, and fostering long-term customer relationships.