Introduction

Customer churn, or customer attrition, is a significant loss for businesses. This metric is especially important for subscription-based services, such as telecommunications companies. In this project, we aim to conduct a churn analysis using data sourced from IBM sample data sets. We will leverage the R programming language to identify key variables associated with customer churn.

Tasks

To achieve this objective, we will undertake the following tasks:

Load Necesssary Packages

We’ll load the necessary package for data cleaning

library(dplyr)
library(tidyr)
library(stringr)
library(readxl)
library(ggplot2)
library(gridExtra)
library(tidyverse)
library(tibble)
library(caret)
library(randomForest)

Loading data

dat <- read_excel("C:/Users/asus/Downloads/Customer churn dataset/Telecom Churn Rate Dataset.xlsx")
head(dat)

Checking data

str(dat)
## tibble [7,043 × 23] (S3: tbl_df/tbl/data.frame)
##  $ customerID      : chr [1:7043] "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr [1:7043] "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr [1:7043] "Yes" "No" "No" "No" ...
##  $ Dependents      : chr [1:7043] "No" "No" "No" "No" ...
##  $ tenure          : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr [1:7043] "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr [1:7043] "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr [1:7043] "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr [1:7043] "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr [1:7043] "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr [1:7043] "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr [1:7043] "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr [1:7043] "No" "No" "No" "No" ...
##  $ StreamingMovies : chr [1:7043] "No" "No" "No" "No" ...
##  $ Contract        : chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr [1:7043] "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr [1:7043] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ numAdminTickets : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
##  $ numTechTickets  : num [1:7043] 0 0 0 3 0 0 0 0 2 0 ...
##  $ Churn           : chr [1:7043] "No" "No" "Yes" "No" ...

The dataset contains information about various customer attributes and their subscription services, which can be used to predict customer churn.

Objective: Predict customer churn based on customer attributes and their subscription services.

Problem type: Classification (churn: Yes or No)

Let’s break down the features in the dataset:

  1. Churn: Whether the customer churned or not (Yes or No) - Target variable

    Categorical Variables

  2. SeniorCitizen: Whether the customer is a senior citizen or not (Yes or No)

  3. Partner: Whether the customer has a partner or not (Yes or No)

  4. Dependents: Whether the customer has dependents or not (Yes or No)

  5. PhoneService: Whether the customer has a phone service or not (Yes or No)

  6. MultipleLines: Whether the customer has multiple lines or not (Yes or No)

  7. OnlineSecurity: Whether the customer has online security or not (Yes or No)

  8. OnlineBackup: Whether the customer has online backup or not (Yes or No)

  9. DeviceProtection: Whether the customer has device protection or not (Yes or No)

  10. TechSupport: Whether the customer has tech support or not (Yes or No)

  11. StreamingTV: Whether the customer has streaming TV or not (Yes or No)

  12. StreamingMovies: Whether the customer has streaming movies or not (Yes or No)

  13. PaperlessBilling: Whether the customer has paperless billing or not (Yes or No)

  14. Gender: The gender of the customer (Male or Female)

  15. Contract: The contract term of the customer (Month-to-month, One year, Two year)

  16. InternetService: The customer’s internet service provider (DSL, Fiber optic, No)

  17. PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

    Numerical Variables

  18. MonthlyCharges: The amount charged to the customer monthly

  19. TotalCharges: The total amount charged to the customer

  20. Tenure: Number of months the customer has stayed with the company

  21. NumAdminTickets: Number of administrative tickets raised by the customer

  22. NumTechTickets: Number of technical tickets raised by the customer.

Data Cleaning

# customerID has no significance so we are dropping this column

dat1 <- subset(dat, select = -customerID)
head(dat1)
# Handling Missing, Duplicate Data and null data

sum(duplicated(dat1))
## [1] 17
colSums(is.na(dat1) | dat1 == "")
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges  numAdminTickets 
##                0                0               11                0 
##   numTechTickets            Churn 
##                0                0
# There are 17 Duplicate rows & 11 missing values in TotalCharges Column.
# The number of Duplicate rows & missing values is relatively small compared 
# to the size of the dataset. Therefore, we have decided to remove these missing 
# values from our analysis.

# Let's call it datc.

datc <- dat1 %>% 
  na.omit(dat1) %>% 
  distinct()
sum(duplicated(datc))
## [1] 0
colSums(is.na(datc) | datc == "")
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges  numAdminTickets 
##                0                0                0                0 
##   numTechTickets            Churn 
##                0                0
# Checking Data structure 

summary(dat)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##  numAdminTickets  numTechTickets      Churn          
##  Min.   :0.0000   Min.   :0.0000   Length:7043       
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :0.5157   Mean   :0.4196                     
##  3rd Qu.:0.0000   3rd Qu.:0.0000                     
##  Max.   :5.0000   Max.   :9.0000                     
## 
# The "SeniorCitizen" variable in the dataset is currently coded as a binary variable with
# values of "0" or "1". This coding scheme is not intuitive and can be difficult to interpret. 

# To make our analysis easier and more understandable, we can recode this variable using a more   
# intuitive labeling scheme, such as "yes" and "no".

# The "MultipleLines" variable is related to the "PhoneService" variable.

# If "PhoneService" is set to "No", then "MultipleLines" is also automatically set to "No". 

# To simplify our graphics and modeling, we can change the "No phone service" response in the 
# "MultipleLines" variable to "No".

# Same goes with "No internet service"

datrecode <- datc %>% 
mutate(SeniorCitizen = if_else(SeniorCitizen == 0, "No", "Yes")) %>% 
mutate(MultipleLines = if_else(MultipleLines == "No phone service", "No", MultipleLines)) %>% 
mutate(across(starts_with("Online") | starts_with("Device") | starts_with("Tech") | 
                  
    starts_with("Streaming"), ~if_else(. == "No internet service", "No", .)))
head(datrecode)

Visualizations

Let’s take a look at numerical Features

# Histogram of Tenure:

p1 <- ggplot(datrecode, aes(x = tenure)) +
  geom_histogram(binwidth = 5, color = "white", fill = "#FFA07A") +
  labs(title = "Distribution of Tenure", 
       x = "Tenure (Months)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 80), breaks = seq(0, 80, 5)) +
  theme_minimal()

# Histogram of TotalCharges
p2 <- ggplot(datrecode, aes(x = TotalCharges)) + 
  geom_histogram(binwidth = 200, fill = "#69b3a2", color = "black") +
  scale_x_continuous(breaks = seq(0, 9000, by = 1000)) +
  labs(x = "Total Charges", y = "Frequency", 
       title = "Histogram of Total Charges")

# Histogram of MonthlyCharges
p3 <- ggplot(data = datrecode, aes(x = MonthlyCharges)) +
  geom_histogram(binwidth = 10, color = "white", fill = "#FFA07A") +
  labs(title = "Distribution of Monthly Charges", 
       x = "Monthly Charges ($)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 130), breaks = seq(0, 150, 20)) +
  theme_minimal()

grid.arrange(p1, p2, p3, ncol = 2)

# Let's look at numAdminTickets and numTechTickets 

# histogram of numAdminTickets

p4 <- ggplot(data = datrecode, aes(x = numAdminTickets)) +
  geom_histogram(binwidth = 1, color = "white", fill = "#FFA07A") +
geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..), 4)*100, "%")),
          stat = "count")+labs(title = "Distribution numAdminTickets", 
       x = "numAdminTickets", 
       y = "Frequency") +
  scale_x_continuous(limits = c(-1, 6), breaks = seq(0, 5, 1)) +
  theme_minimal()

# histogram of numTechtickets

p5 <- ggplot(data = datrecode, aes(x = numTechTickets)) +
  geom_histogram(binwidth = 1, color = "white", fill = "#FFA07A") +
geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..), 4)*100, "%")), 
          stat = "count")+labs(title = "Distribution numTechTickets", 
       x = "numTechTickets", 
       y = "Frequency") +
  scale_x_continuous(limits = c(-1, 10), breaks = seq(0, 10, 1)) +
  theme_minimal()

grid.arrange(p4,p5,ncol = 1)

Let’s take a look at Categorical Features

# Bar graph of Contract

p9 <-  ggplot(datrecode, aes(x = Contract, fill = Contract)) +
geom_bar() +geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..),
                                                                   1)*100, "%")), stat = "count")+labs(title = "Distribution of Contract Type",
x = "Contract Type",
y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))


# Bar graph of gender

colors <- c("#0072B2", "#E69F00")

p10 <- ggplot(data = datrecode, aes(x = gender))+ 
  geom_bar(aes(fill = gender))+ 
geom_text(aes(y = ..count.. -500, label = paste0(round(prop.table(..count..),4)* 100, '%')), 
            stat = "count", position = position_stack(vjust = 0.5))+ xlab("Gender") + ylab("Count") + ggtitle("Customer Gender Distribution")+ 
  theme()


# Arrange plots in a grid
grid.arrange(p9, p10)

# Bar graph of InternetService 

p11 <- ggplot(datrecode, aes(x = InternetService, fill = InternetService)) +
  geom_bar() +
  labs(title = "Distribution of Internet Service", 
       x = "Internet Service", 
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Bar graph of PaymentMethod 

p12 <- ggplot(datrecode, aes(x = PaymentMethod, fill = PaymentMethod)) +
  geom_bar() +
  labs(title = "Distribution of Payment Method", 
       x = "Payment Method", 
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Arrange plots in a grid
grid.arrange(p11, p12, nrow = 1,ncol = 2)

# Plotting Bar graphs of SeniorCitizen, Partner, Dependents, PhoneService,MultipleLines 
# and OnlineSecurity.

p14 <- ggplot(datrecode, aes(x = SeniorCitizen)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("SeniorCitizen")
 

p15 <- ggplot(datrecode, aes(x = Partner)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("Partner")

p16 <- ggplot(datrecode, aes(x = Dependents)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("Dependents")

p17 <- ggplot(datrecode, aes(x = PhoneService)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("PhoneService")

p18 <- ggplot(datrecode, aes(x = MultipleLines)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("MultipleLines")

p19 <- ggplot(datrecode, aes(x = OnlineSecurity)) + geom_bar() +
  geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("OnlineSecurity")


# Arrange plots in a grid
grid.arrange(p14,p15,p16,p17,p18,p19, nrow = 2, ncol = 3)

# Bar graphs of OnlinBackup, DeviceProtection, TechSupport, StreamingTV,
# StreamingMovies and paperLessbill.

p20 <- ggplot(datrecode, aes(x = OnlineBackup)) + geom_bar() +
  geom_text(aes(y = ..count.. -1000, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count')+ ggtitle("OnlineBackup")


p21 <- ggplot(datrecode, aes(x = DeviceProtection)) + geom_bar()+
  geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")), 
                           stat = "count") +ggtitle("DeviceProtection")



p22 <- ggplot(datrecode, aes(x = TechSupport)) + geom_bar()+ 
   geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")), 
                           stat = "count")  +ggtitle("TechSupport")


p23 <- ggplot(datrecode, aes(x = StreamingTV)) + geom_bar() +
  geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")),
            stat = "count") + ggtitle("StreamingTV")

p24 <- ggplot(datrecode, aes(x = StreamingMovies)) +
  geom_bar() +geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")),
                        stat = "count") + ggtitle("StreamingMovies")


p25 <- ggplot(datrecode, aes(x = PaperlessBilling)) + 
  geom_bar() +geom_text(aes(y= ..count.. -400, 
                           label = paste0(round(prop.table(..count..),4)*100, "%")),
                        stat = "count") + ggtitle("PaperlessBilling")


grid.arrange(p20,p21,p22,p23,p24,p25, nrow = 2, ncol = 3)

# How many people has chuned from our data 

p13 <- ggplot(datrecode, aes(x = Churn)) +
  geom_bar()+ geom_text(aes(y = ..count.. -400, 
                            label = paste0(round(prop.table(..count..),4)*100, '%')), 
                        stat = 'count') + ggtitle("Churn")+
  theme(plot.title = element_text(hjust = .5))
p13

Statistical Modeling

Supervised

Classification

Random Forest

  • Note : When there is a difference in the number of customers in each class, such as fewer customers churning in this case, the model may tend to predict the majority class (in this instance, “No” for not churned) more often, as this leads to higher accuracy.
library(tibble)
library(caret)
set.seed(123)
trainIndex <- createDataPartition(datrecode$Churn, p = .7, list=FALSE, times=1)
trainSet <- datrecode[trainIndex,]
testSet <- datrecode[-trainIndex,]


Model1 <- randomForest(as.factor(Churn) ~ ., data = trainSet,mtry = 4, ntree = 400, 
                       importance= TRUE)

Model1
## 
## Call:
##  randomForest(formula = as.factor(Churn) ~ ., data = trainSet,      mtry = 4, ntree = 400, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 400
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 14.5%
## Confusion matrix:
##       No Yes class.error
## No  3322 287  0.07952341
## Yes  425 877  0.32642089
importance(Model1)
##                          No         Yes MeanDecreaseAccuracy MeanDecreaseGini
## gender           -2.8828634  -2.4246713           -3.7345378         30.52423
## SeniorCitizen    10.8580800   0.1094152            9.3583026         27.79091
## Partner           3.3982833   0.4407542            3.3032271         27.84312
## Dependents       -2.9749091   9.7342889            5.3204954         25.36001
## tenure           26.0181782  31.4209342           44.7802826        252.70322
## PhoneService      1.8197101  11.7297989           10.6453395         11.01264
## MultipleLines     3.3933439  12.2979055           11.8788514         28.74293
## InternetService  17.5444257  22.8373573           26.4965962         73.15133
## OnlineSecurity    4.5295764  18.8511354           15.2826782         37.10449
## OnlineBackup      8.0222559  14.7390057           14.7406169         31.76444
## DeviceProtection  9.7705416   0.2064048            9.9093998         26.01393
## TechSupport       1.0548766  20.1539708           14.8994975         30.40627
## StreamingTV      11.4499756   9.3796617           16.4384529         27.74302
## StreamingMovies  12.1572572   9.4221865           17.0537916         27.72715
## Contract         -5.0131795  30.0177061           28.6978169        140.53040
## PaperlessBilling  6.3663897   5.9612556            9.1536206         38.31996
## PaymentMethod     4.7276611   9.7927882           10.9666208         68.68784
## MonthlyCharges   24.9445291  29.2168896           38.7892048        247.03385
## TotalCharges     24.9824655  25.5758890           35.9320283        253.77535
## numAdminTickets  -0.6444313   2.0932069            0.7567089         38.38885
## numTechTickets   51.0852906 107.6261879          106.0643216        299.85094
  • Regarding the class imbalance concern, the class error for “No” (not churned) is 0.0795 (7.95%), while the class error for “Yes” (churned) is 0.3264 (32.64%). This indicates that the model is indeed better at predicting the majority class (not churned), but it’s still able to predict the minority class (churned) to some extent.
feature_importance <- importance(Model1)
 top_10_features <- sort(feature_importance[, "MeanDecreaseAccuracy"], 
                         decreasing = TRUE, index.return = TRUE)$ix[1:10]
 feature_names <- rownames(feature_importance)[top_10_features]
 top_10_importance_values <- feature_importance[top_10_features, 
              c("No", "Yes", "MeanDecreaseAccuracy", "MeanDecreaseGini")]
 data.frame(Feature = feature_names, top_10_importance_values)

Now let’s test our Model by

Using test set 1. Confusion Matrix

prediction <- predict(Model1,testSet)

confusion_matrix <- confusionMatrix(as.factor(prediction), as.factor(testSet$Churn))
confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1418  173
##        Yes  128  385
##                                           
##                Accuracy : 0.8569          
##                  95% CI : (0.8412, 0.8716)
##     No Information Rate : 0.7348          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : 0.01121         
##                                           
##             Sensitivity : 0.9172          
##             Specificity : 0.6900          
##          Pos Pred Value : 0.8913          
##          Neg Pred Value : 0.7505          
##              Prevalence : 0.7348          
##          Detection Rate : 0.6740          
##    Detection Prevalence : 0.7562          
##       Balanced Accuracy : 0.8036          
##                                           
##        'Positive' Class : No              
## 

Let’s Plot the graphs based on our Models

# Bar Graph of Churn VS Contract type 

p26 <- ggplot(datrecode, aes(x = Churn, fill = Contract)) +
  geom_bar(position = "dodge") +
geom_text(aes(y = ..count.., label = paste0(round(prop.table
                                  (..count.. / sum(..count..)), 4) * 100, "%")),
            stat = "count", position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Churn vs Contract", x = "Churn", y = "Count")
p26

# bar graph of Internet Service Vs Churn

p27 <- ggplot(datrecode, aes(x = Churn, fill = InternetService)) +
  geom_bar(position = "dodge") + 
  geom_text(aes(y = ..count.., label = paste0(round(prop.table
                        (..count.. / sum(..count..)), 4) * 100, "%")),
        stat = "count", position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Churn vs Internet Service", x = "Churn", y = "Count")
p27

# Bar graph of  Churn vs Streaming Movies

p28 <- ggplot(datrecode, aes(x = Churn, fill = StreamingMovies)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Streaming Movies", x = "Churn", y = "Count")

p29 <- ggplot(datrecode, aes(x = Churn, fill = StreamingTV)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Streaming TV", x = "Churn", y = "Count")

p30 <- ggplot(datrecode, aes(x = Churn, fill = OnlineSecurity)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Online Security", x = "Churn", y = "Count")

p31 <- ggplot(datrecode, aes(x = Churn, fill = TechSupport)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn vs Tech Support", x = "Churn", y = "Count")

grid.arrange(p28,p29,p30,p31, ncol = 2, nrow =2)

Numerical Variables VS Churn

# Now we'll check them against Churn Aspect to get if there's any good insight

# Tenure & Churn:

p6 <- ggplot(datrecode, aes(x = tenure, fill = Churn)) +
  geom_histogram(binwidth = 5, color = "white", position = "identity", alpha = 0.7) +
  labs(title = "Distribution of Tenure by Churn", 
       x = "Tenure (Months)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 80), breaks = seq(0, 80, 5)) +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))


# TotalCharges & Churn :
  
p7 <- ggplot(datrecode, aes(x = TotalCharges, fill = Churn)) + 
  geom_histogram(binwidth = 200, color = "white", position = "identity", alpha = 0.7) +
  scale_x_continuous(breaks = seq(0, 9000, by = 1000)) +
  labs(x = "Total Charges", y = "Frequency", 
       title = "Histogram of Total Charges by Churn") +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))

# MonthlyCharges & Churn :
  
p8 <- ggplot(data = datrecode, aes(x = MonthlyCharges, fill = Churn)) +
  geom_histogram(binwidth = 10, color = "white", position = "identity", alpha = 0.7) +
  labs(title = "Distribution of Monthly Charges by Churn", 
       x = "Monthly Charges ($)", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 130), breaks = seq(0, 150, 20)) +
  theme_minimal() +
  scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))



grid.arrange(p6,p7,p8, ncol = 1)

  • Churn rate decreases with increasing tenure, implying long-term customers are more satisfied and less likely to churn.

  • Higher churn rate for customers with total charges < 1,000, while negligible for > 3,000, suggesting greater commitment and satisfaction for higher-spending customers.

  • Customers with monthly charges between 60 and 100 exhibit higher churn rates, possibly due to unmet value expectations or competition in this price range.

# Churn vs numTechTickets
p32 <- ggplot(datrecode, aes(x = Churn, y = numTechTickets)) +
  geom_boxplot() + scale_y_continuous(limits = c(0,9), breaks = seq(0, 9, 1))+
  theme_minimal() +
  labs(title = "Churn vs NumTechTickets", x = "Churn", y = "Number of Technical Tickets")
p32

  • The boxplot suggests that customers who churned generally had more technical tickets than those who did not churn, and the distribution of the number of technical tickets for churned customers is more diverse.
library(ggplot2)

# Bar Plot of numTechtickets vs Churn

p34 <- ggplot(data = datrecode, aes(x = tenure, y = numTechTickets, color = Churn)) +
  geom_point() +
  labs(title = "Tenure vs. Number of Tech Tickets",
       x = "Tenure",
       y = "Number of Tech Tickets") +
  scale_x_continuous(breaks = seq(0, 100, 10)) + # Updated interval for x-axis
  scale_y_continuous(breaks = seq(0, 10, 1)) + # Updated interval and range for y-axis
  theme_minimal()
p34

  • Customers with a short tenure who raise fewer tech tickets are more likely to churn. Similarly, customers with a longer tenure who raise more than three tech tickets are also more likely to churn.

  • Customers with a tenure greater than 30 and who raise fewer than three tech tickets are much less likely to churn.

In conclusion, the analysis of customer churn reveals several key findings:

  1. The box plot reveals that customers who churned generally had more technical tickets than those who did not churn, and the distribution of technical tickets for churned customers is more diverse, highlighting the impact of technical issues on customer retention.

  2. Churn rate decreases with increasing tenure, implying that long-term customers are more satisfied and less likely to churn. This emphasizes the importance of nurturing long-term relationships with customers.

  3. Short-tenured customers with fewer tech tickets (1 or 2) and long-tenured customers with more than three tech tickets are more likely to churn. This suggests that both initial service issues and ongoing problems contribute to customer attrition.

  4. Customers with a tenure greater than 30 months and fewer than three tech tickets are much less likely to churn, indicating that long-term satisfaction and minimal service issues help retain customers.

  5. Monthly charges between 60 and 100 result in higher churn rates, possibly due to unmet value expectations or competition in this price range. This highlights the need for businesses to focus on providing competitive pricing and value.

  6. A higher churn rate is observed for customers with total charges below 1,000, while negligible churn is seen for customers with charges above 3,000. This implies that higher-spending customers tend to be more committed and satisfied with the service.

  7. Month-to-month contract types contribute to the highest churn rates, followed by one-year contracts, with a very small proportion of churn for two-year contracts. This suggests that customers with longer contracts may feel more secure and satisfied with the service.

  8. Approximately 18% of customers with Fiber Optics churned, followed by 6.5% for DSL customers, indicating that internet service type can also influence customer satisfaction and retention.

  9. Customers who do not have streaming movies, streaming TV, online security, and tech support services are more likely to churn. This suggests that the absence of these features may lead to dissatisfaction and ultimately, customer attrition.

To reduce churn rates, businesses should focus on improving customer service, offering competitive pricing and value, providing desired features in service packages, and fostering long-term customer relationships.