Customer churn, or customer attrition, is a significant loss for businesses. This metric is especially important for subscription-based services, such as telecommunications companies. In this project, we aim to conduct a churn analysis using data sourced from IBM sample data sets. We will leverage the R programming language to identify key variables associated with customer churn.
To achieve this objective, we will undertake the following tasks:
We’ll load the necessary package for data cleaning
library(dplyr)
library(tidyr)
library(stringr)
library(readxl)
library(ggplot2)
library(gridExtra)
library(tidyverse)
library(tibble)
library(caret)
library(randomForest)
dat <- read_excel("C:/Users/asus/Downloads/Customer churn dataset/Telecom Churn Rate Dataset.xlsx")
head(dat)
str(dat)
## tibble [7,043 × 23] (S3: tbl_df/tbl/data.frame)
## $ customerID : chr [1:7043] "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr [1:7043] "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr [1:7043] "Yes" "No" "No" "No" ...
## $ Dependents : chr [1:7043] "No" "No" "No" "No" ...
## $ tenure : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr [1:7043] "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr [1:7043] "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr [1:7043] "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr [1:7043] "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr [1:7043] "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr [1:7043] "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr [1:7043] "No" "No" "No" "Yes" ...
## $ StreamingTV : chr [1:7043] "No" "No" "No" "No" ...
## $ StreamingMovies : chr [1:7043] "No" "No" "No" "No" ...
## $ Contract : chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr [1:7043] "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr [1:7043] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
## $ numAdminTickets : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
## $ numTechTickets : num [1:7043] 0 0 0 3 0 0 0 0 2 0 ...
## $ Churn : chr [1:7043] "No" "No" "Yes" "No" ...
Let’s break down the features in the dataset:
Churn: Whether the customer churned or not (Yes or No) - Target variable
Categorical Variables
SeniorCitizen: Whether the customer is a senior citizen or not (Yes or No)
Partner: Whether the customer has a partner or not (Yes or No)
Dependents: Whether the customer has dependents or not (Yes or No)
PhoneService: Whether the customer has a phone service or not (Yes or No)
MultipleLines: Whether the customer has multiple lines or not (Yes or No)
OnlineSecurity: Whether the customer has online security or not (Yes or No)
OnlineBackup: Whether the customer has online backup or not (Yes or No)
DeviceProtection: Whether the customer has device protection or not (Yes or No)
TechSupport: Whether the customer has tech support or not (Yes or No)
StreamingTV: Whether the customer has streaming TV or not (Yes or No)
StreamingMovies: Whether the customer has streaming movies or not (Yes or No)
PaperlessBilling: Whether the customer has paperless billing or not (Yes or No)
Gender: The gender of the customer (Male or Female)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
InternetService: The customer’s internet service provider (DSL, Fiber optic, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
Numerical Variables
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
Tenure: Number of months the customer has stayed with the company
NumAdminTickets: Number of administrative tickets raised by the customer
NumTechTickets: Number of technical tickets raised by the customer.
# customerID has no significance so we are dropping this column
dat1 <- subset(dat, select = -customerID)
head(dat1)
# Handling Missing, Duplicate Data and null data
sum(duplicated(dat1))
## [1] 17
colSums(is.na(dat1) | dat1 == "")
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges numAdminTickets
## 0 0 11 0
## numTechTickets Churn
## 0 0
# There are 17 Duplicate rows & 11 missing values in TotalCharges Column.
# The number of Duplicate rows & missing values is relatively small compared
# to the size of the dataset. Therefore, we have decided to remove these missing
# values from our analysis.
# Let's call it datc.
datc <- dat1 %>%
na.omit(dat1) %>%
distinct()
sum(duplicated(datc))
## [1] 0
colSums(is.na(datc) | datc == "")
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges numAdminTickets
## 0 0 0 0
## numTechTickets Churn
## 0 0
# Checking Data structure
summary(dat)
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## numAdminTickets numTechTickets Churn
## Min. :0.0000 Min. :0.0000 Length:7043
## 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :0.0000 Median :0.0000 Mode :character
## Mean :0.5157 Mean :0.4196
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :5.0000 Max. :9.0000
##
# The "SeniorCitizen" variable in the dataset is currently coded as a binary variable with
# values of "0" or "1". This coding scheme is not intuitive and can be difficult to interpret.
# To make our analysis easier and more understandable, we can recode this variable using a more
# intuitive labeling scheme, such as "yes" and "no".
# The "MultipleLines" variable is related to the "PhoneService" variable.
# If "PhoneService" is set to "No", then "MultipleLines" is also automatically set to "No".
# To simplify our graphics and modeling, we can change the "No phone service" response in the
# "MultipleLines" variable to "No".
# Same goes with "No internet service"
datrecode <- datc %>%
mutate(SeniorCitizen = if_else(SeniorCitizen == 0, "No", "Yes")) %>%
mutate(MultipleLines = if_else(MultipleLines == "No phone service", "No", MultipleLines)) %>%
mutate(across(starts_with("Online") | starts_with("Device") | starts_with("Tech") |
starts_with("Streaming"), ~if_else(. == "No internet service", "No", .)))
head(datrecode)
Let’s take a look at numerical Features
# Histogram of Tenure:
p1 <- ggplot(datrecode, aes(x = tenure)) +
geom_histogram(binwidth = 5, color = "white", fill = "#FFA07A") +
labs(title = "Distribution of Tenure",
x = "Tenure (Months)",
y = "Frequency") +
scale_x_continuous(limits = c(0, 80), breaks = seq(0, 80, 5)) +
theme_minimal()
# Histogram of TotalCharges
p2 <- ggplot(datrecode, aes(x = TotalCharges)) +
geom_histogram(binwidth = 200, fill = "#69b3a2", color = "black") +
scale_x_continuous(breaks = seq(0, 9000, by = 1000)) +
labs(x = "Total Charges", y = "Frequency",
title = "Histogram of Total Charges")
# Histogram of MonthlyCharges
p3 <- ggplot(data = datrecode, aes(x = MonthlyCharges)) +
geom_histogram(binwidth = 10, color = "white", fill = "#FFA07A") +
labs(title = "Distribution of Monthly Charges",
x = "Monthly Charges ($)",
y = "Frequency") +
scale_x_continuous(limits = c(0, 130), breaks = seq(0, 150, 20)) +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 2)
Maximum frequency (number of customers) is observed in the 0-5 and 65-70 months tenure ranges, whereas the average frequency for intermediate ranges lies between 250 and 400 customers per range.
The “TotalCharges” graph depicts the distribution of the charged amount to customers, there is a down trend of decreasing frequency and increasing total charges.
The histogram of “Monthly Charges” shows that the majority of customers have monthly charges between 20-40 dollars and a normal distribution around 80 dollars.
# Let's look at numAdminTickets and numTechTickets
# histogram of numAdminTickets
p4 <- ggplot(data = datrecode, aes(x = numAdminTickets)) +
geom_histogram(binwidth = 1, color = "white", fill = "#FFA07A") +
geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..), 4)*100, "%")),
stat = "count")+labs(title = "Distribution numAdminTickets",
x = "numAdminTickets",
y = "Frequency") +
scale_x_continuous(limits = c(-1, 6), breaks = seq(0, 5, 1)) +
theme_minimal()
# histogram of numTechtickets
p5 <- ggplot(data = datrecode, aes(x = numTechTickets)) +
geom_histogram(binwidth = 1, color = "white", fill = "#FFA07A") +
geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..), 4)*100, "%")),
stat = "count")+labs(title = "Distribution numTechTickets",
x = "numTechTickets",
y = "Frequency") +
scale_x_continuous(limits = c(-1, 10), breaks = seq(0, 10, 1)) +
theme_minimal()
grid.arrange(p4,p5,ncol = 1)
In the numAdminTickets graph, it appears that most customers have not raised any tickets.
The highest number of admin tickets raised by a customer is 5, though this may not be representative of every customer’s experience.
Similarly, in the numTechtickets graph, the maximum number of tech tickets raised by a customer is 9, but this may not be the case for all customers.
Let’s take a look at Categorical Features
# Bar graph of Contract
p9 <- ggplot(datrecode, aes(x = Contract, fill = Contract)) +
geom_bar() +geom_text(aes(y = ..count.. -400, label = paste0(round(prop.table(..count..),
1)*100, "%")), stat = "count")+labs(title = "Distribution of Contract Type",
x = "Contract Type",
y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Bar graph of gender
colors <- c("#0072B2", "#E69F00")
p10 <- ggplot(data = datrecode, aes(x = gender))+
geom_bar(aes(fill = gender))+
geom_text(aes(y = ..count.. -500, label = paste0(round(prop.table(..count..),4)* 100, '%')),
stat = "count", position = position_stack(vjust = 0.5))+ xlab("Gender") + ylab("Count") + ggtitle("Customer Gender Distribution")+
theme()
# Arrange plots in a grid
grid.arrange(p9, p10)
The most common contract type is Month-to-Month, followed by Two Year and One Year contracts.
The bar graph indicates an almost equal distribution between genders.
# Bar graph of InternetService
p11 <- ggplot(datrecode, aes(x = InternetService, fill = InternetService)) +
geom_bar() +
labs(title = "Distribution of Internet Service",
x = "Internet Service",
y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Bar graph of PaymentMethod
p12 <- ggplot(datrecode, aes(x = PaymentMethod, fill = PaymentMethod)) +
geom_bar() +
labs(title = "Distribution of Payment Method",
x = "Payment Method",
y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Arrange plots in a grid
grid.arrange(p11, p12, nrow = 1,ncol = 2)
Looking at the bar graph for “Internet Service,” we can see that Fiber optic is the top choice among internet users, followed by DSL and a small percentage of users not utilizing any internet service.
Check it out - Electronic Check is by far the most popular payment method, with the other three options lagging behind by almost 25%. Looks like we’re moving towards a more digital world!
# Plotting Bar graphs of SeniorCitizen, Partner, Dependents, PhoneService,MultipleLines
# and OnlineSecurity.
p14 <- ggplot(datrecode, aes(x = SeniorCitizen)) + geom_bar() +
geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("SeniorCitizen")
p15 <- ggplot(datrecode, aes(x = Partner)) + geom_bar() +
geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("Partner")
p16 <- ggplot(datrecode, aes(x = Dependents)) + geom_bar() +
geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("Dependents")
p17 <- ggplot(datrecode, aes(x = PhoneService)) + geom_bar() +
geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("PhoneService")
p18 <- ggplot(datrecode, aes(x = MultipleLines)) + geom_bar() +
geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("MultipleLines")
p19 <- ggplot(datrecode, aes(x = OnlineSecurity)) + geom_bar() +
geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("OnlineSecurity")
# Arrange plots in a grid
grid.arrange(p14,p15,p16,p17,p18,p19, nrow = 2, ncol = 3)
# Bar graphs of OnlinBackup, DeviceProtection, TechSupport, StreamingTV,
# StreamingMovies and paperLessbill.
p20 <- ggplot(datrecode, aes(x = OnlineBackup)) + geom_bar() +
geom_text(aes(y = ..count.. -1000,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count')+ ggtitle("OnlineBackup")
p21 <- ggplot(datrecode, aes(x = DeviceProtection)) + geom_bar()+
geom_text(aes(y= ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, "%")),
stat = "count") +ggtitle("DeviceProtection")
p22 <- ggplot(datrecode, aes(x = TechSupport)) + geom_bar()+
geom_text(aes(y= ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, "%")),
stat = "count") +ggtitle("TechSupport")
p23 <- ggplot(datrecode, aes(x = StreamingTV)) + geom_bar() +
geom_text(aes(y= ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, "%")),
stat = "count") + ggtitle("StreamingTV")
p24 <- ggplot(datrecode, aes(x = StreamingMovies)) +
geom_bar() +geom_text(aes(y= ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, "%")),
stat = "count") + ggtitle("StreamingMovies")
p25 <- ggplot(datrecode, aes(x = PaperlessBilling)) +
geom_bar() +geom_text(aes(y= ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, "%")),
stat = "count") + ggtitle("PaperlessBilling")
grid.arrange(p20,p21,p22,p23,p24,p25, nrow = 2, ncol = 3)
# How many people has chuned from our data
p13 <- ggplot(datrecode, aes(x = Churn)) +
geom_bar()+ geom_text(aes(y = ..count.. -400,
label = paste0(round(prop.table(..count..),4)*100, '%')),
stat = 'count') + ggtitle("Churn")+
theme(plot.title = element_text(hjust = .5))
p13
library(tibble)
library(caret)
set.seed(123)
trainIndex <- createDataPartition(datrecode$Churn, p = .7, list=FALSE, times=1)
trainSet <- datrecode[trainIndex,]
testSet <- datrecode[-trainIndex,]
Model1 <- randomForest(as.factor(Churn) ~ ., data = trainSet,mtry = 4, ntree = 400,
importance= TRUE)
Model1
##
## Call:
## randomForest(formula = as.factor(Churn) ~ ., data = trainSet, mtry = 4, ntree = 400, importance = TRUE)
## Type of random forest: classification
## Number of trees: 400
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 14.5%
## Confusion matrix:
## No Yes class.error
## No 3322 287 0.07952341
## Yes 425 877 0.32642089
importance(Model1)
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## gender -2.8828634 -2.4246713 -3.7345378 30.52423
## SeniorCitizen 10.8580800 0.1094152 9.3583026 27.79091
## Partner 3.3982833 0.4407542 3.3032271 27.84312
## Dependents -2.9749091 9.7342889 5.3204954 25.36001
## tenure 26.0181782 31.4209342 44.7802826 252.70322
## PhoneService 1.8197101 11.7297989 10.6453395 11.01264
## MultipleLines 3.3933439 12.2979055 11.8788514 28.74293
## InternetService 17.5444257 22.8373573 26.4965962 73.15133
## OnlineSecurity 4.5295764 18.8511354 15.2826782 37.10449
## OnlineBackup 8.0222559 14.7390057 14.7406169 31.76444
## DeviceProtection 9.7705416 0.2064048 9.9093998 26.01393
## TechSupport 1.0548766 20.1539708 14.8994975 30.40627
## StreamingTV 11.4499756 9.3796617 16.4384529 27.74302
## StreamingMovies 12.1572572 9.4221865 17.0537916 27.72715
## Contract -5.0131795 30.0177061 28.6978169 140.53040
## PaperlessBilling 6.3663897 5.9612556 9.1536206 38.31996
## PaymentMethod 4.7276611 9.7927882 10.9666208 68.68784
## MonthlyCharges 24.9445291 29.2168896 38.7892048 247.03385
## TotalCharges 24.9824655 25.5758890 35.9320283 253.77535
## numAdminTickets -0.6444313 2.0932069 0.7567089 38.38885
## numTechTickets 51.0852906 107.6261879 106.0643216 299.85094
feature_importance <- importance(Model1)
top_10_features <- sort(feature_importance[, "MeanDecreaseAccuracy"],
decreasing = TRUE, index.return = TRUE)$ix[1:10]
feature_names <- rownames(feature_importance)[top_10_features]
top_10_importance_values <- feature_importance[top_10_features,
c("No", "Yes", "MeanDecreaseAccuracy", "MeanDecreaseGini")]
data.frame(Feature = feature_names, top_10_importance_values)
Using test set 1. Confusion Matrix
prediction <- predict(Model1,testSet)
confusion_matrix <- confusionMatrix(as.factor(prediction), as.factor(testSet$Churn))
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1418 173
## Yes 128 385
##
## Accuracy : 0.8569
## 95% CI : (0.8412, 0.8716)
## No Information Rate : 0.7348
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6232
##
## Mcnemar's Test P-Value : 0.01121
##
## Sensitivity : 0.9172
## Specificity : 0.6900
## Pos Pred Value : 0.8913
## Neg Pred Value : 0.7505
## Prevalence : 0.7348
## Detection Rate : 0.6740
## Detection Prevalence : 0.7562
## Balanced Accuracy : 0.8036
##
## 'Positive' Class : No
##
Based on the confusion matrix and the performance metrics, the random forest model exhibits an accuracy of 85.69% on the test set, with a sensitivity of 91.72% and specificity of 69.00%.
The model performs better in identifying the ‘No’ class compared to the ‘Yes’ class. The balanced accuracy of 80.36% indicates that the model has a relatively good overall performance.
# Bar Graph of Churn VS Contract type
p26 <- ggplot(datrecode, aes(x = Churn, fill = Contract)) +
geom_bar(position = "dodge") +
geom_text(aes(y = ..count.., label = paste0(round(prop.table
(..count.. / sum(..count..)), 4) * 100, "%")),
stat = "count", position = position_dodge(width = 0.9), vjust = -0.5) +
labs(title = "Churn vs Contract", x = "Churn", y = "Count")
p26
# bar graph of Internet Service Vs Churn
p27 <- ggplot(datrecode, aes(x = Churn, fill = InternetService)) +
geom_bar(position = "dodge") +
geom_text(aes(y = ..count.., label = paste0(round(prop.table
(..count.. / sum(..count..)), 4) * 100, "%")),
stat = "count", position = position_dodge(width = 0.9), vjust = -0.5) +
labs(title = "Churn vs Internet Service", x = "Churn", y = "Count")
p27
# Bar graph of Churn vs Streaming Movies
p28 <- ggplot(datrecode, aes(x = Churn, fill = StreamingMovies)) +
geom_bar(position = "dodge") +
labs(title = "Churn vs Streaming Movies", x = "Churn", y = "Count")
p29 <- ggplot(datrecode, aes(x = Churn, fill = StreamingTV)) +
geom_bar(position = "dodge") +
labs(title = "Churn vs Streaming TV", x = "Churn", y = "Count")
p30 <- ggplot(datrecode, aes(x = Churn, fill = OnlineSecurity)) +
geom_bar(position = "dodge") +
labs(title = "Churn vs Online Security", x = "Churn", y = "Count")
p31 <- ggplot(datrecode, aes(x = Churn, fill = TechSupport)) +
geom_bar(position = "dodge") +
labs(title = "Churn vs Tech Support", x = "Churn", y = "Count")
grid.arrange(p28,p29,p30,p31, ncol = 2, nrow =2)
# Now we'll check them against Churn Aspect to get if there's any good insight
# Tenure & Churn:
p6 <- ggplot(datrecode, aes(x = tenure, fill = Churn)) +
geom_histogram(binwidth = 5, color = "white", position = "identity", alpha = 0.7) +
labs(title = "Distribution of Tenure by Churn",
x = "Tenure (Months)",
y = "Frequency") +
scale_x_continuous(limits = c(0, 80), breaks = seq(0, 80, 5)) +
theme_minimal() +
scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))
# TotalCharges & Churn :
p7 <- ggplot(datrecode, aes(x = TotalCharges, fill = Churn)) +
geom_histogram(binwidth = 200, color = "white", position = "identity", alpha = 0.7) +
scale_x_continuous(breaks = seq(0, 9000, by = 1000)) +
labs(x = "Total Charges", y = "Frequency",
title = "Histogram of Total Charges by Churn") +
theme_minimal() +
scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))
# MonthlyCharges & Churn :
p8 <- ggplot(data = datrecode, aes(x = MonthlyCharges, fill = Churn)) +
geom_histogram(binwidth = 10, color = "white", position = "identity", alpha = 0.7) +
labs(title = "Distribution of Monthly Charges by Churn",
x = "Monthly Charges ($)",
y = "Frequency") +
scale_x_continuous(limits = c(0, 130), breaks = seq(0, 150, 20)) +
theme_minimal() +
scale_fill_manual(values = c("No" = "#FFA07A", "Yes" = "#69b3a2"))
grid.arrange(p6,p7,p8, ncol = 1)
Churn rate decreases with increasing tenure, implying long-term customers are more satisfied and less likely to churn.
Higher churn rate for customers with total charges < 1,000, while negligible for > 3,000, suggesting greater commitment and satisfaction for higher-spending customers.
Customers with monthly charges between 60 and 100 exhibit higher churn rates, possibly due to unmet value expectations or competition in this price range.
# Churn vs numTechTickets
p32 <- ggplot(datrecode, aes(x = Churn, y = numTechTickets)) +
geom_boxplot() + scale_y_continuous(limits = c(0,9), breaks = seq(0, 9, 1))+
theme_minimal() +
labs(title = "Churn vs NumTechTickets", x = "Churn", y = "Number of Technical Tickets")
p32
library(ggplot2)
# Bar Plot of numTechtickets vs Churn
p34 <- ggplot(data = datrecode, aes(x = tenure, y = numTechTickets, color = Churn)) +
geom_point() +
labs(title = "Tenure vs. Number of Tech Tickets",
x = "Tenure",
y = "Number of Tech Tickets") +
scale_x_continuous(breaks = seq(0, 100, 10)) + # Updated interval for x-axis
scale_y_continuous(breaks = seq(0, 10, 1)) + # Updated interval and range for y-axis
theme_minimal()
p34
Customers with a short tenure who raise fewer tech tickets are more likely to churn. Similarly, customers with a longer tenure who raise more than three tech tickets are also more likely to churn.
Customers with a tenure greater than 30 and who raise fewer than three tech tickets are much less likely to churn.
The box plot reveals that customers who churned generally had more technical tickets than those who did not churn, and the distribution of technical tickets for churned customers is more diverse, highlighting the impact of technical issues on customer retention.
Churn rate decreases with increasing tenure, implying that long-term customers are more satisfied and less likely to churn. This emphasizes the importance of nurturing long-term relationships with customers.
Short-tenured customers with fewer tech tickets (1 or 2) and long-tenured customers with more than three tech tickets are more likely to churn. This suggests that both initial service issues and ongoing problems contribute to customer attrition.
Customers with a tenure greater than 30 months and fewer than three tech tickets are much less likely to churn, indicating that long-term satisfaction and minimal service issues help retain customers.
Monthly charges between 60 and 100 result in higher churn rates, possibly due to unmet value expectations or competition in this price range. This highlights the need for businesses to focus on providing competitive pricing and value.
A higher churn rate is observed for customers with total charges below 1,000, while negligible churn is seen for customers with charges above 3,000. This implies that higher-spending customers tend to be more committed and satisfied with the service.
Month-to-month contract types contribute to the highest churn rates, followed by one-year contracts, with a very small proportion of churn for two-year contracts. This suggests that customers with longer contracts may feel more secure and satisfied with the service.
Approximately 18% of customers with Fiber Optics churned, followed by 6.5% for DSL customers, indicating that internet service type can also influence customer satisfaction and retention.
Customers who do not have streaming movies, streaming TV, online security, and tech support services are more likely to churn. This suggests that the absence of these features may lead to dissatisfaction and ultimately, customer attrition.
To reduce churn rates, businesses should focus on improving customer service, offering competitive pricing and value, providing desired features in service packages, and fostering long-term customer relationships.