The dataset used for this project is a dataset from a fictional telecommunications company. This dataset contains information about approximately six thousand of their users, the services that they are using, and their payment details. Similar to any other company, this company aims to maximise the number of users by attracting new users while more importantly, retaining the existing ones. To achieve this goal, we need to analyze the behaviour pattern of the existing users provided by this dataset to predict the churn in time. Ideally, an in-depth exploratory data analysis and data visualization would help the company to make a business decision that will benefit both the company and its users.
Data source: https://www.kaggle.com/radmirzosimov/telecom-users-dataset
Before reading the data, let us prepare the packages that we will be using
library(ggplot2)
library(hrbrthemes)## Warning: package 'hrbrthemes' was built under R version 4.0.5
library(dplyr)
library(tidyr)
library(viridis)
library(ggExtra)## Warning: package 'ggExtra' was built under R version 4.0.5
The data used, “telecom_users.csv”, can be found iunder the sub folder “data_input” and must be put in the same folder. Read the data by running the code below:
telecom <- read.csv("data_input/telecom_users.csv")Let us take a closer look at the data:
head(telecom)dim(telecom)## [1] 5986 22
names(telecom)## [1] "X" "customerID" "gender" "SeniorCitizen"
## [5] "Partner" "Dependents" "tenure" "PhoneService"
## [9] "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup"
## [13] "DeviceProtection" "TechSupport" "StreamingTV" "StreamingMovies"
## [17] "Contract" "PaperlessBilling" "PaymentMethod" "MonthlyCharges"
## [21] "TotalCharges" "Churn"
The explanation regarding each column is shown below:
customerID: customer id
gender: client gender (male / female)
SeniorCitizen: is the client retired (1, 0)
Partner: is the client married (Yes, No)
tenure: how many months a person has been a client of the company
PhoneService: is the telephone service connected (Yes, No)
MultipleLines: are multiple phone lines connected (Yes, No, No phone service)
InternetService: client’s Internet service provider (DSL, Fiber optic, No)
OnlineSecurity: is the online security service connected (Yes, No, No internet service)
OnlineBackup: is the online backup service activated (Yes, No, No internet service)
DeviceProtection: does the client have equipment insurance (Yes, No, No internet service)
TechSupport: is the technical support service connected (Yes, No, No internet service)
StreamingTV: is the streaming TV service connected (Yes, No, No internet service)
StreamingMovies: is the streaming cinema service activated (Yes, No, No internet service)
Contract: type of customer contract (Month-to-month, One year, Two year)
PaperlessBilling: whether the client uses paperless billing (Yes, No)
PaymentMethod: payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges: current monthly payment
TotalCharges: the total amount that the client paid for the services for the entire time
Churn: whether there was a churn (Yes or No)
Next, we will check the data frame structure
str(telecom)## 'data.frame': 5986 obs. of 22 variables:
## $ X : int 1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
## $ customerID : chr "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
## $ gender : chr "Male" "Female" "Female" "Male" ...
## $ SeniorCitizen : int 0 0 1 0 0 0 0 0 0 1 ...
## $ Partner : chr "Yes" "No" "Yes" "No" ...
## $ Dependents : chr "Yes" "No" "No" "No" ...
## $ tenure : int 72 44 38 4 2 70 33 1 39 55 ...
## $ PhoneService : chr "Yes" "Yes" "Yes" "Yes" ...
## $ MultipleLines : chr "Yes" "No" "Yes" "No" ...
## $ InternetService : chr "No" "Fiber optic" "Fiber optic" "DSL" ...
## $ OnlineSecurity : chr "No internet service" "No" "No" "No" ...
## $ OnlineBackup : chr "No internet service" "Yes" "No" "No" ...
## $ DeviceProtection: chr "No internet service" "Yes" "No" "No" ...
## $ TechSupport : chr "No internet service" "No" "No" "No" ...
## $ StreamingTV : chr "No internet service" "Yes" "No" "No" ...
## $ StreamingMovies : chr "No internet service" "No" "No" "Yes" ...
## $ Contract : chr "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
## $ PaperlessBilling: chr "No" "Yes" "Yes" "Yes" ...
## $ PaymentMethod : chr "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
## $ MonthlyCharges : num 24.1 88.2 75 55.9 53.5 ...
## $ TotalCharges : num 1735 3973 2870 238 120 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
We need to convert each column into its correct data type
telecom[,c("gender", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Churn")] <- lapply(telecom[,c("gender", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Churn")],as.factor)
telecom$SeniorCitizen <- as.logical(telecom$SeniorCitizen)We will once again check the data frame structure
str(telecom)## 'data.frame': 5986 obs. of 22 variables:
## $ X : int 1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
## $ customerID : chr "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 1 1 1 2 2 ...
## $ SeniorCitizen : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 1 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ tenure : int 72 44 38 4 2 70 33 1 39 55 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 1 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 3 1 3 1 1 2 3 2 2 3 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 3 2 2 1 1 1 2 1 1 2 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 2 1 1 1 3 3 3 1 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 2 3 1 1 1 1 1 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 2 3 1 1 3 3 1 1 3 3 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 2 1 1 1 1 3 1 1 3 3 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 2 3 1 1 1 1 1 1 1 3 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 2 1 1 3 1 3 3 1 1 3 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 3 1 1 1 1 3 1 1 2 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 1 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 2 2 1 3 3 1 3 4 4 3 ...
## $ MonthlyCharges : num 24.1 88.2 75 55.9 53.5 ...
## $ TotalCharges : num 1735 3973 2870 238 120 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
We will now check for missing values
colSums(is.na(telecom))## X customerID gender SeniorCitizen
## 0 0 0 0
## Partner Dependents tenure PhoneService
## 0 0 0 0
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 0 0 0 0
## DeviceProtection TechSupport StreamingTV StreamingMovies
## 0 0 0 0
## Contract PaperlessBilling PaymentMethod MonthlyCharges
## 0 0 0 0
## TotalCharges Churn
## 10 0
Oh no! Since we have 10 missing values out of 6000, it is okay to omit them all as their deletion is insignificant
telecom <- na.omit(telecom)
dim(telecom)## [1] 5976 22
We have 10 less rows of data and there is no missing value in it
anyNA(telecom)## [1] FALSE
summary(telecom)## X customerID gender SeniorCitizen Partner
## Min. : 0 Length:5976 Female:2932 Mode :logical No :3080
## 1st Qu.:1780 Class :character Male :3044 FALSE:5010 Yes:2896
## Median :3548 Mode :character TRUE :966
## Mean :3535
## 3rd Qu.:5294
## Max. :7042
## Dependents tenure PhoneService MultipleLines
## No :4195 Min. : 1.00 No : 588 No :2843
## Yes:1781 1st Qu.: 9.00 Yes:5388 No phone service: 588
## Median :29.00 Yes :2545
## Mean :32.52
## 3rd Qu.:56.00
## Max. :72.00
## InternetService OnlineSecurity OnlineBackup
## DSL :2064 No :2982 No :2604
## Fiber optic:2627 No internet service:1285 No internet service:1285
## No :1285 Yes :1709 Yes :2087
##
##
##
## DeviceProtection TechSupport
## No :2639 No :2959
## No internet service:1285 No internet service:1285
## Yes :2052 Yes :1732
##
##
##
## StreamingTV StreamingMovies Contract
## No :2388 No :2353 Month-to-month:3269
## No internet service:1285 No internet service:1285 One year :1275
## Yes :2303 Yes :2338 Two year :1432
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2451 Bank transfer (automatic):1306 Min. : 18.25
## Yes:3525 Credit card (automatic) :1302 1st Qu.: 35.75
## Electronic check :2006 Median : 70.42
## Mailed check :1362 Mean : 64.85
## 3rd Qu.: 89.90
## Max. :118.75
## TotalCharges Churn
## Min. : 18.8 No :4389
## 1st Qu.: 404.3 Yes:1587
## Median :1412.2
## Mean :2298.1
## 3rd Qu.:3847.0
## Max. :8684.8
From the summary above, it is possible to conclude that:
The 5986 users is made up of almost the same number of females as males. There are 114 more male users than female users.
Less than 1000 users are retired.
Almost half of all the users are married, while the rest are not married.
By average, the users have been a client to the company for 32 months. The longest user has been a client for 72 months.
Around 10% of the users do not have phone service.
More than 20% of the users do not have internet service.
Most users use the month-to-month contract, taking up more than half the number of users.
The number of users that pay via electronic check are the most among other payment method. Meanwhile, there are roughly the same amount of users that pay via bank transfer, credit card, and mailed check.
In the current month, the users pay an average of 64.80 dollars. The least amount of monthly payment is 18.25 dollars while the most amount is 118.75 dollars.
The average total charge among all users is 2298.1 dollars, while the least is 18.8 dollars and the most is 8648.8 dollars.
Although most users choose not to churn, there is a considerable amount of users that choose to cancel or not renew their subscription.
Relationship between payment method and monthly charges
ggplot(data = telecom, mapping = aes(x = PaymentMethod, y = MonthlyCharges, colour = PaymentMethod)) +
geom_boxplot(outlier.shape = NA, fill = "white")+
theme(legend.position= "none")+
labs(title = "Users' Payment Method and Monthly Charges",
x = "Payment Method",
y = "Month Charges") From the visualization above, it is clear that users that pay via bank transfer, credit card, and electronic check has approximately the same monthly charge value for its average. On the other hand, users that pay via mailed check are more likely to have lower monthly charge value.
Distribution of churn in relation to the users’ tenure and monthly charges
ggplot(data=telecom, aes(x=tenure, group=Churn, fill=Churn)) +
geom_density(adjust=1.5, alpha=.4) +
theme_ipsum()+
labs(title = "Distribution of Churn Based On Tenure",
x = "Tenure(in months)",
y = NULL) In this case, the number of tenure that is considering a churn is found the most when their tenure value is low. This means that most users that do not wish to continue the service are new users.
ggplot(data=telecom, aes(x=MonthlyCharges, group=Churn, fill=Churn)) +
geom_density(adjust=1.5, alpha=.4) +
theme_ipsum()+
labs(title = "Distribution of Churn Based On Monthly Charges",
x = "Monthly Charges",
y = NULL) However, for this case, it is shown that most users that has a high monthly charges usually have a bad user experience due to their high bill and may possibly be looking for other cheaper alternatives. This is shown by the high churn rate for users with high monthly charges. Hence, the company can go and try to retain their high-billed customers.
In conclusion, the distribution of churn is heavily linked with users that are new to the service or users that have high values of monthly charges. In addition, users that pay via mailed check usually have low monthly charges. Hence, the company can take action to retain its customers by bringing in more incentives that may seem appealing to the high monthly charge users and new users.