1 Introduction

The dataset used for this project is a dataset from a fictional telecommunications company. This dataset contains information about approximately six thousand of their users, the services that they are using, and their payment details. Similar to any other company, this company aims to maximise the number of users by attracting new users while more importantly, retaining the existing ones. To achieve this goal, we need to analyze the behaviour pattern of the existing users provided by this dataset to predict the churn in time. Ideally, an in-depth exploratory data analysis and data visualization would help the company to make a business decision that will benefit both the company and its users.

Data source: https://www.kaggle.com/radmirzosimov/telecom-users-dataset

2 Set Up

Before reading the data, let us prepare the packages that we will be using

library(ggplot2)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.0.5
library(dplyr)
library(tidyr)
library(viridis)
library(ggExtra)
## Warning: package 'ggExtra' was built under R version 4.0.5

3 Data Preprocessing

3.1 Read Data

The data used, “telecom_users.csv”, can be found iunder the sub folder “data_input” and must be put in the same folder. Read the data by running the code below:

telecom <- read.csv("data_input/telecom_users.csv")

3.2 Data Inspection

Let us take a closer look at the data:

head(telecom)
dim(telecom)
## [1] 5986   22
names(telecom)
##  [1] "X"                "customerID"       "gender"           "SeniorCitizen"   
##  [5] "Partner"          "Dependents"       "tenure"           "PhoneService"    
##  [9] "MultipleLines"    "InternetService"  "OnlineSecurity"   "OnlineBackup"    
## [13] "DeviceProtection" "TechSupport"      "StreamingTV"      "StreamingMovies" 
## [17] "Contract"         "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"  
## [21] "TotalCharges"     "Churn"

The explanation regarding each column is shown below:

  • customerID: customer id

  • gender: client gender (male / female)

  • SeniorCitizen: is the client retired (1, 0)

  • Partner: is the client married (Yes, No)

  • tenure: how many months a person has been a client of the company

  • PhoneService: is the telephone service connected (Yes, No)

  • MultipleLines: are multiple phone lines connected (Yes, No, No phone service)

  • InternetService: client’s Internet service provider (DSL, Fiber optic, No)

  • OnlineSecurity: is the online security service connected (Yes, No, No internet service)

  • OnlineBackup: is the online backup service activated (Yes, No, No internet service)

  • DeviceProtection: does the client have equipment insurance (Yes, No, No internet service)

  • TechSupport: is the technical support service connected (Yes, No, No internet service)

  • StreamingTV: is the streaming TV service connected (Yes, No, No internet service)

  • StreamingMovies: is the streaming cinema service activated (Yes, No, No internet service)

  • Contract: type of customer contract (Month-to-month, One year, Two year)

  • PaperlessBilling: whether the client uses paperless billing (Yes, No)

  • PaymentMethod: payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

  • MonthlyCharges: current monthly payment

  • TotalCharges: the total amount that the client paid for the services for the entire time

  • Churn: whether there was a churn (Yes or No)

Next, we will check the data frame structure

str(telecom)
## 'data.frame':    5986 obs. of  22 variables:
##  $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
##  $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
##  $ gender          : chr  "Male" "Female" "Female" "Male" ...
##  $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
##  $ Partner         : chr  "Yes" "No" "Yes" "No" ...
##  $ Dependents      : chr  "Yes" "No" "No" "No" ...
##  $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
##  $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
##  $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
##  $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
##  $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
##  $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
##  $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
##  $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
##  $ StreamingMovies : chr  "No internet service" "No" "No" "Yes" ...
##  $ Contract        : chr  "Two year" "Month-to-month" "Month-to-month" "Month-to-month" ...
##  $ PaperlessBilling: chr  "No" "Yes" "Yes" "Yes" ...
##  $ PaymentMethod   : chr  "Credit card (automatic)" "Credit card (automatic)" "Bank transfer (automatic)" "Electronic check" ...
##  $ MonthlyCharges  : num  24.1 88.2 75 55.9 53.5 ...
##  $ TotalCharges    : num  1735 3973 2870 238 120 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

3.3 Change Column Type

We need to convert each column into its correct data type

telecom[,c("gender", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Churn")] <- lapply(telecom[,c("gender", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Churn")],as.factor)

telecom$SeniorCitizen <- as.logical(telecom$SeniorCitizen)

We will once again check the data frame structure

str(telecom)
## 'data.frame':    5986 obs. of  22 variables:
##  $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
##  $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 1 1 1 2 2 ...
##  $ SeniorCitizen   : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 1 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 1 1 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 3 1 3 1 1 2 3 2 2 3 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 3 2 2 1 1 1 2 1 1 2 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 2 1 1 1 3 3 3 1 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 2 3 1 1 1 1 1 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 2 3 1 1 3 3 1 1 3 3 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 2 1 1 1 1 3 1 1 3 3 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 2 3 1 1 1 1 1 1 1 3 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 2 1 1 3 1 3 3 1 1 3 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 3 1 1 1 1 3 1 1 2 1 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 1 2 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 2 2 1 3 3 1 3 4 4 3 ...
##  $ MonthlyCharges  : num  24.1 88.2 75 55.9 53.5 ...
##  $ TotalCharges    : num  1735 3973 2870 238 120 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...

3.4 Check Missing Values

We will now check for missing values

colSums(is.na(telecom))
##                X       customerID           gender    SeniorCitizen 
##                0                0                0                0 
##          Partner       Dependents           tenure     PhoneService 
##                0                0                0                0 
##    MultipleLines  InternetService   OnlineSecurity     OnlineBackup 
##                0                0                0                0 
## DeviceProtection      TechSupport      StreamingTV  StreamingMovies 
##                0                0                0                0 
##         Contract PaperlessBilling    PaymentMethod   MonthlyCharges 
##                0                0                0                0 
##     TotalCharges            Churn 
##               10                0

Oh no! Since we have 10 missing values out of 6000, it is okay to omit them all as their deletion is insignificant

telecom <- na.omit(telecom)
dim(telecom)
## [1] 5976   22

We have 10 less rows of data and there is no missing value in it

anyNA(telecom)
## [1] FALSE

4 Practical Statistics

summary(telecom)
##        X         customerID           gender     SeniorCitizen   Partner   
##  Min.   :   0   Length:5976        Female:2932   Mode :logical   No :3080  
##  1st Qu.:1780   Class :character   Male  :3044   FALSE:5010      Yes:2896  
##  Median :3548   Mode  :character                 TRUE :966                 
##  Mean   :3535                                                              
##  3rd Qu.:5294                                                              
##  Max.   :7042                                                              
##  Dependents     tenure      PhoneService          MultipleLines 
##  No :4195   Min.   : 1.00   No : 588     No              :2843  
##  Yes:1781   1st Qu.: 9.00   Yes:5388     No phone service: 588  
##             Median :29.00                Yes             :2545  
##             Mean   :32.52                                       
##             3rd Qu.:56.00                                       
##             Max.   :72.00                                       
##     InternetService             OnlineSecurity              OnlineBackup 
##  DSL        :2064   No                 :2982   No                 :2604  
##  Fiber optic:2627   No internet service:1285   No internet service:1285  
##  No         :1285   Yes                :1709   Yes                :2087  
##                                                                          
##                                                                          
##                                                                          
##             DeviceProtection              TechSupport  
##  No                 :2639    No                 :2959  
##  No internet service:1285    No internet service:1285  
##  Yes                :2052    Yes                :1732  
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2388   No                 :2353   Month-to-month:3269  
##  No internet service:1285   No internet service:1285   One year      :1275  
##  Yes                :2303   Yes                :2338   Two year      :1432  
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2451         Bank transfer (automatic):1306   Min.   : 18.25  
##  Yes:3525         Credit card (automatic)  :1302   1st Qu.: 35.75  
##                   Electronic check         :2006   Median : 70.42  
##                   Mailed check             :1362   Mean   : 64.85  
##                                                    3rd Qu.: 89.90  
##                                                    Max.   :118.75  
##   TotalCharges    Churn     
##  Min.   :  18.8   No :4389  
##  1st Qu.: 404.3   Yes:1587  
##  Median :1412.2             
##  Mean   :2298.1             
##  3rd Qu.:3847.0             
##  Max.   :8684.8

From the summary above, it is possible to conclude that:

  1. The 5986 users is made up of almost the same number of females as males. There are 114 more male users than female users.

  2. Less than 1000 users are retired.

  3. Almost half of all the users are married, while the rest are not married.

  4. By average, the users have been a client to the company for 32 months. The longest user has been a client for 72 months.

  5. Around 10% of the users do not have phone service.

  6. More than 20% of the users do not have internet service.

  7. Most users use the month-to-month contract, taking up more than half the number of users.

  8. The number of users that pay via electronic check are the most among other payment method. Meanwhile, there are roughly the same amount of users that pay via bank transfer, credit card, and mailed check.

  9. In the current month, the users pay an average of 64.80 dollars. The least amount of monthly payment is 18.25 dollars while the most amount is 118.75 dollars.

  10. The average total charge among all users is 2298.1 dollars, while the least is 18.8 dollars and the most is 8648.8 dollars.

  11. Although most users choose not to churn, there is a considerable amount of users that choose to cancel or not renew their subscription.

5 Data Plotting

Relationship between payment method and monthly charges

ggplot(data = telecom, mapping = aes(x = PaymentMethod, y = MonthlyCharges, colour = PaymentMethod)) +
  geom_boxplot(outlier.shape = NA, fill = "white")+
   theme(legend.position= "none")+
  labs(title = "Users' Payment Method and Monthly Charges",
       x = "Payment Method",
       y = "Month Charges")

From the visualization above, it is clear that users that pay via bank transfer, credit card, and electronic check has approximately the same monthly charge value for its average. On the other hand, users that pay via mailed check are more likely to have lower monthly charge value.

Distribution of churn in relation to the users’ tenure and monthly charges

ggplot(data=telecom, aes(x=tenure, group=Churn, fill=Churn)) +
    geom_density(adjust=1.5, alpha=.4) +
    theme_ipsum()+
  labs(title = "Distribution of Churn Based On Tenure",
       x = "Tenure(in months)",
       y = NULL)

In this case, the number of tenure that is considering a churn is found the most when their tenure value is low. This means that most users that do not wish to continue the service are new users.

ggplot(data=telecom, aes(x=MonthlyCharges, group=Churn, fill=Churn)) +
    geom_density(adjust=1.5, alpha=.4) +
    theme_ipsum()+
  labs(title = "Distribution of Churn Based On Monthly Charges",
       x = "Monthly Charges",
       y = NULL)

However, for this case, it is shown that most users that has a high monthly charges usually have a bad user experience due to their high bill and may possibly be looking for other cheaper alternatives. This is shown by the high churn rate for users with high monthly charges. Hence, the company can go and try to retain their high-billed customers.

6 Conclusion

In conclusion, the distribution of churn is heavily linked with users that are new to the service or users that have high values of monthly charges. In addition, users that pay via mailed check usually have low monthly charges. Hence, the company can take action to retain its customers by bringing in more incentives that may seem appealing to the high monthly charge users and new users.