1 Introduction

This data set contains customer level information for a telecommunication company. Various attributes related to the services used are recorded for each customer.

With the rapid development of the telecommunication industry, the service providers are inclined more towards expansion of the subscriber base. To meet the need of surviving in the competitive environment, the retention of existing customers has become a huge challenge. It is stated that the cost of acquiring a new customer is far more than that for retaining the existing one.

Therefore, it is imperative for the telecommunication industries to use advanced analytics to understand consumer behavior and in turn predict the association of the customers as whether or not they will leave the company.

2 Description of the Data

The total number of records in this data set is 1000. It consists of 14 variables including the response variable with the name Churn. There are 3 numerical variables and 11 categorical variables. The predictor variables include sex, marital status, term, phone service and others. A detailed description of the variables is given below:

Sex: Sex of the customer - Categorical var

Marital_status: Marital status of the customer - Categorical var

Term: Term (Displayed in months) - Numerical var

Phone_service: Phone service - Categorical var

international_plan: International plan - Categorical var

Voice_mail_plan: Voice mail plan - Categorical var

Multiple_line: Multiple line - Categorical var

Internet_service: Internet service - Categorical var

Technical_support: Technical support - Categorical var

Streaming_videos: Streaming Videos - Categorical var

Agreement_period: Agreement period - Categorical var

Monthly_charges: Monthly Charges - Numerical var

Phone_service: Total Charges - Numerical var

Churn: Churn (Yes or No)

A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt

##      Sex Marital_Status Term Phone_service International_plan Voice_mail_plan
## 1 Female        Married   16           Yes                Yes             Yes
## 2   Male        Married   70           Yes                 No             Yes
## 3 Female        Married   36           Yes                 No             Yes
## 4 Female        Married   72           Yes                 No              No
## 5 Female        Married   40           Yes                Yes              No
## 6 Female         Single   15           Yes                Yes             Yes
##   Multiple_line Internet_service Technical_support Streaming_Videos
## 1            No            Cable               Yes               No
## 2            No            Cable               Yes              Yes
## 3            No            Cable               Yes              Yes
## 4           Yes            Cable               Yes              Yes
## 5           Yes            Cable                No              Yes
## 6            No      No Internet      No internet      No internet 
##    Agreement_period Monthly_Charges Total_Charges Churn
## 1  Monthly contract           98.05       1410.25   Yes
## 2 One year contract           75.25       5023.00    No
## 3  Monthly contract           73.35       2379.10    No
## 4 One year contract          112.60       7882.25    No
## 5  Monthly contract           95.05       3646.80    No
## 6  Monthly contract           19.85        255.35    No

3 EDA for Feature Engineering

The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. The results were as follows:

##      Sex            Marital_Status          Term      Phone_service     
##  Length:1000        Length:1000        Min.   : 0.0   Length:1000       
##  Class :character   Class :character   1st Qu.: 8.0   Class :character  
##  Mode  :character   Mode  :character   Median :30.0   Mode  :character  
##                                        Mean   :32.8                     
##                                        3rd Qu.:57.0                     
##                                        Max.   :72.0                     
##  International_plan Voice_mail_plan    Multiple_line      Internet_service  
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Technical_support  Streaming_Videos   Agreement_period   Monthly_Charges 
##  Length:1000        Length:1000        Length:1000        Min.   : 18.95  
##  Class :character   Class :character   Class :character   1st Qu.: 40.16  
##  Mode  :character   Mode  :character   Mode  :character   Median : 74.72  
##                                                           Mean   : 66.64  
##                                                           3rd Qu.: 90.88  
##                                                           Max.   :116.25  
##  Total_Charges       Churn          
##  Min.   :   5.0   Length:1000       
##  1st Qu.: 334.8   Class :character  
##  Median :1442.3   Mode  :character  
##  Mean   :2351.6                     
##  3rd Qu.:4016.8                     
##  Max.   :8476.5

3.1 Missing Values

The above summary table indicates that there are no missing values in all the variables.

4 Assess Distributions

Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.

4.1 Categorical variables

The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.

From the above plots, it can be seen that 51.4% of the customers in this study are male. Majority of customers are married, have a phone service and a voice mail plan. For the rest of the categorical variables, their distributions are as follows:

44.4% of the customers have multiple lines. Under the Internet service category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and 20.8% had none. About a third of the customers had technical support and 41% had video streaming.

4.1.1 Regrouping of categorical variables

One of the categorical variables, International Plan, had 3 groups: No, Yes and yes. This was an input error that happened when this data was collected. Below is the table showing the frequency of these groups.

Groups Freq
No 429
yes 262
Yes 309

In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes. Below are the distributions of the new International plan, and the remaining variables: Agreement period and the Churn response variable.

57.1% of the customers had an international plan and majority were on a monthly contract. For this data, 74.1% had not left the company.

4.2 Numerical Variables

There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:

The plot of the histogram showing the distribution of Term shows a non-symmetric pattern with the highest frequency between 0 and 5 months and lowest between 35 and 40 months.

This is quite different from the distribution of the total charges which was positively skewed (right skewed). For total charges, the highest frequency was between 0 and 1000 and the lowest was between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency). The density plot is similar too with a right skew which illustrates that the mean is greater than the median.

The plot of the histogram showing monthly charges is also non-symmetric, with the highest frequency for monthly charge between 20 and 30 and the lowest between 30 and 40. It can also be seen here in the histogram plot that there is a second peak (slightly lower than the first) between 80 and 90. The density plot of the monthly charges is similar to the distribution as represented by the histogram. It shows 2 peaks, the first approximately at 20 and the other (higher peak) approximately at 90.

The boxplot of the distribution of the total charges shows that the median lies between 1700 and 1800. From the plot, there are no visible outliers.

5 Pairwise Associations

Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.

5.1 2 categorical variables

This was done to determine whether the response variable (churn - which is binary) is independent of the categorical variables. Categorical variables found to be independent of the response variable will be excluded in any of the subsequent models and algorithms. Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and so the boxes line up in a grid.

5.1.1 Mosaic Plots

From the above mosaic plots, it can be seen that sex, phone service, voicemail plan, and multiple line appear to be independent of the response variable, churn. This is because the proportion of churn cases in the individual categories of these variables appear to be identical. Churn is not independent of marital status and International plan. The other mosaic plots are shown below:

In addition to marital status and International plan, Agreement period, Internet service, technical support and streaming videos are not independent of the response variable, Churn.

5.1.2 Pearson Chi-Square Test

A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:

Chisq.sex.p.value Chisq.Phoneservice.p.value Chisq.Voicemail.p.value Chisq.multipleline.p.value
0.1248683 0.3680155 0.6651237 0.3384263

5.2 Two Numerical Variables

The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Churn and non-Churn groups. This means that the stacked density curves show the relationship between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable, Churn.

Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.

6 Conclusion

Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn. International plan was also dropped and the new variable, Grp.IP was kept instead. The number of variables in the final dataset was 10.

The following are the variables that will be used for subsequent modelling. Marital_Status, Term, Internet_service, Technical_support, Streaming_Videos, Agreement_period, Monthly_Charges, Total_Charges, grp.IP and Churn