This data set contains customer level information for a telecommunication company. Various attributes related to the services used are recorded for each customer.
With the rapid development of the telecommunication industry, the service providers are inclined more towards expansion of the subscriber base. To meet the need of surviving in the competitive environment, the retention of existing customers has become a huge challenge. It is stated that the cost of acquiring a new customer is far more than that for retaining the existing one.
Therefore, it is imperative for the telecommunication industries to use advanced analytics to understand consumer behavior and in turn predict the association of the customers as whether or not they will leave the company.
The total number of records in this data set is 1000. It consists of
14 variables including the response variable with the name
Churn. There are 3 numerical variables and 11 categorical
variables. The predictor variables include sex, marital status, term,
phone service and others. A detailed description of the variables is
given below:
Sex: Sex of the customer - Categorical var
Marital_status: Marital status of the customer -
Categorical var
Term: Term (Displayed in months) - Numerical var
Phone_service: Phone service - Categorical var
international_plan: International plan - Categorical
var
Voice_mail_plan: Voice mail plan - Categorical var
Multiple_line: Multiple line - Categorical var
Internet_service: Internet service - Categorical var
Technical_support: Technical support - Categorical
var
Streaming_videos: Streaming Videos - Categorical var
Agreement_period: Agreement period - Categorical var
Monthly_charges: Monthly Charges - Numerical var
Phone_service: Total Charges - Numerical var
Churn: Churn (Yes or No)
A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt
## Sex Marital_Status Term Phone_service International_plan Voice_mail_plan
## 1 Female Married 16 Yes Yes Yes
## 2 Male Married 70 Yes No Yes
## 3 Female Married 36 Yes No Yes
## 4 Female Married 72 Yes No No
## 5 Female Married 40 Yes Yes No
## 6 Female Single 15 Yes Yes Yes
## Multiple_line Internet_service Technical_support Streaming_Videos
## 1 No Cable Yes No
## 2 No Cable Yes Yes
## 3 No Cable Yes Yes
## 4 Yes Cable Yes Yes
## 5 Yes Cable No Yes
## 6 No No Internet No internet No internet
## Agreement_period Monthly_Charges Total_Charges Churn
## 1 Monthly contract 98.05 1410.25 Yes
## 2 One year contract 75.25 5023.00 No
## 3 Monthly contract 73.35 2379.10 No
## 4 One year contract 112.60 7882.25 No
## 5 Monthly contract 95.05 3646.80 No
## 6 Monthly contract 19.85 255.35 No
The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. The results were as follows:
## Sex Marital_Status Term Phone_service
## Length:1000 Length:1000 Min. : 0.0 Length:1000
## Class :character Class :character 1st Qu.: 8.0 Class :character
## Mode :character Mode :character Median :30.0 Mode :character
## Mean :32.8
## 3rd Qu.:57.0
## Max. :72.0
## International_plan Voice_mail_plan Multiple_line Internet_service
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Technical_support Streaming_Videos Agreement_period Monthly_Charges
## Length:1000 Length:1000 Length:1000 Min. : 18.95
## Class :character Class :character Class :character 1st Qu.: 40.16
## Mode :character Mode :character Mode :character Median : 74.72
## Mean : 66.64
## 3rd Qu.: 90.88
## Max. :116.25
## Total_Charges Churn
## Min. : 5.0 Length:1000
## 1st Qu.: 334.8 Class :character
## Median :1442.3 Mode :character
## Mean :2351.6
## 3rd Qu.:4016.8
## Max. :8476.5
The above summary table indicates that there are no missing values in all the variables.
Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.
The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.
From the above plots, it can be seen that 51.4% of the customers in this
study are male. Majority of customers are married, have a phone service
and a voice mail plan. For the rest of the categorical variables, their
distributions are as follows:
44.4% of the customers have multiple lines. Under the Internet service
category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and
20.8% had none. About a third of the customers had technical support and
41% had video streaming.
One of the categorical variables, International Plan, had 3 groups: No, Yes and yes. This was an input error that happened when this data was collected. Below is the table showing the frequency of these groups.
| Groups | Freq |
|---|---|
| No | 429 |
| yes | 262 |
| Yes | 309 |
In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes. Below are the distributions of the new International plan, and the remaining variables: Agreement period and the Churn response variable.
57.1% of the customers had an international plan and majority were on a
monthly contract. For this data, 74.1% had not left the company.
There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:
The plot of the histogram showing the distribution of Term shows a
non-symmetric pattern with the highest frequency between 0 and 5 months
and lowest between 35 and 40 months.
This is quite different from the distribution of the total charges which was positively skewed (right skewed). For total charges, the highest frequency was between 0 and 1000 and the lowest was between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency). The density plot is similar too with a right skew which illustrates that the mean is greater than the median.
The plot of the histogram showing monthly charges is also non-symmetric, with the highest frequency for monthly charge between 20 and 30 and the lowest between 30 and 40. It can also be seen here in the histogram plot that there is a second peak (slightly lower than the first) between 80 and 90. The density plot of the monthly charges is similar to the distribution as represented by the histogram. It shows 2 peaks, the first approximately at 20 and the other (higher peak) approximately at 90.
The boxplot of the distribution of the total charges shows that the median lies between 1700 and 1800. From the plot, there are no visible outliers.
Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.
This was done to determine whether the response variable (churn - which is binary) is independent of the categorical variables. Categorical variables found to be independent of the response variable will be excluded in any of the subsequent models and algorithms. Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and so the boxes line up in a grid.
From the above mosaic plots, it can be seen that sex, phone service,
voicemail plan, and multiple line appear to be independent of the
response variable, churn. This is because the proportion of churn cases
in the individual categories of these variables appear to be identical.
Churn is not independent of marital status and International plan. The
other mosaic plots are shown below:
In addition to marital status and International plan, Agreement period,
Internet service, technical support and streaming videos are not
independent of the response variable, Churn.
A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:
| Chisq.sex.p.value | Chisq.Phoneservice.p.value | Chisq.Voicemail.p.value | Chisq.multipleline.p.value |
|---|---|---|---|
| 0.1248683 | 0.3680155 | 0.6651237 | 0.3384263 |
The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.
The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.
The main diagonal stacked density curves show the potential
difference in the distribution of the underlying numeric variable in
Churn and non-Churn groups. This means that the stacked density curves
show the relationship between numeric and categorical variables. These
stacked density curves are not completely overlapped indicating somewhat
correlation between each of these numeric variables and the binary
response variable, Churn.
Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.
Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn. International plan was also dropped and the new variable, Grp.IP was kept instead. The number of variables in the final dataset was 10.
The following are the variables that will be used for subsequent
modelling. Marital_Status, Term,
Internet_service, Technical_support,
Streaming_Videos, Agreement_period,
Monthly_Charges, Total_Charges,
grp.IP and Churn