This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.
The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.
Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.
The following are the questions and objectives that describe and explain the main purpose of this project.
The response variable is Churn which is a binary
variable with two values, yes and no. The
value yes means that a customer left the company while
no means that a customer is still active.
By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, a logistic regression model will be utilized to analyse the relationship between the binary response variable, churn, and the predictor variables.
The total number of records in this data set is 1000. It consists of
14 variables including the response variable with the name
Churn. There are 3 numerical variables and 11 categorical
variables. The predictor variables include sex, marital status, term,
phone service and others. A detailed description of the variables is
given below:
Sex: Sex of the customer - Categorical var
Marital_status: Marital status of the customer -
Categorical var
Term: Term (Displayed in months) - Numerical var
Phone_service: Phone service - Categorical var
international_plan: International plan - Categorical
var
Voice_mail_plan: Voice mail plan - Categorical var
Multiple_line: Multiple line - Categorical var
Internet_service: Internet service - Categorical var
Technical_support: Technical support - Categorical
var
Streaming_videos: Streaming Videos - Categorical var
Agreement_period: Agreement period - Categorical var
Monthly_charges: Monthly Charges - Numerical var
Total charges: Total Charges - Numerical var
Churn: Churn (Yes or No)
A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt
## Sex Marital_Status Term Phone_service International_plan Voice_mail_plan
## 1 Female Married 16 Yes Yes Yes
## 2 Male Married 70 Yes No Yes
## 3 Female Married 36 Yes No Yes
## 4 Female Married 72 Yes No No
## 5 Female Married 40 Yes Yes No
## 6 Female Single 15 Yes Yes Yes
## Multiple_line Internet_service Technical_support Streaming_Videos
## 1 No Cable Yes No
## 2 No Cable Yes Yes
## 3 No Cable Yes Yes
## 4 Yes Cable Yes Yes
## 5 Yes Cable No Yes
## 6 No No Internet No internet No internet
## Agreement_period Monthly_Charges Total_Charges Churn
## 1 Monthly contract 98.05 1410.25 Yes
## 2 One year contract 75.25 5023.00 No
## 3 Monthly contract 73.35 2379.10 No
## 4 One year contract 112.60 7882.25 No
## 5 Monthly contract 95.05 3646.80 No
## 6 Monthly contract 19.85 255.35 No
The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. The results were as follows:
## Sex Marital_Status Term Phone_service
## Length:1000 Length:1000 Min. : 0.0 Length:1000
## Class :character Class :character 1st Qu.: 8.0 Class :character
## Mode :character Mode :character Median :30.0 Mode :character
## Mean :32.8
## 3rd Qu.:57.0
## Max. :72.0
## International_plan Voice_mail_plan Multiple_line Internet_service
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Technical_support Streaming_Videos Agreement_period Monthly_Charges
## Length:1000 Length:1000 Length:1000 Min. : 18.95
## Class :character Class :character Class :character 1st Qu.: 40.16
## Mode :character Mode :character Mode :character Median : 74.72
## Mean : 66.64
## 3rd Qu.: 90.88
## Max. :116.25
## Total_Charges Churn
## Min. : 5.0 Length:1000
## 1st Qu.: 334.8 Class :character
## Median :1442.3 Mode :character
## Mean :2351.6
## 3rd Qu.:4016.8
## Max. :8476.5
The above summary table indicates that there are no missing values in all the variables.
Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.
The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.
From the above plots, it can be seen that 51.4% of the customers in this
study are male. Majority of customers are married, have a phone service
and a voice mail plan.
44.4% of the customers have multiple lines. Under the Internet service
category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and
20.8% had none. Majority were on a monthly contract. For this data,
74.1% had not left the company.
One of the categorical variables, International Plan, had 3 groups: No, Yes and yes. This was an input error that happened when this data was collected. Below is the table showing the frequency of these groups.
| Groups | Freq |
|---|---|
| No | 429 |
| yes | 262 |
| Yes | 309 |
In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.
Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.
57.1% of the customers had an international plan. About a third of the
customers had technical support and 41% had video streaming.
There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:
The plot of the histogram showing the distribution of Term shows a
non-symmetric pattern with the highest frequency between 0 and 5 months
and lowest between 35 and 40 months.
This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).
The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.
From the above density plot of monthly charges, it can be seen that
the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore,
these variables will be discretized for future models and algorithms.
The variable, monthly charges, ranges from 18.95 to
116.25.
less than 30: low charges
30 to 80 : moderate charges
greater than 80: high charges
The following table shows the frequency of the grouped variable, grp.month
| Var1 | Freq |
|---|---|
| High | 406 |
| Low | 223 |
| Moderate | 371 |
Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.
This was done to determine whether the response variable (churn - which is binary) is independent of the categorical variables. Categorical variables found to be independent of the response variable will be excluded in any of the subsequent models and algorithms. Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and so the boxes line up in a grid.
From the above mosaic plots, it can be seen that sex, phone service,
voicemail plan, and multiple line appear to be independent of the
response variable, churn. This is because the proportion of churn cases
in the individual categories of these variables appear to be identical.
Churn is not independent of marital status and International plan. The
other mosaic plots are shown below:
In addition to marital status and International plan, Agreement period,
Internet service, monthly charges (grouped), technical support and
streaming videos are not independent of the response variable,
Churn.
A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:
| Chisq.sex.p.value | Chisq.Phoneservice.p.value | Chisq.Voicemail.p.value | Chisq.multipleline.p.value |
|---|---|---|---|
| 0.1248683 | 0.3680155 | 0.6651237 | 0.3384263 |
The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.
The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.
The main diagonal stacked density curves show the potential
difference in the distribution of the underlying numeric variable in
Churn and non-Churn groups. This means that the stacked density curves
show the relationship between numeric and categorical variables. These
stacked density curves are not completely overlapped indicating somewhat
correlation between each of these numeric variables and the binary
response variable, Churn.
Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.
Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.
International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The number of variables in the final dataset was 11.
The following are the variables that will be used for subsequent
modelling. Marital_Status, Term,
Internet_service, tech_support,
stream_videos, Agreement_period,
Monthly_Charges, grp.month,
Total_Charges, grp.IP and
Churn
In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The following are the assumptions of a logistic model:
The response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no.
The predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity.
The functional form of the predictor variables are correctly specified.
Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.
The numeric variables are Term, Monthly charges and Total charges. In total, there are 9 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).
The predictors for our model are:
Marital.status: Marital status of the customer - factor
with 2 levels
Term: Term (Displayed in months) - Numerical
variable
International.plan: International plan - factor with 2
levels
Internet.service: Internet service - factor with 4
levels
technical.support: Technical support - factor with 2
levels
streaming.videos: Streaming Videos - factor with 2
levels
Agreement_period: Agreement period - factor with 3
levels
Monthly_Charges: Monthly Charges - Numerical variable
OR grpd.month : Monthly charges - factor
with 3 levels
Total_Charges: Total charges - numerical variable
First, a logistic regression model that contains all predictor variables with monthly variable as numeric in the data set was built. This is called the first model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.8334170 | 0.8265047 | -1.0083633 | 0.3132801 |
| marital.statusSingle | 0.5372165 | 0.3585663 | 1.4982347 | 0.1340723 |
| Term | -0.0306097 | 0.0135583 | -2.2576403 | 0.0239681 |
| technical.supportYes | -0.7406359 | 0.2139030 | -3.4624854 | 0.0005352 |
| internet.serviceDSL | -0.4170019 | 0.3358900 | -1.2414836 | 0.2144272 |
| internet.serviceFiber optic | -0.3456259 | 0.2248026 | -1.5374642 | 0.1241797 |
| internet.serviceNo Internet | -2.0864226 | 0.6082339 | -3.4302965 | 0.0006029 |
| streaming.videosYes | 0.0601511 | 0.2523427 | 0.2383705 | 0.8115937 |
| agreement.periodOne year contract | -1.5935134 | 0.3026089 | -5.2659178 | 0.0000001 |
| agreement.periodTwo year contract | -1.7055952 | 0.3929081 | -4.3409525 | 0.0000142 |
| International.planYes | 0.1558211 | 0.2008870 | 0.7756652 | 0.4379467 |
| Monthly_Charges | 0.0173013 | 0.0108412 | 1.5958830 | 0.1105149 |
| Total_Charges | 0.0001754 | 0.0001538 | 1.1406038 | 0.2540349 |
The AIC of the first model is 881.7286. It is made up of 12 variables. In the first model, some of the variables were significant at the .05 level. These are: Term (p=0.0239681), Technical support (p=0.0005352), Internet service-no internet (p=0.0006029), one year agreement period (0.0000001), and two year agreement period (0.0000142).
Then another model containing all the predictors, but this time with monthly charges as a factor, is built.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.5920773 | 0.3527946 | 1.6782492 | 0.0932985 |
| marital.statusSingle | 0.1919100 | 0.3146996 | 0.6098196 | 0.5419813 |
| Term | -0.0336493 | 0.0131886 | -2.5513906 | 0.0107294 |
| technical.supportYes | -0.7098222 | 0.2122457 | -3.3443414 | 0.0008248 |
| internet.serviceDSL | -0.6291153 | 0.2958855 | -2.1262121 | 0.0334856 |
| internet.serviceFiber optic | -0.3212021 | 0.2230693 | -1.4399211 | 0.1498897 |
| internet.serviceNo Internet | -3.0963883 | 0.7207580 | -4.2960167 | 0.0000174 |
| streaming.videosYes | 0.1840132 | 0.2325419 | 0.7913119 | 0.4287620 |
| agreement.periodOne year contract | -1.5746912 | 0.3024023 | -5.2072730 | 0.0000002 |
| agreement.periodTwo year contract | -1.6864130 | 0.3949791 | -4.2696260 | 0.0000196 |
| International.planYes | 0.2394340 | 0.2038002 | 1.1748465 | 0.2400561 |
| grpd.monthLow | 0.1892848 | 0.6708755 | 0.2821459 | 0.7778316 |
| grpd.monthModerate | -0.3335832 | 0.2777887 | -1.2008523 | 0.2298085 |
| Total_Charges | 0.0002338 | 0.0001424 | 1.6412233 | 0.1007511 |
The AIC of the second model is 883.9549. It is made up of 13 variables. Here, the variables that were significant at the .05 level are: Term (p=0.0107294), Technical support (p=0.0008248), Internet service-DSL (p=0.0334856), Internet service-no internet (p=0.0000174), one year agreement period (p=0.0000002), and two year agreement(p=0.0000196).
When compared to the first model based on the AIC, the first model had a lower AIC, 881.7286. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model.
Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.8270051 | 0.2414150 | -7.567903 | 0.0000000 |
| Term | -0.0174912 | 0.0051167 | -3.418431 | 0.0006298 |
| agreement.periodOne year contract | -1.8564358 | 0.2919578 | -6.358575 | 0.0000000 |
| agreement.periodTwo year contract | -2.1559901 | 0.3681332 | -5.856549 | 0.0000000 |
| Monthly_Charges | 0.0266624 | 0.0034836 | 7.653632 | 0.0000000 |
The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level.
All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Term, Technical support, Internet service, and agreement period. Since Term and agreement period were already present in the reduced model, just technical support and internet service were added from the first model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.6846363 | 0.5267513 | -1.299734 | 0.1936923 |
| Term | -0.0176784 | 0.0052027 | -3.397891 | 0.0006791 |
| technical.supportYes | -0.7625008 | 0.2130108 | -3.579634 | 0.0003441 |
| internet.serviceDSL | -0.3204252 | 0.3014467 | -1.062958 | 0.2878010 |
| internet.serviceFiber optic | -0.3313126 | 0.2246921 | -1.474519 | 0.1403420 |
| internet.serviceNo Internet | -1.7390432 | 0.5341455 | -3.255748 | 0.0011309 |
| agreement.periodOne year contract | -1.5743930 | 0.2996931 | -5.253350 | 0.0000001 |
| agreement.periodTwo year contract | -1.6702441 | 0.3865429 | -4.320979 | 0.0000155 |
| Monthly_Charges | 0.0180282 | 0.0064770 | 2.783392 | 0.0053794 |
The AIC of the fourth model is 878.4816. It is made up of 8 variables. The intercept and 2 dummy variables in the internet service (fiber optic and DSL) were not significant at .05 significance level.
The next step is to use an automatic variable procedure to find the best model.
This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.2265316 | 0.6170311 | -1.987795 | 0.0468343 |
| marital.statusSingle | 0.5806634 | 0.3403543 | 1.706055 | 0.0879977 |
| Term | -0.0181675 | 0.0052304 | -3.473470 | 0.0005138 |
| technical.supportYes | -0.7525514 | 0.2136209 | -3.522836 | 0.0004270 |
| internet.serviceDSL | -0.3298774 | 0.3039017 | -1.085474 | 0.2777118 |
| internet.serviceFiber optic | -0.3261646 | 0.2244343 | -1.453275 | 0.1461476 |
| internet.serviceNo Internet | -1.8273452 | 0.5366413 | -3.405152 | 0.0006613 |
| agreement.periodOne year contract | -1.5894411 | 0.3009554 | -5.281318 | 0.0000001 |
| agreement.periodTwo year contract | -1.6724833 | 0.3871723 | -4.319739 | 0.0000156 |
| Monthly_Charges | 0.0241694 | 0.0074372 | 3.249803 | 0.0011548 |
The best and final model is model A, with the smallest AIC, 877.5205, and 9 variables. Marital status and 2 dummy variables, DSL and fiber optic were not significant at the .05 significance level.
The summary table for the best model, model A, contains the very important variables in the reduced model: Term, one year agreement period, two year agreement period and Monthly_Charges. All 4 variables were statistically significant at the significance level of 0.05 (Term - 0.0005138, one year agreement period - 0.0000001, two year agreement period - 0.0000156 and Monthly_Charges - 0.0011548). Term, one year agreement period (vs monthly period) and two year agreement periods (vs monthly period) are negatively associated with the response variable, churn while monthly charges is positively associated with the response variable.
Fiber optic, DSL and No internet when compared to the reference variable, cable, are negatively associated with the response variable. The odds of success in keeping a customer in the company who uses fiber optic (p=0.1461476) or DSL (p=0.2777118) are lower than those who use Cable. Similarly, customers who do not have internet (p=0.0006613) have lower odds of successfully remaining in the company than those who use Cable.
Single customers have higher odds of successfully remaining in the company than married customers. This is not significant at .05 level (p=0.0879977). One and two year agreement periods are negatively associated with churn. The odds of success in keeping a customer in the company who has a one year (p=0.0000001) or two year agreement period (p=0.0000156) are lower than those whose agreement period is monthly.
The odds of success increase as the amount of monthly charges increases (0.0011548) and decrease as the term (time spent in the company in months) increases (p=0.0005138). Customers who require technical support (p=0.0004270) have lower odds of successfully remaining in the company.
The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.
The predicted response is compared to the original response. This is shown in the following table.
| Mar.status | Term | Internet.service | Tech.support | Agr.period | Month.charges | churn | Predicted |
|---|---|---|---|---|---|---|---|
| Married | 16 | Cable | Yes | Monthly contract | 98.05 | Yes | Yes |
| Married | 70 | Cable | Yes | One year contract | 75.25 | No | No |
| Married | 36 | Cable | Yes | Monthly contract | 73.35 | No | No |
| Married | 72 | Cable | Yes | One year contract | 112.60 | No | No |
| Married | 40 | Cable | No | Monthly contract | 95.05 | No | Yes |
| Single | 15 | No Internet | No | Monthly contract | 19.85 | No | No |
| Married | 1 | Cable | No | Monthly contract | 89.20 | No | Yes |
| Married | 36 | Cable | No | Monthly contract | 94.65 | Yes | Yes |
| Married | 5 | Cable | No | Monthly contract | 97.10 | Yes | Yes |
| Married | 57 | Cable | Yes | One year contract | 113.25 | No | No |
The predicted response of the first 10 observations in the dataset, is quite similar to the original response.
The following are tables showing the frequency of the original response variable and the frequency of the predicted response variable. It can be seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.7%. Therefore, this model is acceptable.
| Var1 | Freq |
|---|---|
| No | 777 |
| Yes | 223 |
| Var1 | Freq |
|---|---|
| No | 741 |
| Yes | 259 |
A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:
| Mar.status | Term | Internet.service | Tech.support | Agr.period | Monthly.charges | Predicted |
|---|---|---|---|---|---|---|
| Married | 38 | Yes | Cable | Monthly contract | 100.5 | No |
| Single | 50 | No | No Internet | Monthly contract | 87.6 | No |
| Single | 14 | No | No Internet | One year contract | 110.5 | No |
| Single | 4 | Yes | Cable | Monthly contract | 90.0 | Yes |
At the end of the logistics regression modelling, only the variables used in the final model were retained in the dataset. They are: marital status, term, technical support, agreement period, monthly charges and the response variable, churn.
Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.
The three best models already built in the previous section were used. They are ModelA, reduced model and fourth model. Using cross-validation which was conducted on all three models and the ROC curve, the best model was selected. A sequence of 20 candidate cut-off probabilities and then a 5-fold cross-validation was used to identify the optimal cut-off probability for the final detection model.
5-fold CV performance plot
ModelA was the model generated from automatic variable selection. The above figure indicates that the optimal cut-off probability that yields the best accuracy for ModelA is 0.52.
The model was fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:
| test.accuracy |
|---|
| 0.79 |
The accuracy was found to be 79%. This indicates that there is no under-fitting or over-fitting in this model.
5-fold CV performance plot
The reduced model was made up of term, agreement period and monthly charges. The above figure indicates that the optimal cut-off probability that yields the best accuracy for this model is 0.57.
The regression coefficients obtained by fitting the reduced model on the training data was used to obtain accuracy from the test data. The results are as follows:
| test.accuracy3 |
|---|
| 0.7833333 |
The fourth model was made up of term, technical support, internet
service, agreement period and monthly charges. The above figure
indicates that the optimal cut-off probability that yields the best
accuracy for this model is 0.52.
| test.accuracy2 |
|---|
| 0.79 |
The above figure indicates that the optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.
The ROC curve is the plot of the False Positive Rate (FPR) against the True Positive Rate (TPR) calculated from each decision boundary such as the cut-off probability. In order to create an ROC curve for all the models, a sequence of decision thresholds is needed and the corresponding sensitivity and specificity for each model was calculated. The interval, (0, 1), was split into 20 subintervals and specificity and sensitivity was calculated based on each of these cut-offs.Below is the plot of the ROC curve (1-specificity, sensitivity).
The above ROC curves plots pairs of the true positive rate vs. the false positive rate for every possible decision threshold of the three models, modelA, reduced model and the fourth model.The true positive rate represents the proportion of observations that are predicted to be positive when indeed they are positive while the false positive rate represents the proportion of observations that are predicted to be positive when they’re actually negative.
From the plot, it can be seen that the AUC for modelA which is 0.8542 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.
Out of all the 3 models evaluated above using cross-validation and KPI measures, ModelA had the best performance metrics and accuracy (79%) when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.4473684 | 0.90625 | 0.6181818 | 0.4473684 | 0.519084 |
Below is the optimal cut-off probability for the final model.
5-fold CV performance plot
Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables and numerical variables are scaled.
First, all the 3 numeric variables to be included in this model are scaled according to this formula. \[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \]
All feature variables were extracted and the categorical variables converted to dummy variables using model.matrix(). They are:
## [1] "(Intercept)" "Marital_StatusSingle"
## [3] "Term" "Internet_serviceDSL"
## [5] "Internet_serviceFiber optic" "Internet_serviceNo Internet"
## [7] "tech_supportYes" "stream_videosYes"
## [9] "Agreement_periodOne year contract" "Agreement_periodTwo year contract"
## [11] "Monthly_Charges" "Total_Charges"
## [13] "grp.IPYes" "ChurnYes"
Then the variables were renamed using simpler names.
## [1] "(Intercept)" "maritalstat" "Term" "DSL"
## [5] "fiber" "nointernet" "techsupport" "streamvideos"
## [9] "oneyear" "twoyears" "monthlycharges" "totalcharges"
## [13] "IP" "Churn"
The following is the model formula:
## Churn ~ maritalstat + Term + DSL + fiber + nointernet + techsupport +
## streamvideos + oneyear + twoyears + monthlycharges + totalcharges +
## IP
This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training and testing data. Cross-validation was done with the training data and the model tested with the testing data.
The data was split into 70% for training the neural network and 30% for testing. There were 700 observations in the training dataset and 300 observations in the testing data.
Below is the neural network model obtained using the training data:
| error | 47.3790324 |
| reached.threshold | 0.0096885 |
| steps | 1175.0000000 |
| Intercept.to.1layhid1 | -1.1157300 |
| maritalstat.to.1layhid1 | 0.4172192 |
| Term.to.1layhid1 | -6.3052454 |
| DSL.to.1layhid1 | -0.2939402 |
| fiber.to.1layhid1 | -0.2356842 |
| nointernet.to.1layhid1 | -2.8342897 |
| techsupport.to.1layhid1 | -0.8026092 |
| streamvideos.to.1layhid1 | 0.3203401 |
| oneyear.to.1layhid1 | -1.6723427 |
| twoyears.to.1layhid1 | -2.4868972 |
| monthlycharges.to.1layhid1 | 1.0020602 |
| totalcharges.to.1layhid1 | 6.2390948 |
| IP.to.1layhid1 | 0.0823311 |
| Intercept.to.Churn | 0.0433270 |
| 1layhid1.to.Churn | 1.6424948 |
Single-layer backpropagation Neural network model for Customer Churn
| Â | Estimate | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|
| (Intercept) | -0.5047 | 0.6383 | -0.7907 | 0.4291 |
| Marital_StatusSingle | 0.5372 | 0.3586 | 1.498 | 0.1341 |
| Term | -2.204 | 0.9762 | -2.258 | 0.02397 |
| Internet_serviceDSL | -0.417 | 0.3359 | -1.241 | 0.2144 |
| Internet_serviceFiber optic | -0.3456 | 0.2248 | -1.537 | 0.1242 |
| Internet_serviceNo Internet | -2.086 | 0.6082 | -3.43 | 0.0006029 |
| tech_supportYes | -0.7406 | 0.2139 | -3.462 | 0.0005352 |
| stream_videosYes | 0.06015 | 0.2523 | 0.2384 | 0.8116 |
| Agreement_periodOne year contract | -1.594 | 0.3026 | -5.266 | 1.395e-07 |
| Agreement_periodTwo year contract | -1.706 | 0.3929 | -4.341 | 1.419e-05 |
| Monthly_Charges | 1.683 | 1.055 | 1.596 | 0.1105 |
| Total_Charges | 1.486 | 1.303 | 1.141 | 0.254 |
| grp.IPYes | 0.1558 | 0.2009 | 0.7757 | 0.4379 |
The cross validation in the neural network was carried out with the
training dataset.
The optimal cut off probability obtained for the neural network model
was 0.43.
The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.
| 0 | 1 | |
|---|---|---|
| FALSE | 186 | 33 |
| TRUE | 32 | 49 |
| x |
|---|
| 0.7833333 |
The accuracy was found to be 78.33%.
The ROC curve is the plot of sensitivity against 1 - specificity calculated from the confusion matrix based on a sequence of selected cut-off scores.
An ROC curve is shown for the above neural network model based on the training data set.
ROC Curve of the neural network model
The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable.
Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.
The neural network model had a test accuracy of 78.33% and an AUC of 0.8587 while the final logistic regression model had a test accuracy of 79% and an AUC of 0.8542. Both models are good and acceptable and it may be difficult to determine which particular model is better based solely on these information.