1 Introduction

This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.

The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.

Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.

1.1 Objectives of this study

The following are the questions and objectives that describe and explain the main purpose of this project.

  1. To predict which customers are more likely to churn.
  2. What is the percentage of churn customers in the company?
  3. Are there any notable patterns in terms of customer churn based on gender and marital status?
  4. Are there any notable patterns in terms of customer churn based on the amount spent by the customer and type of service provided?
  5. Which services are the most profitable?

1.2 Response variable

The response variable is Churn which is a binary variable with two values, yes and no. The value yes means that a customer left the company while no means that a customer is still active.

1.3 Model

By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, a logistic regression model will be utilized to analyse the relationship between the binary response variable, churn, and the predictor variables.

2 Description of the Data

The total number of records in this data set is 1000. It consists of 14 variables including the response variable with the name Churn. There are 3 numerical variables and 11 categorical variables. The predictor variables include sex, marital status, term, phone service and others. A detailed description of the variables is given below:

Sex: Sex of the customer - Categorical var

Marital_status: Marital status of the customer - Categorical var

Term: Term (Displayed in months) - Numerical var

Phone_service: Phone service - Categorical var

international_plan: International plan - Categorical var

Voice_mail_plan: Voice mail plan - Categorical var

Multiple_line: Multiple line - Categorical var

Internet_service: Internet service - Categorical var

Technical_support: Technical support - Categorical var

Streaming_videos: Streaming Videos - Categorical var

Agreement_period: Agreement period - Categorical var

Monthly_charges: Monthly Charges - Numerical var

Total charges: Total Charges - Numerical var

Churn: Churn (Yes or No)

A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt

##      Sex Marital_Status Term Phone_service International_plan Voice_mail_plan
## 1 Female        Married   16           Yes                Yes             Yes
## 2   Male        Married   70           Yes                 No             Yes
## 3 Female        Married   36           Yes                 No             Yes
## 4 Female        Married   72           Yes                 No              No
## 5 Female        Married   40           Yes                Yes              No
## 6 Female         Single   15           Yes                Yes             Yes
##   Multiple_line Internet_service Technical_support Streaming_Videos
## 1            No            Cable               Yes               No
## 2            No            Cable               Yes              Yes
## 3            No            Cable               Yes              Yes
## 4           Yes            Cable               Yes              Yes
## 5           Yes            Cable                No              Yes
## 6            No      No Internet      No internet      No internet 
##    Agreement_period Monthly_Charges Total_Charges Churn
## 1  Monthly contract           98.05       1410.25   Yes
## 2 One year contract           75.25       5023.00    No
## 3  Monthly contract           73.35       2379.10    No
## 4 One year contract          112.60       7882.25    No
## 5  Monthly contract           95.05       3646.80    No
## 6  Monthly contract           19.85        255.35    No

3 EDA for Feature Engineering

The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. The results were as follows:

##      Sex            Marital_Status          Term      Phone_service     
##  Length:1000        Length:1000        Min.   : 0.0   Length:1000       
##  Class :character   Class :character   1st Qu.: 8.0   Class :character  
##  Mode  :character   Mode  :character   Median :30.0   Mode  :character  
##                                        Mean   :32.8                     
##                                        3rd Qu.:57.0                     
##                                        Max.   :72.0                     
##  International_plan Voice_mail_plan    Multiple_line      Internet_service  
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Technical_support  Streaming_Videos   Agreement_period   Monthly_Charges 
##  Length:1000        Length:1000        Length:1000        Min.   : 18.95  
##  Class :character   Class :character   Class :character   1st Qu.: 40.16  
##  Mode  :character   Mode  :character   Mode  :character   Median : 74.72  
##                                                           Mean   : 66.64  
##                                                           3rd Qu.: 90.88  
##                                                           Max.   :116.25  
##  Total_Charges       Churn          
##  Min.   :   5.0   Length:1000       
##  1st Qu.: 334.8   Class :character  
##  Median :1442.3   Mode  :character  
##  Mean   :2351.6                     
##  3rd Qu.:4016.8                     
##  Max.   :8476.5

3.1 Missing Values

The above summary table indicates that there are no missing values in all the variables.

3.2 Assess Distributions

Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.

3.2.1 Categorical variables

The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.

From the above plots, it can be seen that 51.4% of the customers in this study are male. Majority of customers are married, have a phone service and a voice mail plan.

44.4% of the customers have multiple lines. Under the Internet service category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and 20.8% had none. Majority were on a monthly contract. For this data, 74.1% had not left the company.

3.2.2 Regrouping of categorical variables

One of the categorical variables, International Plan, had 3 groups: No, Yes and yes. This was an input error that happened when this data was collected. Below is the table showing the frequency of these groups.

Groups Freq
No 429
yes 262
Yes 309

In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.

Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.

57.1% of the customers had an international plan. About a third of the customers had technical support and 41% had video streaming.

3.3 Numerical Variables

There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:

The plot of the histogram showing the distribution of Term shows a non-symmetric pattern with the highest frequency between 0 and 5 months and lowest between 35 and 40 months.

This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).

The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.

3.4 Discretizing Continuous Varaibles

From the above density plot of monthly charges, it can be seen that the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore, these variables will be discretized for future models and algorithms. The variable, monthly charges, ranges from 18.95 to 116.25.

less than 30: low charges

30 to 80 : moderate charges

greater than 80: high charges

The following table shows the frequency of the grouped variable, grp.month

Var1 Freq
High 406
Low 223
Moderate 371

3.5 Pairwise Associations

Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.

3.5.1 Two categorical variables

This was done to determine whether the response variable (churn - which is binary) is independent of the categorical variables. Categorical variables found to be independent of the response variable will be excluded in any of the subsequent models and algorithms. Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and so the boxes line up in a grid.

From the above mosaic plots, it can be seen that sex, phone service, voicemail plan, and multiple line appear to be independent of the response variable, churn. This is because the proportion of churn cases in the individual categories of these variables appear to be identical. Churn is not independent of marital status and International plan. The other mosaic plots are shown below:

In addition to marital status and International plan, Agreement period, Internet service, monthly charges (grouped), technical support and streaming videos are not independent of the response variable, Churn.

3.5.2 Pearson Chi-Square Test

A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:

Chisq.sex.p.value Chisq.Phoneservice.p.value Chisq.Voicemail.p.value Chisq.multipleline.p.value
0.1248683 0.3680155 0.6651237 0.3384263

3.5.3 Two Numerical Variables

The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Churn and non-Churn groups. This means that the stacked density curves show the relationship between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable, Churn.

Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.

3.6 Conclusion

Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.

International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The number of variables in the final dataset was 11.

The following are the variables that will be used for subsequent modelling. Marital_Status, Term, Internet_service, tech_support, stream_videos, Agreement_period, Monthly_Charges, grp.month, Total_Charges, grp.IP and Churn

4 Logistics Predictive Modelling

4.1 Assumptions

In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The following are the assumptions of a logistic model:

  1. The response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no.

  2. The predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity.

  3. The functional form of the predictor variables are correctly specified.

4.2 Model building

Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.

The numeric variables are Term, Monthly charges and Total charges. In total, there are 9 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).

The predictors for our model are:

Marital.status: Marital status of the customer - factor with 2 levels

Term: Term (Displayed in months) - Numerical variable

International.plan: International plan - factor with 2 levels

Internet.service: Internet service - factor with 4 levels

technical.support: Technical support - factor with 2 levels

streaming.videos: Streaming Videos - factor with 2 levels

Agreement_period: Agreement period - factor with 3 levels

Monthly_Charges: Monthly Charges - Numerical variable OR grpd.month : Monthly charges - factor with 3 levels

Total_Charges: Total charges - numerical variable

First, a logistic regression model that contains all predictor variables with monthly variable as numeric in the data set was built. This is called the first model.

Significance Tests for the First Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8334170 0.8265047 -1.0083633 0.3132801
marital.statusSingle 0.5372165 0.3585663 1.4982347 0.1340723
Term -0.0306097 0.0135583 -2.2576403 0.0239681
technical.supportYes -0.7406359 0.2139030 -3.4624854 0.0005352
internet.serviceDSL -0.4170019 0.3358900 -1.2414836 0.2144272
internet.serviceFiber optic -0.3456259 0.2248026 -1.5374642 0.1241797
internet.serviceNo Internet -2.0864226 0.6082339 -3.4302965 0.0006029
streaming.videosYes 0.0601511 0.2523427 0.2383705 0.8115937
agreement.periodOne year contract -1.5935134 0.3026089 -5.2659178 0.0000001
agreement.periodTwo year contract -1.7055952 0.3929081 -4.3409525 0.0000142
International.planYes 0.1558211 0.2008870 0.7756652 0.4379467
Monthly_Charges 0.0173013 0.0108412 1.5958830 0.1105149
Total_Charges 0.0001754 0.0001538 1.1406038 0.2540349

The AIC of the first model is 881.7286. It is made up of 12 variables. In the first model, some of the variables were significant at the .05 level. These are: Term (p=0.0239681), Technical support (p=0.0005352), Internet service-no internet (p=0.0006029), one year agreement period (0.0000001), and two year agreement period (0.0000142).

Then another model containing all the predictors, but this time with monthly charges as a factor, is built.

Significance Tests for the Second Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5920773 0.3527946 1.6782492 0.0932985
marital.statusSingle 0.1919100 0.3146996 0.6098196 0.5419813
Term -0.0336493 0.0131886 -2.5513906 0.0107294
technical.supportYes -0.7098222 0.2122457 -3.3443414 0.0008248
internet.serviceDSL -0.6291153 0.2958855 -2.1262121 0.0334856
internet.serviceFiber optic -0.3212021 0.2230693 -1.4399211 0.1498897
internet.serviceNo Internet -3.0963883 0.7207580 -4.2960167 0.0000174
streaming.videosYes 0.1840132 0.2325419 0.7913119 0.4287620
agreement.periodOne year contract -1.5746912 0.3024023 -5.2072730 0.0000002
agreement.periodTwo year contract -1.6864130 0.3949791 -4.2696260 0.0000196
International.planYes 0.2394340 0.2038002 1.1748465 0.2400561
grpd.monthLow 0.1892848 0.6708755 0.2821459 0.7778316
grpd.monthModerate -0.3335832 0.2777887 -1.2008523 0.2298085
Total_Charges 0.0002338 0.0001424 1.6412233 0.1007511

The AIC of the second model is 883.9549. It is made up of 13 variables. Here, the variables that were significant at the .05 level are: Term (p=0.0107294), Technical support (p=0.0008248), Internet service-DSL (p=0.0334856), Internet service-no internet (p=0.0000174), one year agreement period (p=0.0000002), and two year agreement(p=0.0000196).

When compared to the first model based on the AIC, the first model had a lower AIC, 881.7286. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model.

Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.

Significance Tests for Reduced Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8270051 0.2414150 -7.567903 0.0000000
Term -0.0174912 0.0051167 -3.418431 0.0006298
agreement.periodOne year contract -1.8564358 0.2919578 -6.358575 0.0000000
agreement.periodTwo year contract -2.1559901 0.3681332 -5.856549 0.0000000
Monthly_Charges 0.0266624 0.0034836 7.653632 0.0000000

The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level.

All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Term, Technical support, Internet service, and agreement period. Since Term and agreement period were already present in the reduced model, just technical support and internet service were added from the first model.

Significance Tests for Fourth Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6846363 0.5267513 -1.299734 0.1936923
Term -0.0176784 0.0052027 -3.397891 0.0006791
technical.supportYes -0.7625008 0.2130108 -3.579634 0.0003441
internet.serviceDSL -0.3204252 0.3014467 -1.062958 0.2878010
internet.serviceFiber optic -0.3313126 0.2246921 -1.474519 0.1403420
internet.serviceNo Internet -1.7390432 0.5341455 -3.255748 0.0011309
agreement.periodOne year contract -1.5743930 0.2996931 -5.253350 0.0000001
agreement.periodTwo year contract -1.6702441 0.3865429 -4.320979 0.0000155
Monthly_Charges 0.0180282 0.0064770 2.783392 0.0053794

The AIC of the fourth model is 878.4816. It is made up of 8 variables. The intercept and 2 dummy variables in the internet service (fiber optic and DSL) were not significant at .05 significance level.

The next step is to use an automatic variable procedure to find the best model.

4.2.1 Automatic Variable Selection

This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.

Summary Table of Significant Tests for final model - model A
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2265316 0.6170311 -1.987795 0.0468343
marital.statusSingle 0.5806634 0.3403543 1.706055 0.0879977
Term -0.0181675 0.0052304 -3.473470 0.0005138
technical.supportYes -0.7525514 0.2136209 -3.522836 0.0004270
internet.serviceDSL -0.3298774 0.3039017 -1.085474 0.2777118
internet.serviceFiber optic -0.3261646 0.2244343 -1.453275 0.1461476
internet.serviceNo Internet -1.8273452 0.5366413 -3.405152 0.0006613
agreement.periodOne year contract -1.5894411 0.3009554 -5.281318 0.0000001
agreement.periodTwo year contract -1.6724833 0.3871723 -4.319739 0.0000156
Monthly_Charges 0.0241694 0.0074372 3.249803 0.0011548

The best and final model is model A, with the smallest AIC, 877.5205, and 9 variables. Marital status and 2 dummy variables, DSL and fiber optic were not significant at the .05 significance level.

4.3 Interpretation - Association Analysis

The summary table for the best model, model A, contains the very important variables in the reduced model: Term, one year agreement period, two year agreement period and Monthly_Charges. All 4 variables were statistically significant at the significance level of 0.05 (Term - 0.0005138, one year agreement period - 0.0000001, two year agreement period - 0.0000156 and Monthly_Charges - 0.0011548). Term, one year agreement period (vs monthly period) and two year agreement periods (vs monthly period) are negatively associated with the response variable, churn while monthly charges is positively associated with the response variable.

Fiber optic, DSL and No internet when compared to the reference variable, cable, are negatively associated with the response variable. The odds of success in keeping a customer in the company who uses fiber optic (p=0.1461476) or DSL (p=0.2777118) are lower than those who use Cable. Similarly, customers who do not have internet (p=0.0006613) have lower odds of successfully remaining in the company than those who use Cable.

Single customers have higher odds of successfully remaining in the company than married customers. This is not significant at .05 level (p=0.0879977). One and two year agreement periods are negatively associated with churn. The odds of success in keeping a customer in the company who has a one year (p=0.0000001) or two year agreement period (p=0.0000156) are lower than those whose agreement period is monthly.

The odds of success increase as the amount of monthly charges increases (0.0011548) and decrease as the term (time spent in the company in months) increases (p=0.0005138). Customers who require technical support (p=0.0004270) have lower odds of successfully remaining in the company.

4.4 Prediction Analysis

The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.

4.4.1 Predict already existing data

The predicted response is compared to the original response. This is shown in the following table.

Dataset with model predicted response
Mar.status Term Internet.service Tech.support Agr.period Month.charges churn Predicted
Married 16 Cable Yes Monthly contract 98.05 Yes Yes
Married 70 Cable Yes One year contract 75.25 No No
Married 36 Cable Yes Monthly contract 73.35 No No
Married 72 Cable Yes One year contract 112.60 No No
Married 40 Cable No Monthly contract 95.05 No Yes
Single 15 No Internet No Monthly contract 19.85 No No
Married 1 Cable No Monthly contract 89.20 No Yes
Married 36 Cable No Monthly contract 94.65 Yes Yes
Married 5 Cable No Monthly contract 97.10 Yes Yes
Married 57 Cable Yes One year contract 113.25 No No

The predicted response of the first 10 observations in the dataset, is quite similar to the original response.

The following are tables showing the frequency of the original response variable and the frequency of the predicted response variable. It can be seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.7%. Therefore, this model is acceptable.

Frequency of Predicted Response Variable
Var1 Freq
No 777
Yes 223
Frequency of Original Response Variable
Var1 Freq
No 741
Yes 259

4.4.2 Predict New Data

A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:

Predicted Values of New Data
Mar.status Term Internet.service Tech.support Agr.period Monthly.charges Predicted
Married 38 Yes Cable Monthly contract 100.5 No
Single 50 No No Internet Monthly contract 87.6 No
Single 14 No No Internet One year contract 110.5 No
Single 4 Yes Cable Monthly contract 90.0 Yes

At the end of the logistics regression modelling, only the variables used in the final model were retained in the dataset. They are: marital status, term, technical support, agreement period, monthly charges and the response variable, churn.

5 Cross Validation and Performance Measures

5.1 Data Partition

Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.

5.2 Finding Optimal Cut-off Probability

The three best models already built in the previous section were used. They are ModelA, reduced model and fourth model. Using cross-validation which was conducted on all three models and the ROC curve, the best model was selected. A sequence of 20 candidate cut-off probabilities and then a 5-fold cross-validation was used to identify the optimal cut-off probability for the final detection model.

5.2.1 Cross Validation and Cut-off Probability for Model A

5-fold CV performance plot

5-fold CV performance plot

ModelA was the model generated from automatic variable selection. The above figure indicates that the optimal cut-off probability that yields the best accuracy for ModelA is 0.52.

5.2.1.1 Test Accuracy for ModelA

The model was fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:

Test Accuracy of ModelA
test.accuracy
0.79

The accuracy was found to be 79%. This indicates that there is no under-fitting or over-fitting in this model.

5.2.2 Cross Validation and Optimal Cut-off Probability for the Reduced Model

5-fold CV performance plot

5-fold CV performance plot

The reduced model was made up of term, agreement period and monthly charges. The above figure indicates that the optimal cut-off probability that yields the best accuracy for this model is 0.57.

5.2.2.1 Test Accuracy for the Reduced Model

The regression coefficients obtained by fitting the reduced model on the training data was used to obtain accuracy from the test data. The results are as follows:

Test Accuracy of the Reduced Model
test.accuracy3
0.7833333

5.2.3 Cross Validation and Optimal Cut-off Probability for the Fourth Model

The fourth model was made up of term, technical support, internet service, agreement period and monthly charges. The above figure indicates that the optimal cut-off probability that yields the best accuracy for this model is 0.52.

5.2.3.1 Test Accuracy for the fourth Model

Test Accuracy of the Fourth Model
test.accuracy2
0.79

The above figure indicates that the optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.

5.3 Global Measure: ROC and AUC

The ROC curve is the plot of the False Positive Rate (FPR) against the True Positive Rate (TPR) calculated from each decision boundary such as the cut-off probability. In order to create an ROC curve for all the models, a sequence of decision thresholds is needed and the corresponding sensitivity and specificity for each model was calculated. The interval, (0, 1), was split into 20 subintervals and specificity and sensitivity was calculated based on each of these cut-offs.Below is the plot of the ROC curve (1-specificity, sensitivity).

The above ROC curves plots pairs of the true positive rate vs. the false positive rate for every possible decision threshold of the three models, modelA, reduced model and the fourth model.The true positive rate represents the proportion of observations that are predicted to be positive when indeed they are positive while the false positive rate represents the proportion of observations that are predicted to be positive when they’re actually negative.

From the plot, it can be seen that the AUC for modelA which is 0.8542 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.

5.4 The Final Model

Out of all the 3 models evaluated above using cross-validation and KPI measures, ModelA had the best performance metrics and accuracy (79%) when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:

Local performance metrics for Final Model
sensitivity specificity precision recall F1
0.4473684 0.90625 0.6181818 0.4473684 0.519084

Below is the optimal cut-off probability for the final model.

5-fold CV performance plot

5-fold CV performance plot

6 Neural Network Model

6.1 Feature Conversion for Neural Network

Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables and numerical variables are scaled.

6.1.1 Numeric Feature Scaling

First, all the 3 numeric variables to be included in this model are scaled according to this formula. \[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \]

6.1.2 Categorical Feature Conversion

All feature variables were extracted and the categorical variables converted to dummy variables using model.matrix(). They are:

##  [1] "(Intercept)"                       "Marital_StatusSingle"             
##  [3] "Term"                              "Internet_serviceDSL"              
##  [5] "Internet_serviceFiber optic"       "Internet_serviceNo Internet"      
##  [7] "tech_supportYes"                   "stream_videosYes"                 
##  [9] "Agreement_periodOne year contract" "Agreement_periodTwo year contract"
## [11] "Monthly_Charges"                   "Total_Charges"                    
## [13] "grp.IPYes"                         "ChurnYes"

Then the variables were renamed using simpler names.

##  [1] "(Intercept)"    "maritalstat"    "Term"           "DSL"           
##  [5] "fiber"          "nointernet"     "techsupport"    "streamvideos"  
##  [9] "oneyear"        "twoyears"       "monthlycharges" "totalcharges"  
## [13] "IP"             "Churn"

6.2 The Model Formula

The following is the model formula:

## Churn ~ maritalstat + Term + DSL + fiber + nointernet + techsupport + 
##     streamvideos + oneyear + twoyears + monthlycharges + totalcharges + 
##     IP

6.3 Training and Testing NN Model

This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training and testing data. Cross-validation was done with the training data and the model tested with the testing data.

6.3.1 Data Splitting

The data was split into 70% for training the neural network and 30% for testing. There were 700 observations in the training dataset and 300 observations in the testing data.

6.3.2 Neural network Model Building

Below is the neural network model obtained using the training data:

error 47.3790324
reached.threshold 0.0096885
steps 1175.0000000
Intercept.to.1layhid1 -1.1157300
maritalstat.to.1layhid1 0.4172192
Term.to.1layhid1 -6.3052454
DSL.to.1layhid1 -0.2939402
fiber.to.1layhid1 -0.2356842
nointernet.to.1layhid1 -2.8342897
techsupport.to.1layhid1 -0.8026092
streamvideos.to.1layhid1 0.3203401
oneyear.to.1layhid1 -1.6723427
twoyears.to.1layhid1 -2.4868972
monthlycharges.to.1layhid1 1.0020602
totalcharges.to.1layhid1 6.2390948
IP.to.1layhid1 0.0823311
Intercept.to.Churn 0.0433270
1layhid1.to.Churn 1.6424948

6.3.3 Neural Network Plot

The plot of a single layer neural network model of customer churn:
Single-layer backpropagation Neural network model for Customer Churn

Single-layer backpropagation Neural network model for Customer Churn

6.3.4 Logistic Model

  Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5047 0.6383 -0.7907 0.4291
Marital_StatusSingle 0.5372 0.3586 1.498 0.1341
Term -2.204 0.9762 -2.258 0.02397
Internet_serviceDSL -0.417 0.3359 -1.241 0.2144
Internet_serviceFiber optic -0.3456 0.2248 -1.537 0.1242
Internet_serviceNo Internet -2.086 0.6082 -3.43 0.0006029
tech_supportYes -0.7406 0.2139 -3.462 0.0005352
stream_videosYes 0.06015 0.2523 0.2384 0.8116
Agreement_periodOne year contract -1.594 0.3026 -5.266 1.395e-07
Agreement_periodTwo year contract -1.706 0.3929 -4.341 1.419e-05
Monthly_Charges 1.683 1.055 1.596 0.1105
Total_Charges 1.486 1.303 1.141 0.254
grp.IPYes 0.1558 0.2009 0.7757 0.4379

6.3.5 Cross-validation in Neural Network

The cross validation in the neural network was carried out with the training dataset. The optimal cut off probability obtained for the neural network model was 0.43.

6.3.6 Testing Model Performance

The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.

Confusion Matrix
0 1
FALSE 186 33
TRUE 32 49
Test Accuracy of Neural Network
x
0.7833333

The accuracy was found to be 78.33%.

6.4 ROC Analysis

The ROC curve is the plot of sensitivity against 1 - specificity calculated from the confusion matrix based on a sequence of selected cut-off scores.

An ROC curve is shown for the above neural network model based on the training data set.

ROC Curve of the neural network model

ROC Curve of the neural network model

The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable.

6.5 Comparison of predictive performance between the logistic model and the neural network model

Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.

The neural network model had a test accuracy of 78.33% and an AUC of 0.8587 while the final logistic regression model had a test accuracy of 79% and an AUC of 0.8542. Both models are good and acceptable and it may be difficult to determine which particular model is better based solely on these information.