1 Introduction

This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.

The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.

Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.

1.1 Objectives of this study

The following are the questions and objectives that describe and explain the main purpose of this project.

To predict which customers are more likely to churn.
What is the percentage of churn customers in the company?
Are there any notable patterns in terms of customer churn based on gender and marital status?
Are there any notable patterns in terms of customer churn based on the amount spent by the customer and type of service provided?
Which services are the most profitable?

1.2 Response variable

The response variable is Churn which is a binary variable with two values, yes and no. The value yes means that a customer left the company while no means that a customer is still active.

1.3 Model

By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, a logistic regression model will be utilized to analyse the relationship between the binary response variable, churn, and the predictor variables.

2 Description of the Data

The total number of records in this data set is 1000. It consists of 14 variables including the response variable with the name Churn. There are 3 numerical variables and 11 categorical variables. The predictor variables include sex, marital status, term, phone service and others. A detailed description of the variables is given below:

Sex: Sex of the customer - Categorical var

Marital_status: Marital status of the customer - Categorical var

Term: Term (Displayed in months) - Numerical var

Phone_service: Phone service - Categorical var

international_plan: International plan - Categorical var

Voice_mail_plan: Voice mail plan - Categorical var

Multiple_line: Multiple line - Categorical var

Internet_service: Internet service - Categorical var

Technical_support: Technical support - Categorical var

Streaming_videos: Streaming Videos - Categorical var

Agreement_period: Agreement period - Categorical var

Monthly_charges: Monthly Charges - Numerical var

Total charges: Total Charges - Numerical var

Churn: Churn (Yes or No)

A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt

3 EDA for Feature Engineering

The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. All the numerical and categorical variables were examined closely and there were no missing values found.

3.1 Missing Values

The above summary table indicates that there are no missing values in all the variables.

3.2 Assess Distributions

Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.

3.2.1 Categorical variables

The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.

From the above plots, it can be seen that 51.4% of the customers in this study are male. Majority of customers are married, have a phone service and a voice mail plan.

44.4% of the customers have multiple lines. Under the Internet service category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and 20.8% had none. Majority were on a monthly contract. For this data, 74.1% had not left the company.

3.2.2 Regrouping of categorical variables

One of the categorical variables, International Plan, had 3 groups: No, Yes and yes. This was an input error that happened when this data was collected. Below is the table showing the frequency of these groups.

Groups	Freq
No	429
yes	262
Yes	309

In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.

Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.

57.1% of the customers had an international plan. About a third of the customers had technical support and 41% had video streaming.

3.3 Numerical Variables

There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:

The plot of the histogram showing the distribution of Term shows a non-symmetric pattern with the highest frequency between 0 and 5 months and lowest between 35 and 40 months.

This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).

The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.

3.4 Discretizing Continuous Varaibles

From the above density plot of monthly charges, it can be seen that the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore, these variables will be discretized for future models and algorithms. The variable, monthly charges, ranges from 18.95 to 116.25.

less than 30: low charges

30 to 80 : moderate charges

greater than 80: high charges

The following table shows the frequency of the grouped variable, grp.month

Var1	Freq
High	406
Low	223
Moderate	371

3.5 Pairwise Associations

Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.

3.5.1 Two categorical variables

This was done to determine whether the response variable (churn - which is binary) is independent of the categorical variables. Categorical variables found to be independent of the response variable will be excluded in any of the subsequent models and algorithms. Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and so the boxes line up in a grid.

From the above mosaic plots, it can be seen that sex, phone service, voicemail plan, and multiple line appear to be independent of the response variable, churn. This is because the proportion of churn cases in the individual categories of these variables appear to be identical. Churn is not independent of marital status and International plan. The other mosaic plots are shown below:

In addition to marital status and International plan, Agreement period, Internet service, monthly charges (grouped), technical support and streaming videos are not independent of the response variable, Churn.

3.5.2 Pearson Chi-Square Test

A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:

Chisq.sex.p.value	Chisq.Phoneservice.p.value	Chisq.Voicemail.p.value	Chisq.multipleline.p.value
0.1248683	0.3680155	0.6651237	0.3384263

3.5.3 Two Numerical Variables

The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Churn and non-Churn groups. This means that the stacked density curves show the relationship between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable, Churn.

Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.

3.6 Conclusion

Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.

International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The number of variables in the final dataset was 11.

The following are the variables that will be used for subsequent modelling. Marital_Status, Term, Internet_service, tech_support, stream_videos, Agreement_period, Monthly_Charges, grp.month, Total_Charges, grp.IP and Churn

4 Logistics Predictive Modelling

4.1 Assumptions

In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The following are the assumptions of a logistic model:

The response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no.
The predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity.
The functional form of the predictor variables are correctly specified.

4.2 Model building

Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.

The numeric variables are Term, Monthly charges and Total charges. In total, there are 9 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).

The predictors for our model are:

Marital.status: Marital status of the customer - factor with 2 levels

Term: Term (Displayed in months) - Numerical variable

International.plan: International plan - factor with 2 levels

Internet.service: Internet service - factor with 4 levels

technical.support: Technical support - factor with 2 levels

streaming.videos: Streaming Videos - factor with 2 levels

Agreement_period: Agreement period - factor with 3 levels

Monthly_Charges: Monthly Charges - Numerical variable OR grpd.month : Monthly charges - factor with 3 levels

Total_Charges: Total charges - numerical variable

First, a logistic regression model that contains all predictor variables with monthly variable as numeric in the data set was built. This is called the first model.

Significance Tests for the First Model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.8334170	0.8265047	-1.0083633	0.3132801
marital.statusSingle	0.5372165	0.3585663	1.4982347	0.1340723
Term	-0.0306097	0.0135583	-2.2576403	0.0239681
technical.supportYes	-0.7406359	0.2139030	-3.4624854	0.0005352
internet.serviceDSL	-0.4170019	0.3358900	-1.2414836	0.2144272
internet.serviceFiber optic	-0.3456259	0.2248026	-1.5374642	0.1241797
internet.serviceNo Internet	-2.0864226	0.6082339	-3.4302965	0.0006029
streaming.videosYes	0.0601511	0.2523427	0.2383705	0.8115937
agreement.periodOne year contract	-1.5935134	0.3026089	-5.2659178	0.0000001
agreement.periodTwo year contract	-1.7055952	0.3929081	-4.3409525	0.0000142
International.planYes	0.1558211	0.2008870	0.7756652	0.4379467
Monthly_Charges	0.0173013	0.0108412	1.5958830	0.1105149
Total_Charges	0.0001754	0.0001538	1.1406038	0.2540349

The AIC of the first model is 881.7286. It is made up of 12 variables. In the first model, some of the variables were significant at the .05 level. These are: Term (p=0.0239681), Technical support (p=0.0005352), Internet service-no internet (p=0.0006029), one year agreement period (0.0000001), and two year agreement period (0.0000142).

Then another model containing all the predictors, but this time with monthly charges as a factor, is built.

Significance Tests for the Second Model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.5920773	0.3527946	1.6782492	0.0932985
marital.statusSingle	0.1919100	0.3146996	0.6098196	0.5419813
Term	-0.0336493	0.0131886	-2.5513906	0.0107294
technical.supportYes	-0.7098222	0.2122457	-3.3443414	0.0008248
internet.serviceDSL	-0.6291153	0.2958855	-2.1262121	0.0334856
internet.serviceFiber optic	-0.3212021	0.2230693	-1.4399211	0.1498897
internet.serviceNo Internet	-3.0963883	0.7207580	-4.2960167	0.0000174
streaming.videosYes	0.1840132	0.2325419	0.7913119	0.4287620
agreement.periodOne year contract	-1.5746912	0.3024023	-5.2072730	0.0000002
agreement.periodTwo year contract	-1.6864130	0.3949791	-4.2696260	0.0000196
International.planYes	0.2394340	0.2038002	1.1748465	0.2400561
grpd.monthLow	0.1892848	0.6708755	0.2821459	0.7778316
grpd.monthModerate	-0.3335832	0.2777887	-1.2008523	0.2298085
Total_Charges	0.0002338	0.0001424	1.6412233	0.1007511

The AIC of the second model is 883.9549. It is made up of 13 variables. Here, the variables that were significant at the .05 level are: Term (p=0.0107294), Technical support (p=0.0008248), Internet service-DSL (p=0.0334856), Internet service-no internet (p=0.0000174), one year agreement period (p=0.0000002), and two year agreement(p=0.0000196).

When compared to the first model based on the AIC, the first model had a lower AIC, 881.7286. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model.

Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.

Significance Tests for Reduced Model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-1.8270051	0.2414150	-7.567903	0.0000000
Term	-0.0174912	0.0051167	-3.418431	0.0006298
agreement.periodOne year contract	-1.8564358	0.2919578	-6.358575	0.0000000
agreement.periodTwo year contract	-2.1559901	0.3681332	-5.856549	0.0000000
Monthly_Charges	0.0266624	0.0034836	7.653632	0.0000000

The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level.

All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Term, Technical support, Internet service, and agreement period. Since Term and agreement period were already present in the reduced model, just technical support and internet service were added from the first model.

Significance Tests for Fourth Model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.6846363	0.5267513	-1.299734	0.1936923
Term	-0.0176784	0.0052027	-3.397891	0.0006791
technical.supportYes	-0.7625008	0.2130108	-3.579634	0.0003441
internet.serviceDSL	-0.3204252	0.3014467	-1.062958	0.2878010
internet.serviceFiber optic	-0.3313126	0.2246921	-1.474519	0.1403420
internet.serviceNo Internet	-1.7390432	0.5341455	-3.255748	0.0011309
agreement.periodOne year contract	-1.5743930	0.2996931	-5.253350	0.0000001
agreement.periodTwo year contract	-1.6702441	0.3865429	-4.320979	0.0000155
Monthly_Charges	0.0180282	0.0064770	2.783392	0.0053794

The AIC of the fourth model is 878.4816. It is made up of 8 variables. The intercept and 2 dummy variables in the internet service (fiber optic and DSL) were not significant at .05 significance level.

The next step is to use an automatic variable procedure to find the best model.

4.2.1 Automatic Variable Selection

This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.

Summary Table of Significant Tests for final model - model A
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-1.2265316	0.6170311	-1.987795	0.0468343
marital.statusSingle	0.5806634	0.3403543	1.706055	0.0879977
Term	-0.0181675	0.0052304	-3.473470	0.0005138
technical.supportYes	-0.7525514	0.2136209	-3.522836	0.0004270
internet.serviceDSL	-0.3298774	0.3039017	-1.085474	0.2777118
internet.serviceFiber optic	-0.3261646	0.2244343	-1.453275	0.1461476
internet.serviceNo Internet	-1.8273452	0.5366413	-3.405152	0.0006613
agreement.periodOne year contract	-1.5894411	0.3009554	-5.281318	0.0000001
agreement.periodTwo year contract	-1.6724833	0.3871723	-4.319739	0.0000156
Monthly_Charges	0.0241694	0.0074372	3.249803	0.0011548

The best and final model is model A, with the smallest AIC, 877.5205, and 9 variables. Marital status and 2 dummy variables, DSL and fiber optic were not significant at the .05 significance level.

4.3 Interpretation - Association Analysis

The summary table for the best model, model A, contains the very important variables in the reduced model: Term, one year agreement period, two year agreement period and Monthly_Charges. All 4 variables were statistically significant at the significance level of 0.05 (Term - 0.0005138, one year agreement period - 0.0000001, two year agreement period - 0.0000156 and Monthly_Charges - 0.0011548). Term, one year agreement period (vs monthly period) and two year agreement periods (vs monthly period) are negatively associated with the response variable, churn while monthly charges is positively associated with the response variable.

Fiber optic, DSL and No internet when compared to the reference variable, cable, are negatively associated with the response variable. The odds of success in keeping a customer in the company who uses fiber optic (p=0.1461476) or DSL (p=0.2777118) are lower than those who use Cable. Similarly, customers who do not have internet (p=0.0006613) have lower odds of successfully remaining in the company than those who use Cable.

Single customers have higher odds of successfully remaining in the company than married customers. This is not significant at .05 level (p=0.0879977). One and two year agreement periods are negatively associated with churn. The odds of success in keeping a customer in the company who has a one year (p=0.0000001) or two year agreement period (p=0.0000156) are lower than those whose agreement period is monthly.

The odds of success increase as the amount of monthly charges increases (0.0011548) and decrease as the term (time spent in the company in months) increases (p=0.0005138). Customers who require technical support (p=0.0004270) have lower odds of successfully remaining in the company.

4.4 Prediction Analysis

The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.

4.4.1 Predict already existing data

The predicted response is compared to the original response. This is shown in the following table.

Dataset with model predicted response
Mar.status	Term	Internet.service	Tech.support	Agr.period	Month.charges	churn	Predicted
Married	16	Cable	Yes	Monthly contract	98.05	Yes	Yes
Married	70	Cable	Yes	One year contract	75.25	No	No
Married	36	Cable	Yes	Monthly contract	73.35	No	No
Married	72	Cable	Yes	One year contract	112.60	No	No
Married	40	Cable	No	Monthly contract	95.05	No	Yes
Single	15	No Internet	No	Monthly contract	19.85	No	No
Married	1	Cable	No	Monthly contract	89.20	No	Yes
Married	36	Cable	No	Monthly contract	94.65	Yes	Yes
Married	5	Cable	No	Monthly contract	97.10	Yes	Yes
Married	57	Cable	Yes	One year contract	113.25	No	No

The predicted response of the first 10 observations in the dataset, is quite similar to the original response.

The following are tables showing the frequency of the original response variable and the frequency of the predicted response variable. It can be seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.7%. Therefore, this model is acceptable.

Frequency of Predicted Response Variable
Var1	Freq
No	777
Yes	223

Frequency of Original Response Variable
Var1	Freq
No	741
Yes	259

4.4.2 Predict New Data

A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:

Predicted Values of New Data
Mar.status	Term	Internet.service	Tech.support	Agr.period	Monthly.charges	Predicted
Married	38	Yes	Cable	Monthly contract	100.5	No
Single	50	No	No Internet	Monthly contract	87.6	No
Single	14	No	No Internet	One year contract	110.5	No
Single	4	Yes	Cable	Monthly contract	90.0	Yes

At the end of the logistics regression modelling, only the variables used in the final model were retained in the dataset. They are: marital status, term, technical support, agreement period, monthly charges and the response variable, churn.

5 Cross Validation and Performance Measures

5.1 Data Partition

Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.

5.2 Cross-Validation

The three best models already built in the previous section were used. They are ModelA, reduced model and fourth model. Cross-validation was done on all 3 models using the training dataset.This involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set).

5.2.1 Cross-validation and Test Accuracy for the ModelA

The model was fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:

Test Accuracy of ModelA
test.accuracy
0.79

The accuracy was found to be 79%. This indicates that there is no under-fitting or over-fitting in this model.

5.2.2 Cross-validation and Test Accuracy for the Reduced Model

The regression coefficients obtained by fitting the reduced model on the training data was used to obtain accuracy from the test data. The results are as follows:

Test Accuracy of the Reduced Model
test.accuracy3
0.7833333

5.2.3 Cross-validation and Test Accuracy for the fourth Model

Test Accuracy of the Fourth Model
test.accuracy2
0.79

5.3 Optimal Cut-off Probability for all Models

A sequence of 20 candidate cut-off probabilities and then a 5-fold cross-validation was used to identify the optimal cut-off probability for all models.

5-fold CV performance plot

ModelA was the model generated from automatic variable selection. The optimal cut-off probability that yields the best accuracy for ModelA is 0.52. The reduced model was made up of term, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for this model is 0.57. The fourth model was made up of term, technical support, internet service, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.

5.4 Global Measure: ROC and AUC

The ROC curve is the plot of the False Positive Rate (FPR) against the True Positive Rate (TPR) calculated from each decision boundary such as the cut-off probability. In order to create an ROC curve for all the models, a sequence of decision thresholds is needed and the corresponding sensitivity and specificity for each model was calculated. The interval, (0, 1), was split into 20 subintervals and specificity and sensitivity was calculated based on each of these cut-offs.Below is the plot of the ROC curve (1-specificity, sensitivity).

The above ROC curves plots pairs of the true positive rate vs. the false positive rate for every possible decision threshold of the three models, modelA, reduced model and the fourth model.The true positive rate represents the proportion of observations that are predicted to be positive when indeed they are positive while the false positive rate represents the proportion of observations that are predicted to be positive when they’re actually negative.

From the plot, it can be seen that the AUC for modelA which is 0.8542 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.

5.5 The Final Model

Out of all the 3 models evaluated above using cross-validation and KPI measures, ModelA had the best performance metrics and accuracy (79%) when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:

Local performance metrics for Final Model
sensitivity	specificity	precision	recall	F1
0.4473684	0.90625	0.6181818	0.4473684	0.519084

Below is the optimal cut-off probability for the final model.

5-fold CV performance plot

6 Neural Network Model

6.1 Feature Conversion for Neural Network

Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables and numerical variables are scaled.

6.1.1 Numeric Feature Scaling

First, all the 3 numeric variables to be included in this model are scaled according to this formula. \[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \]

6.1.2 Categorical Feature Conversion

All feature variables were extracted and the categorical variables converted to dummy variables using model.matrix().

Then the variables were renamed using simpler names.

6.2 The Model Formula

The following is the model formula:

Churn ~ maritalstat + Term + DSL + fiber + nointernet + techsupport + streamvideos + oneyear + twoyears + monthlycharges + totalcharges + IP

6.3 Training and Testing NN Model

This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training and testing data. Cross-validation was done with the training data and the model tested with the testing data.

6.3.1 Data Splitting

The data was split into 70% for training the neural network and 30% for testing. There were 700 observations in the training dataset and 300 observations in the testing data.

6.3.2 Neural network Model Building

Below is the neural network model obtained using the training data:

error	47.3790324
reached.threshold	0.0096885
steps	1175.0000000
Intercept.to.1layhid1	-1.1157300
maritalstat.to.1layhid1	0.4172192
Term.to.1layhid1	-6.3052454
DSL.to.1layhid1	-0.2939402
fiber.to.1layhid1	-0.2356842
nointernet.to.1layhid1	-2.8342897
techsupport.to.1layhid1	-0.8026092
streamvideos.to.1layhid1	0.3203401
oneyear.to.1layhid1	-1.6723427
twoyears.to.1layhid1	-2.4868972
monthlycharges.to.1layhid1	1.0020602
totalcharges.to.1layhid1	6.2390948
IP.to.1layhid1	0.0823311
Intercept.to.Churn	0.0433270
1layhid1.to.Churn	1.6424948

6.3.3 Neural Network Plot

The plot of a single layer neural network model of customer churn:

Single-layer backpropagation Neural network model for Customer Churn

6.3.4 Logistic Model

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.5047	0.6383	-0.7907	0.4291
Marital_StatusSingle	0.5372	0.3586	1.498	0.1341
Term	-2.204	0.9762	-2.258	0.02397
Internet_serviceDSL	-0.417	0.3359	-1.241	0.2144
Internet_serviceFiber optic	-0.3456	0.2248	-1.537	0.1242
Internet_serviceNo Internet	-2.086	0.6082	-3.43	0.0006029
tech_supportYes	-0.7406	0.2139	-3.462	0.0005352
stream_videosYes	0.06015	0.2523	0.2384	0.8116
Agreement_periodOne year contract	-1.594	0.3026	-5.266	1.395e-07
Agreement_periodTwo year contract	-1.706	0.3929	-4.341	1.419e-05
Monthly_Charges	1.683	1.055	1.596	0.1105
Total_Charges	1.486	1.303	1.141	0.254
grp.IPYes	0.1558	0.2009	0.7757	0.4379

6.3.5 Cross-validation in Neural Network

The cross validation in the neural network was carried out with the training dataset. The optimal cut off probability obtained for the neural network model was 0.43.

6.3.6 Testing Model Performance

The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.

Confusion Matrix
	0	1
FALSE	186	33
TRUE	32	49

Test Accuracy of Neural Network
x
0.7833333

The accuracy was found to be 78.33%.

6.4 ROC Analysis

The ROC curve is the plot of sensitivity against 1 - specificity calculated from the confusion matrix based on a sequence of selected cut-off scores.

An ROC curve is shown for the above neural network model based on the training data set.

ROC Curve of the neural network model

The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable.

6.5 Comparison of predictive performance between the logistic model and the neural network model

Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.

The neural network model had a test accuracy of 78.33% and an AUC of 0.8587 while the final logistic regression model had a test accuracy of 79% and an AUC of 0.8542. Both models are good and acceptable and it may be difficult to determine which particular model is better based solely on these information.

7 Decision Tree Algorithm

7.1 Description of the Algorithm

In this section, a decision tree was used as a predictive model to draw conclusions about customer churn data. The main goal of this model is to predict the value of a response variable based on several input variables. The Decision Tree (DT) algorithm is based on conditional probabilities and it generates rules.

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes.

The decision tree is appropriate for this data because it is easy to interpret, very flexible and also insensitive to underlying relationships between attributes. This means that if there are 2 variables in this dataset that are highly correlated, the algorithm will only choose one of the variables to split on.

To build the different model decision trees, the data set was first split into two datasets randomly, the training and testing dataset. There are 700 observations in the training dataset and 300 in the testing data. Cross-validation analysis will be done and optimal cut-off score calculated using the training dataset. ROC analysis will also be carried out and the best decision tree with the largest AUC will be identified.

7.2 rpart

Here, a wrapper is written so that different decision trees can be built conveniently.

Using the function, tree.builder = function(in.data, fp, fn, purity), six different decision tree models were defined below:

Model 1: gini.tree.11 is based on the Gini index without penalizing false positives and false negatives.

Model 2: info.tree.11 is based on entropy without penalizing false positives and false negatives.

Model 3: gini.tree.110 is based on the Gini index: cost of false negatives is 10 times the positives.

Model 4: info.tree.110 is based on entropy: cost of false negatives is 10 times the positives.

Model 5: gini.tree.101 is based on the Gini index: cost of false positive is 10 times the negatives.

Model 6: info.tree.101 is based on entropy: cost of false positive is 10 times the negatives.

The tree diagrams of the first 4 decision models are given below.

Non-penalized decision tree models using Gini index (left) and entropy (right).

Penalized decision tree models using Gini index (left) and entropy (right).

7.3 ROC for Model Selection

ROC analysis was then used to select the best among all models. The function SenSpe = function(in.data, fp, fn, purity) is defined and used to build 6 different trees and plot their corresponding ROC curves so that the global performance of these tree algorithms can be seen and compared. This function has 3 arguments and they include: false positive, false negative and purity.

7.4 ROC Curves for the Different Tree Models

This shows the ROC curves with their corresponding AUCs (Area under the curve) for individual decision tree models.

Comparison of ROC curves

The above ROC curves represent various decision trees and their corresponding AUC.The model, info.1.10 has the largest AUC of 0.85 and its curve extends farthest to the upper left corner. Therefore, it is considered the best decision tree among the others.

7.5 Optimal Cut-off Score Determination

The optimal cut-off score is needed for reporting the predictive performance of the final model with the test data. The optimal cut-off determination through cross-validation was based on the training data set. The function Optm.cutoff = function(in.data, fp, fn, purity) was first created and then used to calculate the optimal cut-off for the 6 decision trees shown earlier above.

Plot of optimal cut-off determination

In the above figure, there are multiple cut-offs for each plot. Therefore, the final cut-off for the best model will be the average of the multiple cut-offs for that particular model. the average cut-off for the best model, gini 1.10 is 0.4475.

7.6 Discussions and Conclusions

At the end of the decision tree modelling, the best decision tree was identified with the optimal cut-off score. The following is the diagram of the best decision tree and the cut off score.

Best decision tree model using Gini index (left) and optimal cut-off (right).

Customer Churn Analysis and Prediction - A Case Study

Echefu Chinwendu

2023-07-24