This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.
The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.
Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.
The following are the questions and objectives that describe and explain the main purpose of this project.
The response variable is Churn which is a binary
variable with two values, yes and no. The
value yes means that a customer left the company while
no means that a customer is still active.
By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, a logistic regression model will be utilized to analyse the relationship between the binary response variable, churn, and the predictor variables.
The total number of records in this data set is 1000. It consists of
14 variables including the response variable with the name
Churn. There are 3 numerical variables and 11 categorical
variables. The predictor variables include sex, marital status, term,
phone service and others. A detailed description of the variables is
given below:
Sex: Sex of the customer - Categorical var
Marital_status: Marital status of the customer -
Categorical var
Term: Term (Displayed in months) - Numerical var
Phone_service: Phone service - Categorical var
international_plan: International plan - Categorical
var
Voice_mail_plan: Voice mail plan - Categorical var
Multiple_line: Multiple line - Categorical var
Internet_service: Internet service - Categorical var
Technical_support: Technical support - Categorical
var
Streaming_videos: Streaming Videos - Categorical var
Agreement_period: Agreement period - Categorical var
Monthly_charges: Monthly Charges - Numerical var
Total charges: Total Charges - Numerical var
Churn: Churn (Yes or No)
A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt
The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. All the numerical and categorical variables were examined closely and there were no missing values found.
The above summary table indicates that there are no missing values in all the variables.
Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.
The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.
From the above plots, it can be seen that 51.4% of the customers in this
study are male. Majority of customers are married, have a phone service
and a voice mail plan.
44.4% of the customers have multiple lines. Under the Internet service
category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and
20.8% had none. Majority were on a monthly contract. For this data,
74.1% had not left the company.
One of the categorical variables, International Plan, had 3 groups: No, Yes and yes. This was an input error that happened when this data was collected. Below is the table showing the frequency of these groups.
| Groups | Freq |
|---|---|
| No | 429 |
| yes | 262 |
| Yes | 309 |
In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.
Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.
57.1% of the customers had an international plan. About a third of the
customers had technical support and 41% had video streaming.
There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:
The plot of the histogram showing the distribution of Term shows a
non-symmetric pattern with the highest frequency between 0 and 5 months
and lowest between 35 and 40 months.
This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).
The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.
From the above density plot of monthly charges, it can be seen that
the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore,
these variables will be discretized for future models and algorithms.
The variable, monthly charges, ranges from 18.95 to
116.25.
less than 30: low charges
30 to 80 : moderate charges
greater than 80: high charges
The following table shows the frequency of the grouped variable, grp.month
| Var1 | Freq |
|---|---|
| High | 406 |
| Low | 223 |
| Moderate | 371 |
Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.
This was done to determine whether the response variable (churn - which is binary) is independent of the categorical variables. Categorical variables found to be independent of the response variable will be excluded in any of the subsequent models and algorithms. Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and so the boxes line up in a grid.
From the above mosaic plots, it can be seen that sex, phone service,
voicemail plan, and multiple line appear to be independent of the
response variable, churn. This is because the proportion of churn cases
in the individual categories of these variables appear to be identical.
Churn is not independent of marital status and International plan. The
other mosaic plots are shown below:
In addition to marital status and International plan, Agreement period,
Internet service, monthly charges (grouped), technical support and
streaming videos are not independent of the response variable,
Churn.
A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:
| Chisq.sex.p.value | Chisq.Phoneservice.p.value | Chisq.Voicemail.p.value | Chisq.multipleline.p.value |
|---|---|---|---|
| 0.1248683 | 0.3680155 | 0.6651237 | 0.3384263 |
The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.
The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.
The main diagonal stacked density curves show the potential
difference in the distribution of the underlying numeric variable in
Churn and non-Churn groups. This means that the stacked density curves
show the relationship between numeric and categorical variables. These
stacked density curves are not completely overlapped indicating somewhat
correlation between each of these numeric variables and the binary
response variable, Churn.
Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.
Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.
International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The number of variables in the final dataset was 11.
The following are the variables that will be used for subsequent
modelling. Marital_Status, Term,
Internet_service, tech_support,
stream_videos, Agreement_period,
Monthly_Charges, grp.month,
Total_Charges, grp.IP and
Churn
In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The following are the assumptions of a logistic model:
The response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no.
The predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity.
The functional form of the predictor variables are correctly specified.
Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.
The numeric variables are Term, Monthly charges and Total charges. In total, there are 9 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).
The predictors for our model are:
Marital.status: Marital status of the customer - factor
with 2 levels
Term: Term (Displayed in months) - Numerical
variable
International.plan: International plan - factor with 2
levels
Internet.service: Internet service - factor with 4
levels
technical.support: Technical support - factor with 2
levels
streaming.videos: Streaming Videos - factor with 2
levels
Agreement_period: Agreement period - factor with 3
levels
Monthly_Charges: Monthly Charges - Numerical variable
OR grpd.month : Monthly charges - factor
with 3 levels
Total_Charges: Total charges - numerical variable
First, a logistic regression model that contains all predictor variables with monthly variable as numeric in the data set was built. This is called the first model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.8334170 | 0.8265047 | -1.0083633 | 0.3132801 |
| marital.statusSingle | 0.5372165 | 0.3585663 | 1.4982347 | 0.1340723 |
| Term | -0.0306097 | 0.0135583 | -2.2576403 | 0.0239681 |
| technical.supportYes | -0.7406359 | 0.2139030 | -3.4624854 | 0.0005352 |
| internet.serviceDSL | -0.4170019 | 0.3358900 | -1.2414836 | 0.2144272 |
| internet.serviceFiber optic | -0.3456259 | 0.2248026 | -1.5374642 | 0.1241797 |
| internet.serviceNo Internet | -2.0864226 | 0.6082339 | -3.4302965 | 0.0006029 |
| streaming.videosYes | 0.0601511 | 0.2523427 | 0.2383705 | 0.8115937 |
| agreement.periodOne year contract | -1.5935134 | 0.3026089 | -5.2659178 | 0.0000001 |
| agreement.periodTwo year contract | -1.7055952 | 0.3929081 | -4.3409525 | 0.0000142 |
| International.planYes | 0.1558211 | 0.2008870 | 0.7756652 | 0.4379467 |
| Monthly_Charges | 0.0173013 | 0.0108412 | 1.5958830 | 0.1105149 |
| Total_Charges | 0.0001754 | 0.0001538 | 1.1406038 | 0.2540349 |
The AIC of the first model is 881.7286. It is made up of 12 variables. In the first model, some of the variables were significant at the .05 level. These are: Term (p=0.0239681), Technical support (p=0.0005352), Internet service-no internet (p=0.0006029), one year agreement period (0.0000001), and two year agreement period (0.0000142).
Then another model containing all the predictors, but this time with monthly charges as a factor, is built.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.5920773 | 0.3527946 | 1.6782492 | 0.0932985 |
| marital.statusSingle | 0.1919100 | 0.3146996 | 0.6098196 | 0.5419813 |
| Term | -0.0336493 | 0.0131886 | -2.5513906 | 0.0107294 |
| technical.supportYes | -0.7098222 | 0.2122457 | -3.3443414 | 0.0008248 |
| internet.serviceDSL | -0.6291153 | 0.2958855 | -2.1262121 | 0.0334856 |
| internet.serviceFiber optic | -0.3212021 | 0.2230693 | -1.4399211 | 0.1498897 |
| internet.serviceNo Internet | -3.0963883 | 0.7207580 | -4.2960167 | 0.0000174 |
| streaming.videosYes | 0.1840132 | 0.2325419 | 0.7913119 | 0.4287620 |
| agreement.periodOne year contract | -1.5746912 | 0.3024023 | -5.2072730 | 0.0000002 |
| agreement.periodTwo year contract | -1.6864130 | 0.3949791 | -4.2696260 | 0.0000196 |
| International.planYes | 0.2394340 | 0.2038002 | 1.1748465 | 0.2400561 |
| grpd.monthLow | 0.1892848 | 0.6708755 | 0.2821459 | 0.7778316 |
| grpd.monthModerate | -0.3335832 | 0.2777887 | -1.2008523 | 0.2298085 |
| Total_Charges | 0.0002338 | 0.0001424 | 1.6412233 | 0.1007511 |
The AIC of the second model is 883.9549. It is made up of 13 variables. Here, the variables that were significant at the .05 level are: Term (p=0.0107294), Technical support (p=0.0008248), Internet service-DSL (p=0.0334856), Internet service-no internet (p=0.0000174), one year agreement period (p=0.0000002), and two year agreement(p=0.0000196).
When compared to the first model based on the AIC, the first model had a lower AIC, 881.7286. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model.
Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.8270051 | 0.2414150 | -7.567903 | 0.0000000 |
| Term | -0.0174912 | 0.0051167 | -3.418431 | 0.0006298 |
| agreement.periodOne year contract | -1.8564358 | 0.2919578 | -6.358575 | 0.0000000 |
| agreement.periodTwo year contract | -2.1559901 | 0.3681332 | -5.856549 | 0.0000000 |
| Monthly_Charges | 0.0266624 | 0.0034836 | 7.653632 | 0.0000000 |
The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level.
All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Term, Technical support, Internet service, and agreement period. Since Term and agreement period were already present in the reduced model, just technical support and internet service were added from the first model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.6846363 | 0.5267513 | -1.299734 | 0.1936923 |
| Term | -0.0176784 | 0.0052027 | -3.397891 | 0.0006791 |
| technical.supportYes | -0.7625008 | 0.2130108 | -3.579634 | 0.0003441 |
| internet.serviceDSL | -0.3204252 | 0.3014467 | -1.062958 | 0.2878010 |
| internet.serviceFiber optic | -0.3313126 | 0.2246921 | -1.474519 | 0.1403420 |
| internet.serviceNo Internet | -1.7390432 | 0.5341455 | -3.255748 | 0.0011309 |
| agreement.periodOne year contract | -1.5743930 | 0.2996931 | -5.253350 | 0.0000001 |
| agreement.periodTwo year contract | -1.6702441 | 0.3865429 | -4.320979 | 0.0000155 |
| Monthly_Charges | 0.0180282 | 0.0064770 | 2.783392 | 0.0053794 |
The AIC of the fourth model is 878.4816. It is made up of 8 variables. The intercept and 2 dummy variables in the internet service (fiber optic and DSL) were not significant at .05 significance level.
The next step is to use an automatic variable procedure to find the best model.
This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.2265316 | 0.6170311 | -1.987795 | 0.0468343 |
| marital.statusSingle | 0.5806634 | 0.3403543 | 1.706055 | 0.0879977 |
| Term | -0.0181675 | 0.0052304 | -3.473470 | 0.0005138 |
| technical.supportYes | -0.7525514 | 0.2136209 | -3.522836 | 0.0004270 |
| internet.serviceDSL | -0.3298774 | 0.3039017 | -1.085474 | 0.2777118 |
| internet.serviceFiber optic | -0.3261646 | 0.2244343 | -1.453275 | 0.1461476 |
| internet.serviceNo Internet | -1.8273452 | 0.5366413 | -3.405152 | 0.0006613 |
| agreement.periodOne year contract | -1.5894411 | 0.3009554 | -5.281318 | 0.0000001 |
| agreement.periodTwo year contract | -1.6724833 | 0.3871723 | -4.319739 | 0.0000156 |
| Monthly_Charges | 0.0241694 | 0.0074372 | 3.249803 | 0.0011548 |
The best and final model is model A, with the smallest AIC, 877.5205, and 9 variables. Marital status and 2 dummy variables, DSL and fiber optic were not significant at the .05 significance level.
The summary table for the best model, model A, contains the very important variables in the reduced model: Term, one year agreement period, two year agreement period and Monthly_Charges. All 4 variables were statistically significant at the significance level of 0.05 (Term - 0.0005138, one year agreement period - 0.0000001, two year agreement period - 0.0000156 and Monthly_Charges - 0.0011548). Term, one year agreement period (vs monthly period) and two year agreement periods (vs monthly period) are negatively associated with the response variable, churn while monthly charges is positively associated with the response variable.
Fiber optic, DSL and No internet when compared to the reference variable, cable, are negatively associated with the response variable. The odds of success in keeping a customer in the company who uses fiber optic (p=0.1461476) or DSL (p=0.2777118) are lower than those who use Cable. Similarly, customers who do not have internet (p=0.0006613) have lower odds of successfully remaining in the company than those who use Cable.
Single customers have higher odds of successfully remaining in the company than married customers. This is not significant at .05 level (p=0.0879977). One and two year agreement periods are negatively associated with churn. The odds of success in keeping a customer in the company who has a one year (p=0.0000001) or two year agreement period (p=0.0000156) are lower than those whose agreement period is monthly.
The odds of success increase as the amount of monthly charges increases (0.0011548) and decrease as the term (time spent in the company in months) increases (p=0.0005138). Customers who require technical support (p=0.0004270) have lower odds of successfully remaining in the company.
The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.
The predicted response is compared to the original response. This is shown in the following table.
| Mar.status | Term | Internet.service | Tech.support | Agr.period | Month.charges | churn | Predicted |
|---|---|---|---|---|---|---|---|
| Married | 16 | Cable | Yes | Monthly contract | 98.05 | Yes | Yes |
| Married | 70 | Cable | Yes | One year contract | 75.25 | No | No |
| Married | 36 | Cable | Yes | Monthly contract | 73.35 | No | No |
| Married | 72 | Cable | Yes | One year contract | 112.60 | No | No |
| Married | 40 | Cable | No | Monthly contract | 95.05 | No | Yes |
| Single | 15 | No Internet | No | Monthly contract | 19.85 | No | No |
| Married | 1 | Cable | No | Monthly contract | 89.20 | No | Yes |
| Married | 36 | Cable | No | Monthly contract | 94.65 | Yes | Yes |
| Married | 5 | Cable | No | Monthly contract | 97.10 | Yes | Yes |
| Married | 57 | Cable | Yes | One year contract | 113.25 | No | No |
The predicted response of the first 10 observations in the dataset, is quite similar to the original response.
The following are tables showing the frequency of the original response variable and the frequency of the predicted response variable. It can be seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.7%. Therefore, this model is acceptable.
| Var1 | Freq |
|---|---|
| No | 777 |
| Yes | 223 |
| Var1 | Freq |
|---|---|
| No | 741 |
| Yes | 259 |
A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:
| Mar.status | Term | Internet.service | Tech.support | Agr.period | Monthly.charges | Predicted |
|---|---|---|---|---|---|---|
| Married | 38 | Yes | Cable | Monthly contract | 100.5 | No |
| Single | 50 | No | No Internet | Monthly contract | 87.6 | No |
| Single | 14 | No | No Internet | One year contract | 110.5 | No |
| Single | 4 | Yes | Cable | Monthly contract | 90.0 | Yes |
At the end of the logistics regression modelling, only the variables used in the final model were retained in the dataset. They are: marital status, term, technical support, agreement period, monthly charges and the response variable, churn.
Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.
The three best models already built in the previous section were used. They are ModelA, reduced model and fourth model. Cross-validation was done on all 3 models using the training dataset.This involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set).
The model was fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:
| test.accuracy |
|---|
| 0.79 |
The accuracy was found to be 79%. This indicates that there is no under-fitting or over-fitting in this model.
The regression coefficients obtained by fitting the reduced model on the training data was used to obtain accuracy from the test data. The results are as follows:
| test.accuracy3 |
|---|
| 0.7833333 |
| test.accuracy2 |
|---|
| 0.79 |
A sequence of 20 candidate cut-off probabilities and then a 5-fold cross-validation was used to identify the optimal cut-off probability for all models.
5-fold CV performance plot
ModelA was the model generated from automatic variable selection. The optimal cut-off probability that yields the best accuracy for ModelA is 0.52. The reduced model was made up of term, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for this model is 0.57. The fourth model was made up of term, technical support, internet service, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.
The ROC curve is the plot of the False Positive Rate (FPR) against the True Positive Rate (TPR) calculated from each decision boundary such as the cut-off probability. In order to create an ROC curve for all the models, a sequence of decision thresholds is needed and the corresponding sensitivity and specificity for each model was calculated. The interval, (0, 1), was split into 20 subintervals and specificity and sensitivity was calculated based on each of these cut-offs.Below is the plot of the ROC curve (1-specificity, sensitivity).
The above ROC curves plots pairs of the true positive rate vs. the false positive rate for every possible decision threshold of the three models, modelA, reduced model and the fourth model.The true positive rate represents the proportion of observations that are predicted to be positive when indeed they are positive while the false positive rate represents the proportion of observations that are predicted to be positive when they’re actually negative.
From the plot, it can be seen that the AUC for modelA which is 0.8542 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.
Out of all the 3 models evaluated above using cross-validation and KPI measures, ModelA had the best performance metrics and accuracy (79%) when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.4473684 | 0.90625 | 0.6181818 | 0.4473684 | 0.519084 |
Below is the optimal cut-off probability for the final model.
5-fold CV performance plot
Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables and numerical variables are scaled.
First, all the 3 numeric variables to be included in this model are scaled according to this formula. \[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \]
All feature variables were extracted and the categorical variables converted to dummy variables using model.matrix().
Then the variables were renamed using simpler names.
The following is the model formula:
Churn ~ maritalstat + Term + DSL + fiber + nointernet + techsupport + streamvideos + oneyear + twoyears + monthlycharges + totalcharges + IP
This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training and testing data. Cross-validation was done with the training data and the model tested with the testing data.
The data was split into 70% for training the neural network and 30% for testing. There were 700 observations in the training dataset and 300 observations in the testing data.
Below is the neural network model obtained using the training data:
| error | 47.3790324 |
| reached.threshold | 0.0096885 |
| steps | 1175.0000000 |
| Intercept.to.1layhid1 | -1.1157300 |
| maritalstat.to.1layhid1 | 0.4172192 |
| Term.to.1layhid1 | -6.3052454 |
| DSL.to.1layhid1 | -0.2939402 |
| fiber.to.1layhid1 | -0.2356842 |
| nointernet.to.1layhid1 | -2.8342897 |
| techsupport.to.1layhid1 | -0.8026092 |
| streamvideos.to.1layhid1 | 0.3203401 |
| oneyear.to.1layhid1 | -1.6723427 |
| twoyears.to.1layhid1 | -2.4868972 |
| monthlycharges.to.1layhid1 | 1.0020602 |
| totalcharges.to.1layhid1 | 6.2390948 |
| IP.to.1layhid1 | 0.0823311 |
| Intercept.to.Churn | 0.0433270 |
| 1layhid1.to.Churn | 1.6424948 |
Single-layer backpropagation Neural network model for Customer Churn
| Â | Estimate | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|
| (Intercept) | -0.5047 | 0.6383 | -0.7907 | 0.4291 |
| Marital_StatusSingle | 0.5372 | 0.3586 | 1.498 | 0.1341 |
| Term | -2.204 | 0.9762 | -2.258 | 0.02397 |
| Internet_serviceDSL | -0.417 | 0.3359 | -1.241 | 0.2144 |
| Internet_serviceFiber optic | -0.3456 | 0.2248 | -1.537 | 0.1242 |
| Internet_serviceNo Internet | -2.086 | 0.6082 | -3.43 | 0.0006029 |
| tech_supportYes | -0.7406 | 0.2139 | -3.462 | 0.0005352 |
| stream_videosYes | 0.06015 | 0.2523 | 0.2384 | 0.8116 |
| Agreement_periodOne year contract | -1.594 | 0.3026 | -5.266 | 1.395e-07 |
| Agreement_periodTwo year contract | -1.706 | 0.3929 | -4.341 | 1.419e-05 |
| Monthly_Charges | 1.683 | 1.055 | 1.596 | 0.1105 |
| Total_Charges | 1.486 | 1.303 | 1.141 | 0.254 |
| grp.IPYes | 0.1558 | 0.2009 | 0.7757 | 0.4379 |
The cross validation in the neural network was carried out with the
training dataset.
The optimal cut off probability obtained for the neural network model
was 0.43.
The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.
| 0 | 1 | |
|---|---|---|
| FALSE | 186 | 33 |
| TRUE | 32 | 49 |
| x |
|---|
| 0.7833333 |
The accuracy was found to be 78.33%.
The ROC curve is the plot of sensitivity against 1 - specificity calculated from the confusion matrix based on a sequence of selected cut-off scores.
An ROC curve is shown for the above neural network model based on the training data set.
ROC Curve of the neural network model
The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable.
Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.
The neural network model had a test accuracy of 78.33% and an AUC of 0.8587 while the final logistic regression model had a test accuracy of 79% and an AUC of 0.8542. Both models are good and acceptable and it may be difficult to determine which particular model is better based solely on these information.
In this section, a decision tree was used as a predictive model to draw conclusions about customer churn data. The main goal of this model is to predict the value of a response variable based on several input variables. The Decision Tree (DT) algorithm is based on conditional probabilities and it generates rules.
A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes.
The decision tree is appropriate for this data because it is easy to interpret, very flexible and also insensitive to underlying relationships between attributes. This means that if there are 2 variables in this dataset that are highly correlated, the algorithm will only choose one of the variables to split on.
To build the different model decision trees, the data set was first split into two datasets randomly, the training and testing dataset. There are 700 observations in the training dataset and 300 in the testing data. Cross-validation analysis will be done and optimal cut-off score calculated using the training dataset. ROC analysis will also be carried out and the best decision tree with the largest AUC will be identified.
Here, a wrapper is written so that different decision trees can be built conveniently.
Using the function,
tree.builder = function(in.data, fp, fn, purity), six
different decision tree models were defined below:
Model 1: gini.tree.11 is based on the
Gini index without penalizing false positives and false negatives.
Model 2: info.tree.11 is based on
entropy without penalizing false positives and false negatives.
Model 3: gini.tree.110 is based on the
Gini index: cost of false negatives is 10 times the positives.
Model 4: info.tree.110 is based on
entropy: cost of false negatives is 10 times the positives.
Model 5: gini.tree.101 is based on the
Gini index: cost of false positive is 10 times the negatives.
Model 6: info.tree.101 is based on
entropy: cost of false positive is 10 times the negatives.
The tree diagrams of the first 4 decision models are given below.
Non-penalized decision tree models using Gini index (left) and entropy (right).
Penalized decision tree models using Gini index (left) and entropy (right).
ROC analysis was then used to select the best among all models. The
function SenSpe = function(in.data, fp, fn, purity) is
defined and used to build 6 different trees and plot their corresponding
ROC curves so that the global performance of these tree algorithms can
be seen and compared. This function has 3 arguments and they include:
false positive, false negative and purity.
This shows the ROC curves with their corresponding AUCs (Area under the curve) for individual decision tree models.
Comparison of ROC curves
The above ROC curves represent various decision trees and their
corresponding AUC.The model, info.1.10 has the largest AUC
of 0.85 and its curve extends farthest to the upper left corner.
Therefore, it is considered the best decision tree among the others.
The optimal cut-off score is needed for reporting the predictive
performance of the final model with the test data. The optimal cut-off
determination through cross-validation was based on the training data
set. The function
Optm.cutoff = function(in.data, fp, fn, purity) was first
created and then used to calculate the optimal cut-off for the 6
decision trees shown earlier above.
Plot of optimal cut-off determination
In the above figure, there are multiple cut-offs for each plot.
Therefore, the final cut-off for the best model will be the average of
the multiple cut-offs for that particular model. the average cut-off for
the best model, gini 1.10 is 0.4475.
At the end of the decision tree modelling, the best decision tree was identified with the optimal cut-off score. The following is the diagram of the best decision tree and the cut off score.
Best decision tree model using Gini index (left) and optimal cut-off (right).