1 Introduction

This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.

The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.

Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.

1.1 Objectives of this study

The following are the questions and objectives that describe and explain the main purpose of this project.

  1. To predict which customers are more likely to churn.
  2. What is the percentage of churn customers in the company?
  3. Are there any notable patterns in terms of customer churn based on gender and marital status?
  4. Are there any notable patterns in terms of customer churn based on the amount spent by the customer and type of service provided?

1.2 Response variable

The response variable is Churn which is a binary variable with two values, yes and no. The value yes means that a customer left the company while no means that a customer is still active.

1.3 Model

By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, three models will be built and the best model used to analyse the relationship between the binary response variable, churn, and the predictor variables. The three models are: Logistics Regression, Neural Network and Decision Tree

2 Description of the Data

The total number of records in this data set is 1000. It consists of 14 variables including the response variable with the name Churn. There are 3 numerical variables and 11 categorical variables. The predictor variables include sex, marital status, term, phone service and others. A detailed description of the variables is given below:

Sex: Sex of the customer - Categorical var; Marital_status: Marital status of the customer - Categorical var; Term: Term (Displayed in months) - Numerical var; Phone_service: Phone service - Categorical var; international_plan: International plan - Categorical var; Voice_mail_plan: Voice mail plan - Categorical var; Multiple_line: Multiple line - Categorical var; Internet_service: Internet service - Categorical var; Technical_support: Technical support - Categorical var; Streaming_videos: Streaming Videos - Categorical var; Agreement_period: Agreement period - Categorical var; Monthly_charges: Monthly Charges - Numerical var; Total charges: Total Charges - Numerical var; Churn: Churn (Yes or No)

A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt

3 EDA for Feature Engineering

The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. All the numerical and categorical variables were examined closely and there were no missing values found.

3.1 Missing Values

The above summary table indicates that there are no missing values in all the variables.

3.2 Assess Distributions

Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.

3.2.1 Categorical variables

The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.

From the above plots, it can be seen that 51.4% of the customers in this study are male. Majority of customers are married, have a phone service and a voice mail plan.

44.4% of the customers have multiple lines. Under the Internet service category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and 20.8% had none. Majority were on a monthly contract. For this data, 74.1% had not left the company.

3.2.2 Regrouping of categorical variables

One of the categorical variables, International Plan, had 3 groups: No, Yes and yes, with values, 429, 309, and 262 respectively. This was an input error that happened when this data was collected.

In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.

Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.

57.1% of the customers had an international plan. About a third of the customers had technical support and 41% had video streaming.

3.3 Numerical Variables

There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:

The plot of the histogram showing the distribution of Term shows a non-symmetric pattern with the highest frequency between 0 and 5 months and lowest between 35 and 40 months.

This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).

The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.

3.4 Discretizing Continuous Varaibles

From the above density plot of monthly charges, it can be seen that the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore, these variables will be discretized for future models and algorithms. The variable, monthly charges, ranges from 18.95 to 116.25.less than 30: low charges; 30 to 80 : moderate charges; greater than 80: high charges

The following table shows the frequency of the grouped variable, grp.month

Monthly Charges Frequency
High 406
Low 223
Moderate 371

3.5 Pairwise Associations

Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.

3.5.1 Two categorical variables

This was done to determine whether the response is independent of the categorical variables. Variables found to be independent of the response will be excluded in any of the subsequent models.Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and the boxes line up in a grid.

From the above mosaic plots, it can be seen that sex, phone service, voicemail plan, and multiple line appear to be independent of the response variable, churn. This is because the proportion of churn cases in the individual categories of these variables appear to be identical.

In addition to marital status and International plan, Agreement period, Internet service, monthly charges (grouped), technical support and streaming videos are not independent of the response variable, Churn.

3.5.2 Pearson Chi-Square Test

A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:

Chisq.sex.p.value Chisq.Phoneservice.p.value Chisq.Voicemail.p.value Chisq.multipleline.p.value
0.1248683 0.3680155 0.6651237 0.3384263

3.5.3 Two Numerical Variables

The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Churn and non-Churn groups. This means that the stacked density curves show the relationship between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable, Churn.

Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.

3.6 KMeans Clustering

This clustering algorithm divides a set of n observations into k clusters. In general, clustering is a method of assigning comparable data points to groups using data patterns. Clustering algorithms find similar data points and allocate them to the same set.

A subset of the original data is created containing only the 3 numeric variables: term, monthly charges and total charges. This subset is then scaled.

3.6.1 Heat Map Representation of Clusters

Heatmap representation of potential clusters

Heatmap representation of potential clusters

The above heatmap indicates that different clusters exist in this data (based on the three numerical variables).

3.6.2 Determination of optimal class.

The elbow method is used to find the optimal number of clusters using the fviz_nbclust() function to create a plot of the number of clusters vs. the total within sum of squares. For this plot it appears that there is a bit of an elbow or “bend” at k = 5 clusters. This is the optimal number of clusters.

3.6.3 K-Means Clustering with Optimal K

k-means clustering is performed on the dataset using the optimal value for k of 5:

There were 5 clusters of sizes 204, 252, 191, 221 and 132. 204 customers were assigned to cluster 1. 252 customers were assigned to cluster 2. 191 customers were assigned to cluster 3. 221 customers were assigned to cluster 4. 132 customers were assigned to cluster 5

3.6.4 Scatter plot of the clusters

The clusters are visualized on a scatterplot that displays the first two principal components on the axes using the fivz_cluster() function:
Scatter Plot of Final cluster results

Scatter Plot of Final cluster results

At the end of the exploratory data analysis, the new cluster variable will be added into the original dataset.

3.7 Conclusion

Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.

International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The cluster variable was also added. The number of variables in the final dataset was 12.

The following are the variables that will be used for subsequent modelling. Marital_Status, Term, Internet_service, tech_support, stream_videos, Agreement_period, Monthly_Charges, grp.month, Total_Charges, grp.IP, cluster, and Churn

4 Logistics Predictive Modelling

4.1 Assumptions

In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The first assumption of a logistic model is that the response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no. The second is the predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity, and the third is the functional form of the predictor variables are correctly specified.

4.2 Model building

Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.

The numeric variables are Term, Monthly charges and Total charges. The cluster variable has 5 values which signifies the 5 different groups (1-5). it was also changed to a factor with 5 levels. In total, there are 10 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).

First, a logistic regression model that contains all predictor variables with monthly charges variable as numeric in the data set was built. This is called the first model.

Significance Tests for the First Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.3607015 0.9453676 -2.4971254 0.0125205
marital.statusSingle -0.1758372 0.3828222 -0.4593183 0.6460056
Term -0.0341104 0.0188862 -1.8061068 0.0709017
technical.supportYes -0.7053120 0.2225359 -3.1694304 0.0015274
internet.serviceDSL -0.5606394 0.3616395 -1.5502715 0.1210764
internet.serviceFiber optic -0.3355105 0.2291304 -1.4642778 0.1431181
internet.serviceNo Internet -1.9911186 0.6167162 -3.2285818 0.0012441
streaming.videosYes -0.0514545 0.2608122 -0.1972855 0.8436041
agreement.periodOne year contract -1.8023818 0.3242817 -5.5580753 0.0000000
agreement.periodTwo year contract -1.9417424 0.4175161 -4.6507007 0.0000033
International.planYes 0.1843095 0.2074822 0.8883150 0.3743713
cluster2 -0.2465339 0.3868904 -0.6372190 0.5239822
cluster3 1.1873077 0.7592222 1.5638475 0.1178534
cluster4 1.5778116 0.4738521 3.3297551 0.0008692
cluster5 2.6631985 0.8390290 3.1741436 0.0015028
Monthly_Charges 0.0383224 0.0121460 3.1551530 0.0016041
Total_Charges 0.0000022 0.0002110 0.0104330 0.9916758

The AIC of the first model is 859.1497. It is made up of 16 variables. In the first model, some of the variables were significant at the .05 level. These were: the intercept, technical support, no internet, one-year agreement, two-year agreement, cluster 4, cluster 5 and monthly charges. one year agreement and two year agreement were significantly different from monthly charges. Cluster 4 and 5 were significantly different from cluster 1. no internet was significantly different from cable.

Then another model containing all the predictors, but this time with monthly charges as a factor, was built.

The AIC of the second model is 868.3467. It was made up of 17 variables. Here, the variables that were significant at the .05 level are: Term, Technical support, Internet service-DSL, Internet service-no internet, one year agreement period, two year agreement, cluster 4 and cluster 5.

When compared to the first model based on the AIC, the first model had a lower AIC of 859.1497. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model. Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.

Significance Tests for Reduced Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8270051 0.2414150 -7.567903 0.0000000
Term -0.0174912 0.0051167 -3.418431 0.0006298
agreement.periodOne year contract -1.8564358 0.2919578 -6.358575 0.0000000
agreement.periodTwo year contract -2.1559901 0.3681332 -5.856549 0.0000000
Monthly_Charges 0.0266624 0.0034836 7.653632 0.0000000

The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level. All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Technical support, Internet service, agreement period, cluster, and monthly charges. Since monthly charges and agreement period were already present in the reduced model, technical support, internet service and cluster were added from the first model.

The AIC of the fourth model was 852.3914. It was made up of 12 variables. 2 dummy variables in the internet service (fiber optic and DSL) and 2 in the cluster (2 and 3) were not significant at .05 significance level. The results were not shown here since they were the exact ones obtained in the next model below. The next step is to use an automatic variable procedure to find the best model.

4.2.1 Automatic Variable Selection

This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.

Summary Table of Significant Tests for model A
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2249631 0.7230624 -3.0771384 0.0020900
Term -0.0346433 0.0126062 -2.7481259 0.0059937
technical.supportYes -0.7045058 0.2208271 -3.1903049 0.0014212
internet.serviceDSL -0.5500562 0.3356312 -1.6388709 0.1012401
internet.serviceFiber optic -0.3118063 0.2279093 -1.3681160 0.1712758
internet.serviceNo Internet -1.9455052 0.5565853 -3.4954308 0.0004733
agreement.periodOne year contract -1.8075414 0.3242020 -5.5753562 0.0000000
agreement.periodTwo year contract -1.9237553 0.4128618 -4.6595626 0.0000032
cluster2 -0.2987356 0.3767943 -0.7928348 0.4278741
cluster3 1.0961043 0.7008642 1.5639326 0.1178334
cluster4 1.4185471 0.4051955 3.5008950 0.0004637
cluster5 2.4506129 0.7801954 3.1410247 0.0016836
Monthly_Charges 0.0382299 0.0086272 4.4312985 0.0000094

This model is same with the fourth model with the AIC, 852.3914, and 12 variables. 2 dummy variables in the internet service (fiber optic and DSL) and 2 in the cluster (2 and 3) were also not significant at .05 significance level.

4.3 Prediction Analysis

The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.

4.3.1 Predict already existing data

The predicted response is compared to the original response. This is shown in the following table.

Dataset with model predicted response
Mar.status Term Int.service Tech.support Agr.period Month.charges cluster churn Predicted
Married 16 Cable Yes Monthly contract 98.05 1 Yes Yes
Married 70 Cable Yes One year contract 75.25 3 No No
Married 36 Cable Yes Monthly contract 73.35 2 No No
Married 72 Cable Yes One year contract 112.60 3 No No

The predicted response of the first 4 observations in the dataset, is quite similar to the original response.

After calculating the frequency of the original response variable and the frequency of the predicted response variable. It was seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.4%. Therefore, this model is acceptable.

4.3.2 Predict New Data

A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:

Predicted Values of New Data
Term Internet.service Tech.support Agr.period Monthly.charges cluster Predicted
38 Yes Cable Monthly contract 100.5 3 Yes
50 No No Internet Monthly contract 87.6 1 No
14 No No Internet One year contract 110.5 2 No
4 Yes Cable Monthly contract 90.0 1 Yes

5 Cross Validation and Performance Measures

5.1 Data Partition

Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.

5.2 Cross-Validation

The three best models already built in the previous section were used. They are ModelA, reduced model and first model. Cross-validation was done on all 3 models using the training dataset.This involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set).

5.2.1 Cross-validation and Optimal Cut Off For All Models

5-fold CV performance plot

5-fold CV performance plot

ModelA was the model generated from automatic variable selection. The optimal cut-off probability that yields the best accuracy for ModelA is 0.52. The reduced model was made up of term, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for this model is 0.57. The fourth model was made up of term, technical support, internet service, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.

The models were fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:

Model Test Accuracy
ModelA 0.7933333
Reduced Model 0.7566667
First Model 0.7800000

The regression coefficients obtained by fitting the models on the training data was used to obtain accuracy from the test data. The accuracy was found to be 79.33% for modelA, 75.67% for the reduced model and 78.0% for the first model. The model with the best test accuracy was ModelA. When the test accuracy is compared with the training accuracy (79.14%) of modelA, no underfitting or overfitting was seen.

5.3 Global Measure: ROC and AUC

Below is the plot of the ROC curve (1-specificity, sensitivity).

From the plot, it can be seen that the AUC for modelA which is 0.8639 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.

5.4 The Best logistics Model

Out of all the 3 logistics models evaluated above using cross-validation and KPI measures, ModelA had the best test accuracy (79.33%) and the highest AUC(0.8639) from the ROC curve when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:

Local performance metrics for the Best Logistics Model
sensitivity specificity precision recall F1
0.6931818 0.8349057 0.6354167 0.6931818 0.6630435

Below is the optimal cut-off probability for the best logistics model.

5-fold CV performance plot

5-fold CV performance plot

At the end of the cross validation and testing, a subset of the main dataset containing the variables included in ModelA was created and ready to be used to build subsequent models.

6 Neural Network Model

6.1 Feature Conversion and Scaling for Neural Network

Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables using model.matrix() and numerical variables are scaled.

6.2 Training and Testing NN Model

This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training (70%) and testing data (30%). Cross-validation was done with the training data and the model tested with the testing data. The data was split into 70% for training the neural network and 30% for testing.

6.2.1 Neural network Model Building and Plot

The plot of a single layer neural network model of customer churn is shown.

6.2.2 Cross-validation in Neural Network

The cross validation in the neural network was carried out with the training dataset. The optimal cut off probability obtained for the neural network model was 0.48 and the training accuracy was 0.8214.

6.2.3 Testing Model Performance

The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.

Confusion Matrix
0 1
FALSE 184 32
TRUE 28 56
Test Accuracy of Neural Network
Test Accuracy
0.8

The test accuracy was found to be 80%.

6.3 ROC Analysis

An ROC curve is shown for the above neural network model based on the training data set.

ROC Curve of the neural network model

ROC Curve of the neural network model

The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable. The AUC was 0.9039.

6.4 Comparison of predictive performance between the logistic model and the neural network model

Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.

The neural network model had a test accuracy of 80% and an AUC of 0.9039 while the best logistic regression model had a test accuracy of 79.33% and an AUC of 0.8639. Both models are good and acceptable and from here, it is clear that the neural model is the better model between the two.

7 Decision Tree Algorithm

7.1 Description of the Algorithm

In this section, a decision tree was used as a predictive model to draw conclusions about customer churn data. The main goal of this model is to predict the value of a response variable based on several input variables. A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. The decision tree is appropriate for this data because it is easy to interpret, very flexible and also insensitive to underlying relationships between attributes. This means that if there are 2 variables in this dataset that are highly correlated, the algorithm will only choose one of the variables to split on.

To build the different model decision trees, the data set was first split into two datasets randomly, the training and testing dataset. There are 700 observations in the training dataset and 300 in the testing data. Cross-validation analysis will be done and optimal cut-off score calculated using the training dataset. ROC analysis will also be carried out and the best decision tree with the largest AUC will be identified.

7.2 Tree Models

Here, a wrapper is written so that 6 different decision trees can be built conveniently.

The tree diagrams of two decision models are given below.

Penalized decision tree models using Gini index (left) and entropy (right).

Penalized decision tree models using Gini index (left) and entropy (right).

7.3 ROC for Model Selection

ROC analysis was then used to select the best among all models. The function SenSpe = function(in.data, fp, fn, purity) is defined and used to build 6 different trees and plot their corresponding ROC curves so that the global performance of these tree algorithms can be seen and compared. This function has 3 arguments and they include: false positive, false negative and purity.

7.3.1 ROC Curves for the Different Tree Models

The ROC curves represent various decision trees and their corresponding AUC.The model, gini.1.10 has the largest AUC of 0.864 and its curve extends farthest to the upper left corner. Therefore, it is considered the best decision tree among the others.

Comparison of ROC curves

Comparison of ROC curves

7.4 Optimal Cut-off Score Determination

The optimal cut-off determination through cross-validation was based on the training data set. The function Optm.cutoff = function(in.data, fp, fn, purity) was first created and then used to calculate the optimal cut-off for the first 4 decision trees shown above.

Plot of optimal cut-off determination

Plot of optimal cut-off determination

In the above figure, there are multiple cut-offs for each plot. Therefore, the final cut-off for the best model will be the average of the multiple cut-offs for that particular model. the average cut-off for the best model, gini 1.10 is 0.4475.

7.5 Discussions and Conclusions

At the end of the decision tree modelling, the best decision tree was identified with the optimal cut-off score. The following is the diagram of the best decision tree and the cut off score.

Best decision tree model (left) and optimal cut-off (right).

Best decision tree model (left) and optimal cut-off (right).

The optimal cut off was 0.4475 and the training accuracy was 0.8071

7.5.1 Best Decision Tree with Test Accuracy

Test Accuracy of Best Decision Tree
Test Accuracy
0.7633333

The accuracy for the best decision tree was 0.7633.

8 Final Model

At the end of this analysis, the following models were built to analyse customer churn: Logistics Regression, Neural Network Model, and Decision Trees. These three models were compared using their ROC curves and test accuracy and performance.

8.1 ROC Curve Comparison

Below is the plot of the ROC curves for these 3 models and the respective AUC values.

ROC Curve of all Models

ROC Curve of all Models

From the above plot, the Decision Tree model appears to be the best model among all the other models based on the Area under the curve which was 0.9153.

8.2 Model Test Accuracy

Below is a table showing the type of model, training accuracy, test accuracy and AUC value.

Comparison Between All The Models
Model Training Accuracy Test Accuracy AUC
Logistic Regression 0.7914 0.7933 0.8639
Neural Network 0.8214 0.8000 0.9039
Decision Tree 0.8071 0.7633 0.9153

8.3 Local Performance Metrics For All Models

The following performance metrics: Specificity, Sensitivity, Recall, Precision and F1, was conducted for all 3 models: Logistics regression, Neural network and Decision Tree. The results are as follows:

Local Performance metrics For All Models
Logistics regression Neural Network Decision Tree
sensitivity 0.6931818 0.6363636 0.7500000
specificity 0.8349057 0.8679245 0.8254717
precision 0.6354167 0.6666667 0.6407767
recall 0.6931818 0.6363636 0.7500000
F1 0.6630435 0.6511628 0.6910995

8.4 The Best Model

In this dataset, identifying the customers that are leaving the company is the ultimate goal. Precision measures the probability of correctly detecting positive values (which is what is needed). For this data, positive values are customers that are likely to churn. When the positive class is smaller, and the ability to detect correctly positive samples is the main focus (such as this study), precision and recall are the best performance metrics. The best model with the highest precision and recall was the decision tree.

9 Discussion

Churn prediction identifies customers that are likely to leave or remain with a company. The response variable, churn, is a binary variable with values of yes and no, and it indicates whether a customer will churn.

9.1 Model Implementation

After model building, the next thing is to use it to achieve the expected goal. Using the diagram of the best decision tree above, it is easy to show how this model can be used to predict customers that are likely to churn. A decision tree is a type of flow-chart that simplifies the decision making process by breaking down the different available paths of actions and their potential outcomes.

Key points Agreement period 0 - Monthly; 1 - One Year; 2 - Two Years, Internet service 0 - Cable; 1 - DSL; 2 - Fiber optic; 3 - No Internet Tech support - 0 - No; 1 - Yes

For example, if a customer’s agreement period is not equal to one or two years, the tree follows the right branch and arrives at internet service. If the customer has no internet, then the tree leads to a node that represents Term. If the customer’s term period is greater or equal to 14 months, the decision tree follows the right branch. The right branch leads to a leaf node that predicts that the person is likely to churn. On the other hand, if the customer’s internet service is either cable or DSL or Fiber optic, the decision tree follows the left branch and gets to Term. If the customer’s Term period is less than 11 months, the tree follows the left branch to a leaf node that predicts that the customer is not likely to churn.

9.2 Limitations

Using a decision tree to predict the likelihood of a customer leaving the company has some limitations. One is that a decision tree provides less information on the relationship between the predictors and the response. Therefore it will be difficult for this model to explain the relationships between the response, Churn and any of the predictors: term, monthly charges, agreement period and so on. Another limitation is that the probabilities provided by this model are only estimates which are prone to error.

9.3 Recommendations

From the model, it is clear that customers with no internet, duration of term greater than 14 months and lower monthly charges (about 80 and lower) are more likely to churn. Therefore, it is recommended that both new and old customers should be encouraged to use internet services. Additionally, in order to keep consumers who have been with them for 12 months or longer, the business needs to pay closer attention to them and immediately address any issues they may have.

Customers that pay smaller monthly fees should also receive same treatment from the business like those paying high amounts. It’s likely that high-paying customers tend to stick around because they get better service, are treated with more respect, and earn bonuses for their subscriptions, compared to consumers who pay less each month and don’t get the same level of treatment.

10 Conclusion

Advanced analytics are crucial for telecommunication businesses to use in order to predict whether or not customers would quit the company. They will be better able to recognize the different groups that exist and know how to communicate with each group after doing so.

At the end of this study, various factors have been analysed and the best model put forward.The proper way to use this model has been discussed as well as the limitations. From this model, key factors that lead to customer churn have been correctly identified and various recommendations suggested in order to prevent customer churn.