Before fitting the model there were 10 observations with a NA value in TotalCharges. These observations were removed. Additionally, the extra service features variable had the factor levels No internet service and No phone service folded into the No category. These levels are redundant; the information is contained within other variables and including them causes issues when estimating the model parameters.
The first model used to predict customer churn is a logistic regression model. The data will be split into a testing set consisting of approximately 20% of the data or 1196 observations and a training set consisting of 80% of the data or 4780 observations.
The variable selection was performed using the stepAIC() function from the MASS library. One-hundred iterations of a cross-validated 80-20 split were performed for both the full model and the model using variable selection. Training accuracy results are as follows:
| Model | Training Accuracy | Standard Deviation |
|---|---|---|
| Full | 62.1% | 1.34% |
| Selected | 79.85% | 1.07% |
Clearly the selected model is superior to the full model, and it will be used for the rest of this analysis.
One of the advantages of the logistic regression model is that the cutoff probability can be selected based on a cost function. For example, if the company was more concerned about reaching customers who were about to churn then the cutoff probability could be adjusted downwards. The ROC curve demonstrates the relationship between sensitivity (the correct identification of churned customers) and specificity (the correct identification of non-churned customers) as the probability cutoff is adjusted.
For this model the cutoff is 53%. This means that customers for which the model calculates a probability of churning that is 53% or greater are classified as “Churned” .
| Accuracy | Sensitivity | Specificity |
|---|---|---|
| 82.61% | 56.61% | 91.12% |
As you can see the model accurately identifies over 4 in 5 customers. But only a little over half of the churned customers are identified. We can trade accuracy and specificity for increased sensitivity by lowering the probability cutoff.
By adjusting the cutoff probability to a 40% or more chance of churning the model now identifies nearly 70% of the churned customers correctly:
| Accuracy | Sensitivity | Specificity |
|---|---|---|
| 79.1% | 69.83% | 82.13% |
Variable importance is roughly estimated by calculating the absolute t-scores associated with each estimated coefficient. Additionally ,the sign of each coefficient is stored to determine if the associated variable reduces or increases customers churn, according to the model.
| Importance Score | Effect | |
|---|---|---|
| tenure | 8.48 | Reduces Churn |
| InternetServiceNo | 7.42 | Reduces Churn |
| ContractTwo year | 6.71 | Reduces Churn |
| InternetServiceFiber optic | 6.57 | Increases Churn |
| ContractOne year | 4.76 | Reduces Churn |
| MonthlyCharges | 4.71 | Reduces Churn |
| StreamingMoviesYes | 4.45 | Increases Churn |
| TotalCharges | 4.42 | Increases Churn |
| StreamingTVYes | 4.33 | Increases Churn |
| PaperlessBillingYes | 3.65 | Increases Churn |
| MultipleLinesYes | 3.41 | Increases Churn |
| PaymentMethodElectronic check | 3.23 | Increases Churn |
| OnlineSecurityYes | 2.73 | Reduces Churn |
| SeniorCitizen | 2.14 | Increases Churn |
| TechSupportYes | 2.01 | Reduces Churn |
| PaymentMethodCredit card (automatic) | 0.67 | Reduces Churn |
| PaymentMethodMailed check | 0.22 | Reduces Churn |
These results indicate that the strongest variables are tenure, contract type, and type of internet service. Long tenure and long contracts reduce the probability of churn, while using fiber optic internet tends to increase the probability of churn.
Next a random forest model was fitted to the data. The training error rate for the tuning parameter mtry was calculated for 1 through 10 on a forest size of 500 trees. The lowest training error rate was found at mtry = 2.
Random forests are not as interpretable as logistic regression, however a measure of variable importance can be calculated. Here is seems that the random forest model was focused mostly on tenure and the charges a customer has paid.
The test results of the random forest are below. The model performed slightly worse than logistic regression and has lower usability and cannot be interpreted as well as logistic regression.
| Accuracy | Sensitivity | Specificity |
|---|---|---|
| 80.85% | 50.51% | 90.79% |
Results for a K-nearest neighbors model are listed below. The model does not perform well and has little interpretability. K = 14 was selected at value of the tuning parameter with highest training accuracy.
Testing results are as follows:
| Accuracy | Sensitivity | Specificity |
|---|---|---|
| 78.76% | 43.73% | 90.23% |