Data Cleaning

Before fitting the model there were 10 observations with a NA value in TotalCharges. These observations were removed. Additionally, the extra service features variable had the factor levels No internet service and No phone service folded into the No category. These levels are redundant; the information is contained within other variables and including them causes issues when estimating the model parameters.

Logistic Regression

The first model used to predict customer churn is a logistic regression model. The data will be split into a testing set consisting of approximately 20% of the data or 1196 observations and a training set consisting of 80% of the data or 4780 observations.

The variable selection was performed using the stepAIC() function from the MASS library. One-hundred iterations of a cross-validated 80-20 split were performed for both the full model and the model using variable selection. Training accuracy results are as follows:

Model Training Accuracy Standard Deviation
Full 62.1% 1.34%
Selected 79.85% 1.07%

Clearly the selected model is superior to the full model, and it will be used for the rest of this analysis.

One of the advantages of the logistic regression model is that the cutoff probability can be selected based on a cost function. For example, if the company was more concerned about reaching customers who were about to churn then the cutoff probability could be adjusted downwards. The ROC curve demonstrates the relationship between sensitivity (the correct identification of churned customers) and specificity (the correct identification of non-churned customers) as the probability cutoff is adjusted.

For this model the cutoff is 53%. This means that customers for which the model calculates a probability of churning that is 53% or greater are classified as “Churned” .

Accuracy Sensitivity Specificity
82.61% 56.61% 91.12%

As you can see the model accurately identifies over 4 in 5 customers. But only a little over half of the churned customers are identified. We can trade accuracy and specificity for increased sensitivity by lowering the probability cutoff.

By adjusting the cutoff probability to a 40% or more chance of churning the model now identifies nearly 70% of the churned customers correctly:

Accuracy Sensitivity Specificity
79.1% 69.83% 82.13%

Variable importance is roughly estimated by calculating the absolute t-scores associated with each estimated coefficient. Additionally ,the sign of each coefficient is stored to determine if the associated variable reduces or increases customers churn, according to the model.

Importance Score Effect
tenure 8.48 Reduces Churn
InternetServiceNo 7.42 Reduces Churn
ContractTwo year 6.71 Reduces Churn
InternetServiceFiber optic 6.57 Increases Churn
ContractOne year 4.76 Reduces Churn
MonthlyCharges 4.71 Reduces Churn
StreamingMoviesYes 4.45 Increases Churn
TotalCharges 4.42 Increases Churn
StreamingTVYes 4.33 Increases Churn
PaperlessBillingYes 3.65 Increases Churn
MultipleLinesYes 3.41 Increases Churn
PaymentMethodElectronic check 3.23 Increases Churn
OnlineSecurityYes 2.73 Reduces Churn
SeniorCitizen 2.14 Increases Churn
TechSupportYes 2.01 Reduces Churn
PaymentMethodCredit card (automatic) 0.67 Reduces Churn
PaymentMethodMailed check 0.22 Reduces Churn

These results indicate that the strongest variables are tenure, contract type, and type of internet service. Long tenure and long contracts reduce the probability of churn, while using fiber optic internet tends to increase the probability of churn.

Random Forest

Next a random forest model was fitted to the data. The training error rate for the tuning parameter mtry was calculated for 1 through 10 on a forest size of 500 trees. The lowest training error rate was found at mtry = 2.

Random forests are not as interpretable as logistic regression, however a measure of variable importance can be calculated. Here is seems that the random forest model was focused mostly on tenure and the charges a customer has paid.

The test results of the random forest are below. The model performed slightly worse than logistic regression and has lower usability and cannot be interpreted as well as logistic regression.

Accuracy Sensitivity Specificity
80.85% 50.51% 90.79%

K-Nearest Neighbors

Results for a K-nearest neighbors model are listed below. The model does not perform well and has little interpretability. K = 14 was selected at value of the tuning parameter with highest training accuracy.

Testing results are as follows:

Accuracy Sensitivity Specificity
78.76% 43.73% 90.23%

Takeaways