1 Introduction

The data set used in this analysis is a subset of a larger pool of customer data collected by a telecommunications company to investigate what factors may contribute to the retention or churn (i.e., loss) of customers. This data set consists of 1000 observations and 14 variables. Though it has no missing values, there are some inconsistencies in the assignment of character values for some of the categorical variables that may warrant some data cleaning to make sorting and analysis easier. It is available for free download on Kaggle.com

churn <- read.csv("https://pengdsci.github.io/datasets/ChurnData/Customer-Chrn-dataset.txt")

1.1 Variables & Descriptions

The names of the 14 variables (11 categorical, 3 numerical) are as follows:

  1. Sex: (categorical)
  2. Marital_Status: (categorical)
  3. Term: (Amount of time customer has been with company,in months) (discrete numerical)
  4. Phone_service: (“Yes”= Has phone service, “No” = Does not have phone service) (categorical)
  5. International_plan: (“Yes” = Has international plan, “No” = Does not have international plan) (categorical)
  6. Voice_mail_plan: (“Yes” = Has voice mail plan, “No” = Does not have voice mail plan) (categorical)
  7. Multiple_line: (“Yes” = Has multiple phone lines, “No” = Does not have multiple phone lines, “No phone” = Does not have phone service) (categorical)
  8. Internet_service: (Type of Internet Service: “Cable”/“Fibre optic”/“DSL”/“No internet”) (Categorical)
  9. Technical_support: (“Yes” – Has technical support, “No” – Does not have technical support, “No internet” – Does not have internet service) (Categorical)
  10. Streaming_Videos: (“Yes” = Has video streaming, “No” = Does not have video streaming, “No internet” = does not have internet service) (Categorical)
  11. Agreement_period: (Length of Contract: “Monthly contract”/“One year contract”/“Two year contract”) (Categorical)
  12. Monthly_Charges: (Amount charged to customer for one month of services, currency not specified) (Continuous numerical)
  13. Total_Charges: (Total amount charged to customer across length of term) (Continuous numerical)
  14. Churn: (“Yes” = Customer churn, “No” = Customer retained) (Categorical)

1.2 Clinical Question

In this analysis, a simple logistic regression model will be constructed to investigate a potential link between the amount of a customer’s monthly charges for telecommunication services and the likelihood of them leaving the company, i.e., customer churn, from month to month. Interpretation of the model’s regression coefficients can provide some insight into how the size of a customer’s monthly bill might influence their willingness to continue to give the company their business, and the magnitude of this influence.

2 Exploratory Data Analysis

2.1 Distribution of Predictor Variable

ylimit = max(density(churn$Monthly_Charges)$y)
hist(churn$Monthly_Charges, probability = TRUE, main = "Monthly Charges Distribution", xlab="", 
       col = "springgreen3", border="lightsteelblue1")
  lines(density(churn$Monthly_Charges, adjust=2), col="slateblue1") 

This histogram suggests that the distribution of customers’ total monthly charges is roughly bimodal, with one peak around the 20-30 range and the other around 80 to 90. It is possible that a transformation or discretization of the original values could produce a new predictor variable with a distribution that is closer to normal, but for ease of interpretation the original variable will be used to construct the regression model for this analysis.

2.2 Converting Response Variable to Binary Factor

Before the simple logistic regression model can be generated, a new variable which converts the “Yes” & “No” character values of the Churn variable to 1’s and 0’s must be added to the data set.

y0 <- churn$Churn
Churn.bin <- rep(0, length(y0))
Churn.bin[which(y0=="Yes")] = 1
churn$Churn.bin <- Churn.bin

3 Constructing the Simple Logistic Regression Model

With the response variable converted to binary form, the simple logistic regression model can be generated.

logistic.model <- glm(Churn.bin ~ Monthly_Charges, family = binomial(link = "logit"), data = churn)

From this new model the regression coefficients can be extracted and presented in table form for interpretation purposes.

model.coef.stats = summary(logistic.model)$coef       
conf.ci = confint(logistic.model)
odds.ratio = exp(coef(logistic.model))
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci, Odds.Ratio = odds.ratio) 
kable(sum.stats,caption = "Summary Stats of Regression Coefficients with Odds Ratio")
Summary Stats of Regression Coefficients with Odds Ratio
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 % Odds.Ratio
(Intercept) -2.2092749 0.2075945 -10.642262 0 -2.6276461 -1.8128684 0.1097802
Monthly_Charges 0.0164779 0.0026273 6.271826 0 0.0114057 0.0217157 1.0166144

Based on this output there is a positive association between monthly charges and customer churn, as \(\beta_1\) = 0.0165. The 95% confidence interval for this coefficient is [0.0114, .0217]. The odds ratio is about 1.017, which implies that for a one-unit increase in monthly charges, the odds of experiencing customer churn increases by about 1.7\(\%\). Though not an overwhelming result, this value does suggest that a moderate increase in monthly charges, say 15 units of currency, can have an inflating effect on the likelihood of customer churn that shouldn’t be ignored. The output’s p-value indicates high statistical significance.

Graphical representations of the model output can help to better visualize the association between monthly charges and the probability of customer churn:

charges.range = range(churn$Monthly_Charges)
x = seq(charges.range[1], 200, length = 10)
beta.x = coef(logistic.model)[1] + coef(logistic.model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(logistic.model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2)) 
plot(x, success.prob, type = "l", lwd = 2, col = "springgreen3",
     main = "Probability of Customer Churn", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Monthly Charges",
     ylab = "Probability",
     axes = FALSE,
     col.main = "springgreen3",
     cex.main = 0.8)
axis(1, pos = 0)
axis(2)

y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "springgreen3",
     main = "The Rate of Change in the Probability \n  of Customer Churn", 
     xlab = "Monthly Charges",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "springgreen3",
     cex.main = 0.8
     )
axis(1, pos = 0)
axis(2)

The positive association between monthly charges and the probability of customer churn can be seen clearly from these graphs. Also, though it is well beyond the range of this data’s monthly charges values, these curves suggest that, in theory, the rate of change of the probability of customer churn would continue to increase steadily until monthly charges reached about 140. It is at this point that the increase in the probability of customer churn would begin to level off. However, as this value lies beyond the range of the data set, this interpretation should be taken with caution.

4 Bootstrap Simple Logistic Regression

As a supplement to the regression coefficient and odds ratio estimates provided by the model output, bootstrapping can be employed to construct confidence intervals for these values based on a non-parametric approach.

B <- 1000

boot.beta0 <- NULL 
boot.beta1 <- NULL
boot.odds.ratio <- NULL

vector.id <- 1:length(Churn.bin)   # vector of observation IDs
for(i in 1:B){ #starting loop
  
  ##creating samples of observation IDs with replacement, of same size as original sample
  boot.id <- sample(vector.id, length(Churn.bin), replace=TRUE) 
  
  #matching response and explanatory variable values to bootstrap sample   observation IDs
  boot.churn <- churn$Churn.bin[boot.id]
  boot.charges <- churn$Monthly_Charges[boot.id]
  
  #generating bootstrap SLR model for each bootstrap sample
  boot.log.model <-glm(boot.churn[boot.id] ~ boot.charges[boot.id], family = binomial(link = "logit"))       
  
  #storing regression coefficient values for each bootstrap SLR
  boot.beta0[i] <- coef(boot.log.model)[1]
  boot.beta1[i] <- coef(boot.log.model)[2]
  boot.odds.ratio[i] <- exp(coef(boot.log.model)[2]) 
}

boot.beta0.95 <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.95 <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.odds.ratio.95 <- quantile(boot.odds.ratio, c(0.025, 0.975), type = 2)

The 95% confidence intervals constructed via the bootstrap cases method for \(\beta_1\) and the odds ratio are [0.0099, 0.0237] and [1.099, 1.0237], respectively. Both of these intervals capture the respective estimates for each value produced by the parametric model, providing further support their validity and the validity of \(\beta_1\)’s p-value. However, given that the model’s estimate for \(\beta_1\) falls about in the center of both the parametric model’s confidence interval and this bootstrap-generated confidence interval, it may still be preferable to report the former due to its slightly smaller width.

5 Conclusion

Based on the estimates of the simple logistic regression model, in conjunction with the results of the bootstrap regression analysis, it can be reasonably concluded that there is a positive association between a customer’s total monthly charges and the probability of customer churn. The magnitude of the practical significance of this association is not entirely clear, however. While the model suggests that a considerable increase in monthly charges would produce a somewhat significant increase in customer churn likelihood, this is essentially just a confirmation of what simple intuition would dictate. It is very possible that some of the other variables in the data set have a more practically significant impact on the probability of customer churn, and therefore a multiple logistic regression model (perhaps with monthly charges as one of several predictor variables) may be a more appropriate way of understanding changes in the probability of customer churn.