1 Introduction

The data set used in this analysis is a subset of a larger pool of customer data collected by a telecommunications company to investigate what factors may contribute to the retention or churn (i.e., loss) of customers. This data set consists of 1000 observations and 14 variables. There are no missing values. It is available for free download on Kaggle.com

churn <- read.csv("https://pengdsci.github.io/datasets/ChurnData/Customer-Chrn-dataset.txt")

1.1 Variables & Descriptions

The names of the 14 variables (11 categorical, 3 numerical) are as follows:

  1. Sex: (categorical)
  2. Marital_Status: (categorical)
  3. Term: (Amount of time customer has been with company,in months) (discrete numerical)
  4. Phone_service: (“Yes”= Has phone service, “No” = Does not have phone service) (categorical)
  5. International_plan: (“Yes” = Has international plan, “No” = Does not have international plan) (categorical)
  6. Voice_mail_plan: (“Yes” = Has voice mail plan, “No” = Does not have voice mail plan) (categorical)
  7. Multiple_line: (“Yes” = Has multiple phone lines, “No” = Does not have multiple phone lines, “No phone” = Does not have phone service) (categorical)
  8. Internet_service: (Type of Internet Service: “Cable”/“Fibre optic”/“DSL”/“No internet”) (Categorical)
  9. Technical_support: (“Yes” – Has technical support, “No” – Does not have technical support, “No internet” – Does not have internet service) (Categorical)
  10. Streaming_Videos: (“Yes” = Has video streaming, “No” = Does not have video streaming, “No internet” = does not have internet service) (Categorical)
  11. Agreement_period: (Length of Contract: “Monthly contract”/“One year contract”/“Two year contract”) (Categorical)
  12. Monthly_Charges: (Amount charged to customer for one month of services, currency not specified) (Continuous numerical)
  13. Total_Charges: (Total amount charged to customer across length of term) (Continuous numerical)
  14. Churn: (“Yes” = Customer churn, “No” = Customer retained) (Categorical)

1.2 Practical Question

The purpose of this analysis is to determine how these different variables pertaining to a customer’s telecommunications subscription impact the probability of them leaving the company (i.e., customer churn) from month to month.

2 Exploratory Data Analysis

2.1 Analysis of Numerical Variables

While the majority of the variables in this data set are categorical, a pair-wise scatter plot of the few numerical variables can be generated to evaluate any potential correlations between them.

library(psych)
pairs.panels(churn[, c(3, 12, 13)], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

This scatter plot indicated that there is a very strong correlation between the variables Term & Total_Charges, as well as moderate correlation between Monthly_Charges and Total_Charges. Because Total_Charges is moderately to strongly correlated with each of the other two numerical values, it will be excluded from the regression model.

From the plot it can also be seen that neither Term nor Monthly_Charges is particularly normally distributed. More detailed histograms can give a better understanding of their distributions.

par(mfrow=c(1,2))
hist(churn$Term, xlab="Term", main = "", col = "springgreen3")
hist(churn$Monthly_Charges, xlab = "Monthly_Charges", main = "", col = "slateblue1")

The distribution of Term is significantly skewed, while Monthly_Charges exhibits a roughly bimodal distribution. Therefore, both variables will be discretized before being used in the regression model.

Term = churn$Term
grp.Term = Term
grp.Term[Term %in% c(0:20)] = "0-20"
grp.Term[Term %in% c(21:40)] = "21-40"
grp.Term[Term %in% c(41:60)] = "41-60"
grp.Term[Term %in% c(61:75)] = "65+"

Monthly = churn$Monthly_Charges
grp.Monthly = Monthly
grp.Monthly[Monthly <= 60.00] = "<=60"
grp.Monthly[Monthly > 60.00] = ">60"

churn$grp.Term <- grp.Term
churn$grp.Monthly <- grp.Monthly

2.2 Converting Response Variable Into Binary Factor Variable

The variable Churn must be converted into a binary factor variable in order to use it as the response variable in the logistic regression model.

y0 <- churn$Churn
Churn.bin <- rep(0, length(y0))
Churn.bin[which(y0=="Yes")] = 1
churn$Churn.bin <- Churn.bin

3 Building Multiple Logistic Regression Model

3.1 Full Model

To begin the multiple logistic model building process, the full model will be constructed using the binary version of Churn as the response, and for the predictors all of the other categorical variables as well as the discretized Term and Monthly_Charges variables.

full.model = glm(Churn.bin ~ Sex + Marital_Status + Phone_service + International_plan + Voice_mail_plan + Multiple_line + Internet_service + Technical_support + Streaming_Videos + Agreement_period + grp.Term + grp.Monthly, 
          family = binomial(link = "logit"),
          data = churn)  
kable(summary(full.model)$coef, 
      caption="Summary of Inferential Statistics of the Full Model")
Summary of Inferential Statistics of the Full Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5416140 0.4743373 1.1418330 0.2535234
SexMale -0.0150689 0.1755353 -0.0858453 0.9315894
Marital_StatusSingle -0.0497139 0.3542051 -0.1403535 0.8883807
Phone_serviceYes 0.8025502 0.4267435 1.8806384 0.0600211
International_planyes -0.0804928 0.2688971 -0.2993444 0.7646773
International_planYes 0.2639021 0.2287407 1.1537171 0.2486162
Voice_mail_planYes -0.0023890 0.1845780 -0.0129433 0.9896730
Multiple_lineYes -0.1661475 0.2080758 -0.7984949 0.4245833
Internet_serviceDSL -1.2114059 0.3246723 -3.7311651 0.0001906
Internet_serviceFiber optic -0.0692190 0.2381681 -0.2906310 0.7713336
Internet_serviceNo Internet -3.4283528 0.5266906 -6.5092353 0.0000000
Technical_supportYes -0.5919972 0.2165968 -2.7331767 0.0062727
Streaming_VideosYes 0.5795696 0.2055061 2.8202069 0.0047993
Agreement_periodOne year contract -1.8666291 0.3148806 -5.9280532 0.0000000
Agreement_periodTwo year contract -2.0750841 0.4195625 -4.9458280 0.0000008
grp.Term21-40 -0.9353897 0.2493380 -3.7514930 0.0001758
grp.Term41-60 -0.2147208 0.3119979 -0.6882123 0.4913191
grp.Term65+ -0.5039791 0.3988333 -1.2636335 0.2063616
grp.Monthly>60 -1.0568128 0.4004310 -2.6391886 0.0083105

3.2 Reduced Model

In the absence of any expert-provided information about which values may be practically significant, the reduced model will be constructed using only the three most statistically significant predictor variables from the full model.

reduced.model = glm(Churn.bin ~ Internet_service + Agreement_period + grp.Term, 
          family = binomial(link = "logit"),
          data = churn)  
kable(summary(reduced.model)$coef, 
      caption="Summary of Inferential Statistics of the Reduced Model")
Summary of Inferential Statistics of the Reduced Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4868430 0.1982836 2.4552859 0.0140773
Internet_serviceDSL -0.8569101 0.2428689 -3.5282823 0.0004183
Internet_serviceFiber optic -0.1156724 0.2156588 -0.5363675 0.5917046
Internet_serviceNo Internet -2.5491609 0.3835675 -6.6459257 0.0000000
Agreement_periodOne year contract -1.9937361 0.3061610 -6.5120519 0.0000000
Agreement_periodTwo year contract -2.3832920 0.3988378 -5.9755923 0.0000000
grp.Term21-40 -0.9863819 0.2323121 -4.2459335 0.0000218
grp.Term41-60 -0.1366617 0.2762953 -0.4946220 0.6208670
grp.Term65+ -0.3900424 0.3575836 -1.0907726 0.2753729

3.3 Automatic Variable Selection Model

With the full and reduced models constructed, another model can be constructed using automatic variable selection.

library(MASS)
auto.var.model = stepAIC(reduced.model, 
                      scope = list(lower=formula(reduced.model),upper=formula(full.model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )
kable(summary(auto.var.model)$coef, 
      caption="Summary of Inferential Statistics of the Automatical Variable Selection Model")
Summary of Inferential Statistics of the Automatical Variable Selection Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5146908 0.3899117 1.3200190 0.1868287
Internet_serviceDSL -1.2111657 0.3175540 -3.8140458 0.0001367
Internet_serviceFiber optic -0.1527670 0.2241249 -0.6816155 0.4954821
Internet_serviceNo Internet -3.4493605 0.5001411 -6.8967754 0.0000000
Agreement_periodOne year contract -1.8646419 0.3140225 -5.9379237 0.0000000
Agreement_periodTwo year contract -2.0509852 0.4148297 -4.9441621 0.0000008
grp.Term21-40 -0.9612877 0.2397912 -4.0088526 0.0000610
grp.Term41-60 -0.2396120 0.2862021 -0.8372126 0.4024731
grp.Term65+ -0.5529218 0.3713989 -1.4887546 0.1365520
Technical_supportYes -0.5988289 0.2136803 -2.8024522 0.0050716
Streaming_VideosYes 0.5614245 0.2022807 2.7754725 0.0055122
grp.Monthly>60 -1.0278732 0.3751703 -2.7397507 0.0061486
Phone_serviceYes 0.8451414 0.3696940 2.2860565 0.0222510

The forward selection process resulted in a new regression model with seven predictor variables: grp.Term, grp.Monthly, Agreement_period, Internet_service, Technical_Support, Streaming_Videos, & Phone_service.

3.4 Comparing Candidate Models

The full model, reduced model, and the automatic variable selection model can now be compared by multiple goodness-of-fit measures to determine which to use as the final model.

global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
      reduced.model=global.measure(reduced.model),
      auto.var.model=global.measure(auto.var.model))
row.names(goodness) = c("full.model", "reduced.model", "auto.var.model")
kable(goodness, caption ="Comparison of Global Goodness-of-Fit Statistics")
Comparison of Global Goodness-of-Fit Statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 842.0548 1144.017 880.0548
reduced.model 866.0357 1144.017 884.0357
auto.var.model 845.1956 1144.017 871.1956

Because it has the smallest AIC value, the automatic variable selection model will be chosen as the final model for the analysis.

4 Final Model

Now that a final model has been chosen, the regression coefficients and corresponding odds ratios for each predictor variable can be analyzed and interpreted in a practical context.

model.coef.stats = summary(auto.var.model)$coef
odds.ratio = exp(coef(auto.var.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 0.5146908 0.3899117 1.3200190 0.1868287 1.6731211
Internet_serviceDSL -1.2111657 0.3175540 -3.8140458 0.0001367 0.2978499
Internet_serviceFiber optic -0.1527670 0.2241249 -0.6816155 0.4954821 0.8583297
Internet_serviceNo Internet -3.4493605 0.5001411 -6.8967754 0.0000000 0.0317659
Agreement_periodOne year contract -1.8646419 0.3140225 -5.9379237 0.0000000 0.1549517
Agreement_periodTwo year contract -2.0509852 0.4148297 -4.9441621 0.0000008 0.1286081
grp.Term21-40 -0.9612877 0.2397912 -4.0088526 0.0000610 0.3824002
grp.Term41-60 -0.2396120 0.2862021 -0.8372126 0.4024731 0.7869331
grp.Term65+ -0.5529218 0.3713989 -1.4887546 0.1365520 0.5752666
Technical_supportYes -0.5988289 0.2136803 -2.8024522 0.0050716 NA
Streaming_VideosYes 0.5614245 0.2022807 2.7754725 0.0055122 0.5494547
grp.Monthly>60 -1.0278732 0.3751703 -2.7397507 0.0061486 NA
Phone_serviceYes 0.8451414 0.3696940 2.2860565 0.0222510 1.7531681

Interestingly, the regression coefficient for nearly every predictor variable is negative, which suggests that any change from than the baseline level in any of these variables (while all others remain unchanged) results in a decrease in the probability of customer churn. For instance, the model suggests that, holding all other variables equal, going from the baseline term duration of 0-20 months to 21-40 months would result in the odds of customer churn decreasing by about 62%. This value is obtained via the formula \((Odds Ratio - 1) * 100\), or in this case \((0.38 - 1)* 100 = -62\). Similarly, a change from cable internet (the baseline internet service type) to DSL (again with all other predictor variables remaining unchanged) would result in an over 70% decrease in the odds of customer churn. The only variables for which the regression coefficients are positive are Streaming_Videos and Phone_service. Since the baseline for these variables is not having these features, the positive coefficients suggest that a customer having these features as opposed to not having them increases the odds of customer churn.

Note: Unfortunately, there are some unresolved errors in the output of the odds ratios of the last four predictor variables in the table. However, the correct odds ratios can be easily calculated by exponentiating the corresponding regression coefficients. Thus, the correct odds ratio for Technical_supportYes is \(e^{-0.5988289...}\) \(\approx\) 0.5494547, and 1.7531681, 0.3577671, & 2.328307 for Streaming_VideosYes, grp.Monthly>60, & Phone_serviceYes, respectively.

5 Conclusion

The interpretation of the regression coefficients and odds ratios can provide some valuable insight into how these different factors might influence the likelihood of a customer ending their business with a telecommunications company. For example, as noted earlier, the model suggests that the odds of customer churn are somewhat higher for a customer with cable internet than a customer with a DSL connection. This may signal an increased level of frustration experienced by customers due to issues with their cable internet connection, and consequently a need for the company to investigate these potential issues. This notion applies to the company’s video streaming and phone services as well, considering the model’s suggestion that the odds of churn are higher for customers who have these services as opposed to those who don’t. However, consultation with an expert in the business might offer some alternative explanations for these phenomena.