The data set used in this analysis is a subset of a larger pool of customer data collected by a telecommunications company to investigate what factors may contribute to the retention or churn (i.e., loss) of customers. This data set consists of 1000 observations and 14 variables. There are no missing values. It is available for free download on Kaggle.com
churn <- read.csv("https://pengdsci.github.io/datasets/ChurnData/Customer-Chrn-dataset.txt")
The names of the 14 variables (11 categorical, 3 numerical) are as follows:
The purpose of this analysis is to determine how these different variables pertaining to a customer’s telecommunications subscription impact the probability of them leaving the company (i.e., customer churn) from month to month.
While the majority of the variables in this data set are categorical, a pair-wise scatter plot of the few numerical variables can be generated to evaluate any potential correlations between them.
library(psych)
pairs.panels(churn[, c(3, 12, 13)],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
This scatter plot indicated that there is a very strong correlation between the variables Term & Total_Charges, as well as moderate correlation between Monthly_Charges and Total_Charges. Because Total_Charges is moderately to strongly correlated with each of the other two numerical values, it will be excluded from the regression model.
From the plot it can also be seen that neither Term nor Monthly_Charges is particularly normally distributed. More detailed histograms can give a better understanding of their distributions.
par(mfrow=c(1,2))
hist(churn$Term, xlab="Term", main = "", col = "springgreen3")
hist(churn$Monthly_Charges, xlab = "Monthly_Charges", main = "", col = "slateblue1")
The distribution of Term is significantly skewed, while Monthly_Charges exhibits a roughly bimodal distribution. Therefore, both variables will be discretized before being used in the regression model.
Term = churn$Term
grp.Term = Term
grp.Term[Term %in% c(0:20)] = "0-20"
grp.Term[Term %in% c(21:40)] = "21-40"
grp.Term[Term %in% c(41:60)] = "41-60"
grp.Term[Term %in% c(61:75)] = "65+"
Monthly = churn$Monthly_Charges
grp.Monthly = Monthly
grp.Monthly[Monthly <= 60.00] = "<=60"
grp.Monthly[Monthly > 60.00] = ">60"
churn$grp.Term <- grp.Term
churn$grp.Monthly <- grp.Monthly
The variable Churn must be converted into a binary factor variable in order to use it as the response variable in the logistic regression model.
y0 <- churn$Churn
Churn.bin <- rep(0, length(y0))
Churn.bin[which(y0=="Yes")] = 1
churn$Churn.bin <- Churn.bin
To begin the multiple logistic model building process, the full model will be constructed using the binary version of Churn as the response, and for the predictors all of the other categorical variables as well as the discretized Term and Monthly_Charges variables.
full.model = glm(Churn.bin ~ Sex + Marital_Status + Phone_service + International_plan + Voice_mail_plan + Multiple_line + Internet_service + Technical_support + Streaming_Videos + Agreement_period + grp.Term + grp.Monthly,
family = binomial(link = "logit"),
data = churn)
kable(summary(full.model)$coef,
caption="Summary of Inferential Statistics of the Full Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.5416140 | 0.4743373 | 1.1418330 | 0.2535234 |
| SexMale | -0.0150689 | 0.1755353 | -0.0858453 | 0.9315894 |
| Marital_StatusSingle | -0.0497139 | 0.3542051 | -0.1403535 | 0.8883807 |
| Phone_serviceYes | 0.8025502 | 0.4267435 | 1.8806384 | 0.0600211 |
| International_planyes | -0.0804928 | 0.2688971 | -0.2993444 | 0.7646773 |
| International_planYes | 0.2639021 | 0.2287407 | 1.1537171 | 0.2486162 |
| Voice_mail_planYes | -0.0023890 | 0.1845780 | -0.0129433 | 0.9896730 |
| Multiple_lineYes | -0.1661475 | 0.2080758 | -0.7984949 | 0.4245833 |
| Internet_serviceDSL | -1.2114059 | 0.3246723 | -3.7311651 | 0.0001906 |
| Internet_serviceFiber optic | -0.0692190 | 0.2381681 | -0.2906310 | 0.7713336 |
| Internet_serviceNo Internet | -3.4283528 | 0.5266906 | -6.5092353 | 0.0000000 |
| Technical_supportYes | -0.5919972 | 0.2165968 | -2.7331767 | 0.0062727 |
| Streaming_VideosYes | 0.5795696 | 0.2055061 | 2.8202069 | 0.0047993 |
| Agreement_periodOne year contract | -1.8666291 | 0.3148806 | -5.9280532 | 0.0000000 |
| Agreement_periodTwo year contract | -2.0750841 | 0.4195625 | -4.9458280 | 0.0000008 |
| grp.Term21-40 | -0.9353897 | 0.2493380 | -3.7514930 | 0.0001758 |
| grp.Term41-60 | -0.2147208 | 0.3119979 | -0.6882123 | 0.4913191 |
| grp.Term65+ | -0.5039791 | 0.3988333 | -1.2636335 | 0.2063616 |
| grp.Monthly>60 | -1.0568128 | 0.4004310 | -2.6391886 | 0.0083105 |
In the absence of any expert-provided information about which values may be practically significant, the reduced model will be constructed using only the three most statistically significant predictor variables from the full model.
reduced.model = glm(Churn.bin ~ Internet_service + Agreement_period + grp.Term,
family = binomial(link = "logit"),
data = churn)
kable(summary(reduced.model)$coef,
caption="Summary of Inferential Statistics of the Reduced Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.4868430 | 0.1982836 | 2.4552859 | 0.0140773 |
| Internet_serviceDSL | -0.8569101 | 0.2428689 | -3.5282823 | 0.0004183 |
| Internet_serviceFiber optic | -0.1156724 | 0.2156588 | -0.5363675 | 0.5917046 |
| Internet_serviceNo Internet | -2.5491609 | 0.3835675 | -6.6459257 | 0.0000000 |
| Agreement_periodOne year contract | -1.9937361 | 0.3061610 | -6.5120519 | 0.0000000 |
| Agreement_periodTwo year contract | -2.3832920 | 0.3988378 | -5.9755923 | 0.0000000 |
| grp.Term21-40 | -0.9863819 | 0.2323121 | -4.2459335 | 0.0000218 |
| grp.Term41-60 | -0.1366617 | 0.2762953 | -0.4946220 | 0.6208670 |
| grp.Term65+ | -0.3900424 | 0.3575836 | -1.0907726 | 0.2753729 |
With the full and reduced models constructed, another model can be constructed using automatic variable selection.
library(MASS)
auto.var.model = stepAIC(reduced.model,
scope = list(lower=formula(reduced.model),upper=formula(full.model)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
kable(summary(auto.var.model)$coef,
caption="Summary of Inferential Statistics of the Automatical Variable Selection Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.5146908 | 0.3899117 | 1.3200190 | 0.1868287 |
| Internet_serviceDSL | -1.2111657 | 0.3175540 | -3.8140458 | 0.0001367 |
| Internet_serviceFiber optic | -0.1527670 | 0.2241249 | -0.6816155 | 0.4954821 |
| Internet_serviceNo Internet | -3.4493605 | 0.5001411 | -6.8967754 | 0.0000000 |
| Agreement_periodOne year contract | -1.8646419 | 0.3140225 | -5.9379237 | 0.0000000 |
| Agreement_periodTwo year contract | -2.0509852 | 0.4148297 | -4.9441621 | 0.0000008 |
| grp.Term21-40 | -0.9612877 | 0.2397912 | -4.0088526 | 0.0000610 |
| grp.Term41-60 | -0.2396120 | 0.2862021 | -0.8372126 | 0.4024731 |
| grp.Term65+ | -0.5529218 | 0.3713989 | -1.4887546 | 0.1365520 |
| Technical_supportYes | -0.5988289 | 0.2136803 | -2.8024522 | 0.0050716 |
| Streaming_VideosYes | 0.5614245 | 0.2022807 | 2.7754725 | 0.0055122 |
| grp.Monthly>60 | -1.0278732 | 0.3751703 | -2.7397507 | 0.0061486 |
| Phone_serviceYes | 0.8451414 | 0.3696940 | 2.2860565 | 0.0222510 |
The forward selection process resulted in a new regression model with seven predictor variables: grp.Term, grp.Monthly, Agreement_period, Internet_service, Technical_Support, Streaming_Videos, & Phone_service.
The full model, reduced model, and the automatic variable selection model can now be compared by multiple goodness-of-fit measures to determine which to use as the final model.
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
reduced.model=global.measure(reduced.model),
auto.var.model=global.measure(auto.var.model))
row.names(goodness) = c("full.model", "reduced.model", "auto.var.model")
kable(goodness, caption ="Comparison of Global Goodness-of-Fit Statistics")
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 842.0548 | 1144.017 | 880.0548 |
| reduced.model | 866.0357 | 1144.017 | 884.0357 |
| auto.var.model | 845.1956 | 1144.017 | 871.1956 |
Because it has the smallest AIC value, the automatic variable selection model will be chosen as the final model for the analysis.
Now that a final model has been chosen, the regression coefficients and corresponding odds ratios for each predictor variable can be analyzed and interpreted in a practical context.
model.coef.stats = summary(auto.var.model)$coef
odds.ratio = exp(coef(auto.var.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | 0.5146908 | 0.3899117 | 1.3200190 | 0.1868287 | 1.6731211 |
| Internet_serviceDSL | -1.2111657 | 0.3175540 | -3.8140458 | 0.0001367 | 0.2978499 |
| Internet_serviceFiber optic | -0.1527670 | 0.2241249 | -0.6816155 | 0.4954821 | 0.8583297 |
| Internet_serviceNo Internet | -3.4493605 | 0.5001411 | -6.8967754 | 0.0000000 | 0.0317659 |
| Agreement_periodOne year contract | -1.8646419 | 0.3140225 | -5.9379237 | 0.0000000 | 0.1549517 |
| Agreement_periodTwo year contract | -2.0509852 | 0.4148297 | -4.9441621 | 0.0000008 | 0.1286081 |
| grp.Term21-40 | -0.9612877 | 0.2397912 | -4.0088526 | 0.0000610 | 0.3824002 |
| grp.Term41-60 | -0.2396120 | 0.2862021 | -0.8372126 | 0.4024731 | 0.7869331 |
| grp.Term65+ | -0.5529218 | 0.3713989 | -1.4887546 | 0.1365520 | 0.5752666 |
| Technical_supportYes | -0.5988289 | 0.2136803 | -2.8024522 | 0.0050716 | NA |
| Streaming_VideosYes | 0.5614245 | 0.2022807 | 2.7754725 | 0.0055122 | 0.5494547 |
| grp.Monthly>60 | -1.0278732 | 0.3751703 | -2.7397507 | 0.0061486 | NA |
| Phone_serviceYes | 0.8451414 | 0.3696940 | 2.2860565 | 0.0222510 | 1.7531681 |
Interestingly, the regression coefficient for nearly every predictor variable is negative, which suggests that any change from than the baseline level in any of these variables (while all others remain unchanged) results in a decrease in the probability of customer churn. For instance, the model suggests that, holding all other variables equal, going from the baseline term duration of 0-20 months to 21-40 months would result in the odds of customer churn decreasing by about 62%. This value is obtained via the formula \((Odds Ratio - 1) * 100\), or in this case \((0.38 - 1)* 100 = -62\). Similarly, a change from cable internet (the baseline internet service type) to DSL (again with all other predictor variables remaining unchanged) would result in an over 70% decrease in the odds of customer churn. The only variables for which the regression coefficients are positive are Streaming_Videos and Phone_service. Since the baseline for these variables is not having these features, the positive coefficients suggest that a customer having these features as opposed to not having them increases the odds of customer churn.
Note: Unfortunately, there are some unresolved errors in the output of the odds ratios of the last four predictor variables in the table. However, the correct odds ratios can be easily calculated by exponentiating the corresponding regression coefficients. Thus, the correct odds ratio for Technical_supportYes is \(e^{-0.5988289...}\) \(\approx\) 0.5494547, and 1.7531681, 0.3577671, & 2.328307 for Streaming_VideosYes, grp.Monthly>60, & Phone_serviceYes, respectively.
The interpretation of the regression coefficients and odds ratios can provide some valuable insight into how these different factors might influence the likelihood of a customer ending their business with a telecommunications company. For example, as noted earlier, the model suggests that the odds of customer churn are somewhat higher for a customer with cable internet than a customer with a DSL connection. This may signal an increased level of frustration experienced by customers due to issues with their cable internet connection, and consequently a need for the company to investigate these potential issues. This notion applies to the company’s video streaming and phone services as well, considering the model’s suggestion that the odds of churn are higher for customers who have these services as opposed to those who don’t. However, consultation with an expert in the business might offer some alternative explanations for these phenomena.