The data set used in this analysis is a subset of a larger pool of customer data collected by a telecommunications company to investigate what factors may contribute to the retention or churn (i.e., loss) of customers. This data set consists of 1000 observations and 14 variables. There are no missing values. It is available for free download on Kaggle.com
churn <- read.csv("https://pengdsci.github.io/datasets/ChurnData/Customer-Chrn-dataset.txt")
The names of the 14 variables (11 categorical, 3 numerical) are as follows:
The purpose of this analysis is to construct a model which predicts whether a customer of a telecommunications company will end their business with the company (i.e., customer churn) in a given month based on several aspects of their subscription.
While the majority of the variables in this data set are categorical, a pair-wise scatter plot of the few numerical variables can be generated to evaluate any potential correlations between them.
library(psych)
pairs.panels(churn[, c(3, 12, 13)],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
This scatter plot indicated that there is a very strong correlation between the variables Term & Total_Charges, as well as moderate correlation between Monthly_Charges and Total_Charges. Because Total_Charges is moderately to strongly correlated with each of the other two numerical values, it will be excluded from the regression model.
From the plot it can also be seen that neither Term nor Monthly_Charges is particularly normally distributed. More detailed histograms can give a better understanding of their distributions.
par(mfrow=c(1,2))
hist(churn$Term, xlab="Term", main = "", col = "springgreen3")
hist(churn$Monthly_Charges, xlab = "Monthly_Charges", main = "", col = "slateblue1")
The distribution of Term is significantly skewed, while Monthly_Charges exhibits a roughly bimodal distribution. Therefore, both variables will be discretized before being used in the regression model.
Term = churn$Term
grp.Term = Term
grp.Term[Term %in% c(0:20)] = "0-20"
grp.Term[Term %in% c(21:40)] = "21-40"
grp.Term[Term %in% c(41:60)] = "41-60"
grp.Term[Term %in% c(61:75)] = "65+"
Monthly = churn$Monthly_Charges
grp.Monthly = Monthly
grp.Monthly[Monthly <= 60.00] = "<=60"
grp.Monthly[Monthly > 60.00] = ">60"
churn$grp.Term <- grp.Term
churn$grp.Monthly <- grp.Monthly
Since the variable Total_Charges was dropped and Term and Monthly_Charges were discretized, this leaves no more numerical variables in the data set to standardize before constructing a predictive regression model. However, the response variable Churn still needs to be converted into a binary factor variable.
y0 <- churn$Churn
Churn.bin <- rep(0, length(y0))
Churn.bin[which(y0=="Yes")] = 1
churn$Churn.bin <- Churn.bin
For this analysis, the data set will be randomly split into two groups: 80% of the observations will be used as a training set to evaluate and choose between three candidate models via cross-validation, and the remaining 20% will be used as a testing group to assess the chosen final model’s predictive accuracy.
n <- dim(churn)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)
train <- churn[train.id, ]
test <- churn[-train.id, ]
Three candidates models constructed in a previous analysis will be assessed using 10-fold cross-validation with the training set to determine which exhibits the least predictive error. The three candidate models are:
Candidate 1 (Full Model): The full model uses every categorical variable in the data set as well as the discretized Term and Monthly_Charges variables as its predictors, with the binary Churn variable as the response.
Candidate 2 (Reduced Model): The reduced model consists of only the three most statistically significant predictor variables in the full model, i.e., Internet_service, Agreement_period, & grp.Term.
Candidate 3 (Automatic Variable-Selection Model): This model was constructed using a forward step-wise variable selection process on the full and reduced models. The selected predictor variables are Internet_service, Agreement_period, grp.Term, grp.Monthly, Technical_support, Streaming_Videos, & Phone_service.
With the candidate models identified and the data set split into training and testing sets, cross-validation can be performed on each to determine their respective predictive error. The cross-validation will be performed with k = 10 folds. 0.5 will be used as the common cutoff to define the predicted outcome for each model for convenience.
k=10
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,10)
PE2 = rep(0,10)
PE3 = rep(0,10)
for(i in 1:k){
## Training and testing folds
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = train[valid.id, ]
train.dat = train[-valid.id,]
## full model
candidate01 = glm(Churn.bin ~ Sex + Marital_Status + Phone_service + International_plan + Voice_mail_plan + Multiple_line + Internet_service + Technical_support + Streaming_Videos + Agreement_period + grp.Term + grp.Monthly, family = binomial(link = "logit"), data = churn)
## reduced model
candidate02 = glm(Churn.bin ~ Internet_service + Agreement_period + grp.Term, family = binomial(link = "logit"), data = churn)
## automatic variable selection model
candidate03 = glm(Churn.bin ~ Internet_service + Agreement_period + grp.Term + grp.Monthly + Technical_support + Streaming_Videos + Phone_service, family = binomial(link = "logit"), data = churn)
## predicted probabilities of each candidate model
pred01 = predict(candidate01, newdata = valid, type="response")
pred02 = predict(candidate02, newdata = valid, type="response")
pred03 = predict(candidate03, newdata = valid, type="response")
## confusion matrix: ftable() will
confusion.matrix01 = ftable(as.vector((pred01>0.5)),(valid$Churn.bin=="1"))
confusion.matrix02 = ftable(as.vector((pred02>0.5)),(valid$Churn.bin=="1"))
confusion.matrix03 = ftable(as.vector((pred03>0.5)),(valid$Churn.bin=="1"))
PE1[i] = (confusion.matrix01[1,2] + confusion.matrix01[2,1])/length(pred01)
PE2[i] = (confusion.matrix02[1,2] + confusion.matrix02[2,1])/length(pred02)
PE3[i] = (confusion.matrix03[1,2] + confusion.matrix03[2,1])/length(pred03)
}
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average of Prediction Errors of Candidate Models")
| PE1 | PE2 | PE3 |
|---|---|---|
| 0.21625 | 0.24125 | 0.22125 |
As seen from the above table, Candidate model 1 (i.e., the full model) has the lowest mean prediction error. Therefore, using the cross-validation method, this model will be chosen as final predictive regression model. The accuracy rate of the final model can now be calculated using the testing portion of the data.
pred01 = predict(candidate01, newdata = test, type="response")
confusion.matrix01 = ftable(as.vector((pred01>0.5)),(test$Churn.bin=="1"))
accuracy = (confusion.matrix01[1,1] + confusion.matrix01[2,2])/length(pred01)
According to this output, the final model correctly predicts the customer churn value of 77.5% of the observations in the testing set.
In order to employ the ROC method in choosing a final model, the true positive rate (TPR) and false positive rate (FPR) of each candidate model must be calculated at every cutoff probability. Below, the formula for each is defined.
TPR.FPR=function(pred){
prob.seq = seq(0,1, length=15)
pn=length(prob.seq)
true.lab=as.vector(test$Churn.bin)
TPR = NULL
FPR = NULL
##
for (i in 1:pn){
pred.lab = as.vector(ifelse(pred >prob.seq[i],"1", "0"))
TPR[i] = length(which(true.lab=="1" & pred.lab=="1"))/length(which(true.lab=="1"))
FPR[i] = length(which(true.lab=="0" & pred.lab=="1"))/length(which(true.lab=="0"))
}
cbind(FPR = FPR, TPR = TPR)
}
With the formulas for TPR and FPR defined, they can be calculated for each candidate model at every cutoff probability. The results can then be plotted to create ROC curves.
## full model
candidate01 = glm(Churn.bin ~ Sex + Marital_Status + Phone_service + International_plan + Voice_mail_plan + Multiple_line + Internet_service + Technical_support + Streaming_Videos + Agreement_period + grp.Term + grp.Monthly, family = binomial(link = "logit"), data = churn)
## reduced model
candidate02 = glm(Churn.bin ~ Internet_service + Agreement_period + grp.Term, family = binomial(link = "logit"), data = churn)
## automatic variable selection model
candidate03 = glm(Churn.bin ~ Internet_service + Agreement_period + grp.Term + grp.Monthly + Technical_support + Streaming_Videos + Phone_service, family = binomial(link = "logit"), data = churn)
## predicted probabilities
pred01 = predict(candidate01, newdata = test, type="response")
pred02 = predict(candidate02, newdata = test, type="response")
pred03 = predict(candidate03, newdata = test, type="response")
####
## ROC curve
plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2],
type="b", col=2, lty=1, xlim=c(0,1), ylim=c(0,1),
xlab = "FPR: 1 - specificity",
ylab ="TPR: sensitivity",
main = "ROC Curve",
cex.main = 0.8,
col.main = "midnightblue")
lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2], type="b", col=3, lty=2)
lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2], type="b", col=4, lty=3)
legend("bottomright", c("Candidate #1", "Candidate #2","Candidate #3"),
col=2:4, lty=1:3, cex = 0.8, bty="n")
Contrary to the results of the cross-validation technique, the plot of the ROC curves seems to suggest that Candidate Model 3 (i.e., the automatic variable selection model) actually has the best global goodness of fit, slightly better than Candidate Model 1, which was chosen as the best based on cross-validation. The predictive accuracy rate for this model can also be calculated using the testing data set.
pred03 = predict(candidate03, newdata = test, type="response")
confusion.matrix03 = ftable(as.vector((pred03>0.5)),(test$Churn.bin=="1"))
accuracy = (confusion.matrix03[1,1] + confusion.matrix03[2,2])/length(pred03)
According to this output, Candidate 3 has a slightly higher predictive accuracy rate than Candidate 1, correctly predicting the customer churn value of 79% of the testing set observations. It is worth noting that across all of these metrics (mean prediction error, ROC curves, & predictive accuracy rate), the difference in apparent goodness of fit between Candidate 1 & Candidate 3 is fairly minuscule, so an argument could be made for either to be chosen as the final predictive regression model. However, Candidate 3 may ultimately be preferable because it is the simpler model. Additionally, it appears that Candidate 2 is unequivocally the least trustworthy of the three.
Again, both the full model and the automatic variable selection model appear to correctly predict the occurrence of customer churn at a respectable rate. However, their level of accuracy might not be entirely sufficient for a company trying to make important business decisions as they attempt to minimize the rate of customer churn from month to month. Therefore, it may still be worthwhile to construct new models, in addition to using different cutoff probabilities in cross-validation that may make more practical sense in the context of the topic.