This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.
The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.
Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.
The following are the questions and objectives that describe and explain the main purpose of this project.
The response variable is Churn which is a binary
variable with two values, yes and no. The
value yes means that a customer left the company while
no means that a customer is still active.
By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, three models will be built and the best model used to analyse the relationship between the binary response variable, churn, and the predictor variables. The three models are: Logistics Regression, Neural Network and Decision Tree
The total number of records in this data set is 1000. It consists of
14 variables including the response variable with the name
Churn. There are 3 numerical variables and 11 categorical
variables. The predictor variables include sex, marital status, term,
phone service and others. A detailed description of the variables is
given below:
Sex: Sex of the customer - Categorical var;
Marital_status: Marital status of the customer -
Categorical var; Term: Term (Displayed in months) -
Numerical var; Phone_service: Phone service - Categorical
var; international_plan: International plan - Categorical
var; Voice_mail_plan: Voice mail plan - Categorical var;
Multiple_line: Multiple line - Categorical var;
Internet_service: Internet service - Categorical var;
Technical_support: Technical support - Categorical var;
Streaming_videos: Streaming Videos - Categorical var;
Agreement_period: Agreement period - Categorical var;
Monthly_charges: Monthly Charges - Numerical var;
Total charges: Total Charges - Numerical var;
Churn: Churn (Yes or No)
A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt
dat <- read.table(file = "https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt", # read in the txt file
header = TRUE, # header row in the txt file
sep = ",", # separator or delimiter
dec = ".") # symbol used for decimal values
The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. All the numerical and categorical variables were examined closely and there were no missing values found.
summary(dat) # produces summarized descriptive statistics for every
# variable in the dataset
The above summary table indicates that there are no missing values in all the variables.
Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.
The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.
par(mfrow = c(2,2))
sx <- table(dat$Sex) # frequency table for sex
ms <- table(dat$Marital_Status) # frequency table for marital status
pservice <- table(dat$Phone_service) # frequency table for phone service
vmp <- table(dat$Voice_mail_plan) # frequency table for voicemail plan
barplot(sx, xlab="Sex", ylab="counts", main="Distribution of Sex", col=c("lightgreen", "brown")) # barplot for sex
# xlab = xaxis label
# ylab = yaxis label
# col = color of the bars
# main = barplot heading
barplot(vmp, xlab="Voicemail Plan", ylab="counts", main="Distribution of Voicemail plan", col=c("purple", "red")) # barplot for voicemail plan
barplot(ms, xlab="Marital status", ylab="counts", main="Distribution of Marital status", col=c("blue", "orange")) # barplot for marital status
barplot(pservice, xlab="Phone Service", ylab="counts", main="Distribution of phone Service", col=c("pink", "yellow")) # barplot for phone service
From the above plots, it can be seen that 51.4% of the customers in this
study are male. Majority of customers are married, have a phone service
and a voice mail plan.
par(mfrow = c(2,2))
ml <- table(dat$Multiple_line) # frequency table for Multiple line
IS <- table(dat$Internet_service) # frequency table for Internet service
barplot(ml, xlab="Multiple Line", ylab="counts", main="Distribution of Multiple Line", col=c("purple", "red", "orange")) # barplot for multiple line
barplot(IS, xlab="Internet service", ylab="counts", main="Distribution of Internet service", col=c("pink", "lightgreen", "yellow", "brown"), names.arg = c("Cable", "DSL", "F.O", "None")) # barplot for Internet service
AP <- table(dat$Agreement_period) # frequency table for Agreement period
CN <- table(dat$Churn) # frequency table for Churn
barplot(CN, xlab="Churn", ylab="counts", main="Distribution of Churn", col=c("purple", "red", "orange")) # barplot for Churn
barplot(AP, xlab="Agreement period", ylab="Counts", main="Distribution of Agreement Period", col=c("green", "yellow", "blue"), names.arg = c("monthly", "1year", "2years"))
# barplot for Agreement period
44.4% of the customers have multiple lines. Under the Internet service category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and 20.8% had none. Majority were on a monthly contract. For this data, 74.1% had not left the company.
One of the categorical variables, International Plan, had 3 groups: No, Yes and yes, with values, 429, 309, and 262 respectively. This was an input error that happened when this data was collected.
Groups <- c("No", "yes", "Yes") # this was the result from the frequency table of
Freq <- c(429, 262, 309) # international plan.
AB <- data.frame(Groups, Freq) # created a dataframe with the two groups
kable(AB) # table
In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.
Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.
par(mfrow=c(2,2))
grp.IP <- ifelse(dat$International_plan == "No", "No",
ifelse(dat$International_plan == "yes", "Yes",
ifelse(dat$International_plan == "Yes", "Yes", "NA")))
# regrouped the original variable into 2 groups
G.IP <- table(grp.IP) # frequency table for the new variable
barplot(G.IP, xlab="Grp.International plan", ylab="counts", main="Distribution of Grp.IP", col=c("blue", "orange")) # barplot of the new international plan variable
tech_support <- ifelse(dat$Technical_support=="Yes", "Yes", "No")
TS <- table(tech_support) # frequency table for Technical support
barplot(TS, xlab="Technical Support", ylab="counts", main="Distribution of Technical Support", col=c("lightblue", "red")) # barplot of Technical support
stream_videos <- ifelse(dat$Streaming_Videos=="Yes", "Yes", "No")
SV <- table(stream_videos) # frequency table for streaming videos
barplot(SV, xlab="Streaming videos", ylab="counts", main="Distribution of Streaming Videos", col=c("green", "yellow")) # barplot of streaming videos
57.1% of the customers had an international plan. About a third of the
customers had technical support and 41% had video streaming.
There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:
par(mfrow=c(1,3))
hist(dat$Term, xlab="Term in months", ylab="counts", main="Distribution of Term", col="lightgreen") # histogram of term
den=density(dat$Monthly_Charges) # kernel density estimates for monthly charges
plot(den, xlab="monthly charges", main="Density of Monthly Charges", col="brown")
# plot of the density of monthly charges
hist(dat$Total_Charges, xlab="Total Charges", ylab="counts", main="Dist. of Total Charges", col="pink") # histogram of total charges
The plot of the histogram showing the distribution of Term shows a
non-symmetric pattern with the highest frequency between 0 and 5 months
and lowest between 35 and 40 months.
This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).
The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.
From the above density plot of monthly charges, it can be seen that
the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore,
these variables will be discretized for future models and algorithms.
The variable, monthly charges, ranges from 18.95 to
116.25.less than 30: low charges; 30 to 80 :
moderate charges; greater than 80: high charges
The following table shows the frequency of the grouped variable, grp.month
## convert monthly charges into a character variable with 3 groups
grp.month <- ifelse(dat$Monthly_Charges < 30, "Low",
ifelse(dat$Monthly_Charges > 80, "High", "Moderate"))
## table showing the frequency of the new grouped variable
kable(table(grp.month), col.names=c("Monthly Charges", "Frequency"))
| Monthly Charges | Frequency |
|---|---|
| High | 406 |
| Low | 223 |
| Moderate | 371 |
Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.
This was done to determine whether the response is independent of the categorical variables. Variables found to be independent of the response will be excluded in any of the subsequent models.Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and the boxes line up in a grid.
par(mfrow=c(3,2))
# mosaic plot for sex, marital status, phone service, Grp.IP, Voicemail plan and multiple
# line with Churn)
mosaicplot(Sex ~ Churn, data=dat, col=c("pink", "orange"), main="Sex Vs Churn")
mosaicplot(Marital_Status ~ Churn, data=dat, col=c("lightblue", "purple"), main="Marital status Vs Churn")
mosaicplot(Phone_service ~ Churn, data=dat, col=c("yellow", "green"), main= "Phone Service Vs Churn")
mosaicplot(grp.IP ~ Churn, data=dat, col=c("lightgreen", "blue"), main="International plan Vs Churn")
mosaicplot(Voice_mail_plan ~ Churn, data=dat, col=c("gray", "red"), main="Voice_mail_plan Vs Churn")
mosaicplot(Multiple_line ~ Churn, data=dat, col=c("lightblue", "brown"), main="Multiple_line Vs Churn")
From the above mosaic plots, it can be seen that sex, phone service,
voicemail plan, and multiple line appear to be independent of the
response variable, churn. This is because the proportion of churn cases
in the individual categories of these variables appear to be
identical.
par(mfrow=c(3,2))
## mosaic plot for Agreement period, Internet service, Technical support, Streaming videos, ## grp.month with churn
mosaicplot(Agreement_period ~ Churn, data=dat, col=c("lightgreen", "blue"), main="Agreement Period Vs Churn")
mosaicplot(Internet_service ~ Churn, data=dat, col=c("yellow", "red"), main="Internet Service Vs Churn")
mosaicplot(tech_support ~ Churn, data=dat, col=c("pink", "green"), main="Technical support Vs Churn")
mosaicplot(stream_videos ~ Churn, data=dat, col=c("orange", "purple"), main="Streaming Videos Vs Churn")
mosaicplot(grp.month ~ Churn, data=dat, col=c("brown", "gold", "lightblue"), main="monthly charges Vs Churn")
In addition to marital status and International plan, Agreement period,
Internet service, monthly charges (grouped), technical support and
streaming videos are not independent of the response variable,
Churn.
A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:
# chi square estimates for voicemail plan, phone service, sex and multiple line with
# Churn
Chisq.Voicemail <- chisq.test(dat$Voice_mail_plan, dat$Churn)
Chisq.Phoneservice <- chisq.test(dat$Phone_service, dat$Churn)
Chisq.sex <- chisq.test(dat$Sex, dat$Churn)
Chisq.multipleline <- chisq.test(dat$Multiple_line, dat$Churn)
chisq.table <- data.frame(Chisq.sex$p.value, Chisq.Phoneservice$p.value, Chisq.Voicemail$p.value, Chisq.multipleline$p.value) # placed all the chi square variable # estimates in a table
chisq.table%>%gt() # table
| Chisq.sex.p.value | Chisq.Phoneservice.p.value | Chisq.Voicemail.p.value | Chisq.multipleline.p.value |
|---|---|---|---|
| 0.1248683 | 0.3680155 | 0.6651237 | 0.3384263 |
The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.
ggpairs(dat, # Data frame
columns = c(3,12,13), # numeric columns
aes(color = Churn, # Color by response variable (cat. variable)
alpha = 0.5))
The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.
The main diagonal stacked density curves show the potential
difference in the distribution of the underlying numeric variable in
Churn and non-Churn groups. This means that the stacked density curves
show the relationship between numeric and categorical variables. These
stacked density curves are not completely overlapped indicating somewhat
correlation between each of these numeric variables and the binary
response variable, Churn.
Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.
This clustering algorithm divides a set of n observations into k clusters. In general, clustering is a method of assigning comparable data points to groups using data patterns. Clustering algorithms find similar data points and allocate them to the same set.
## subset the data
clust.data <- dat[, c(3,12,13)]
scaled.data <- as.data.frame(scale(clust.data)[,1:3])
A subset of the original data is created containing only the 3 numeric variables: term, monthly charges and total charges. This subset is then scaled.
distance <- get_dist(scaled.data)
fviz_dist(distance, gradient = list(low = "yellow", mid = "orange", high = "darkred"), show_labels = FALSE)
Heatmap representation of potential clusters
The above heatmap indicates that different clusters exist in this data (based on the three numerical variables).
fviz_nbclust(scaled.data, kmeans, method = "wss")
The elbow method is used to find the optimal number of clusters using
the
fviz_nbclust() function to create a plot of the number
of clusters vs. the total within sum of squares. For this plot it
appears that there is a bit of an elbow or “bend” at k = 5 clusters.
This is the optimal number of clusters.
k-means clustering is performed on the dataset using the optimal value for k of 5:
set.seed(20) # make this reproducible
k5 <- kmeans(x = scaled.data, # name of dataset
centers = 5, # number of clusters
iter.max = 10,
nstart = 25, # number of initial configurations
algorithm = "Hartigan-Wong",
trace = FALSE)
There were 5 clusters of sizes 204, 252, 191, 221 and 132. 204 customers were assigned to cluster 1. 252 customers were assigned to cluster 2. 191 customers were assigned to cluster 3. 221 customers were assigned to cluster 4. 132 customers were assigned to cluster 5
The clusters are visualized on a scatterplot that displays the first
two principal components on the axes using the
fivz_cluster() function:
fviz_cluster(k5, data = scaled.data)
Scatter Plot of Final cluster results
kable(aggregate(clust.data, by=list(cluster=k5$cluster), mean), col.names = c("Cluster", "Term", "Monthly Charges", "Total Charges"))
At the end of the exploratory data analysis, the new cluster variable will be added into the original dataset.
cluster <- k5$cluster
Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.
International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The cluster variable was also added. The number of variables in the final dataset was 12.
## removed all the variables that are not needed for subsequent modelling
bat <- dat[,-c(1, 4, 5, 6, 7, 9, 10)]
## added the new variables to the trimmed dataset
bat.total <- cbind(bat, grp.month, tech_support, stream_videos, grp.IP, cluster)
## saved the new dataset to a local drive
write.csv(bat.total, "C:\\Users\\echef\\Documents\\sta551\\Datasets\\cx_churn.csv")
The following are the variables that will be used for subsequent
modelling. Marital_Status, Term,
Internet_service, tech_support,
stream_videos, Agreement_period,
Monthly_Charges, grp.month,
Total_Charges, grp.IP, cluster,
and Churn
In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The first assumption of a logistic model is that the response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no. The second is the predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity, and the third is the functional form of the predictor variables are correctly specified.
Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.
The numeric variables are Term, Monthly charges and Total charges. The cluster variable has 5 values which signifies the 5 different groups (1-5). it was also changed to a factor with 5 levels. In total, there are 10 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).
## first we read in the dataset containing all the variables that
## will be used for modelling
mat <- read.csv("https://github.com/chinwex/datasets/raw/main/cx_churn.csv")[,-1]
## converted all character variables to factor
marital.status <- as.factor(mat$Marital_Status) # has 2 levels
internet.service <- as.factor(mat$Internet_service) # has 4 levels
churn <- as.factor(mat$Churn) # has 2 levels
technical.support <- as.factor(mat$tech_support) # has 2 levels
streaming.videos <- as.factor(mat$stream_videos) # has 2 levels
agreement.period <- as.factor(mat$Agreement_period) # has 3 levels
International.plan <- as.factor(mat$grp.IP) # has 2 levels
grpd.month <- as.factor(mat$grp.month) # has 3 levels
cluster <- as.factor(mat$cluster) # has 5 levels
#created duplicates of the 3 numerical variables
Term <- mat$Term
Monthly_Charges <- mat$Monthly_Charges
Total_Charges <- mat$Total_Charges
# put all the 11 variables for modelling in one new dataset
dat.1 <- data.frame(marital.status, Term, internet.service, technical.support, streaming.videos, agreement.period, grpd.month, International.plan, Monthly_Charges, Total_Charges, cluster, churn)
First, a logistic regression model that contains all predictor variables with monthly charges variable as numeric in the data set was built. This is called the first model.
# the logistics regression model containing all predictors
first.model <- glm(churn ~ marital.status + Term + technical.support + internet.service + streaming.videos + agreement.period + International.plan + cluster + Monthly_Charges + Total_Charges, family = binomial, data = dat.1)
#summary of the regression coefficients
firstmodel.table <- summary(first.model)$coef
#table showing the significance tests in the regression model
kable(firstmodel.table, caption = "Significance Tests for the First Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.3607015 | 0.9453676 | -2.4971254 | 0.0125205 |
| marital.statusSingle | -0.1758372 | 0.3828222 | -0.4593183 | 0.6460056 |
| Term | -0.0341104 | 0.0188862 | -1.8061068 | 0.0709017 |
| technical.supportYes | -0.7053120 | 0.2225359 | -3.1694304 | 0.0015274 |
| internet.serviceDSL | -0.5606394 | 0.3616395 | -1.5502715 | 0.1210764 |
| internet.serviceFiber optic | -0.3355105 | 0.2291304 | -1.4642778 | 0.1431181 |
| internet.serviceNo Internet | -1.9911186 | 0.6167162 | -3.2285818 | 0.0012441 |
| streaming.videosYes | -0.0514545 | 0.2608122 | -0.1972855 | 0.8436041 |
| agreement.periodOne year contract | -1.8023818 | 0.3242817 | -5.5580753 | 0.0000000 |
| agreement.periodTwo year contract | -1.9417424 | 0.4175161 | -4.6507007 | 0.0000033 |
| International.planYes | 0.1843095 | 0.2074822 | 0.8883150 | 0.3743713 |
| cluster2 | -0.2465339 | 0.3868904 | -0.6372190 | 0.5239822 |
| cluster3 | 1.1873077 | 0.7592222 | 1.5638475 | 0.1178534 |
| cluster4 | 1.5778116 | 0.4738521 | 3.3297551 | 0.0008692 |
| cluster5 | 2.6631985 | 0.8390290 | 3.1741436 | 0.0015028 |
| Monthly_Charges | 0.0383224 | 0.0121460 | 3.1551530 | 0.0016041 |
| Total_Charges | 0.0000022 | 0.0002110 | 0.0104330 | 0.9916758 |
The AIC of the first model is 859.1497. It is made up of 16 variables. In the first model, some of the variables were significant at the .05 level. These were: the intercept, technical support, no internet, one-year agreement, two-year agreement, cluster 4, cluster 5 and monthly charges. one year agreement and two year agreement were significantly different from monthly charges. Cluster 4 and 5 were significantly different from cluster 1. no internet was significantly different from cable.
first.model$aic # the aic of the first model
Then another model containing all the predictors, but this time with monthly charges as a factor, was built.
# the logistics regression model containing all predictors, with grpd.month
second.model <- glm(churn ~ marital.status + Term + technical.support + internet.service + streaming.videos + agreement.period + International.plan + cluster + grpd.month + Total_Charges, family = binomial, data = dat.1)
#summary of the regression coefficients
secondmodel.table <- summary(second.model)$coef
#table showing the significance tests in the regression model
kable(secondmodel.table, caption = "Significance Tests for the Second Model")
second.model$aic # the aic of the second model
The AIC of the second model is 868.3467. It was made up of 17 variables. Here, the variables that were significant at the .05 level are: Term, Technical support, Internet service-DSL, Internet service-no internet, one year agreement period, two year agreement, cluster 4 and cluster 5.
When compared to the first model based on the AIC, the first model had a lower AIC of 859.1497. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model. Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.
# the reduced model containing the 3 main variables
reduced.model <- glm(churn ~ Term + agreement.period + Monthly_Charges, family = binomial, data = dat.1)
#summary of the coefficients of the reduced model
reducedmodel.table <- summary(reduced.model)$coef
#table showing the significance tests of the reduced model
kable(reducedmodel.table, caption = "Significance Tests for Reduced Model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.8270051 | 0.2414150 | -7.567903 | 0.0000000 |
| Term | -0.0174912 | 0.0051167 | -3.418431 | 0.0006298 |
| agreement.periodOne year contract | -1.8564358 | 0.2919578 | -6.358575 | 0.0000000 |
| agreement.periodTwo year contract | -2.1559901 | 0.3681332 | -5.856549 | 0.0000000 |
| Monthly_Charges | 0.0266624 | 0.0034836 | 7.653632 | 0.0000000 |
reduced.model$aic # aic of the reduced model
The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level. All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Technical support, Internet service, agreement period, cluster, and monthly charges. Since monthly charges and agreement period were already present in the reduced model, technical support, internet service and cluster were added from the first model.
# A fourth model containing all variables in the reduced model and significant variables from the first model.
fourth.model <- glm(churn ~ Term + technical.support + internet.service + cluster + agreement.period + Monthly_Charges, family = binomial, data = dat.1)
# summary of the regression coefficients of the third model
fourthmodel.table <- summary(fourth.model)$coef
# table showing the significance tests for third model
kable(fourthmodel.table, caption = "Significance Tests for Fourth Model")
fourth.model$aic #aic for fourth model
The AIC of the fourth model was 852.3914. It was made up of 12 variables. 2 dummy variables in the internet service (fiber optic and DSL) and 2 in the cluster (2 and 3) were not significant at .05 significance level. The results were not shown here since they were the exact ones obtained in the next model below. The next step is to use an automatic variable procedure to find the best model.
This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.
# the automatic variable selection method
modelA = step(first.model,
scope=list(lower=formula(reduced.model),upper=formula(first.model)),
data = dat.1,
direction = "backward",
trace = 0) # trace = 0: suppress the detailed selection process
# summary of the coefficient of modelA
modelA.coef = summary(modelA)$coef
#table showing the significant tests of modelA
kable(modelA.coef , caption = "Summary Table of Significant Tests for model A")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.2249631 | 0.7230624 | -3.0771384 | 0.0020900 |
| Term | -0.0346433 | 0.0126062 | -2.7481259 | 0.0059937 |
| technical.supportYes | -0.7045058 | 0.2208271 | -3.1903049 | 0.0014212 |
| internet.serviceDSL | -0.5500562 | 0.3356312 | -1.6388709 | 0.1012401 |
| internet.serviceFiber optic | -0.3118063 | 0.2279093 | -1.3681160 | 0.1712758 |
| internet.serviceNo Internet | -1.9455052 | 0.5565853 | -3.4954308 | 0.0004733 |
| agreement.periodOne year contract | -1.8075414 | 0.3242020 | -5.5753562 | 0.0000000 |
| agreement.periodTwo year contract | -1.9237553 | 0.4128618 | -4.6595626 | 0.0000032 |
| cluster2 | -0.2987356 | 0.3767943 | -0.7928348 | 0.4278741 |
| cluster3 | 1.0961043 | 0.7008642 | 1.5639326 | 0.1178334 |
| cluster4 | 1.4185471 | 0.4051955 | 3.5008950 | 0.0004637 |
| cluster5 | 2.4506129 | 0.7801954 | 3.1410247 | 0.0016836 |
| Monthly_Charges | 0.0382299 | 0.0086272 | 4.4312985 | 0.0000094 |
modelA$aic # aic of modelA
This model is same with the fourth model with the AIC, 852.3914, and 12 variables. 2 dummy variables in the internet service (fiber optic and DSL) and 2 in the cluster (2 and 3) were also not significant at .05 significance level.
The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.
The predicted response is compared to the original response. This is shown in the following table.
#puts the predicted values in a dataset
pred.churn <- predict(modelA, type="response")
## threshold probability
cut.off.prob = 0.5
## This groups the response as 1 = Yes and 0 = No
pred.response <- ifelse(pred.churn > cut.off.prob, "Yes", "No")
## Add the new predicted response to the dataset, dat.1
dat.1$Pred.Response <- pred.response
## table showing the first 10 observations in the dataset with the predicted response column
kable(head(dat.1[,-c(5, 7, 8, 10)], 4), caption = "Dataset with model predicted response", col.names = c("Mar.status", "Term", "Int.service", "Tech.support", "Agr.period", "Month.charges", "cluster", "churn", "Predicted"))
| Mar.status | Term | Int.service | Tech.support | Agr.period | Month.charges | cluster | churn | Predicted |
|---|---|---|---|---|---|---|---|---|
| Married | 16 | Cable | Yes | Monthly contract | 98.05 | 1 | Yes | Yes |
| Married | 70 | Cable | Yes | One year contract | 75.25 | 3 | No | No |
| Married | 36 | Cable | Yes | Monthly contract | 73.35 | 2 | No | No |
| Married | 72 | Cable | Yes | One year contract | 112.60 | 3 | No | No |
The predicted response of the first 4 observations in the dataset, is quite similar to the original response.
After calculating the frequency of the original response variable and the frequency of the predicted response variable. It was seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.4%. Therefore, this model is acceptable.
kable(table(dat.1$Pred.Response), caption="Frequency of Predicted Response Variable")
kable(table(dat.1$churn), caption="Frequency of Original Response Variable")
A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:
## A hypothetical dataset containing new information was created
dat.2 = data.frame(Term = c(38, 50, 14, 4),
technical.support = c("Yes", "No", "No", "Yes"),
internet.service = c("Cable", "No Internet", "No Internet", "Cable"),
agreement.period = c("Monthly contract", "Monthly contract", "One year contract", "Monthly contract"),
Monthly_Charges = c(100.5, 87.6, 110.50, 90),
cluster = c("3", "1", "2", "1"))
## ModelA was used to predict the response
pred.churn.1 = predict(modelA, newdata = dat.2, type="response")
## threshold probability
cut.off.prob = 0.5
## This groups the response as 1 = Yes and 0 = No
pred.response.1 = ifelse(pred.churn.1 > cut.off.prob, "Yes", "No")
## Add the new predicted response to dat.2
dat.2$Pred.Response.1 = pred.response.1
## table showing the new dataset with the predicted response
kable(dat.2, caption = "Predicted Values of New Data", col.names = c("Term", "Internet.service", "Tech.support", "Agr.period", "Monthly.charges", "cluster", "Predicted"))
| Term | Internet.service | Tech.support | Agr.period | Monthly.charges | cluster | Predicted |
|---|---|---|---|---|---|---|
| 38 | Yes | Cable | Monthly contract | 100.5 | 3 | Yes |
| 50 | No | No Internet | Monthly contract | 87.6 | 1 | No |
| 14 | No | No Internet | One year contract | 110.5 | 2 | No |
| 4 | Yes | Cable | Monthly contract | 90.0 | 1 | Yes |
Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.
df <- read.csv("https://github.com/chinwex/datasets/raw/main/cx_churn.csv")[,-c(1,9)]
## convert all the character variables with 2 groups into 0 and 1 values
df$Marital_Status <- ifelse(df$Marital_Status=="Single", 1, 0)
df$tech_support <- ifelse(df$tech_support=="Yes", 1, 0)
df$Churn <- ifelse(df$Churn=="Yes", 1, 0)
df$Agreement_period <- ifelse(df$Agreement_period=="Monthly contract", 0,
ifelse(df$Agreement_period=="One year contract", 1, 2))
df$Agreement_period <- as.factor(df$Agreement_period)
df$Internet_service <- ifelse(df$Internet_service=="Cable", 0,
ifelse(df$Internet_service=="DSL", 1,
ifelse(df$Internet_service=="Fiber optic", 2,3)))
df$Internet_service <- as.factor(df$Internet_service)
df$cluster <- as.factor(df$cluster)
df$stream_videos <- ifelse(df$stream_videos=="Yes", 1, 0)
df$grp.IP <- ifelse(df$grp.IP=="Yes", 1, 0)
## gives the dimensions of a dataframe
## number of rows or columns
nn <- dim(df)[1]
set.seed(200)
## divides the data set into 70% training and 30% testing
train.id = sample(1:nn, round(nn*0.7), replace = FALSE)
training = df[train.id,]
testing = df[-train.id,]
The three best models already built in the previous section were used. They are ModelA, reduced model and first model. Cross-validation was done on all 3 models using the training dataset.This involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set).
par(mfrow=c(2,2))
ab <- dim(training)[1]/5 # total obs in training dataset is divided by 5
## for modelA
cutoff.prob <- seq(0,1,length = 22)[-c(1,22)]
## created a null vector for storing prediction accuracy
pred.accuracy <- matrix(0,ncol=20, nrow=5, byrow = TRUE)
## for the reduced model
cutoff.prob3 <- seq(0,1,length = 22)[-c(1,22)]
## created a null vector for storing prediction accuracy
pred.accuracy3 <- matrix(0,ncol=20, nrow=5, byrow = TRUE)
## for the first model
cutoff.prob2 <- seq(0,1,length = 22)[-c(1,22)]
## created a null vector for storing prediction accuracy
pred.accuracy2 <- matrix(0,ncol=20, nrow=5, byrow = TRUE)
## Created a 5-fold Cross Validation
for (i in 1:5){
valid.id <- ((i-1)*ab + 1):(i*ab)
valid.data <- training[valid.id,]
train.data <- training[-valid.id,]
# used modelA
train.model <- glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges , family = binomial(link = logit), data = train.data)
# used the reduced model
train.model3 <- glm(Churn ~ Term + Agreement_period + Monthly_Charges, family = binomial(link = logit), data = train.data)
# used the first model
train.model2 <- glm(Churn ~ Marital_Status + Term + tech_support + Internet_service + stream_videos + Agreement_period + grp.IP + cluster + Monthly_Charges + Total_Charges, family = binomial(link = logit), data = train.data)
## this gives the data set containing the predicted probabilities of the train models on
## the valid dataset
pred.prob = predict.glm(train.model, valid.data, type = "response")
pred.prob3 = predict.glm(train.model3, valid.data, type = "response")
pred.prob2 = predict.glm(train.model2, valid.data, type = "response")
## define the confusion matrix
for(j in 1:20){
pred_churn = rep(0,length(pred.prob))
pred_churn3 = rep(0,length(pred.prob3))
pred_churn2 = rep(0,length(pred.prob2))
valid.data$pred_churn = as.numeric(pred.prob >cutoff.prob[j])
valid.data$pred_churn3 = as.numeric(pred.prob3 >cutoff.prob3[j])
valid.data$pred_churn2 = as.numeric(pred.prob2 >cutoff.prob2[j])
a11 = sum(valid.data$pred_churn == valid.data$Churn)
a113 = sum(valid.data$pred_churn3 == valid.data$Churn)
a112 = sum(valid.data$pred_churn2 == valid.data$Churn)
pred.accuracy[i,j] = a11/length(pred.prob)
pred.accuracy3[i,j] = a113/length(pred.prob3)
pred.accuracy2[i,j] = a112/length(pred.prob2)
}
}
avg.accuracy = apply(pred.accuracy, 2, mean) # calc. the mean by column of the matrix
max.id = which(avg.accuracy == max(avg.accuracy )) # maximum avg accuracy for modelA
avg.accuracy3 = apply(pred.accuracy3, 2, mean) # calc. the mean by column of the matrix
max.id3 = which(avg.accuracy3 == max(avg.accuracy3)) # maximum avg accuracy for reduced model
avg.accuracy2 = apply(pred.accuracy2, 2, mean) # calc. the mean by column of the matrix
max.id2 = which(avg.accuracy2 == max(avg.accuracy2)) # maximum avg accuracy for first model
## PLOT FOR MODELA
### visual representation of cross validation
tick.label = as.character(round(cutoff.prob,2))
plot(1:20, avg.accuracy, type = "b",
xlim=c(1,20),
ylim=c(0.4,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "ModelA"
)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "green")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "blue", cex = 0.8)
## PLOT FOR THE REDUCED MODEL
### visual representation of cross validation
tick.label3 = as.character(round(cutoff.prob3,2))
plot(1:20, avg.accuracy3, type = "b",
xlim=c(1,20),
ylim=c(0.4,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "Reduced model"
)
axis(1, at=1:20, label = tick.label3, las = 2)
axis(2)
segments(max.id3, 0.5, max.id3, avg.accuracy3[max.id3], col = "purple")
text(max.id3, avg.accuracy3[max.id3]+0.03, as.character(round(avg.accuracy3[max.id3],4)), col = "red", cex = 0.8)
## PLOT FOR THE FIRST MODEL
### visual representation of cross validation
tick.label2 = as.character(round(cutoff.prob2,2))
plot(1:20, avg.accuracy2, type = "b",
xlim=c(1,20),
ylim=c(0.4,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "First Model"
)
axis(1, at=1:20, label = tick.label2, las = 2)
axis(2)
segments(max.id2, 0.5, max.id2, avg.accuracy2[max.id2], col = "brown")
text(max.id2, avg.accuracy2[max.id2]+0.03, as.character(round(avg.accuracy2[max.id2],4)), col = "blue", cex = 0.8)
5-fold CV performance plot
ModelA was the model generated from automatic variable selection. The optimal cut-off probability that yields the best accuracy for ModelA is 0.52. The reduced model was made up of term, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for this model is 0.57. The fourth model was made up of term, technical support, internet service, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.
The models were fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:
## ModelA was used on the training data set to generate regression coefficients
## stored on the test.model
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges,
family = binomial(link = logit),
data = training) # training data set
test.model3 = glm(Churn ~ Term + Agreement_period + Monthly_Charges, #reduced model
family = binomial(link = logit),
data = training) # training data set
test.model2 = glm(Churn ~ Marital_Status + Term + tech_support + Internet_service + stream_videos + Agreement_period + grp.IP + cluster + Monthly_Charges + Total_Charges,
family = binomial(link = logit),
data = training) # training data set
## gives the data set containing the predicted probabilities of the test models on
## the testing dataset
pred.prob.test = predict.glm(test.model, testing, type = "response")
pred.prob.test3 = predict.glm(test.model3, testing, type = "response")
pred.prob.test2 = predict.glm(test.model2, testing, type = "response")
testing$test.churn = as.numeric(pred.prob.test > 0.38)
testing$test.churn3 = as.numeric(pred.prob.test3 > 0.52)
testing$test.churn2 = as.numeric(pred.prob.test2 > 0.52)
a11 = sum(testing$test.churn == testing$Churn)
a113 = sum(testing$test.churn3 == testing$Churn)
a112 = sum(testing$test.churn2 == testing$Churn)
test.accuracy = a11/length(pred.prob.test)
test.accuracy3 = a113/length(pred.prob.test3)
test.accuracy2 = a112/length(pred.prob.test2)
Model_type <- c("ModelA", "Reduced Model", "First Model")
Accuracy <- c(test.accuracy, test.accuracy3, test.accuracy2)
Model_Accuracy <- data.frame(Model_type,Accuracy)
kable(Model_Accuracy, col.names = c("Model", "Test Accuracy"))
| Model | Test Accuracy |
|---|---|
| ModelA | 0.7933333 |
| Reduced Model | 0.7566667 |
| First Model | 0.7800000 |
The regression coefficients obtained by fitting the models on the training data was used to obtain accuracy from the test data. The accuracy was found to be 79.33% for modelA, 75.67% for the reduced model and 78.0% for the first model. The model with the best test accuracy was ModelA. When the test accuracy is compared with the training accuracy (79.14%) of modelA, no underfitting or overfitting was seen.
Below is the plot of the ROC curve (1-specificity, sensitivity).
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, # ModelA
family = binomial(link = logit),
data = training) # training data set
test.model3 = glm(Churn ~ Term + Agreement_period + Monthly_Charges, #reduced model
family = binomial(link = logit),
data = training) # training data set
test.model2 = glm(Churn ~ Marital_Status + Term + tech_support + Internet_service + stream_videos + Agreement_period + grp.IP + cluster + Monthly_Charges + Total_Charges,
family = binomial(link = logit),
data = training) # first model
## gives the data set containing the predicted probabilities of the test models on
## the testing dataset
pred.prob.test = predict.glm(test.model, training, type = "response")
pred.prob.test3 = predict.glm(test.model3, training, type = "response")
pred.prob.test2 = predict.glm(test.model2, training, type = "response")
### components for defining various measures for modelA
cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (i in 1:20){
training$train.churn = as.numeric(pred.prob.test > cut.off.seq[i])
### components for defining various measures for modelA
p0.a0 = sum(training$train.churn ==0 & training$Churn==0)
p0.a1 = sum(training$train.churn ==0 & training$Churn ==1)
p1.a0 = sum(training$train.churn ==1 & training$Churn==0)
p1.a1 = sum(training$train.churn ==1 & training$Churn ==1)
###
sensitivity.vec[i] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[i] = p0.a0 / (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)
### components for defining various measures for first model
cut.off.seq2 = seq(0, 1, length = 20)
sensitivity.vec2 = NULL
specificity.vec2 = NULL
for (m in 1:20){
training$train.churn2 = as.numeric(pred.prob.test2 > cut.off.seq2[m])
p0.a02 = sum(training$train.churn2 ==0 & training$Churn==0)
p0.a12 = sum(training$train.churn2 ==0 & training$Churn ==1)
p1.a02 = sum(training$train.churn2 ==1 & training$Churn==0)
p1.a12 = sum(training$train.churn2 ==1 & training$Churn ==1)
sensitivity.vec2[m] = p1.a12 / (p1.a12 + p0.a12)
specificity.vec2[m] = p0.a02/ (p0.a02 + p1.a02)
}
one.minus.spec2 = c(1,1 - specificity.vec2)
sens.vec2 = c(1,sensitivity.vec2)
### components for defining various measures for reduced model
cut.off.seq3 = seq(0, 1, length = 20)
sensitivity.vec3 = NULL
specificity.vec3 = NULL
for (n in 1:20){
training$train.churn3 = as.numeric(pred.prob.test3 > cut.off.seq3[n])
p0.a03 = sum(training$train.churn3 ==0 & training$Churn==0)
p0.a13 = sum(training$train.churn3 ==0 & training$Churn ==1)
p1.a03 = sum(training$train.churn3 ==1 & training$Churn==0)
p1.a13 = sum(training$train.churn3 ==1 & training$Churn ==1)
sensitivity.vec3[n] = p1.a13 / (p1.a13 + p0.a13)
specificity.vec3[n] = p0.a03/ (p0.a03 + p1.a03)
}
one.minus.spec3 = c(1,1 - specificity.vec3)
sens.vec3 = c(1,sensitivity.vec3)
## Visualization
par(pty = "s") # make a square figure
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1),
xlab ="1 - specificity",
ylab = "sensitivity",
main = "ROC Curves of the 3 Models",
lwd = 2,
col = "purple", cex=0.5 )
segments(0,0,1,1, col = "orange", lty = 2, lwd = 2)
AUC = round(sum(sens.vec*(one.minus.spec[-21]-one.minus.spec[-1])),4)
text(0.6, 0.3, paste("AUC = ", AUC), col = "purple", cex=0.5)
lines(one.minus.spec2,sens.vec2, col="red")
AUC2 = round(sum(sens.vec2*(one.minus.spec2[-21]-one.minus.spec2[-1])),4)
text(0.5, 0.2, paste("AUC = ", AUC2), col = "red", cex=0.5)
lines(one.minus.spec3,sens.vec3, col="green")
AUC3 = round(sum(sens.vec3*(one.minus.spec3[-21]-one.minus.spec3[-1])),4)
text(0.8, 0.5, paste("AUC = ", AUC3), col = "green", cex=0.5)
legend(0,1,legend=c("ModelA", "Reduced model", "First model"), col=c("purple","green","red"), pch=0, cex=0.4)
From the plot, it can be seen that the AUC for modelA which is 0.8639 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.
Out of all the 3 logistics models evaluated above using cross-validation and KPI measures, ModelA had the best test accuracy (79.33%) and the highest AUC(0.8639) from the ROC curve when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:
## ModelA was used on the training data set to generate regression coefficients
## stored on the test.model
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges,
family = binomial(link = logit),
data = training) # training data set
## gives the data set containing the predicted probabilities of the test.model
pred.prob.test = predict.glm(test.model, testing, type = "response")
testing$test.churn = as.numeric(pred.prob.test > 0.38)
### components for defining various measures
p0.a0 = sum(testing$test.churn ==0 & testing$Churn==0)
p0.a1 = sum(testing$test.churn ==0 & testing$Churn ==1)
p1.a0 = sum(testing$test.churn ==1 & testing$Churn==0)
p1.a1 = sum(testing$test.churn ==1 & testing$Churn ==1)
sensitivity <- p1.a1 / (p1.a1 + p0.a1)
specificity <- p0.a0 / (p0.a0 + p1.a0)
precision <- p1.a1 / (p1.a1 + p1.a0)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity,
specificity = specificity,
precision = precision,
recall = recall,
F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics for the Best Logistics Model")
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.6931818 | 0.8349057 | 0.6354167 | 0.6931818 | 0.6630435 |
Below is the optimal cut-off probability for the best logistics model.
ab <- dim(training)[1]/5 # total obs in training dataset is divided by 5
## A seq of 20 numbers between 0 and 22 are created excluding 0 and 22
cutoff.prob <- seq(0,1,length = 22)[-c(1,22)]
## created a null vector for storing prediction accuracy
pred.accuracy <- matrix(0,ncol=20, nrow=5, byrow = TRUE)
## Created a 5-fold Cross Validation
for (i in 1:5){
valid.id <- ((i-1)*ab + 1):(i*ab)
valid.data <- training[valid.id,]
train.data <- training[-valid.id,]
# used the final model
train.model <- glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, family = binomial(link = logit), data = train.data)
## this gives the data set containing the predicted probabilities of the train.model on
## the valid.data
pred.prob = predict.glm(train.model, valid.data, type = "response")
## define the confusion matrix
for(j in 1:20){
pred_churn = rep(0,length(pred.prob))
valid.data$pred_churn = as.numeric(pred.prob >cutoff.prob[j])
a11 = sum(valid.data$pred_churn == valid.data$Churn)
pred.accuracy[i,j] = a11/length(pred.prob)
}
}
avg.accuracy = apply(pred.accuracy, 2, mean) # calc. the mean by column of the matrix
max.id = which(avg.accuracy == max(avg.accuracy)) # maximum avg accuracy
### visual representation of cross validation
tick.label = as.character(round(cutoff.prob,2))
plot(1:20, avg.accuracy, type = "b",
xlim=c(1,20),
ylim=c(0.4,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "5-fold Cross Validation performance of ModelA"
)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "brown")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "blue", cex = 0.8)
5-fold CV performance plot
At the end of the cross validation and testing, a subset of the main dataset containing the variables included in ModelA was created and ready to be used to build subsequent models.
keep <- df[,-c(1,6,9,10)]
write.csv(keep,"C:\\Users\\echef\\Documents\\sta551\\Datasets\\churn_model.csv")
Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables using model.matrix() and numerical variables are scaled.
## first we read in the data set containing all the variables that
## will be used for modelling
gat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]
gat$Term <- (gat$Term-min(gat$Term))/(max(gat$Term)-min(gat$Term))
gat$Monthly_Charges <- (gat$Monthly_Charges-min(gat$Monthly_Charges))/(max(gat$Monthly_Charges)-min(gat$Monthly_Charges))
## convert categorical variables to factor
gat$cluster <- as.factor(gat$cluster)
gat$Internet_service <- as.factor(gat$Internet_service)
gat$Agreement_period <- as.factor(gat$Agreement_period)
gat$Churn <- as.factor(gat$Churn)
gat$tech_support <- as.factor(gat$tech_support)
gatmtx = model.matrix(~ ., data = gat)
colnames(gatmtx)
## Then the variables were renamed using simpler names.
colnames(gatmtx)[3] <- "DSL"
colnames(gatmtx)[4] <- "fiber"
colnames(gatmtx)[5] <- "nointernet"
colnames(gatmtx)[6] <- "oneyear"
colnames(gatmtx)[7] <- "twoyears"
colnames(gatmtx)[8] <- "monthlycharges"
colnames(gatmtx)[9] <- "Churn"
colnames(gatmtx)[10] <- "techsupport"
colnames(gatmtx)[11] <- "cluster2"
colnames(gatmtx)[12] <- "cluster3"
colnames(gatmtx)[13] <- "cluster4"
colnames(gatmtx)[14] <- "cluster5"
colnames(gatmtx)
#### The Model Formula
## The following is the model formula:
## `Churn ~ Term + DSL + fiber + nointernet + oneyear + twoyears +
## monthlycharges + techsupport + cluster2 + cluster3 + cluster4 +
## cluster5`
columnNames <- colnames(gatmtx)
columnList <- paste(columnNames[-c(1,9)], collapse = "+")
columnList <- paste(c(columnNames[9],"~",columnList), collapse="")
modelFormula <- formula(columnList)
modelFormula
This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training (70%) and testing data (30%). Cross-validation was done with the training data and the model tested with the testing data. The data was split into 70% for training the neural network and 30% for testing.
n = dim(gatmtx)[1]
set.seed(200)
trainID = sample(1:n, round(n*0.7), replace = FALSE)
traindat = gatmtx[trainID,]
testdat = gatmtx[-trainID,]
NetworkModel = neuralnet(modelFormula,
data = traindat, # training dataset
hidden = 1, # single layer NN
rep = 1, # number of replicates in training NN
threshold = 0.01, # threshold for the partial derivatives # as stopping criteria.
learningrate = 0.1, # user selected rate
algorithm = "rprop+"
)
kable(NetworkModel$result.matrix)
The plot of a single layer neural network model of customer churn is shown.
plot(NetworkModel, rep="best")
n0 = dim(traindat)[1]/5
cut.off.score = seq(0,1, length = 22)[-c(1,22)] # candidate cut off prob
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T) # null vector for storing prediction accuracy
##
for (i in 1:5){
valid.id = ((i-1)*n0 + 1):(i*n0)
valid.data = traindat[valid.id,]
train.data = traindat[-valid.id,]
####
train.model = neuralnet(modelFormula,
data = train.data,
hidden = 1, # single layer NN
rep = 1, # number of replicates in training NN
threshold = 0.01, # threshold for the partial derivatives as stopping criteria.
learningrate = 0.1, # user selected rate
algorithm = "rprop+"
)
pred.nn.score = predict(NetworkModel, valid.data)
for(j in 1:20){
#pred.status = rep(0,length(pred.nn.score))
pred.churn = as.numeric(pred.nn.score > cut.off.score[j])
a11 = sum(pred.churn == valid.data[,9])
pred.accuracy[i,j] = a11/length(pred.nn.score)
}
}
###
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
### visual representation
tick.label = as.character(round(cut.off.score,2))
plot(1:20, avg.accuracy, type = "b",
xlim=c(1,20),
ylim=c(0.6,1),
axes = FALSE,
xlab = "Cut-off Score",
ylab = "Accuracy",
main = "5-fold CV performance"
)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "brown", cex = 0.8)
The cross validation in the neural network was carried out with the
training dataset. The optimal cut off probability obtained for the
neural network model was 0.48 and the training accuracy was 0.8214.
The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.
#Test the resulting output
nn.results <- predict(NetworkModel, testdat)
results <- data.frame(actual = testdat[,9], prediction = nn.results > 0.48)
confMatrix = table(results$prediction, results$actual) # confusion matrix
accuracy=sum(results$actual == results$prediction)/length(results$prediction)
kable(confMatrix, caption="Confusion Matrix")
| 0 | 1 | |
|---|---|---|
| FALSE | 184 | 32 |
| TRUE | 28 | 56 |
kable(accuracy, caption="Test Accuracy of Neural Network", col.names="Test Accuracy")
| Test Accuracy |
|---|
| 0.8 |
The test accuracy was found to be 80%.
An ROC curve is shown for the above neural network model based on the training data set.
nn.results = predict(NetworkModel, traindat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
a = sum(traindat[,"Churn"] == 1 & (nn.results > cut0[i]))
d = sum(traindat[,"Churn"] == 0 & (nn.results < cut0[i]))
b = sum(traindat[,"Churn"] == 0 & (nn.results > cut0[i]))
c = sum(traindat[,"Churn"] == 1 & (nn.results < cut0[i]))
sen = a/(a + c)
spe = d/(b + d)
SenSpe[,i] = c(sen, spe)
}
# plotting ROC
plot(1-SenSpe[2,], SenSpe[1,], type ="b", xlim=c(0,1), ylim=c(0,1),
xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
main = "ROC Curve of the Neural Network Model", col = "purple")
abline(0,1, lty = 2, col = "orange")
legend("bottomright", c("ROC of the model", "Random guessing"), lty=c(1,2),
col = c("purple", "orange"), bty = "n", cex = 0.8)
AUC = round(sum(SenSpe[1,]*((1-SenSpe[2,])[-20]-((1-SenSpe[2,][-1])))),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "purple")
ROC Curve of the neural network model
The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable. The AUC was 0.9039.
Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.
## the neural network model
nn.results = predict(NetworkModel, traindat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
a = sum(traindat[,"Churn"] == 1 & (nn.results > cut0[i]))
d = sum(traindat[,"Churn"] == 0 & (nn.results < cut0[i]))
b = sum(traindat[,"Churn"] == 0 & (nn.results > cut0[i]))
c = sum(traindat[,"Churn"] == 1 & (nn.results < cut0[i]))
sen = a/(a + c)
spe = d/(b + d)
SenSpe[,i] = c(sen, spe)
}
## The logistic regression model
nn <- dim(gat)[1]
set.seed(200) #used same seed
## divides the data set into 70% training and 30% testing
train.id = sample(1:nn, round(nn*0.7), replace = FALSE)
training = gat[train.id,]
testing = gat[-train.id,]
t.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges,
family = binomial(link = logit),
data = training) # training data set obtained in section 5 during
# cross-validation for the LR model
## ModelA was used on the training dataset again to obtain prob.
pred.prob.fm = predict.glm(t.model, training, type = "response")
### Sensitivity and Specificity for the LR model
cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (m in 1:20){
training$train.churn = as.numeric(pred.prob.fm > cut.off.seq[m])
p0.a0 = sum(training$train.churn ==0 & training$Churn==0)
p0.a1 = sum(training$train.churn ==0 & training$Churn ==1)
p1.a0 = sum(training$train.churn ==1 & training$Churn==0)
p1.a1 = sum(training$train.churn ==1 & training$Churn ==1)
sensitivity.vec[m] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[m] = p0.a0/ (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)
# plotting ROC for both models
plot(1-SenSpe[2,], SenSpe[1,], type ="b", xlim=c(0,1), ylim=c(0,1),
xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
main = "ROC Curve of LR and NN Models", col = "purple")
abline(0,1, lty = 2, col = "orange")
AUC = round(sum(SenSpe[1,]*((1-SenSpe[2,])[-20]-((1-SenSpe[2,][-1])))),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "purple")
points(one.minus.spec,sens.vec, col="red")
lines(one.minus.spec,sens.vec, col="red")
AUC.LR = round(sum(sens.vec*(one.minus.spec[-21]-one.minus.spec[-1])),4)
text(0.5, 0.2, paste("AUC = ", AUC.LR), col = "red")
legend("bottomright", c("NN model", "LR model", "Random guessing"), lty=c(1,1,2),
col = c("purple", "red", "orange"), bty = "n", cex = 0.8)
The neural network model had a test accuracy of 80% and an AUC of 0.9039 while the best logistic regression model had a test accuracy of 79.33% and an AUC of 0.8639. Both models are good and acceptable and from here, it is clear that the neural model is the better model between the two.
In this section, a decision tree was used as a predictive model to draw conclusions about customer churn data. The main goal of this model is to predict the value of a response variable based on several input variables. A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. The decision tree is appropriate for this data because it is easy to interpret, very flexible and also insensitive to underlying relationships between attributes. This means that if there are 2 variables in this dataset that are highly correlated, the algorithm will only choose one of the variables to split on.
rat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]
## convert churn variable to factor with yes and no
rat$Churn <- ifelse(rat$Churn==1, "Yes", "No")
## convert categorical variables to factor
rat$cluster <- as.factor(rat$cluster)
rat$Internet_service <- as.factor(rat$Internet_service)
rat$Agreement_period <- as.factor(rat$Agreement_period)
rat$Churn <- as.factor(rat$Churn)
rat$tech_support <- as.factor(rat$tech_support)
# use a random split approach
n = dim(rat)[1] # sample size
set.seed(200)
# using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)
train = rat[train.id, ] # training data
test = rat[-train.id, ] # testing data
To build the different model decision trees, the data set was first split into two datasets randomly, the training and testing dataset. There are 700 observations in the training dataset and 300 in the testing data. Cross-validation analysis will be done and optimal cut-off score calculated using the training dataset. ROC analysis will also be carried out and the best decision tree with the largest AUC will be identified.
Here, a wrapper is written so that 6 different decision trees can be built conveniently.
tree.builder = function(in.data, fp, fn, purity){
tree = rpart(Churn ~ ., # including all features
data = in.data,
na.action = na.rpart, # By default, deleted if the outcome is missing,
# kept if predictors are missing
method = "class", # Classification form factor
model = FALSE,
x = FALSE,
y = TRUE,
parms = list( # loss matrix. Penalize false positives or negatives more heavily
loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),
split = purity), # "gini" or "information"
## rpart algorithm options (These are defaults)
control = rpart.control(
minsplit = 10, # minimum number of observations required before # split
minbucket= 10, # minimum number of observations in any terminal # node, default = minsplit/3
cp = 0.01, # complexity parameter for stopping rule, 0.02 -> small # tree
xval = 10 # number of cross-validation )
)
)
}
### Using the function, `tree.builder = function(in.data, fp, fn, purity)`, six different decision tree models were defined below:
## Model 1: `gini.tree.11` is based on the Gini index without penalizing false positives and false negatives.
## **Model 2**: `info.tree.11` is based on entropy without penalizing false positives and false negatives.
## **Model 3**: `gini.tree.110` is based on the Gini index: cost of false negatives is 10 times the positives.
## **Model 4**: `info.tree.110` is based on entropy: cost of false negatives is 10 times the positives.
## **Model 5**: `gini.tree.101` is based on the Gini index: cost of false positive is 10 times the negatives.
## **Model 6**: `info.tree.101` is based on entropy: cost of false positive is 10 times the negatives.
The tree diagrams of two decision models are given below.
## Call the tree model wrapper.
gini.tree.1.1 = tree.builder(in.data = train, fp = 1, fn = 1, purity = "gini")
info.tree.1.1 = tree.builder(in.data = train, fp = 1, fn = 1, purity = "information")
gini.tree.1.10 = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")
info.tree.1.10 = tree.builder(in.data = train, fp = 1, fn = 10, purity = "information")
par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
rpart.plot(info.tree.1.10, main = "Tree with entropy: penalization")
Penalized decision tree models using Gini index (left) and entropy (right).
ROC analysis was then used to select the best among all models. The
function SenSpe = function(in.data, fp, fn, purity) is
defined and used to build 6 different trees and plot their corresponding
ROC curves so that the global performance of these tree algorithms can
be seen and compared. This function has 3 arguments and they include:
false positive, false negative and purity.
# function returning a sensitivity and specificity matrix
SenSpe = function(in.data, fp, fn, purity){
cutoff = seq(0,1, length = 20) # 20 cut-offs including 0 and 1.
model = tree.builder(in.data, fp, fn, purity)
## decision tree returns both "success" and "failure" probabilities.
## only "success" probability is needed to define sensitivity and specificity
pred = predict(model, newdata = in.data, type = "prob") # two-column matrix.
senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
for (i in 1:length(cutoff)){
# Yes and No are values of the response variable in this data set
# The following line uses only the probability of Yes: pred[, "Yes]
pred.out = ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")
TP = sum(pred.out =="Yes" & in.data$Churn == "Yes")
TN = sum(pred.out =="No" & in.data$Churn == "No")
FP = sum(pred.out =="Yes" & in.data$Churn == "No")
FN = sum(pred.out =="No" & in.data$Churn == "Yes")
senspe.mtx[1,i] = TP/(TP + FN)
senspe.mtx[2,i] = TN/(TN + FP)
}
prediction = pred[,"Yes"]
category = in.data$Churn == "Yes"
ROCobj <- roc(category, prediction)
AUC = auc(ROCobj)
##
list(senspe.mtx= senspe.mtx, AUC = round(AUC,3))
}
giniROC11 = SenSpe(in.data = train, fp=1, fn=1, purity="gini")
infoROC11 = SenSpe(in.data = train, fp=1, fn=1, purity="information")
giniROC110 = SenSpe(in.data = train, fp=1, fn=10, purity="gini")
infoROC110 = SenSpe(in.data = train, fp=1, fn=10, purity="information")
giniROC101 = SenSpe(in.data = train, fp=10, fn=1, purity="gini")
infoROC101 = SenSpe(in.data = train, fp=10, fn=1, purity="information")
The ROC curves represent various decision trees and their
corresponding AUC.The model, gini.1.10 has the largest AUC
of 0.864 and its curve extends farthest to the upper left corner.
Therefore, it is considered the best decision tree among the others.
par(pty="s") # set up square plot through graphic parameter
plot(1-giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1),
xlab="1 - specificity: FPR", ylab="Sensitivity: TPR", col = "blue", lwd = 2,
main="ROC Curves of Decision Trees", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)
lines(1-infoROC11$senspe.mtx[2,], infoROC11$senspe.mtx[1,], col = "firebrick2", lwd = 2, lty=2)
lines(1-giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,], col = "olivedrab", lwd = 2)
lines(1-infoROC110$senspe.mtx[2,], infoROC110$senspe.mtx[1,], col = "skyblue", lwd = 2)
lines(1-giniROC101$senspe.mtx[2,], giniROC101$senspe.mtx[1,], col = "red", lwd = 2, lty = 4)
lines(1-infoROC101$senspe.mtx[2,], infoROC101$senspe.mtx[1,], col = "sienna3", lwd = 2)
legend("bottomright", c(paste("gini.1.1, AUC =", giniROC11$AUC),
paste("info.1.1, AUC =",infoROC11$AUC),
paste("gini.1.10, AUC =",giniROC110$AUC),
paste("info.1.10, AUC =",infoROC110$AUC),
paste("gini.10.1, AUC =",giniROC101$AUC),
paste("info.10.1, AUC =",infoROC101$AUC)),
col=c("blue","firebrick2","olivedrab","skyblue","red","sienna3"),
lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")
Comparison of ROC curves
The optimal cut-off determination through cross-validation was based
on the training data set. The function
Optm.cutoff = function(in.data, fp, fn, purity) was first
created and then used to calculate the optimal cut-off for the first 4
decision trees shown above.
Optm.cutoff = function(in.data, fp, fn, purity){
n0 = dim(in.data)[1]/5
cutoff = seq(0,1, length = 20) # candidate cut off prob
## accuracy for each candidate cut-off
accuracy.mtx = matrix(0, ncol=20, nrow=5) # 20 candidate cutoffs and gini.11
##
for (k in 1:5){
valid.id = ((k-1)*n0 + 1):(k*n0)
valid.dat = in.data[valid.id,]
train.dat = in.data[-valid.id,]
## tree model
tree.model = tree.builder(in.data, fp, fn, purity)
## prediction
pred = predict(tree.model, newdata = valid.dat, type = "prob")[,2]
## for-loop
for (i in 1:20){
## predicted probabilities
pc.1 = ifelse(pred > cutoff[i], "Yes", "No")
## accuracy
a1 = mean(pc.1 == valid.dat$Churn)
accuracy.mtx[k,i] = a1
}
}
avg.acc = apply(accuracy.mtx, 2, mean)
## plots
n = length(avg.acc)
idx = which(avg.acc == max(avg.acc))
tick.label = as.character(round(cutoff,2))
##
plot(1:n, avg.acc, xlab="cut-off score", ylab="average accuracy",
ylim=c(min(avg.acc), 1),
axes = FALSE,
main=paste("5-fold CV optimal cut-off \n ",purity,"(fp, fn) = (", fp, ",", fn,")" , collapse = ""),
cex.main = 0.9,
col.main = "navy")
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
points(idx, avg.acc[idx], pch=19, col = "red")
segments(idx , min(avg.acc), idx , avg.acc[idx ], col = "red")
text(idx, avg.acc[idx]+0.03, as.character(round(avg.acc[idx],4)), col = "red", cex = 0.8)
}
par(mfrow=c(1,2))
Optm.cutoff(in.data = train, fp=1, fn=10, purity="gini")
Optm.cutoff(in.data = train, fp=1, fn=10, purity="information")
Plot of optimal cut-off determination
In the above figure, there are multiple cut-offs for each plot.
Therefore, the final cut-off for the best model will be the average of
the multiple cut-offs for that particular model. the average cut-off for
the best model, gini 1.10 is 0.4475.
At the end of the decision tree modelling, the best decision tree was identified with the optimal cut-off score. The following is the diagram of the best decision tree and the cut off score.
par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
Optm.cutoff(in.data = train, fp=1, fn=10, purity="gini")
Best decision tree model (left) and optimal cut-off (right).
The optimal cut off was 0.4475 and the training accuracy was 0.8071
tree.builder = function(in.data, fp, fn, purity){
tree = rpart(Churn ~ .,
data = in.data,
na.action = na.rpart,
method = "class",
model = FALSE,
x = FALSE,
y = TRUE,
parms = list(loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),
split = purity), # "gini" or "information"
control = rpart.control(
minsplit = 10,
minbucket= 10,
cp = 0.01,
xval = 10
)
)
}
best.tree = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")
t_pred = predict(best.tree,test,type="class")
confMat <- table(test$Churn,t_pred)
accuracy <- sum(diag(confMat))/sum(confMat)
kable(accuracy, caption="Test Accuracy of Best Decision Tree", col.names="Test Accuracy")
| Test Accuracy |
|---|
| 0.7633333 |
The accuracy for the best decision tree was 0.7633.
At the end of this analysis, the following models were built to analyse customer churn: Logistics Regression, Neural Network Model, and Decision Trees. These three models were compared using their ROC curves and test accuracy and performance.
Below is the plot of the ROC curves for these 3 models and the respective AUC values.
rat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]
## convert churn variable to factor with yes and no
rat$Churn <- ifelse(rat$Churn==1, "Yes", "No")
## convert categorical variables to factor
rat$cluster <- as.factor(rat$cluster)
rat$Internet_service <- as.factor(rat$Internet_service)
rat$Agreement_period <- as.factor(rat$Agreement_period)
rat$Churn <- as.factor(rat$Churn)
rat$tech_support <- as.factor(rat$tech_support)
## read in the same dataset that will contain standardized variables for the
## neural network model
gat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]
gat$Term <- (gat$Term-min(gat$Term))/(max(gat$Term)-min(gat$Term))
gat$Monthly_Charges <- (gat$Monthly_Charges-min(gat$Monthly_Charges))/(max(gat$Monthly_Charges)-min(gat$Monthly_Charges))
## convert categorical variables to factor
gat$cluster <- as.factor(gat$cluster)
gat$Internet_service <- as.factor(gat$Internet_service)
gat$Agreement_period <- as.factor(gat$Agreement_period)
gat$Churn <- as.factor(gat$Churn)
gat$tech_support <- as.factor(gat$tech_support)
# use a random split approach
n = dim(rat)[1] # sample size
set.seed(200)
# using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)
train = rat[train.id, ] # training data
test = rat[-train.id, ] # testing data
gatmtx = model.matrix(~ ., data = gat)
colnames(gatmtx)
colnames(gatmtx)[3] <- "DSL"
colnames(gatmtx)[4] <- "fiber"
colnames(gatmtx)[5] <- "nointernet"
colnames(gatmtx)[6] <- "oneyear"
colnames(gatmtx)[7] <- "twoyears"
colnames(gatmtx)[8] <- "monthlycharges"
colnames(gatmtx)[9] <- "Churn"
colnames(gatmtx)[10] <- "techsupport"
colnames(gatmtx)[11] <- "cluster2"
colnames(gatmtx)[12] <- "cluster3"
colnames(gatmtx)[13] <- "cluster4"
colnames(gatmtx)[14] <- "cluster5"
colnames(gatmtx)
columnNames <- colnames(gatmtx)
columnList <- paste(columnNames[-c(1,9)], collapse = "+")
columnList <- paste(c(columnNames[9],"~",columnList), collapse="")
modelFormula <- formula(columnList)
modelFormula
n = dim(gatmtx)[1]
set.seed(200)
trainID = sample(1:n, round(n*0.7), replace = FALSE)
traindat = gatmtx[trainID,]
testdat = gatmtx[-trainID,]
NetworkModel = neuralnet(modelFormula,
data = traindat, # training dataset
hidden = 1, # single layer NN
rep = 1, # number of replicates in training NN
threshold = 0.01, # threshold for the partial derivatives # as stopping criteria.
learningrate = 0.1, # user selected rate
algorithm = "rprop+"
)
kable(NetworkModel$result.matrix)
tree.builder = function(in.data, fp, fn, purity){
tree = rpart(Churn ~ .,
data = in.data,
na.action = na.rpart,
method = "class",
model = FALSE,
x = FALSE,
y = TRUE,
parms = list(loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),
split = purity), # "gini" or "information"
control = rpart.control(
minsplit = 10,
minbucket= 10,
cp = 0.01,
xval = 10
)
)
}
best.tree = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")
## ModelA
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, # ModelA
family = binomial(link = logit),
data = train) # training data set
## gives the data set containing the predicted probabilities of the test model on
## the training dataset
pred.prob.test = predict.glm(test.model, train, type = "response")
### components for defining various measures for modelA
cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (i in 1:20){
train$train.churn = as.numeric(pred.prob.test > cut.off.seq[i])
### components for defining various measures for modelA
p0.a0 = sum(train$train.churn ==0 & train$Churn=="No")
p0.a1 = sum(train$train.churn ==0 & train$Churn =="Yes")
p1.a0 = sum(train$train.churn ==1 & train$Churn=="No")
p1.a1 = sum(train$train.churn ==1 & train$Churn =="Yes")
###
sensitivity.vec[i] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[i] = p0.a0 / (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)
## Neural Network Model
nn.results = predict(NetworkModel, traindat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
a = sum(traindat[,"Churn"] == 1 & (nn.results > cut0[i]))
d = sum(traindat[,"Churn"] == 0 & (nn.results < cut0[i]))
b = sum(traindat[,"Churn"] == 0 & (nn.results > cut0[i]))
c = sum(traindat[,"Churn"] == 1 & (nn.results < cut0[i]))
sen = a/(a + c)
spe = d/(b + d)
SenSpe[,i] = c(sen, spe)
}
## Decision Tree
train <- train[,-8]
cutoff = seq(0,1, length = 20)
pred = predict(best.tree, train, type = "prob") # two-column matrix.
senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
for (i in 1:length(cutoff)){
# Yes and No are values of the response variable in this data set
# The following line uses only the probability of Yes: pred[, "Yes]
pred.out = ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")
TP = sum(pred.out =="Yes" & train$Churn == "Yes")
TN = sum(pred.out =="No" & train$Churn == "No")
FP = sum(pred.out =="Yes" & train$Churn == "No")
FN = sum(pred.out =="No" & train$Churn == "Yes")
senspe.mtx[1,i] = TP/(TP + FN)
senspe.mtx[2,i] = TN/(TN + FP)
}
prediction = pred[,"Yes"]
category = train$Churn == "Yes"
ROCobj <- roc(category, prediction)
## Visualization
par(pty = "s") # make a square figure
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1),ylim = c(0,1),
xlab ="1 - specificity: FPR",
ylab = "sensitivity: TPR",
main = "ROC Curves of the Three Models",
lwd = 2,
col = "purple", )
abline(0,1, lty = 2, col = "orange")
AUC = round(sum(sens.vec*(one.minus.spec[-21]-one.minus.spec[-1])),4)
lines(1-SenSpe[2,], SenSpe[1,], xlim=c(0,1), ylim=c(0,1), col = "olivedrab")
lines(1-senspe.mtx[2,], senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1),
col = "red", lwd = 2)
AUC2 = round(sum(SenSpe[1,]*((1-SenSpe[2,])[-20]-((1-SenSpe[2,][-1])))),4)
AUC3= round(sum(senspe.mtx[1,]*((1-senspe.mtx[2,])[-20]-((1-senspe.mtx[2,][-1])))),4)
legend("bottomright", c(paste("LR: AUC =", AUC),
paste("NN: AUC =",AUC2),
paste("DT: AUC =",AUC3)),
col=c("purple","olivedrab","red"),
lty=rep(1,3), cex = 0.5, bty = "n")
ROC Curve of all Models
From the above plot, the Decision Tree model appears to be the best model among all the other models based on the Area under the curve which was 0.9153.
Below is a table showing the type of model, training accuracy, test accuracy and AUC value.
Model <- c("Logistic Regression", "Neural Network", "Decision Tree")
Trainingaccuracy <- c(0.7914, 0.8214, 0.8071)
Testaccuracy <- c(0.7933, 0.80, 0.7633)
AUCscore <- c(0.8639, 0.9039,0.9153)
perform <- data.frame(Model, Trainingaccuracy, Testaccuracy, AUCscore)
kable(perform, col.names=c("Model", "Training Accuracy", "Test Accuracy", "AUC"), caption="Comparison Between All The Models")
| Model | Training Accuracy | Test Accuracy | AUC |
|---|---|---|---|
| Logistic Regression | 0.7914 | 0.7933 | 0.8639 |
| Neural Network | 0.8214 | 0.8000 | 0.9039 |
| Decision Tree | 0.8071 | 0.7633 | 0.9153 |
The following performance metrics: Specificity, Sensitivity, Recall, Precision and F1, was conducted for all 3 models: Logistics regression, Neural network and Decision Tree. The results are as follows:
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges,
family = binomial(link = logit),
data = train) # training data set
## gives the data set containing the predicted probabilities of the test.model
pred.prob.test = predict.glm(test.model, test, type = "response")
test$test.churn = as.numeric(pred.prob.test > 0.38)
### components for defining various measures
p0.a0 = sum(test$test.churn ==0 & test$Churn=="No")
p0.a1 = sum(test$test.churn ==0 & test$Churn =="Yes")
p1.a0 = sum(test$test.churn ==1 & test$Churn=="No")
p1.a1 = sum(test$test.churn ==1 & test$Churn =="Yes")
sensitivity <- p1.a1 / (p1.a1 + p0.a1)
specificity <- p0.a0 / (p0.a0 + p1.a0)
precision <- p1.a1 / (p1.a1 + p1.a0)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = rbind(sensitivity = sensitivity,
specificity = specificity,
precision = precision,
recall = recall,
F1 = F1)
## Neural Network Model
nn.results = predict(NetworkModel, testdat)
test.Churn = as.numeric(nn.results > 0.48)
testdat <- cbind(testdat, test.Churn)
a = sum(testdat[,"test.Churn"] == 1 & testdat[,"Churn"]==0)
d = sum(testdat[,"test.Churn"] == 0 & testdat[,"Churn"]==1)
b = sum(testdat[,"test.Churn"] == 0 & testdat[,"Churn"]==0)
c = sum(testdat[,"test.Churn"] == 1 & testdat[,"Churn"]==1)
sen = c/(c + d) # tp/(tp + fn)
spe = b/(b + a) # tn/(tn + fp)
precision_nn <- c/(c + a) #tp/(tp + fp)
recall_nn <- c/(c + d) # tp/(tp + fn)
F1_nn = 2*precision_nn*recall_nn/(precision_nn + recall_nn)
metric.list.nn = rbind(sensitivity = sen,
specificity = spe,
precision = precision_nn,
recall = recall_nn,
F1 = F1_nn)
## The Best Decision tree
pred = predict(best.tree, newdata=test, type = "prob") # two-column matrix.
pred.out = ifelse(pred[,"Yes"] >= 0.4475, "Yes", "No")
TP = sum(pred.out =="Yes" & test$Churn == "Yes")
TN = sum(pred.out =="No" & test$Churn == "No")
FP = sum(pred.out =="Yes" & test$Churn == "No")
FN = sum(pred.out =="No" & test$Churn == "Yes")
sensi = TP/(TP + FN)
speci = TN/(TN + FP)
precision_dt <- TP/(TP + FP) #tp/(tp + fp)
recall_dt <- TP/(TP + FN) # tp/(tp + fn)
F1_dt = 2*precision_dt*recall_dt/(precision_dt + recall_dt)
metric.list.dt = rbind(sensitivity = sensi,
specificity = speci,
precision = precision_dt,
recall = recall_dt,
F1 = F1_dt)
Total_metric <- data.frame(metric.list, metric.list.nn, metric.list.dt)
kable(Total_metric, col.names=c("Logistics regression", "Neural Network", "Decision Tree"), caption="Local Performance metrics For All Models")
| Logistics regression | Neural Network | Decision Tree | |
|---|---|---|---|
| sensitivity | 0.6931818 | 0.6363636 | 0.7500000 |
| specificity | 0.8349057 | 0.8679245 | 0.8254717 |
| precision | 0.6354167 | 0.6666667 | 0.6407767 |
| recall | 0.6931818 | 0.6363636 | 0.7500000 |
| F1 | 0.6630435 | 0.6511628 | 0.6910995 |
In this dataset, identifying the customers that are leaving the company is the ultimate goal. Precision measures the probability of correctly detecting positive values (which is what is needed). For this data, positive values are customers that are likely to churn. When the positive class is smaller, and the ability to detect correctly positive samples is the main focus (such as this study), precision and recall are the best performance metrics. The best model with the highest precision and recall was the decision tree.
Churn prediction identifies customers that are likely to leave or remain with a company. The response variable, churn, is a binary variable with values of yes and no, and it indicates whether a customer will churn.
After model building, the next thing is to use it to achieve the expected goal. Using the diagram of the best decision tree above, it is easy to show how this model can be used to predict customers that are likely to churn. A decision tree is a type of flow-chart that simplifies the decision making process by breaking down the different available paths of actions and their potential outcomes.
Key points Agreement period 0 - Monthly; 1 - One Year; 2 - Two Years, Internet service 0 - Cable; 1 - DSL; 2 - Fiber optic; 3 - No Internet Tech support - 0 - No; 1 - Yes
For example, if a customer’s agreement period is not equal to one or two years, the tree follows the right branch and arrives at internet service. If the customer has no internet, then the tree leads to a node that represents Term. If the customer’s term period is greater or equal to 14 months, the decision tree follows the right branch. The right branch leads to a leaf node that predicts that the person is likely to churn. On the other hand, if the customer’s internet service is either cable or DSL or Fiber optic, the decision tree follows the left branch and gets to Term. If the customer’s Term period is less than 11 months, the tree follows the left branch to a leaf node that predicts that the customer is not likely to churn.
Using a decision tree to predict the likelihood of a customer leaving the company has some limitations. One is that a decision tree provides less information on the relationship between the predictors and the response. Therefore it will be difficult for this model to explain the relationships between the response, Churn and any of the predictors: term, monthly charges, agreement period and so on. Another limitation is that the probabilities provided by this model are only estimates which are prone to error.
From the model, it is clear that customers with no internet, duration of term greater than 14 months and lower monthly charges (about 80 and lower) are more likely to churn. Therefore, it is recommended that both new and old customers should be encouraged to use internet services. Additionally, in order to keep consumers who have been with them for 12 months or longer, the business needs to pay closer attention to them and immediately address any issues they may have.
Customers that pay smaller monthly fees should also receive same treatment from the business like those paying high amounts. It’s likely that high-paying customers tend to stick around because they get better service, are treated with more respect, and earn bonuses for their subscriptions, compared to consumers who pay less each month and don’t get the same level of treatment.
Advanced analytics are crucial for telecommunication businesses to use in order to predict whether or not customers would quit the company. They will be better able to recognize the different groups that exist and know how to communicate with each group after doing so.
At the end of this study, various factors have been analysed and the best model put forward.The proper way to use this model has been discussed as well as the limitations. From this model, key factors that lead to customer churn have been correctly identified and various recommendations suggested in order to prevent customer churn.