1 Introduction

This data set contains customer level information for a telecommunication company. Each customer has a unique set of characteristics relating to the services they have used.

The telecommunications sector is growing quickly, and service providers are more focused on growing their subscriber bases. Retaining current clients has become one of the biggest challenges in order to meet the demand of surviving in the competitive industry. It is said that the expense of getting a new customer is significantly more than the expense of keeping an existing one.

Therefore, it is crucial for the telecommunications sectors to employ advanced analytics to comprehend consumer behavior and hence predict whether or not clients are going to leave the business.

1.1 Objectives of this study

The following are the questions and objectives that describe and explain the main purpose of this project.

To predict which customers are more likely to churn.
What is the percentage of churn customers in the company?
Are there any notable patterns in terms of customer churn based on gender and marital status?
Are there any notable patterns in terms of customer churn based on the amount spent by the customer and type of service provided?

1.2 Response variable

The response variable is Churn which is a binary variable with two values, yes and no. The value yes means that a customer left the company while no means that a customer is still active.

1.3 Model

By examining how the predictor variables affect the likelihood of detecting the larger value of the response variable, three models will be built and the best model used to analyse the relationship between the binary response variable, churn, and the predictor variables. The three models are: Logistics Regression, Neural Network and Decision Tree

2 Description of the Data

The total number of records in this data set is 1000. It consists of 14 variables including the response variable with the name Churn. There are 3 numerical variables and 11 categorical variables. The predictor variables include sex, marital status, term, phone service and others. A detailed description of the variables is given below:

Sex: Sex of the customer - Categorical var; Marital_status: Marital status of the customer - Categorical var; Term: Term (Displayed in months) - Numerical var; Phone_service: Phone service - Categorical var; international_plan: International plan - Categorical var; Voice_mail_plan: Voice mail plan - Categorical var; Multiple_line: Multiple line - Categorical var; Internet_service: Internet service - Categorical var; Technical_support: Technical support - Categorical var; Streaming_videos: Streaming Videos - Categorical var; Agreement_period: Agreement period - Categorical var; Monthly_charges: Monthly Charges - Numerical var; Total charges: Total Charges - Numerical var; Churn: Churn (Yes or No)

A copy of this publicly available data is stored at https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt

dat <- read.table(file = "https://github.com/chinwex/sta551/raw/main/Customer-Churn-dataset.txt", # read in the txt                                                                                file
           header = TRUE,        # header row in the txt file
           sep = ",",            # separator or delimiter
           dec = ".")            # symbol used for decimal values

3 EDA for Feature Engineering

The entire data set was scanned to determine the Exploratory Data Analysis (EDA) tools to use for feature engineering. All the numerical and categorical variables were examined closely and there were no missing values found.

summary(dat)    # produces summarized descriptive statistics for every 
                # variable in the dataset

3.1 Missing Values

The above summary table indicates that there are no missing values in all the variables.

3.2 Assess Distributions

Basic statistical graphics were used to visualize the shape of the data to discover the distributional information of variables from the data and the potential relationships between variables.

3.2.1 Categorical variables

The following are the distributions of the categorical variables: Sex, Marital status, Phone service, and Voice mail plan.

par(mfrow = c(2,2))
sx <- table(dat$Sex)             # frequency table for sex
ms <- table(dat$Marital_Status)  # frequency table for marital status
pservice <- table(dat$Phone_service)  # frequency table for phone service
vmp <- table(dat$Voice_mail_plan)   # frequency table for voicemail plan
  
  barplot(sx, xlab="Sex", ylab="counts", main="Distribution of Sex", col=c("lightgreen", "brown"))  # barplot for sex
           # xlab = xaxis label
           # ylab = yaxis label
           # col  = color of the bars
           # main = barplot heading
  
  barplot(vmp, xlab="Voicemail Plan", ylab="counts", main="Distribution of Voicemail plan", col=c("purple", "red"))     # barplot for voicemail plan
  
  barplot(ms, xlab="Marital status", ylab="counts", main="Distribution of Marital status", col=c("blue", "orange"))  # barplot for marital status
  
   barplot(pservice, xlab="Phone Service", ylab="counts", main="Distribution of phone Service", col=c("pink", "yellow"))   # barplot for phone service

From the above plots, it can be seen that 51.4% of the customers in this study are male. Majority of customers are married, have a phone service and a voice mail plan.

par(mfrow = c(2,2))
ml <- table(dat$Multiple_line)      # frequency table for Multiple line
IS <- table(dat$Internet_service)   # frequency table for Internet service

   barplot(ml, xlab="Multiple Line", ylab="counts", main="Distribution of Multiple Line", col=c("purple", "red", "orange"))   # barplot for multiple line
   
    barplot(IS, xlab="Internet service", ylab="counts", main="Distribution of Internet service", col=c("pink", "lightgreen", "yellow", "brown"), names.arg = c("Cable", "DSL", "F.O", "None"))      # barplot for Internet service
    
  AP <- table(dat$Agreement_period)  # frequency table for Agreement period
CN <- table(dat$Churn)             # frequency table for Churn

  barplot(CN, xlab="Churn", ylab="counts", main="Distribution of Churn", col=c("purple", "red", "orange"))    # barplot for Churn
  
   barplot(AP, xlab="Agreement period", ylab="Counts", main="Distribution of Agreement Period", col=c("green", "yellow", "blue"), names.arg = c("monthly", "1year", "2years"))

   # barplot for Agreement period

44.4% of the customers have multiple lines. Under the Internet service category, 17.1% use cable, 28.0% use DSL, 34.1% use Fiber Optic and 20.8% had none. Majority were on a monthly contract. For this data, 74.1% had not left the company.

3.2.2 Regrouping of categorical variables

One of the categorical variables, International Plan, had 3 groups: No, Yes and yes, with values, 429, 309, and 262 respectively. This was an input error that happened when this data was collected.

Groups <- c("No", "yes", "Yes")     # this was the result from the frequency table of
Freq <- c(429, 262, 309)            # international plan.
AB <- data.frame(Groups, Freq)      # created a dataframe with the two groups
kable(AB)                           # table

In other to rectify this, it was decided to create a new variable called grp.IP that will contain only 2 distinct groups of the International plan variable: No and Yes.

Also, for Technical support and streaming videos, with 3 groups each - Yes, No and No internet; No and No internet were combined together into a single group. This is because they are close in meaning.

par(mfrow=c(2,2))
grp.IP <- ifelse(dat$International_plan == "No", "No", 
                     ifelse(dat$International_plan == "yes", "Yes",
                            ifelse(dat$International_plan == "Yes", "Yes", "NA")))
# regrouped the original variable into 2 groups

G.IP <- table(grp.IP)   # frequency table for the new variable

  barplot(G.IP, xlab="Grp.International plan", ylab="counts", main="Distribution of Grp.IP", col=c("blue", "orange"))  # barplot of the new international plan variable
  
  tech_support <- ifelse(dat$Technical_support=="Yes", "Yes", "No")
  
  TS <- table(tech_support)  # frequency table for Technical support

barplot(TS, xlab="Technical Support", ylab="counts", main="Distribution of Technical Support", col=c("lightblue", "red"))   # barplot of Technical support

stream_videos <- ifelse(dat$Streaming_Videos=="Yes",  "Yes", "No")
  
SV <- table(stream_videos)   # frequency table for streaming videos

     barplot(SV, xlab="Streaming videos", ylab="counts", main="Distribution of Streaming Videos", col=c("green", "yellow"))  # barplot of streaming videos

57.1% of the customers had an international plan. About a third of the customers had technical support and 41% had video streaming.

3.3 Numerical Variables

There are 3 numerical variables and they are: Term, monthly charges and Total charges. Their distributions are as follows:

par(mfrow=c(1,3))

hist(dat$Term, xlab="Term in months", ylab="counts", main="Distribution of Term", col="lightgreen")   # histogram of term

den=density(dat$Monthly_Charges)   # kernel density estimates for monthly charges

plot(den, xlab="monthly charges", main="Density of Monthly Charges", col="brown")
# plot of the density of monthly charges

hist(dat$Total_Charges, xlab="Total Charges", ylab="counts", main="Dist. of Total Charges", col="pink")  # histogram of total charges

The plot of the histogram showing the distribution of Term shows a non-symmetric pattern with the highest frequency between 0 and 5 months and lowest between 35 and 40 months.

This is quite different from the distribution of the total charges which is right skewed. It shows that the mean is greater than the median. Here, the highest frequency is between 0 and 1000 and the lowest is between 8000 and 9000. The distribution appears to have a step wise pattern (That is smaller amounts have higher frequency and larger amounts have lower frequency).

The distribution of monthly charges is represented by the density plot. It shows a bimodal distribution at 2 points; the first approximately at 20 and the other (higher peak) approximately at 90. The lowest point on the plot corresponding to the lowest frequency is approximately at 40.

3.4 Discretizing Continuous Varaibles

From the above density plot of monthly charges, it can be seen that the distribution is bimodal at points 20 to 30 and 80 to 90. Therefore, these variables will be discretized for future models and algorithms. The variable, monthly charges, ranges from 18.95 to 116.25.less than 30: low charges; 30 to 80 : moderate charges; greater than 80: high charges

The following table shows the frequency of the grouped variable, grp.month

## convert monthly charges into a character variable with 3 groups
grp.month <- ifelse(dat$Monthly_Charges < 30, "Low", 
                     ifelse(dat$Monthly_Charges > 80, "High", "Moderate"))

## table showing the frequency of the new grouped variable
kable(table(grp.month), col.names=c("Monthly Charges", "Frequency"))

Monthly Charges	Frequency
High	406
Low	223
Moderate	371

3.5 Pairwise Associations

Pairwise associations between two variables were assessed graphically based on three scenarios which were: 2 categorical variables, 2 numerical variables, one categorical and one numerical variable.

3.5.1 Two categorical variables

This was done to determine whether the response is independent of the categorical variables. Variables found to be independent of the response will be excluded in any of the subsequent models.Mosaic plots are convenient to show whether two categorical variables are dependent. When they are independent, all proportions are the same and the boxes line up in a grid.

par(mfrow=c(3,2))
# mosaic plot for sex, marital status, phone service, Grp.IP, Voicemail plan and multiple
# line with Churn)

mosaicplot(Sex ~ Churn, data=dat, col=c("pink", "orange"), main="Sex Vs Churn")
mosaicplot(Marital_Status ~ Churn, data=dat, col=c("lightblue", "purple"), main="Marital status Vs Churn")
mosaicplot(Phone_service ~ Churn, data=dat, col=c("yellow", "green"), main= "Phone Service Vs Churn")
mosaicplot(grp.IP ~ Churn, data=dat, col=c("lightgreen", "blue"), main="International plan Vs Churn")
mosaicplot(Voice_mail_plan ~ Churn, data=dat, col=c("gray", "red"), main="Voice_mail_plan Vs Churn")
mosaicplot(Multiple_line ~ Churn, data=dat, col=c("lightblue", "brown"), main="Multiple_line Vs Churn")

From the above mosaic plots, it can be seen that sex, phone service, voicemail plan, and multiple line appear to be independent of the response variable, churn. This is because the proportion of churn cases in the individual categories of these variables appear to be identical.

par(mfrow=c(3,2))
## mosaic plot for Agreement period, Internet service, Technical support, Streaming videos, ## grp.month with churn

mosaicplot(Agreement_period ~ Churn, data=dat, col=c("lightgreen", "blue"), main="Agreement Period Vs Churn")
mosaicplot(Internet_service ~ Churn, data=dat, col=c("yellow", "red"), main="Internet Service Vs Churn")

mosaicplot(tech_support ~ Churn, data=dat, col=c("pink", "green"), main="Technical support Vs Churn")
mosaicplot(stream_videos ~ Churn, data=dat, col=c("orange", "purple"), main="Streaming Videos Vs Churn")
mosaicplot(grp.month ~ Churn, data=dat, col=c("brown", "gold", "lightblue"), main="monthly charges Vs Churn")

In addition to marital status and International plan, Agreement period, Internet service, monthly charges (grouped), technical support and streaming videos are not independent of the response variable, Churn.

3.5.2 Pearson Chi-Square Test

A pearson Chi-square test was carried out to confirm the independence of Sex, Phone service, voice mail plan and multiple line with the binary response variable, Churn. It was found that there was no significant association between each one of them and the response variable at the 0.05 significance level. Below are the results of the chi-square p-values for each of the variables:

# chi square estimates for voicemail plan, phone service, sex and multiple line with 
# Churn

Chisq.Voicemail <- chisq.test(dat$Voice_mail_plan, dat$Churn)
Chisq.Phoneservice <- chisq.test(dat$Phone_service, dat$Churn)
Chisq.sex <- chisq.test(dat$Sex, dat$Churn)
Chisq.multipleline <- chisq.test(dat$Multiple_line, dat$Churn)

chisq.table <- data.frame(Chisq.sex$p.value, Chisq.Phoneservice$p.value, Chisq.Voicemail$p.value, Chisq.multipleline$p.value)   # placed all the chi square variable                                                        # estimates in a table
chisq.table%>%gt()                # table

Chisq.sex.p.value	Chisq.Phoneservice.p.value	Chisq.Voicemail.p.value	Chisq.multipleline.p.value
0.1248683	0.3680155	0.6651237	0.3384263

3.5.3 Two Numerical Variables

The pair-wise scatter plot was used to assess the pairwise linear association between two numeric variables.

ggpairs(dat,                  # Data frame
        columns = c(3,12,13),         # numeric columns
        aes(color = Churn,  # Color by response variable (cat. variable)
            alpha = 0.5))

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. Total Charges and Term are strongly correlated while Total charges and monthly charges are moderately correlated. Both correlations are significant. A weak correlation exists between monthly charges and term.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Churn and non-Churn groups. This means that the stacked density curves show the relationship between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable, Churn.

Because of the above interpretation between numeric variables and the binary Churn variable, there was no need to open another subsection to illustrate the relationship between a numeric variable and a categorical variable.

3.6 KMeans Clustering

This clustering algorithm divides a set of n observations into k clusters. In general, clustering is a method of assigning comparable data points to groups using data patterns. Clustering algorithms find similar data points and allocate them to the same set.

## subset the data
clust.data <-  dat[, c(3,12,13)]
scaled.data <-  as.data.frame(scale(clust.data)[,1:3])

A subset of the original data is created containing only the 3 numeric variables: term, monthly charges and total charges. This subset is then scaled.

3.6.1 Heat Map Representation of Clusters

distance <-  get_dist(scaled.data)
fviz_dist(distance, gradient = list(low = "yellow", mid = "orange", high = "darkred"), show_labels = FALSE)

Heatmap representation of potential clusters

The above heatmap indicates that different clusters exist in this data (based on the three numerical variables).

3.6.2 Determination of optimal class.

fviz_nbclust(scaled.data, kmeans, method = "wss")

The elbow method is used to find the optimal number of clusters using the fviz_nbclust() function to create a plot of the number of clusters vs. the total within sum of squares. For this plot it appears that there is a bit of an elbow or “bend” at k = 5 clusters. This is the optimal number of clusters.

3.6.3 K-Means Clustering with Optimal K

k-means clustering is performed on the dataset using the optimal value for k of 5:

set.seed(20)     # make this reproducible
k5 <- kmeans(x = scaled.data,  # name of dataset
             centers = 5,      # number of clusters
             iter.max = 10,    
             nstart = 25,      # number of initial configurations
             algorithm = "Hartigan-Wong",
             trace = FALSE)

There were 5 clusters of sizes 204, 252, 191, 221 and 132. 204 customers were assigned to cluster 1. 252 customers were assigned to cluster 2. 191 customers were assigned to cluster 3. 221 customers were assigned to cluster 4. 132 customers were assigned to cluster 5

3.6.4 Scatter plot of the clusters

The clusters are visualized on a scatterplot that displays the first two principal components on the axes using the fivz_cluster() function:

fviz_cluster(k5, data = scaled.data)

Scatter Plot of Final cluster results

kable(aggregate(clust.data, by=list(cluster=k5$cluster), mean), col.names = c("Cluster", "Term", "Monthly Charges", "Total Charges"))

At the end of the exploratory data analysis, the new cluster variable will be added into the original dataset.

cluster <- k5$cluster

3.7 Conclusion

Finally, only the variables to be used in subsequent modelling were kept in the dataset. Sex, Phone service, Voicemail plan and multiple line were dropped because of their independence with the response variable, Churn.

International plan was also dropped and the new variable, Grp.IP was kept instead. Grp.month will also be kept in the dataset, as an alternative to its numerical counterpart, monthly charges for modelling. The cluster variable was also added. The number of variables in the final dataset was 12.

## removed all the variables that are not needed for subsequent modelling
bat <- dat[,-c(1, 4, 5, 6, 7, 9, 10)]

## added the new variables to the trimmed dataset
bat.total <- cbind(bat, grp.month, tech_support, stream_videos, grp.IP, cluster)

## saved the new dataset to a local drive
write.csv(bat.total, "C:\\Users\\echef\\Documents\\sta551\\Datasets\\cx_churn.csv")

The following are the variables that will be used for subsequent modelling. Marital_Status, Term, Internet_service, tech_support, stream_videos, Agreement_period, Monthly_Charges, grp.month, Total_Charges, grp.IP, cluster, and Churn

4 Logistics Predictive Modelling

4.1 Assumptions

In building a logistic model for this analysis, it is necessary to make sure that all assumptions are satisfied. The first assumption of a logistic model is that the response variable must be binary. This is true for this data. The values for the response variable, churn, are yes and no. The second is the predictor variables are assumed to be uncorrelated. Since the primary aim of this analysis is to predict which customers are more likely to churn, there is no need to understand the role of each predictor variable and no need to reduce severe multicollinearity, and the third is the functional form of the predictor variables are correctly specified.

4.2 Model building

Seven of the variables are characters with 2, 3 or 4 groups. The variables with two groups are Marital status, Streaming videos, technical support, and international plan. Agreement period and grp.month have 3 groups each while Internet service has 4 groups. All the character variables were changed to factors with different levels.

The numeric variables are Term, Monthly charges and Total charges. The cluster variable has 5 values which signifies the 5 different groups (1-5). it was also changed to a factor with 5 levels. In total, there are 10 predictor variables (each model can only contain either the continuous monthly charges or the grouped variable).

## first we read in the dataset containing all the variables that 
## will be used for modelling
mat <- read.csv("https://github.com/chinwex/datasets/raw/main/cx_churn.csv")[,-1]

## converted all character variables to factor

marital.status <- as.factor(mat$Marital_Status)  # has 2 levels


internet.service <- as.factor(mat$Internet_service) # has 4 levels

churn <- as.factor(mat$Churn)  # has 2 levels

technical.support <- as.factor(mat$tech_support)   # has 2 levels


streaming.videos <- as.factor(mat$stream_videos)   # has 2 levels

agreement.period <- as.factor(mat$Agreement_period)   # has 3 levels


International.plan <- as.factor(mat$grp.IP)    # has 2 levels

grpd.month <- as.factor(mat$grp.month)         # has 3 levels

cluster <- as.factor(mat$cluster)             # has 5 levels

#created duplicates of the 3 numerical variables
Term <- mat$Term
Monthly_Charges <-  mat$Monthly_Charges
Total_Charges <- mat$Total_Charges

# put all the 11 variables for modelling in one new dataset
dat.1 <- data.frame(marital.status, Term, internet.service, technical.support, streaming.videos, agreement.period, grpd.month, International.plan, Monthly_Charges, Total_Charges, cluster, churn)

First, a logistic regression model that contains all predictor variables with monthly charges variable as numeric in the data set was built. This is called the first model.

# the logistics regression model containing all predictors
first.model <- glm(churn ~ marital.status + Term + technical.support + internet.service + streaming.videos + agreement.period + International.plan + cluster + Monthly_Charges + Total_Charges, family = binomial, data = dat.1)

#summary of the regression coefficients
firstmodel.table <-  summary(first.model)$coef

#table showing the significance tests in the regression model
kable(firstmodel.table, caption = "Significance Tests for the First Model")

Significance Tests for the First Model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.3607015	0.9453676	-2.4971254	0.0125205
marital.statusSingle	-0.1758372	0.3828222	-0.4593183	0.6460056
Term	-0.0341104	0.0188862	-1.8061068	0.0709017
technical.supportYes	-0.7053120	0.2225359	-3.1694304	0.0015274
internet.serviceDSL	-0.5606394	0.3616395	-1.5502715	0.1210764
internet.serviceFiber optic	-0.3355105	0.2291304	-1.4642778	0.1431181
internet.serviceNo Internet	-1.9911186	0.6167162	-3.2285818	0.0012441
streaming.videosYes	-0.0514545	0.2608122	-0.1972855	0.8436041
agreement.periodOne year contract	-1.8023818	0.3242817	-5.5580753	0.0000000
agreement.periodTwo year contract	-1.9417424	0.4175161	-4.6507007	0.0000033
International.planYes	0.1843095	0.2074822	0.8883150	0.3743713
cluster2	-0.2465339	0.3868904	-0.6372190	0.5239822
cluster3	1.1873077	0.7592222	1.5638475	0.1178534
cluster4	1.5778116	0.4738521	3.3297551	0.0008692
cluster5	2.6631985	0.8390290	3.1741436	0.0015028
Monthly_Charges	0.0383224	0.0121460	3.1551530	0.0016041
Total_Charges	0.0000022	0.0002110	0.0104330	0.9916758

The AIC of the first model is 859.1497. It is made up of 16 variables. In the first model, some of the variables were significant at the .05 level. These were: the intercept, technical support, no internet, one-year agreement, two-year agreement, cluster 4, cluster 5 and monthly charges. one year agreement and two year agreement were significantly different from monthly charges. Cluster 4 and 5 were significantly different from cluster 1. no internet was significantly different from cable.

first.model$aic   # the aic of the first model

Then another model containing all the predictors, but this time with monthly charges as a factor, was built.

# the logistics regression model containing all predictors, with grpd.month
second.model <- glm(churn ~ marital.status + Term + technical.support + internet.service + streaming.videos + agreement.period + International.plan + cluster + grpd.month + Total_Charges, family = binomial, data = dat.1)

#summary of the regression coefficients
secondmodel.table <-  summary(second.model)$coef

#table showing the significance tests in the regression model
kable(secondmodel.table, caption = "Significance Tests for the Second Model")

second.model$aic   # the aic of the second model

The AIC of the second model is 868.3467. It was made up of 17 variables. Here, the variables that were significant at the .05 level are: Term, Technical support, Internet service-DSL, Internet service-no internet, one year agreement period, two year agreement, cluster 4 and cluster 5.

When compared to the first model based on the AIC, the first model had a lower AIC of 859.1497. This shows that monthly charges as a numerical variable is better than monthly charges as a grouped variable.Therefore, subsequent modelling will be carried out with the first model. Important variables which must be included in the model based on results from other studies and analysis are agreement period, term and monthly charges. With this three variables, the reduced model is built.

# the reduced model containing the 3 main variables
reduced.model <- glm(churn ~ Term + agreement.period + Monthly_Charges, family = binomial, data = dat.1)

#summary of the coefficients of the reduced model
reducedmodel.table <-  summary(reduced.model)$coef

#table showing the significance tests of the reduced model
kable(reducedmodel.table, caption = "Significance Tests for Reduced Model")

Significance Tests for Reduced Model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-1.8270051	0.2414150	-7.567903	0.0000000
Term	-0.0174912	0.0051167	-3.418431	0.0006298
agreement.periodOne year contract	-1.8564358	0.2919578	-6.358575	0.0000000
agreement.periodTwo year contract	-2.1559901	0.3681332	-5.856549	0.0000000
Monthly_Charges	0.0266624	0.0034836	7.653632	0.0000000

reduced.model$aic  # aic of the reduced model

The AIC of the reduced model is 896.8588. it is made up of 4 variables. Here all the variables are significant at the .05 level. All the significant variables from the first model were added to the reduced model to build a fourth model. These are: Technical support, Internet service, agreement period, cluster, and monthly charges. Since monthly charges and agreement period were already present in the reduced model, technical support, internet service and cluster were added from the first model.

# A fourth model containing all variables in the reduced model and significant variables from the first model.
fourth.model <- glm(churn ~ Term + technical.support + internet.service + cluster + agreement.period + Monthly_Charges, family = binomial, data = dat.1)

# summary of the regression coefficients of the third model
fourthmodel.table <-  summary(fourth.model)$coef

# table showing the significance tests for third model
kable(fourthmodel.table, caption = "Significance Tests for Fourth Model")

fourth.model$aic  #aic for fourth model

The AIC of the fourth model was 852.3914. It was made up of 12 variables. 2 dummy variables in the internet service (fiber optic and DSL) and 2 in the cluster (2 and 3) were not significant at .05 significance level. The results were not shown here since they were the exact ones obtained in the next model below. The next step is to use an automatic variable procedure to find the best model.

4.2.1 Automatic Variable Selection

This is done using the automatic variable selection function, step(), to search for the final model. From the first model, insignificant variables will be dropped using AIC as an inclusion/exclusion criterion.

# the automatic variable selection method
modelA =  step(first.model,
                    scope=list(lower=formula(reduced.model),upper=formula(first.model)),
                    data = dat.1, 
                    direction = "backward",
                    trace = 0)   # trace = 0: suppress the detailed selection process

# summary of the coefficient of modelA
modelA.coef = summary(modelA)$coef

#table showing the significant tests of modelA
kable(modelA.coef , caption = "Summary Table of Significant Tests for model A")

Summary Table of Significant Tests for model A
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.2249631	0.7230624	-3.0771384	0.0020900
Term	-0.0346433	0.0126062	-2.7481259	0.0059937
technical.supportYes	-0.7045058	0.2208271	-3.1903049	0.0014212
internet.serviceDSL	-0.5500562	0.3356312	-1.6388709	0.1012401
internet.serviceFiber optic	-0.3118063	0.2279093	-1.3681160	0.1712758
internet.serviceNo Internet	-1.9455052	0.5565853	-3.4954308	0.0004733
agreement.periodOne year contract	-1.8075414	0.3242020	-5.5753562	0.0000000
agreement.periodTwo year contract	-1.9237553	0.4128618	-4.6595626	0.0000032
cluster2	-0.2987356	0.3767943	-0.7928348	0.4278741
cluster3	1.0961043	0.7008642	1.5639326	0.1178334
cluster4	1.4185471	0.4051955	3.5008950	0.0004637
cluster5	2.4506129	0.7801954	3.1410247	0.0016836
Monthly_Charges	0.0382299	0.0086272	4.4312985	0.0000094

modelA$aic   # aic of modelA

This model is same with the fourth model with the AIC, 852.3914, and 12 variables. 2 dummy variables in the internet service (fiber optic and DSL) and 2 in the cluster (2 and 3) were also not significant at .05 significance level.

4.3 Prediction Analysis

The final model is used to predict whether a customer will leave the company or not based on the new values of the predictor variables.

4.3.1 Predict already existing data

The predicted response is compared to the original response. This is shown in the following table.

#puts the predicted values in a dataset
pred.churn <-  predict(modelA, type="response")

## threshold probability
cut.off.prob = 0.5

## This groups the response as 1 = Yes and 0 = No
pred.response <-  ifelse(pred.churn > cut.off.prob, "Yes", "No") 

## Add the new predicted response to the dataset, dat.1
dat.1$Pred.Response <-  pred.response

## table showing the first 10 observations in the dataset with the predicted response column
kable(head(dat.1[,-c(5, 7, 8, 10)], 4), caption = "Dataset with model predicted response", col.names = c("Mar.status", "Term", "Int.service", "Tech.support", "Agr.period", "Month.charges", "cluster", "churn", "Predicted"))

Dataset with model predicted response
Mar.status	Term	Int.service	Tech.support	Agr.period	Month.charges	cluster	churn	Predicted
Married	16	Cable	Yes	Monthly contract	98.05	1	Yes	Yes
Married	70	Cable	Yes	One year contract	75.25	3	No	No
Married	36	Cable	Yes	Monthly contract	73.35	2	No	No
Married	72	Cable	Yes	One year contract	112.60	3	No	No

The predicted response of the first 4 observations in the dataset, is quite similar to the original response.

After calculating the frequency of the original response variable and the frequency of the predicted response variable. It was seen that in the original variable, 74.1% of customers are still with the company while the predicted response variable gives this as 77.4%. Therefore, this model is acceptable.

kable(table(dat.1$Pred.Response), caption="Frequency of Predicted Response Variable")
kable(table(dat.1$churn), caption="Frequency of Original Response Variable")

4.3.2 Predict New Data

A hypothetical dataset was formed and the model (modelA) was used to predict the response variable. The results are shown below:

## A hypothetical dataset containing new information was created
dat.2 = data.frame(Term = c(38, 50, 14, 4),
                   technical.support = c("Yes", "No", "No", "Yes"),
                   internet.service = c("Cable", "No Internet", "No Internet", "Cable"),
                   agreement.period = c("Monthly contract", "Monthly contract", "One year contract", "Monthly contract"),
                   Monthly_Charges = c(100.5, 87.6, 110.50, 90),
                   cluster = c("3", "1", "2", "1"))
                  
## ModelA was used to predict the response 
pred.churn.1 = predict(modelA, newdata = dat.2, type="response")

## threshold probability
cut.off.prob = 0.5

## This groups the response as 1 = Yes and 0 = No
pred.response.1 = ifelse(pred.churn.1 > cut.off.prob, "Yes", "No")  

## Add the new predicted response to dat.2
dat.2$Pred.Response.1 = pred.response.1

## table showing the new dataset with the predicted response
kable(dat.2, caption = "Predicted Values of New Data", col.names = c("Term", "Internet.service", "Tech.support", "Agr.period", "Monthly.charges", "cluster", "Predicted"))

Predicted Values of New Data
Term	Internet.service	Tech.support	Agr.period	Monthly.charges	cluster	Predicted
38	Yes	Cable	Monthly contract	100.5	3	Yes
50	No	No Internet	Monthly contract	87.6	1	No
14	No	No Internet	One year contract	110.5	2	No
4	Yes	Cable	Monthly contract	90.0	1	Yes

5 Cross Validation and Performance Measures

5.1 Data Partition

Since the sample size is large, the data was split randomly by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The total number of observations were 1000. The number in the training data set was 700 while the number in the testing data set was 300.

df <- read.csv("https://github.com/chinwex/datasets/raw/main/cx_churn.csv")[,-c(1,9)]

## convert all the character variables with 2 groups into 0 and 1 values
df$Marital_Status <- ifelse(df$Marital_Status=="Single", 1, 0)
df$tech_support <- ifelse(df$tech_support=="Yes", 1, 0)
df$Churn <- ifelse(df$Churn=="Yes", 1, 0)

df$Agreement_period <- ifelse(df$Agreement_period=="Monthly contract", 0,
                               ifelse(df$Agreement_period=="One year contract", 1, 2))
df$Agreement_period <- as.factor(df$Agreement_period)

df$Internet_service <- ifelse(df$Internet_service=="Cable", 0,
                               ifelse(df$Internet_service=="DSL", 1,
                                      ifelse(df$Internet_service=="Fiber optic", 2,3)))
df$Internet_service <- as.factor(df$Internet_service)

df$cluster <- as.factor(df$cluster)

df$stream_videos <- ifelse(df$stream_videos=="Yes", 1, 0)
df$grp.IP <- ifelse(df$grp.IP=="Yes", 1, 0)


## gives the dimensions of a dataframe
## number of rows or columns
nn <- dim(df)[1]
set.seed(200)
## divides the data set into 70% training and 30% testing
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 

training = df[train.id,]
testing = df[-train.id,]

5.2 Cross-Validation

The three best models already built in the previous section were used. They are ModelA, reduced model and first model. Cross-validation was done on all 3 models using the training dataset.This involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set).

5.2.1 Cross-validation and Optimal Cut Off For All Models

par(mfrow=c(2,2))
ab <- dim(training)[1]/5  # total obs in training dataset is divided by 5

## for modelA
cutoff.prob <-  seq(0,1,length = 22)[-c(1,22)]  
## created a null vector for storing prediction accuracy
pred.accuracy <-  matrix(0,ncol=20, nrow=5, byrow = TRUE) 

## for the reduced model
cutoff.prob3 <-  seq(0,1,length = 22)[-c(1,22)]  
## created a null vector for storing prediction accuracy
pred.accuracy3 <-  matrix(0,ncol=20, nrow=5, byrow = TRUE) 

## for the first model
cutoff.prob2 <-  seq(0,1,length = 22)[-c(1,22)]  
## created a null vector for storing prediction accuracy
pred.accuracy2 <-  matrix(0,ncol=20, nrow=5, byrow = TRUE) 

## Created a 5-fold Cross Validation
for (i in 1:5){
  valid.id <-  ((i-1)*ab + 1):(i*ab)
  valid.data <-  training[valid.id,]
  train.data <-  training[-valid.id,]
 
   # used modelA
  train.model <-  glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges  , family = binomial(link = logit), data = train.data)
  
  # used the reduced model
  train.model3 <-  glm(Churn ~ Term + Agreement_period + Monthly_Charges, family = binomial(link = logit), data = train.data)
  
  # used the first model
  train.model2 <-  glm(Churn ~ Marital_Status + Term + tech_support + Internet_service + stream_videos + Agreement_period + grp.IP + cluster + Monthly_Charges + Total_Charges, family = binomial(link = logit), data = train.data)
  
 ## this gives the data set containing the predicted probabilities of the train models on 
  ## the valid dataset
   pred.prob = predict.glm(train.model, valid.data, type = "response")
   pred.prob3 = predict.glm(train.model3, valid.data, type = "response")
   pred.prob2 = predict.glm(train.model2, valid.data, type = "response")
   
   ## define the confusion matrix
   for(j in 1:20){
    pred_churn = rep(0,length(pred.prob))
    pred_churn3 = rep(0,length(pred.prob3))
    pred_churn2 = rep(0,length(pred.prob2))
    valid.data$pred_churn = as.numeric(pred.prob >cutoff.prob[j])
    valid.data$pred_churn3 = as.numeric(pred.prob3 >cutoff.prob3[j])
    valid.data$pred_churn2 = as.numeric(pred.prob2 >cutoff.prob2[j])
    a11 = sum(valid.data$pred_churn == valid.data$Churn)
    a113 = sum(valid.data$pred_churn3 == valid.data$Churn)
    a112 = sum(valid.data$pred_churn2 == valid.data$Churn)
    pred.accuracy[i,j] = a11/length(pred.prob)
    pred.accuracy3[i,j] = a113/length(pred.prob3)
    pred.accuracy2[i,j] = a112/length(pred.prob2)
   }
}
avg.accuracy = apply(pred.accuracy, 2, mean)   # calc. the mean by column of the matrix
max.id = which(avg.accuracy == max(avg.accuracy ))  # maximum avg accuracy for modelA

avg.accuracy3 = apply(pred.accuracy3, 2, mean)   # calc. the mean by column of the matrix
max.id3 = which(avg.accuracy3 == max(avg.accuracy3))  # maximum avg accuracy for reduced                                                                               model
avg.accuracy2 = apply(pred.accuracy2, 2, mean)   # calc. the mean by column of the matrix
max.id2 = which(avg.accuracy2 == max(avg.accuracy2))  # maximum avg accuracy for first model

## PLOT FOR MODELA
### visual representation of cross validation
tick.label = as.character(round(cutoff.prob,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.4,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "ModelA"
     )
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "green")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "blue", cex = 0.8)

## PLOT FOR THE REDUCED MODEL
### visual representation of cross validation
tick.label3 = as.character(round(cutoff.prob3,2))
plot(1:20, avg.accuracy3, type = "b",
     xlim=c(1,20), 
     ylim=c(0.4,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "Reduced model"
     )
axis(1, at=1:20, label = tick.label3, las = 2)
axis(2)
segments(max.id3, 0.5, max.id3, avg.accuracy3[max.id3], col = "purple")
text(max.id3, avg.accuracy3[max.id3]+0.03, as.character(round(avg.accuracy3[max.id3],4)), col = "red", cex = 0.8)

## PLOT FOR THE FIRST MODEL
### visual representation of cross validation
tick.label2 = as.character(round(cutoff.prob2,2))
plot(1:20, avg.accuracy2, type = "b",
     xlim=c(1,20), 
     ylim=c(0.4,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "First Model"
     )
axis(1, at=1:20, label = tick.label2, las = 2)
axis(2)
segments(max.id2, 0.5, max.id2, avg.accuracy2[max.id2], col = "brown")
text(max.id2, avg.accuracy2[max.id2]+0.03, as.character(round(avg.accuracy2[max.id2],4)), col = "blue", cex = 0.8)

5-fold CV performance plot

ModelA was the model generated from automatic variable selection. The optimal cut-off probability that yields the best accuracy for ModelA is 0.52. The reduced model was made up of term, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for this model is 0.57. The fourth model was made up of term, technical support, internet service, agreement period and monthly charges. The optimal cut-off probability that yields the best accuracy for the fourth model is 0.52.

The models were fit to the original training data to find the regression coefficients and then used on the holdout testing sample to find the accuracy. The result is shown below:

## ModelA was used on the training data set to generate regression coefficients
## stored on the test.model
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, 
                 family = binomial(link = logit), 
                 data = training)       # training data set

test.model3 = glm(Churn ~ Term + Agreement_period + Monthly_Charges, #reduced model
                 family = binomial(link = logit), 
                 data = training)       # training data set

test.model2 = glm(Churn ~ Marital_Status + Term + tech_support + Internet_service + stream_videos + Agreement_period + grp.IP + cluster + Monthly_Charges + Total_Charges, 
        family = binomial(link = logit), 
                 data = training)       # training data set

## gives the data set containing the predicted probabilities of the test models on 
## the testing dataset
pred.prob.test = predict.glm(test.model, testing, type = "response")
pred.prob.test3 = predict.glm(test.model3, testing, type = "response")
pred.prob.test2 = predict.glm(test.model2, testing, type = "response")


testing$test.churn = as.numeric(pred.prob.test > 0.38)
testing$test.churn3 = as.numeric(pred.prob.test3 > 0.52)
testing$test.churn2 = as.numeric(pred.prob.test2 > 0.52)

a11 = sum(testing$test.churn == testing$Churn)
a113 = sum(testing$test.churn3 == testing$Churn)
a112 = sum(testing$test.churn2 == testing$Churn)

test.accuracy = a11/length(pred.prob.test)
test.accuracy3 = a113/length(pred.prob.test3)
test.accuracy2 = a112/length(pred.prob.test2)


Model_type <- c("ModelA", "Reduced Model", "First Model")
Accuracy <- c(test.accuracy, test.accuracy3, test.accuracy2)
Model_Accuracy <- data.frame(Model_type,Accuracy)

kable(Model_Accuracy, col.names = c("Model", "Test Accuracy"))

Model	Test Accuracy
ModelA	0.7933333
Reduced Model	0.7566667
First Model	0.7800000

The regression coefficients obtained by fitting the models on the training data was used to obtain accuracy from the test data. The accuracy was found to be 79.33% for modelA, 75.67% for the reduced model and 78.0% for the first model. The model with the best test accuracy was ModelA. When the test accuracy is compared with the training accuracy (79.14%) of modelA, no underfitting or overfitting was seen.

5.3 Global Measure: ROC and AUC

Below is the plot of the ROC curve (1-specificity, sensitivity).

test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges,                                      # ModelA
                 family = binomial(link = logit), 
                 data = training)       # training data set

test.model3 = glm(Churn ~ Term + Agreement_period + Monthly_Charges, #reduced model
                 family = binomial(link = logit), 
                 data = training)       # training data set

test.model2 = glm(Churn ~ Marital_Status + Term + tech_support + Internet_service + stream_videos + Agreement_period + grp.IP + cluster + Monthly_Charges + Total_Charges, 
        family = binomial(link = logit), 
                 data = training)       # first model

## gives the data set containing the predicted probabilities of the test models on 
## the testing dataset
pred.prob.test = predict.glm(test.model, training, type = "response")
pred.prob.test3 = predict.glm(test.model3, training, type = "response")
pred.prob.test2 = predict.glm(test.model2, training, type = "response")

### components for defining various measures for modelA
cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (i in 1:20){
  training$train.churn = as.numeric(pred.prob.test > cut.off.seq[i])
### components for defining various measures for modelA
p0.a0 = sum(training$train.churn ==0 & training$Churn==0)
p0.a1 = sum(training$train.churn ==0 & training$Churn ==1)
p1.a0 = sum(training$train.churn ==1 & training$Churn==0)
p1.a1 = sum(training$train.churn ==1 & training$Churn ==1)
###
sensitivity.vec[i] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[i] = p0.a0 / (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)


### components for defining various measures for first model
cut.off.seq2 = seq(0, 1, length = 20)
sensitivity.vec2 = NULL
specificity.vec2 = NULL
for (m in 1:20){
  training$train.churn2 = as.numeric(pred.prob.test2 > cut.off.seq2[m])

p0.a02 = sum(training$train.churn2 ==0 & training$Churn==0)
p0.a12 = sum(training$train.churn2 ==0 & training$Churn ==1)
p1.a02 = sum(training$train.churn2 ==1 & training$Churn==0)
p1.a12 = sum(training$train.churn2 ==1 & training$Churn ==1)

sensitivity.vec2[m] = p1.a12 / (p1.a12 + p0.a12)
specificity.vec2[m] = p0.a02/ (p0.a02 + p1.a02)
}
one.minus.spec2 = c(1,1 - specificity.vec2)
sens.vec2 = c(1,sensitivity.vec2)

### components for defining various measures for reduced model
cut.off.seq3 = seq(0, 1, length = 20)
sensitivity.vec3 = NULL
specificity.vec3 = NULL
for (n in 1:20){
  training$train.churn3 = as.numeric(pred.prob.test3 > cut.off.seq3[n])

p0.a03 = sum(training$train.churn3 ==0 & training$Churn==0)
p0.a13 = sum(training$train.churn3 ==0 & training$Churn ==1)
p1.a03 = sum(training$train.churn3 ==1 & training$Churn==0)
p1.a13 = sum(training$train.churn3 ==1 & training$Churn ==1)

sensitivity.vec3[n] = p1.a13 / (p1.a13 + p0.a13)
specificity.vec3[n] = p0.a03/ (p0.a03 + p1.a03)
}
one.minus.spec3 = c(1,1 - specificity.vec3)
sens.vec3 = c(1,sensitivity.vec3)


## Visualization
par(pty = "s")   # make a square figure
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1),
     xlab ="1 - specificity",
     ylab = "sensitivity",
     main = "ROC Curves of the 3 Models",
     lwd = 2,
     col = "purple", cex=0.5 )
segments(0,0,1,1, col = "orange", lty = 2, lwd = 2)
AUC = round(sum(sens.vec*(one.minus.spec[-21]-one.minus.spec[-1])),4)
text(0.6, 0.3, paste("AUC = ", AUC), col = "purple", cex=0.5)

lines(one.minus.spec2,sens.vec2, col="red")
AUC2 = round(sum(sens.vec2*(one.minus.spec2[-21]-one.minus.spec2[-1])),4)
text(0.5, 0.2, paste("AUC = ", AUC2), col = "red", cex=0.5)

lines(one.minus.spec3,sens.vec3, col="green")
AUC3 = round(sum(sens.vec3*(one.minus.spec3[-21]-one.minus.spec3[-1])),4)
text(0.8, 0.5, paste("AUC = ", AUC3), col = "green", cex=0.5)

legend(0,1,legend=c("ModelA", "Reduced model", "First model"), col=c("purple","green","red"), pch=0, cex=0.4)

From the plot, it can be seen that the AUC for modelA which is 0.8639 is the highest and this is evidence that it is a very good model. Also from the plot above, it can be seen that the curve is very close to the top-left corner which is a very good indication of an excellent performance.

5.4 The Best logistics Model

Out of all the 3 logistics models evaluated above using cross-validation and KPI measures, ModelA had the best test accuracy (79.33%) and the highest AUC(0.8639) from the ROC curve when compared to the other two. It fits the model well. Below is the table showing the local performance metrics:

## ModelA was used on the training data set to generate regression coefficients
## stored on the test.model
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, 
        family = binomial(link = logit), 
                 data = training)       # training data set

## gives the data set containing the predicted probabilities of the test.model
pred.prob.test = predict.glm(test.model, testing, type = "response")
testing$test.churn = as.numeric(pred.prob.test > 0.38)

### components for defining various measures
p0.a0 = sum(testing$test.churn ==0 & testing$Churn==0)
p0.a1 = sum(testing$test.churn ==0 & testing$Churn ==1)
p1.a0 = sum(testing$test.churn ==1 & testing$Churn==0)
p1.a1 = sum(testing$test.churn ==1 & testing$Churn ==1)

sensitivity <-  p1.a1 / (p1.a1 + p0.a1)
specificity <-  p0.a0 / (p0.a0 + p1.a0)

precision <-  p1.a1 / (p1.a1 + p1.a0)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity, 
                    specificity = specificity, 
                    precision = precision,
                    recall = recall,
                    F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics for the Best Logistics Model")

Local performance metrics for the Best Logistics Model
sensitivity	specificity	precision	recall	F1
0.6931818	0.8349057	0.6354167	0.6931818	0.6630435

Below is the optimal cut-off probability for the best logistics model.

ab <- dim(training)[1]/5  # total obs in training dataset is divided by 5

## A seq of 20 numbers between 0 and 22 are created excluding 0 and 22
cutoff.prob <-  seq(0,1,length = 22)[-c(1,22)]  
## created a null vector for storing prediction accuracy
pred.accuracy <-  matrix(0,ncol=20, nrow=5, byrow = TRUE) 

## Created a 5-fold Cross Validation
for (i in 1:5){
  valid.id <-  ((i-1)*ab + 1):(i*ab)
  valid.data <-  training[valid.id,]
  train.data <-  training[-valid.id,]
  # used the final model
  train.model <-  glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, family = binomial(link = logit), data = train.data)
  
 ## this gives the data set containing the predicted probabilities of the train.model on 
  ## the valid.data
   pred.prob = predict.glm(train.model, valid.data, type = "response")
   
   ## define the confusion matrix
   for(j in 1:20){
    pred_churn = rep(0,length(pred.prob))
    valid.data$pred_churn = as.numeric(pred.prob >cutoff.prob[j])
    a11 = sum(valid.data$pred_churn == valid.data$Churn)
    pred.accuracy[i,j] = a11/length(pred.prob)
   }
}
avg.accuracy = apply(pred.accuracy, 2, mean)   # calc. the mean by column of the matrix
max.id = which(avg.accuracy == max(avg.accuracy))  # maximum avg accuracy


### visual representation of cross validation
tick.label = as.character(round(cutoff.prob,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.4,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "5-fold Cross Validation performance of ModelA"
     )
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "brown")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "blue", cex = 0.8)

5-fold CV performance plot

At the end of the cross validation and testing, a subset of the main dataset containing the variables included in ModelA was created and ready to be used to build subsequent models.

keep <- df[,-c(1,6,9,10)]
write.csv(keep,"C:\\Users\\echef\\Documents\\sta551\\Datasets\\churn_model.csv")

6 Neural Network Model

6.1 Feature Conversion and Scaling for Neural Network

Neural network models require all feature variables to be in the numeric form. Categorical variables will be converted to dummy variables using model.matrix() and numerical variables are scaled.

## first we read in the data set containing all the variables that 
## will be used for modelling
gat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]

gat$Term <- (gat$Term-min(gat$Term))/(max(gat$Term)-min(gat$Term))
gat$Monthly_Charges <- (gat$Monthly_Charges-min(gat$Monthly_Charges))/(max(gat$Monthly_Charges)-min(gat$Monthly_Charges))

## convert categorical variables to factor
gat$cluster <- as.factor(gat$cluster)
gat$Internet_service <- as.factor(gat$Internet_service)
gat$Agreement_period <- as.factor(gat$Agreement_period)
gat$Churn <- as.factor(gat$Churn)
gat$tech_support <- as.factor(gat$tech_support)

gatmtx = model.matrix(~ ., data = gat)
colnames(gatmtx)

## Then the variables were renamed using simpler names.
colnames(gatmtx)[3] <- "DSL"
colnames(gatmtx)[4] <- "fiber"
colnames(gatmtx)[5] <- "nointernet"
colnames(gatmtx)[6] <- "oneyear"
colnames(gatmtx)[7] <- "twoyears"
colnames(gatmtx)[8] <- "monthlycharges"
colnames(gatmtx)[9] <- "Churn"
colnames(gatmtx)[10] <- "techsupport"
colnames(gatmtx)[11] <- "cluster2"
colnames(gatmtx)[12] <- "cluster3"
colnames(gatmtx)[13] <- "cluster4"
colnames(gatmtx)[14] <- "cluster5"
colnames(gatmtx)

#### The Model Formula
## The following is the model formula:

## `Churn ~ Term + DSL + fiber + nointernet + oneyear + twoyears + 
 ##   monthlycharges + techsupport + cluster2 + cluster3 + cluster4 + 
 ##   cluster5`

columnNames <-  colnames(gatmtx)
columnList <-  paste(columnNames[-c(1,9)], collapse = "+")
columnList <-  paste(c(columnNames[9],"~",columnList), collapse="")
modelFormula <-  formula(columnList)
modelFormula

6.2 Training and Testing NN Model

This follows the usual steps for building a neural network model to predict customer churn. First, the data was split into two: Training (70%) and testing data (30%). Cross-validation was done with the training data and the model tested with the testing data. The data was split into 70% for training the neural network and 30% for testing.

n = dim(gatmtx)[1]
set.seed(200)
trainID = sample(1:n, round(n*0.7), replace = FALSE)
traindat = gatmtx[trainID,]
testdat = gatmtx[-trainID,]

6.2.1 Neural network Model Building and Plot

NetworkModel = neuralnet(modelFormula,
                         data = traindat,          # training dataset
                         hidden = 1,               # single layer NN
                         rep = 1,                  # number of replicates in training NN
                         threshold = 0.01,         # threshold for the partial derivatives                                                     # as stopping criteria.
                         learningrate = 0.1,       # user selected rate
                         algorithm = "rprop+"
                         )
kable(NetworkModel$result.matrix)

The plot of a single layer neural network model of customer churn is shown.

plot(NetworkModel, rep="best")

6.2.2 Cross-validation in Neural Network

n0 = dim(traindat)[1]/5
cut.off.score = seq(0,1, length = 22)[-c(1,22)]        # candidate cut off prob
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)  # null vector for storing prediction accuracy
##
for (i in 1:5){
  valid.id = ((i-1)*n0 + 1):(i*n0)
  valid.data = traindat[valid.id,]
  train.data = traindat[-valid.id,]
  ####
  train.model = neuralnet(modelFormula,
                         data = train.data,
                         hidden = 1,               # single layer NN
                         rep = 1,                  # number of replicates in training NN
                         threshold = 0.01,         # threshold for the partial derivatives as stopping criteria.
                         learningrate = 0.1,       # user selected rate
                         algorithm = "rprop+"
                         )
    pred.nn.score = predict(NetworkModel, valid.data)
    for(j in 1:20){
    #pred.status = rep(0,length(pred.nn.score))
    pred.churn = as.numeric(pred.nn.score > cut.off.score[j])
    a11 = sum(pred.churn == valid.data[,9])
    pred.accuracy[i,j] = a11/length(pred.nn.score)
  }
}
###  
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
### visual representation
tick.label = as.character(round(cut.off.score,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.6,1), 
     axes = FALSE,
     xlab = "Cut-off Score",
     ylab = "Accuracy",
     main = "5-fold CV performance"
     )
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "brown", cex = 0.8)

The cross validation in the neural network was carried out with the training dataset. The optimal cut off probability obtained for the neural network model was 0.48 and the training accuracy was 0.8214.

6.2.3 Testing Model Performance

The model was tested with the testing data made up of 300 observations and the test accuracy was obtained.

#Test the resulting output
nn.results <- predict(NetworkModel, testdat)
results <- data.frame(actual = testdat[,9], prediction = nn.results > 0.48)

confMatrix = table(results$prediction, results$actual)               # confusion matrix
accuracy=sum(results$actual == results$prediction)/length(results$prediction)

kable(confMatrix, caption="Confusion Matrix")

Confusion Matrix
	0	1
FALSE	184	32
TRUE	28	56

kable(accuracy, caption="Test Accuracy of Neural Network", col.names="Test Accuracy")

Test Accuracy of Neural Network
Test Accuracy
0.8

The test accuracy was found to be 80%.

6.3 ROC Analysis

An ROC curve is shown for the above neural network model based on the training data set.

nn.results = predict(NetworkModel, traindat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
    a = sum(traindat[,"Churn"] == 1 & (nn.results > cut0[i]))
    d = sum(traindat[,"Churn"] == 0 & (nn.results < cut0[i]))
    b = sum(traindat[,"Churn"] == 0 & (nn.results > cut0[i]))    
    c = sum(traindat[,"Churn"] == 1 & (nn.results < cut0[i]))   
    sen = a/(a + c)
    spe = d/(b + d)
    SenSpe[,i] = c(sen, spe)
}
# plotting ROC
plot(1-SenSpe[2,], SenSpe[1,], type ="b", xlim=c(0,1), ylim=c(0,1),
     xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
     main = "ROC Curve of the Neural Network Model", col = "purple")
abline(0,1, lty = 2, col = "orange")
legend("bottomright", c("ROC of the model", "Random guessing"), lty=c(1,2),
       col = c("purple", "orange"), bty = "n", cex = 0.8)
AUC = round(sum(SenSpe[1,]*((1-SenSpe[2,])[-20]-((1-SenSpe[2,][-1])))),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "purple")

ROC Curve of the neural network model

The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. Also, since the AUC is greater than 0.65, the neural network model is acceptable. The AUC was 0.9039.

6.4 Comparison of predictive performance between the logistic model and the neural network model

Both models, the final logistic regression model and the Neural Network, were compared using their ROC curves and AUC values.

## the neural network model

nn.results = predict(NetworkModel, traindat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
    a = sum(traindat[,"Churn"] == 1 & (nn.results > cut0[i]))
    d = sum(traindat[,"Churn"] == 0 & (nn.results < cut0[i]))
    b = sum(traindat[,"Churn"] == 0 & (nn.results > cut0[i]))    
    c = sum(traindat[,"Churn"] == 1 & (nn.results < cut0[i]))   
    sen = a/(a + c)
    spe = d/(b + d)
    SenSpe[,i] = c(sen, spe)
}

## The logistic regression model

nn <- dim(gat)[1]
set.seed(200)             #used same seed
## divides the data set into 70% training and 30% testing
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 

training = gat[train.id,]
testing = gat[-train.id,]

t.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, 
        family = binomial(link = logit), 
                 data = training)       # training data set obtained in section 5 during 
                                        # cross-validation for the LR model

## ModelA was used on the training dataset again to obtain prob.
pred.prob.fm = predict.glm(t.model, training, type = "response")

### Sensitivity and Specificity for the LR model
cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (m in 1:20){
  training$train.churn = as.numeric(pred.prob.fm > cut.off.seq[m])

p0.a0 = sum(training$train.churn ==0 & training$Churn==0)
p0.a1 = sum(training$train.churn ==0 & training$Churn ==1)
p1.a0 = sum(training$train.churn ==1 & training$Churn==0)
p1.a1 = sum(training$train.churn ==1 & training$Churn ==1)

sensitivity.vec[m] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[m] = p0.a0/ (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)


# plotting ROC for both models
plot(1-SenSpe[2,], SenSpe[1,], type ="b", xlim=c(0,1), ylim=c(0,1),
     xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
     main = "ROC Curve of LR and NN Models", col = "purple")
abline(0,1, lty = 2, col = "orange")

AUC = round(sum(SenSpe[1,]*((1-SenSpe[2,])[-20]-((1-SenSpe[2,][-1])))),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "purple")


points(one.minus.spec,sens.vec, col="red")
lines(one.minus.spec,sens.vec, col="red")
AUC.LR = round(sum(sens.vec*(one.minus.spec[-21]-one.minus.spec[-1])),4)
text(0.5, 0.2, paste("AUC = ", AUC.LR), col = "red")

legend("bottomright", c("NN model", "LR model", "Random guessing"), lty=c(1,1,2),
       col = c("purple", "red", "orange"), bty = "n", cex = 0.8)

The neural network model had a test accuracy of 80% and an AUC of 0.9039 while the best logistic regression model had a test accuracy of 79.33% and an AUC of 0.8639. Both models are good and acceptable and from here, it is clear that the neural model is the better model between the two.

7 Decision Tree Algorithm

7.1 Description of the Algorithm

In this section, a decision tree was used as a predictive model to draw conclusions about customer churn data. The main goal of this model is to predict the value of a response variable based on several input variables. A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. The decision tree is appropriate for this data because it is easy to interpret, very flexible and also insensitive to underlying relationships between attributes. This means that if there are 2 variables in this dataset that are highly correlated, the algorithm will only choose one of the variables to split on.

rat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]

## convert churn variable to factor with yes and no

rat$Churn <- ifelse(rat$Churn==1, "Yes", "No")
## convert categorical variables to factor
rat$cluster <- as.factor(rat$cluster)
rat$Internet_service <- as.factor(rat$Internet_service)
rat$Agreement_period <- as.factor(rat$Agreement_period)
rat$Churn <- as.factor(rat$Churn)
rat$tech_support <- as.factor(rat$tech_support)


# use a random split approach
n = dim(rat)[1]  # sample size
set.seed(200)
# using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)  
train = rat[train.id, ]    # training data
test = rat[-train.id, ]    # testing data

To build the different model decision trees, the data set was first split into two datasets randomly, the training and testing dataset. There are 700 observations in the training dataset and 300 in the testing data. Cross-validation analysis will be done and optimal cut-off score calculated using the training dataset. ROC analysis will also be carried out and the best decision tree with the largest AUC will be identified.

7.2 Tree Models

Here, a wrapper is written so that 6 different decision trees can be built conveniently.

tree.builder = function(in.data, fp, fn, purity){
   tree = rpart(Churn ~ .,                # including all features
                data = in.data, 
                na.action  = na.rpart,       # By default, deleted if the outcome is missing, 
                                             # kept if predictors are missing
                method = "class",            # Classification form factor
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list( # loss matrix. Penalize false positives or negatives more heavily
                         loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),          # "gini" or "information"
             ## rpart algorithm options (These are defaults)
             control = rpart.control(
                        minsplit = 10,  # minimum number of observations required before                                              #   split
                        minbucket= 10,  # minimum number of observations in any terminal                                             #  node, default = minsplit/3
                        cp  = 0.01,  # complexity parameter for stopping rule, 0.02 -> small                                      #     tree 
                       xval = 10     # number of cross-validation )
                        )
             )
}

### Using the function, `tree.builder = function(in.data, fp, fn, purity)`, six different decision tree models were defined below:
## Model 1: `gini.tree.11` is based on the Gini index without penalizing false positives and false negatives.
## **Model 2**: `info.tree.11` is based on entropy without penalizing false positives and false negatives.
## **Model 3**: `gini.tree.110` is based on the Gini index: cost of false negatives is 10 times the positives.
## **Model 4**: `info.tree.110` is based on entropy: cost of false negatives is 10 times the positives.
## **Model 5**: `gini.tree.101` is based on the Gini index: cost of false positive is 10 times the negatives.  
## **Model 6**: `info.tree.101` is based on entropy: cost of false positive is 10 times the negatives.

The tree diagrams of two decision models are given below.

## Call the tree model wrapper.
gini.tree.1.1 = tree.builder(in.data = train, fp = 1, fn = 1, purity = "gini")
info.tree.1.1 = tree.builder(in.data = train, fp = 1, fn = 1, purity = "information")
gini.tree.1.10 = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")
info.tree.1.10 = tree.builder(in.data = train, fp = 1, fn = 10, purity = "information")

par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
rpart.plot(info.tree.1.10, main = "Tree with entropy: penalization")

Penalized decision tree models using Gini index (left) and entropy (right).

7.3 ROC for Model Selection

ROC analysis was then used to select the best among all models. The function SenSpe = function(in.data, fp, fn, purity) is defined and used to build 6 different trees and plot their corresponding ROC curves so that the global performance of these tree algorithms can be seen and compared. This function has 3 arguments and they include: false positive, false negative and purity.

# function returning a sensitivity and specificity matrix
SenSpe = function(in.data, fp, fn, purity){
  cutoff = seq(0,1, length = 20)   # 20 cut-offs including 0 and 1. 
  model = tree.builder(in.data, fp, fn, purity) 
                 ## decision tree returns both "success" and "failure" probabilities.
           ## only "success" probability is needed to define sensitivity and specificity
  pred = predict(model, newdata = in.data, type = "prob") # two-column matrix.
  senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
  
  for (i in 1:length(cutoff)){
  #  Yes and No are values of the response variable in this data set
  # The following line uses only the probability of Yes: pred[, "Yes]
  pred.out =  ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")  
  TP = sum(pred.out =="Yes" & in.data$Churn == "Yes")
  TN = sum(pred.out =="No" & in.data$Churn == "No")
  FP = sum(pred.out =="Yes" & in.data$Churn == "No")
  FN = sum(pred.out =="No" & in.data$Churn == "Yes")
  senspe.mtx[1,i] = TP/(TP + FN)
  senspe.mtx[2,i] = TN/(TN + FP)
  }
  
  prediction = pred[,"Yes"]
  category = in.data$Churn == "Yes"
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)
  ##
  list(senspe.mtx= senspe.mtx, AUC = round(AUC,3))
 }

giniROC11 = SenSpe(in.data = train, fp=1, fn=1, purity="gini")
infoROC11 = SenSpe(in.data = train, fp=1, fn=1, purity="information")
giniROC110 = SenSpe(in.data = train, fp=1, fn=10, purity="gini")
infoROC110 = SenSpe(in.data = train, fp=1, fn=10, purity="information")
giniROC101 = SenSpe(in.data = train, fp=10, fn=1, purity="gini")
infoROC101 = SenSpe(in.data = train, fp=10, fn=1, purity="information")

7.3.1 ROC Curves for the Different Tree Models

The ROC curves represent various decision trees and their corresponding AUC.The model, gini.1.10 has the largest AUC of 0.864 and its curve extends farthest to the upper left corner. Therefore, it is considered the best decision tree among the others.

par(pty="s")      # set up square plot through graphic parameter
plot(1-giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), 
     xlab="1 - specificity: FPR", ylab="Sensitivity: TPR", col = "blue", lwd = 2,
     main="ROC Curves of Decision Trees", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)
lines(1-infoROC11$senspe.mtx[2,], infoROC11$senspe.mtx[1,], col = "firebrick2", lwd = 2, lty=2)
lines(1-giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,], col = "olivedrab", lwd = 2)
lines(1-infoROC110$senspe.mtx[2,], infoROC110$senspe.mtx[1,], col = "skyblue", lwd = 2)
lines(1-giniROC101$senspe.mtx[2,], giniROC101$senspe.mtx[1,], col = "red", lwd = 2, lty = 4)
lines(1-infoROC101$senspe.mtx[2,], infoROC101$senspe.mtx[1,], col = "sienna3", lwd = 2)
legend("bottomright", c(paste("gini.1.1,  AUC =", giniROC11$AUC), 
                        paste("info.1.1,   AUC =",infoROC11$AUC), 
                        paste("gini.1.10, AUC =",giniROC110$AUC), 
                        paste("info.1.10, AUC =",infoROC110$AUC),
                        paste("gini.10.1, AUC =",giniROC101$AUC), 
                        paste("info.10.1, AUC =",infoROC101$AUC)),
                        col=c("blue","firebrick2","olivedrab","skyblue","red","sienna3"), 
                        lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")

Comparison of ROC curves

7.4 Optimal Cut-off Score Determination

The optimal cut-off determination through cross-validation was based on the training data set. The function Optm.cutoff = function(in.data, fp, fn, purity) was first created and then used to calculate the optimal cut-off for the first 4 decision trees shown above.

Optm.cutoff = function(in.data, fp, fn, purity){
  n0 = dim(in.data)[1]/5
  cutoff = seq(0,1, length = 20)               # candidate cut off prob
  ## accuracy for each candidate cut-off
  accuracy.mtx = matrix(0, ncol=20, nrow=5)    # 20 candidate cutoffs and gini.11
  ##
  for (k in 1:5){
     valid.id = ((k-1)*n0 + 1):(k*n0)
     valid.dat = in.data[valid.id,]
     train.dat = in.data[-valid.id,] 
     ## tree model
     tree.model = tree.builder(in.data, fp, fn, purity)
     ## prediction 
     pred = predict(tree.model, newdata = valid.dat, type = "prob")[,2]
     ## for-loop
     for (i in 1:20){
        ## predicted probabilities
        pc.1 = ifelse(pred > cutoff[i], "Yes", "No")
        ## accuracy
        a1 = mean(pc.1 == valid.dat$Churn)
        accuracy.mtx[k,i] = a1
       }
      }
   avg.acc = apply(accuracy.mtx, 2, mean)
   ## plots
   n = length(avg.acc)
   idx = which(avg.acc == max(avg.acc))
   tick.label = as.character(round(cutoff,2))
   ##
   plot(1:n, avg.acc, xlab="cut-off score", ylab="average accuracy", 
        ylim=c(min(avg.acc), 1), 
        axes = FALSE,
        main=paste("5-fold CV optimal cut-off \n ",purity,"(fp, fn) = (", fp, ",", fn,")" , collapse = ""),
        cex.main = 0.9,
        col.main = "navy")
        axis(1, at=1:20, label = tick.label, las = 2)
        axis(2)
        points(idx, avg.acc[idx], pch=19, col = "red")
        segments(idx , min(avg.acc), idx , avg.acc[idx ], col = "red")
       text(idx, avg.acc[idx]+0.03, as.character(round(avg.acc[idx],4)), col = "red", cex = 0.8) 
   }

par(mfrow=c(1,2))
Optm.cutoff(in.data = train, fp=1, fn=10, purity="gini")
Optm.cutoff(in.data = train, fp=1, fn=10, purity="information")

Plot of optimal cut-off determination

In the above figure, there are multiple cut-offs for each plot. Therefore, the final cut-off for the best model will be the average of the multiple cut-offs for that particular model. the average cut-off for the best model, gini 1.10 is 0.4475.

7.5 Discussions and Conclusions

At the end of the decision tree modelling, the best decision tree was identified with the optimal cut-off score. The following is the diagram of the best decision tree and the cut off score.

par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
Optm.cutoff(in.data = train, fp=1, fn=10, purity="gini")

Best decision tree model (left) and optimal cut-off (right).

The optimal cut off was 0.4475 and the training accuracy was 0.8071

7.5.1 Best Decision Tree with Test Accuracy

tree.builder = function(in.data, fp, fn, purity){
   tree = rpart(Churn ~ .,  
                data = in.data, 
                na.action  = na.rpart,  
                method = "class", 
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list(loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),          # "gini" or "information"
             control = rpart.control(
                        minsplit = 10,
                        minbucket= 10,
                        cp  = 0.01,
                       xval = 10
                        )
             )
}
best.tree = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")
t_pred = predict(best.tree,test,type="class")

confMat <- table(test$Churn,t_pred)
accuracy <- sum(diag(confMat))/sum(confMat)

kable(accuracy, caption="Test Accuracy of Best Decision Tree", col.names="Test Accuracy")

Test Accuracy of Best Decision Tree
Test Accuracy
0.7633333

The accuracy for the best decision tree was 0.7633.

8 Final Model

At the end of this analysis, the following models were built to analyse customer churn: Logistics Regression, Neural Network Model, and Decision Trees. These three models were compared using their ROC curves and test accuracy and performance.

8.1 ROC Curve Comparison

Below is the plot of the ROC curves for these 3 models and the respective AUC values.

rat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]

## convert churn variable to factor with yes and no

rat$Churn <- ifelse(rat$Churn==1, "Yes", "No")
## convert categorical variables to factor
rat$cluster <- as.factor(rat$cluster)
rat$Internet_service <- as.factor(rat$Internet_service)
rat$Agreement_period <- as.factor(rat$Agreement_period)
rat$Churn <- as.factor(rat$Churn)
rat$tech_support <- as.factor(rat$tech_support)

## read in the same dataset that will contain standardized variables for the 
## neural network model
gat <- read.csv("https://github.com/chinwex/datasets/raw/main/churn_model.csv")[,-1]

gat$Term <- (gat$Term-min(gat$Term))/(max(gat$Term)-min(gat$Term))
gat$Monthly_Charges <- (gat$Monthly_Charges-min(gat$Monthly_Charges))/(max(gat$Monthly_Charges)-min(gat$Monthly_Charges))

## convert categorical variables to factor
gat$cluster <- as.factor(gat$cluster)
gat$Internet_service <- as.factor(gat$Internet_service)
gat$Agreement_period <- as.factor(gat$Agreement_period)
gat$Churn <- as.factor(gat$Churn)
gat$tech_support <- as.factor(gat$tech_support)



# use a random split approach
n = dim(rat)[1]  # sample size
set.seed(200)
# using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)  
train = rat[train.id, ]    # training data
test = rat[-train.id, ]    # testing data

gatmtx = model.matrix(~ ., data = gat)
colnames(gatmtx)

colnames(gatmtx)[3] <- "DSL"
colnames(gatmtx)[4] <- "fiber"
colnames(gatmtx)[5] <- "nointernet"
colnames(gatmtx)[6] <- "oneyear"
colnames(gatmtx)[7] <- "twoyears"
colnames(gatmtx)[8] <- "monthlycharges"
colnames(gatmtx)[9] <- "Churn"
colnames(gatmtx)[10] <- "techsupport"
colnames(gatmtx)[11] <- "cluster2"
colnames(gatmtx)[12] <- "cluster3"
colnames(gatmtx)[13] <- "cluster4"
colnames(gatmtx)[14] <- "cluster5"
colnames(gatmtx)

columnNames <-  colnames(gatmtx)
columnList <-  paste(columnNames[-c(1,9)], collapse = "+")
columnList <-  paste(c(columnNames[9],"~",columnList), collapse="")
modelFormula <-  formula(columnList)
modelFormula

n = dim(gatmtx)[1]
set.seed(200)
trainID = sample(1:n, round(n*0.7), replace = FALSE)
traindat = gatmtx[trainID,]
testdat = gatmtx[-trainID,]

NetworkModel = neuralnet(modelFormula,
                         data = traindat,          # training dataset
                         hidden = 1,               # single layer NN
                         rep = 1,                  # number of replicates in training NN
                         threshold = 0.01,         # threshold for the partial derivatives                                                     # as stopping criteria.
                         learningrate = 0.1,       # user selected rate
                         algorithm = "rprop+"
                         )
kable(NetworkModel$result.matrix)

tree.builder = function(in.data, fp, fn, purity){
   tree = rpart(Churn ~ .,  
                data = in.data, 
                na.action  = na.rpart,
                method = "class",   
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list(loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),          # "gini" or "information"
             control = rpart.control(
                        minsplit = 10, 
                        minbucket= 10,
                        cp  = 0.01,  
                       xval = 10
                        )
             )
}
best.tree = tree.builder(in.data = train, fp = 1, fn = 10, purity = "gini")

## ModelA
test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges,                                      # ModelA
                 family = binomial(link = logit), 
                 data = train)       # training data set


## gives the data set containing the predicted probabilities of the test model on 
## the training dataset
pred.prob.test = predict.glm(test.model, train, type = "response")

### components for defining various measures for modelA
cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (i in 1:20){
  train$train.churn = as.numeric(pred.prob.test > cut.off.seq[i])
### components for defining various measures for modelA
p0.a0 = sum(train$train.churn ==0 & train$Churn=="No")
p0.a1 = sum(train$train.churn ==0 & train$Churn =="Yes")
p1.a0 = sum(train$train.churn ==1 & train$Churn=="No")
p1.a1 = sum(train$train.churn ==1 & train$Churn =="Yes")
###
sensitivity.vec[i] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[i] = p0.a0 / (p0.a0 + p1.a0)
}
one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)                                                                                                                                                                                             

## Neural Network Model
nn.results = predict(NetworkModel, traindat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
    a = sum(traindat[,"Churn"] == 1 & (nn.results > cut0[i]))
    d = sum(traindat[,"Churn"] == 0 & (nn.results < cut0[i]))
    b = sum(traindat[,"Churn"] == 0 & (nn.results > cut0[i]))    
    c = sum(traindat[,"Churn"] == 1 & (nn.results < cut0[i]))   
    sen = a/(a + c)
    spe = d/(b + d)
    SenSpe[,i] = c(sen, spe)
} 

## Decision Tree
train <- train[,-8]
cutoff = seq(0,1, length = 20)    

  pred = predict(best.tree, train, type = "prob") # two-column matrix.
  senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
  
  for (i in 1:length(cutoff)){
  #  Yes and No are values of the response variable in this data set
  # The following line uses only the probability of Yes: pred[, "Yes]
  pred.out =  ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")  
  TP = sum(pred.out =="Yes" & train$Churn == "Yes")
  TN = sum(pred.out =="No" & train$Churn == "No")
  FP = sum(pred.out =="Yes" & train$Churn == "No")
  FN = sum(pred.out =="No" & train$Churn == "Yes")
  senspe.mtx[1,i] = TP/(TP + FN)
  senspe.mtx[2,i] = TN/(TN + FP)
  }
  
  prediction = pred[,"Yes"]
  category = train$Churn == "Yes"
  ROCobj <- roc(category, prediction)
  
  

## Visualization
par(pty = "s")   # make a square figure
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1),ylim = c(0,1),
     xlab ="1 - specificity: FPR",
     ylab = "sensitivity: TPR",
     main = "ROC Curves of the Three Models",
     lwd = 2,
     col = "purple", )
abline(0,1, lty = 2, col = "orange")
AUC = round(sum(sens.vec*(one.minus.spec[-21]-one.minus.spec[-1])),4)

lines(1-SenSpe[2,], SenSpe[1,], xlim=c(0,1), ylim=c(0,1), col = "olivedrab")

lines(1-senspe.mtx[2,], senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), 
     col = "red", lwd = 2)

AUC2 = round(sum(SenSpe[1,]*((1-SenSpe[2,])[-20]-((1-SenSpe[2,][-1])))),4)
AUC3= round(sum(senspe.mtx[1,]*((1-senspe.mtx[2,])[-20]-((1-senspe.mtx[2,][-1])))),4)
legend("bottomright", c(paste("LR: AUC =", AUC), 
                        paste("NN: AUC =",AUC2), 
                        paste("DT: AUC =",AUC3)), 
                        col=c("purple","olivedrab","red"), 
                        lty=rep(1,3), cex = 0.5, bty = "n")

ROC Curve of all Models

From the above plot, the Decision Tree model appears to be the best model among all the other models based on the Area under the curve which was 0.9153.

8.2 Model Test Accuracy

Below is a table showing the type of model, training accuracy, test accuracy and AUC value.

Model <- c("Logistic Regression", "Neural Network", "Decision Tree")
Trainingaccuracy <- c(0.7914, 0.8214, 0.8071)
Testaccuracy <- c(0.7933, 0.80, 0.7633)
AUCscore <- c(0.8639, 0.9039,0.9153)
perform <- data.frame(Model, Trainingaccuracy, Testaccuracy, AUCscore)

kable(perform, col.names=c("Model", "Training Accuracy", "Test Accuracy", "AUC"), caption="Comparison Between All The Models")

Comparison Between All The Models
Model	Training Accuracy	Test Accuracy	AUC
Logistic Regression	0.7914	0.7933	0.8639
Neural Network	0.8214	0.8000	0.9039
Decision Tree	0.8071	0.7633	0.9153

8.3 Local Performance Metrics For All Models

The following performance metrics: Specificity, Sensitivity, Recall, Precision and F1, was conducted for all 3 models: Logistics regression, Neural network and Decision Tree. The results are as follows:

test.model = glm(Churn ~ Term + tech_support + Internet_service + Agreement_period + cluster + Monthly_Charges, 
        family = binomial(link = logit), 
                 data = train)       # training data set

## gives the data set containing the predicted probabilities of the test.model
pred.prob.test = predict.glm(test.model, test, type = "response")
test$test.churn = as.numeric(pred.prob.test > 0.38)

### components for defining various measures
p0.a0 = sum(test$test.churn ==0 & test$Churn=="No")
p0.a1 = sum(test$test.churn ==0 & test$Churn =="Yes")
p1.a0 = sum(test$test.churn ==1 & test$Churn=="No")
p1.a1 = sum(test$test.churn ==1 & test$Churn =="Yes")


sensitivity <-  p1.a1 / (p1.a1 + p0.a1)
specificity <-  p0.a0 / (p0.a0 + p1.a0)

precision <-  p1.a1 / (p1.a1 + p1.a0)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = rbind(sensitivity = sensitivity, 
                    specificity = specificity, 
                    precision = precision,
                    recall = recall,
                    F1 = F1)


## Neural Network Model
nn.results = predict(NetworkModel, testdat)
test.Churn = as.numeric(nn.results > 0.48)
testdat <- cbind(testdat, test.Churn)

    a = sum(testdat[,"test.Churn"] == 1 & testdat[,"Churn"]==0)
    d = sum(testdat[,"test.Churn"] == 0 & testdat[,"Churn"]==1)
    b = sum(testdat[,"test.Churn"] == 0 & testdat[,"Churn"]==0)    
    c = sum(testdat[,"test.Churn"] == 1 & testdat[,"Churn"]==1)   
    sen = c/(c + d)   # tp/(tp + fn)
    spe = b/(b + a)   # tn/(tn + fp)
    
    precision_nn <- c/(c + a)    #tp/(tp + fp)
    recall_nn <- c/(c + d)       # tp/(tp + fn)
    F1_nn = 2*precision_nn*recall_nn/(precision_nn + recall_nn)
    
metric.list.nn = rbind(sensitivity = sen, 
                    specificity = spe, 
                    precision = precision_nn,
                    recall = recall_nn,
                    F1 = F1_nn)



## The Best Decision tree
 pred = predict(best.tree, newdata=test, type = "prob") # two-column matrix.
  
  pred.out =  ifelse(pred[,"Yes"] >= 0.4475, "Yes", "No")  
  TP = sum(pred.out =="Yes" & test$Churn == "Yes")
  TN = sum(pred.out =="No" & test$Churn == "No")
  FP = sum(pred.out =="Yes" & test$Churn == "No")
  FN = sum(pred.out =="No" & test$Churn == "Yes")
  sensi = TP/(TP + FN)
  speci = TN/(TN + FP)
  
   precision_dt <- TP/(TP + FP)    #tp/(tp + fp)
    recall_dt <- TP/(TP + FN)       # tp/(tp + fn)
    F1_dt = 2*precision_dt*recall_dt/(precision_dt + recall_dt)
    
metric.list.dt = rbind(sensitivity = sensi, 
                    specificity = speci, 
                    precision = precision_dt,
                    recall = recall_dt,
                    F1 = F1_dt)


Total_metric <- data.frame(metric.list, metric.list.nn, metric.list.dt)
kable(Total_metric, col.names=c("Logistics regression", "Neural Network", "Decision Tree"), caption="Local Performance metrics For All Models")

Local Performance metrics For All Models
	Logistics regression	Neural Network	Decision Tree
sensitivity	0.6931818	0.6363636	0.7500000
specificity	0.8349057	0.8679245	0.8254717
precision	0.6354167	0.6666667	0.6407767
recall	0.6931818	0.6363636	0.7500000
F1	0.6630435	0.6511628	0.6910995

8.4 The Best Model

In this dataset, identifying the customers that are leaving the company is the ultimate goal. Precision measures the probability of correctly detecting positive values (which is what is needed). For this data, positive values are customers that are likely to churn. When the positive class is smaller, and the ability to detect correctly positive samples is the main focus (such as this study), precision and recall are the best performance metrics. The best model with the highest precision and recall was the decision tree.

9 Discussion

Churn prediction identifies customers that are likely to leave or remain with a company. The response variable, churn, is a binary variable with values of yes and no, and it indicates whether a customer will churn.

9.1 Model Implementation

After model building, the next thing is to use it to achieve the expected goal. Using the diagram of the best decision tree above, it is easy to show how this model can be used to predict customers that are likely to churn. A decision tree is a type of flow-chart that simplifies the decision making process by breaking down the different available paths of actions and their potential outcomes.

Key points Agreement period 0 - Monthly; 1 - One Year; 2 - Two Years, Internet service 0 - Cable; 1 - DSL; 2 - Fiber optic; 3 - No Internet Tech support - 0 - No; 1 - Yes

For example, if a customer’s agreement period is not equal to one or two years, the tree follows the right branch and arrives at internet service. If the customer has no internet, then the tree leads to a node that represents Term. If the customer’s term period is greater or equal to 14 months, the decision tree follows the right branch. The right branch leads to a leaf node that predicts that the person is likely to churn. On the other hand, if the customer’s internet service is either cable or DSL or Fiber optic, the decision tree follows the left branch and gets to Term. If the customer’s Term period is less than 11 months, the tree follows the left branch to a leaf node that predicts that the customer is not likely to churn.

9.2 Limitations

Using a decision tree to predict the likelihood of a customer leaving the company has some limitations. One is that a decision tree provides less information on the relationship between the predictors and the response. Therefore it will be difficult for this model to explain the relationships between the response, Churn and any of the predictors: term, monthly charges, agreement period and so on. Another limitation is that the probabilities provided by this model are only estimates which are prone to error.

9.3 Recommendations

From the model, it is clear that customers with no internet, duration of term greater than 14 months and lower monthly charges (about 80 and lower) are more likely to churn. Therefore, it is recommended that both new and old customers should be encouraged to use internet services. Additionally, in order to keep consumers who have been with them for 12 months or longer, the business needs to pay closer attention to them and immediately address any issues they may have.

Customers that pay smaller monthly fees should also receive same treatment from the business like those paying high amounts. It’s likely that high-paying customers tend to stick around because they get better service, are treated with more respect, and earn bonuses for their subscriptions, compared to consumers who pay less each month and don’t get the same level of treatment.

10 Conclusion

Advanced analytics are crucial for telecommunication businesses to use in order to predict whether or not customers would quit the company. They will be better able to recognize the different groups that exist and know how to communicate with each group after doing so.

At the end of this study, various factors have been analysed and the best model put forward.The proper way to use this model has been discussed as well as the limitations. From this model, key factors that lead to customer churn have been correctly identified and various recommendations suggested in order to prevent customer churn.

Customer Churn Analysis and Prediction - A Case Study

Echefu Chinwendu

2023-08-04