LinkedIn profile: https://www.linkedin.com/in/gaetano-caira-4216ba93

1. CONTEXT

This project work aims to show the skills of the writer, Gaetano Caira, on the following topics:

The dataset chosen for show these skills has been shared on Kaggle (link: https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download) in order to be able to analyze and forecast the churn phenomenon in the Telco sector.

The data set includes information about:

In detail below you can find the description of the individual features:

Each row represents a customer, so the ML problem is characterized as a classification problem within the family of supervised models.

A guide for the reader is summarized below and summarizes the increase in performance of the sensitivity (vertical axis) of the predictive model as the chapters progress (orizontal axis):

2. DATA CLEANING & MINING

Below are all the actions (in sequence) aimed at:

  1. perform one-hot-endcoding for categorical features

  2. remove the missing values

  3. remove features with low and no variability (no informative)

  4. correct the datatype’s format

  5. some features engineer: new feature as correlation between 2 features, non-linearity (only for “tenure”), mix between features (only for “% Charges”)

knitr::opts_chunk$set(error = TRUE)

mix_dataset_tra<-
  as.data.frame(mix_dataset_tra)

mix_dataset_tra$customerID<-
  as.character(mix_dataset_tra$customerID)

TargetName<-("Churn")

Binarize_Features <- 
  function(data_set, features_to_ignore=c(), leave_out_one_level=TRUE, max_level_count=10) {
  require(dplyr)
  text_features <- c(names(data_set[sapply(data_set, is.character)]), names(data_set[sapply(data_set, is.factor)]))
  for (feature_name in setdiff(text_features, features_to_ignore)) {
    feature_vector <- as.character(data_set[,feature_name])
    # check that data has more than one level
    if (length(unique(feature_vector)) == 1)
      next
    # We set any non-data to text
    feature_vector[is.na(feature_vector)] <- 'NA'
    feature_vector[is.infinite(feature_vector)] <- 'INF'
    feature_vector[is.nan(feature_vector)] <- 'NAN'
    # only give us the top x most popular categories
    temp_vect <- data.frame(table(feature_vector)) %>% arrange(desc(Freq)) %>% head(max_level_count)
    feature_vector <- ifelse(feature_vector %in% temp_vect$feature_vector, feature_vector, 'Other')
    # loop through each level of a feature and create a new column
    first_level=TRUE
    for (newcol in unique(feature_vector)) {
      if (leave_out_one_level & first_level) {
        # avoid dummy trap and skip first level
        first_level=FALSE
        next
      }
      data_set[,paste0(feature_name,"_",newcol)] <- ifelse(feature_vector==newcol,1,0)
    }
    # remove original feature
    data_set <- data_set[,setdiff(names(data_set),feature_name)]
  }
  return (data_set)
}
#str(Binarize_Features(data_set = mix_dataset_tra))
mix_dataset_tra<-
  Binarize_Features(data_set = mix_dataset_tra)
for(i in 5:41){
  mix_dataset_tra[,i]<- as.factor(mix_dataset_tra[,i])
}

#give NA if there is not written nothing
mix_dataset_tra[mix_dataset_tra==""]<-"NA"
mix_dataset_tra[mix_dataset_tra=="?"]<-"NA"
#remove
mix_dataset_tra<-na.omit(mix_dataset_tra)
library(caret)

#delete features with low variability
VarDaEl<-row.names(caret::nearZeroVar(mix_dataset_tra, saveMetrics=TRUE))[which(nearZeroVar(mix_dataset_tra, saveMetrics=TRUE)$nzv==TRUE)]
mix_dataset_tra<-mix_dataset_tra[,-which(colnames(mix_dataset_tra)%in% VarDaEl)]

#correct datatype
mix_dataset_tra$SeniorCitizen<- as.factor(mix_dataset_tra$SeniorCitizen)
mix_dataset_tra$PercCharges<- mix_dataset_tra$MonthlyCharges/mix_dataset_tra$TotalCharges

TargetName<-"Churn_Yes"

Fr2<-c("PercCharges","InternetService_Fiber optic","InternetService_Yes", "Contract_Two year")

#One hot endconding reverse (more understandable)
mix_dataset_tra$OnlineBackup_Yes<-ifelse(mix_dataset_tra$OnlineBackup_No==1,0,1)
mix_dataset_tra$InternetService_Yes<-ifelse(mix_dataset_tra$InternetService_No==1,0,1) 

mix_dataset_tra$OnlineBackup_Yes<-as.factor(mix_dataset_tra$OnlineBackup_Yes)
mix_dataset_tra$InternetService_Yes<-as.factor(mix_dataset_tra$InternetService_Yes)
#remove old feature
mix_dataset_tra$OnlineBackup_No <- NULL
mix_dataset_tra$InternetService_No <- NULL

#correlation feature
mix_dataset_tra$InternetService_Fiber_CON_OnlineBackup<-
  as.integer(mix_dataset_tra$`InternetService_Fiber optic`)*
  as.integer(mix_dataset_tra$OnlineBackup_Yes)

mix_dataset_tra$SeniorCitizen_CON_tenure<- 
  as.integer(mix_dataset_tra$SeniorCitizen)*
  as.integer(mix_dataset_tra$tenure)

mix_dataset_tra$tenure_CON_PaymentMethod_Mailed_check<- 
  as.integer(mix_dataset_tra$tenure)*
  as.integer(mix_dataset_tra$`PaymentMethod_Mailed check`)

mix_dataset_tra$tenure_CON_OnlineBackup<- 
  as.integer(mix_dataset_tra$tenure)*as.integer(mix_dataset_tra$OnlineBackup_Yes)

#non linearity
mix_dataset_tra$tenure_QUADRATO<- 
  as.integer(mix_dataset_tra$tenure)* 
  as.integer(mix_dataset_tra$tenure)

3. EDA

A mixed (univariate + multivariate) exploratory data analysis (EDA) is made below, focused only on the most informative graphs.

3.1 Churn

The 3/4 of dataset with NO Churn and 1/4 with YES Churn. The classification problem is sufficiently balanced.

3.2 Contract 2 years

Let’s now discover the relationship between the Churns and the presence of 2-year contracts.

## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

Contract_Two year Churn Freq%
0 0 66
1 0 97
0 1 34
1 1 3

There are very few (<< 25%) churns in the case of a 2-year contract.

Therefore it is very unlikely that the customer will abandon us if they have a 2-year contract in progress.

3.3 Fiber Optic Internet Service

What is the relationship between the Churns and the presence of the fiber optic service?

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

Fiber_optic Churn_Yes Freq%
0 0 85
1 0 58
0 1 15
1 1 42

Customers with active Fiber service are those in % who have a higher Churn (dal 15% al 42%). In other words, the optical fiber service active at the customer is correlated to churns.

3.4 Internet Service

If there is a positive correlation between Churn and fiber optic internet service…how will it relate to internet service in general?

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

The relation is similar to that with Fiber optic service: almost all customers without internet service do not generate churn. These last 4 graphs must be read together: obviously if the churn generator almost always has the optical fiber service, he also has the internet service.

3.5 % Charges

The % charges is the ratio between the last month’s charge and the total charged up to that moment. This value is higher when:

  • more recent the customer is

  • more customer spend in the last month than in the past

Let’s see how this % is related to the churn phenomenon:

Customers who have a higher% charge tend to be the ones who leave.

Therefore, it is the newest customers and/or those who spend the most in the last month who leave the most; let see in:

  • last graph how the average is higher in the case of Churn than in the case of no Churn

  • penultimate graph the proportion of Churn on the total when the % Charges is equal to 1

For both types, this distribution is strongly skewed to the right

3.6 Payment method mailed check

See how Payment Mehod Mailed Check, Tenure and Churn are related to each other

Customers with a payment method by postal check have on average a lower tenure. In both cases (with or without payment method by postal check), those that generate churn have a shorter term of tenure. The two phenomena are additive

In other words, the most recent customers (minor tenure) are those who have the payment method by postal check. Within them, those who abandon are the most recent (in line with the analysis made for the % Charges).

3.7 On-line back up

See how On line back up, Tenure and Churn are related to each other

Similar mechanism seen previously only by exchanging Payment Method Mailed Check with Online backup. Customers with a Online backup have on average a higher tenure (they are older). In both cases (with or without Online backup), those that generate churn have a shorter term of tenure. The two phenomena are additive.

Let’s now investigate the two relations indirectly linked to the Churn between the internet service and :

  1. the internet service with optical fiber

  2. the online backup

From the last two graphs there is a correlation:

  • positive between the internet service and the internet service with optical fiber; of all those with active internet service, less more than half are with internet service

  • negative between the internet service and the online back-up; of all those with active internet service, less than half are backed up online

At last, let’s try to understand how all the 4 variables are related:

  • Churn

  • Online Back up

  • Tenure

  • Payment Method Mailed Check

Starting from the basic quadrant at the top-left, if I move:

  • on the top-right, with the presence of the Payment method check, both terms of the mandate are reduced (both Churn and not Churn). The same result if I go to the low-right quadrant. In other words, when there is the Payment method check, the importance of the online backup is null on the tenure.

  • at the bottom-left with the presence of the backup both the duration of the tenure increases (both churn and not)

    In other words the effect given by the payment method check (in reducing tenure) prevails over online back up (in increasing tenure)

Synthesizing:

  1. customers with a lower tenure with the same On line Back-up and payment method mailed check are those who abandon most.

  2. if there is a payment method then the average tenure decreases

  3. if there is no payment method mailed check then the average tenure can increase if there is On line back up and decrease if there is not.

Now let’s try to understand how all the 4 variables are related:

  • Churn

  • Online Back up

  • Internet Service Fiber Optic

In descending sequence there are several cases (in absolute and percentage) of churn:

  1. without onlinebackup and with internet - top-right quadrant

  2. with onlinebackup and with internet - down-right quadrant

  3. without internet and onlinebackup - down-left quadrant

3.8 Tenure

Although already mentioned above, let’s look at the relationship between tenure and churn (as well as the square of the tenure)

As seen previously, the churn generator has a lower tenure duration

4. FEATURE SELECTION AND THE SYNTHESIS

At this point it is possible to reduce the columns/features of the dataset to those that express a significantly important relationship with the target variable (Churn_Yes), which were analyzed in the previous chapter

Fr<-
  union(Fr2,colnames(mix_dataset_tra[,29:37]))
rm(Fr2)
Fr<-union(Fr,TargetName)
F<-c("tenure","PaymentMethod_Mailed check")
Fr<-union(Fr,F)
rm(F)
mix_dataset_tra<- mix_dataset_tra[,which(colnames(mix_dataset_tra) %in% Fr)]
rm(Fr)

Now that the number of features has been reduced, it is possible to analyze the mutual correlations more broadly with a matrix that plots the correlations for pairs of features and individually

library(GGally)
mix_dataset_tra %>% GGally::ggpairs(mapping = aes(color = mix_dataset_tra$Churn_Yes, alpha = 0.5))

As seen before, those who generate Churn tend to be:

Following there is an over view of the most significant features correlated with Churn, perhaps a little risky as it should first be validated by an expert in the Telco sector (every Data Scientist, like the writer, know the difference between correlation and causality):

5. MODEL A RULE OF THUMB - CLASSIFICATION TREE

library(rpart)
library(partykit)

dt<-rpart(Churn_Yes~., data=mix_dataset_tra, control=rpart.control(min.split=60,min.bucket=30,maxdepth = 4))

plot(as.party(dt))

In this simple decision tree (classification tree) we go so far as to predict as Churn cases those that:

  1. doubling the value of the tenure if there is online backup; If this value is at least equal to 17.5 (wrong 16% of the time (*))
  2. those that do not fall within the previous case AND that do not have internet service with optical fiber (wrong 34% of the time (*))

(*) on this dataset

Why is there this rule? All the high values of tenure and the low ones but with on-line back up (chapter 3.7) are all merged as NOT churn. Those without on-line backup and low tenure remain to be further investigated.

Let’s calculate the mean:

library(dplyr)
library(gt)

tb<-as.data.frame(mix_dataset_tra%>%dplyr::group_by(OnlineBackup_Yes,Churn_Yes)%>%summarize( mean_tenure= round(mean(tenure)))) 

gt(tb)
OnlineBackup_Yes Churn_Yes mean_tenure
0 0 30
0 1 14
1 0 42
1 1 26

Assuming on the basis of the preventive actions a management cost of false positives (I predict there is Churn when there isn’t=> useless retention actions) equal to that of false negatives (I predict there is not Churn when there is =>I lose the customer without trying to keep it), we can consider that the accuracy of the classification is a pre-condition of a good model. In other words we would like the overall accuracy to be at least equal to to the value obtained in the null model (as a constraint).

We also assume that the real objective function to be maximized within the overall accuracy constraint is the sensitivity, or the ability to intercept the highest number of real churns. The % of the expected churns out of the total of those actually present is Sensitivity or TP / (TP + FN)

We will therefore carry forward the two indicators, accuracy and sensitivity, together in the choice and refinement of the classification models.

Let’s assume now that the null (simpler) model is simply to classify all customers as no churn (since churns are the least represented % class).

The null model therefore has an accuracy of 73% over the entire dataset but with a sensitivity equal to 0%. Now we look for the model with the simplest degree of complexity (classification tree seen above) when its accuracy performance improves.

accuracy_NullModel<-0.73
sensitivity_NullModel<-0

The 40% of the initial dataset (appropriately represented) on which the model is not built is used to calculate the accuracy and sensitivity and is split on 2 datasets:

library(rpart)
library(caret)
library(partykit)
set.seed(10)
TrainIndex <- createDataPartition(mix_dataset_tra$Churn_Yes, p=0.60, list= FALSE)
train<-mix_dataset_tra[TrainIndex,]
other<-mix_dataset_tra[-TrainIndex,]
crossvalidation<-other[1:(0.5*NROW(other)),]
test<-other[(1+0.5*NROW(other)):(NROW(other)),]
rm(TrainIndex)

dt<-rpart(Churn_Yes~., data=train, control=rpart.control(min.split=60,min.bucket=30,maxdepth = 4))

plot(as.party(dt))

The classification tree built on 60% of the data and with the same specifications in its rule is the same as the previous one built with the entire dataset.

Let’s calculate its performance

Null_Model DecisionTree_train DecisionTree_crossvalidation
0.73 79 80
0 38 41

The increase in accuracy from the null model to the decision tree goes from 73% to 80% (+ 7%)

accuracy_crossvalidation
## [1] 0.802276

The null model fails to predict any % of real churn (sensitivity), despite its average good reliability. In the case of sensitivity we pass from 0 in the Null Model to one equal to 40% in Decision Tree

sensitivity_crossvalidation
## [1] 0.4055556

In other words, if the model is used to intercept a sufficient number of churns then it is particularly performing compared to the null model (from 0 to 40%). On the contrary, if I look at the overall performance (accuracy) I have a slight improvement of + 7%.

In order to improve sensitivity for a new kind of model, there we have to understand if with this classification tree is bias or variance

##                     V1                 V2                           V3
## names       Null_Model DecisionTree_train DecisionTree_crossvalidation
## accuracy            73                 79                           80
## sensitivity          0                 38                           41

The comparison of the accuracy performances between null model and training set and crossvalidation with the tree model shows that the model has a greater component of bias than variance. This means looking for more complex models to increase performance

## [1] 0.06075829
DeltaAccuracy_crossvalidation_train
## [1] 0.01151767

6. A MODEL WITH A BETTER PERFORMANCE: BIAS & VARIANCE

library(caretEnsemble)
library(lattice)

set.seed(10)
trc<- trainControl(method = "repeatedcv", number=3,repeats=4, classProbs=FALSE) 
AlgoritmListClass<-c("gbm","svmLinear","regLogistic","gaussprRadial","qda","lda","rf")

mixedModel <- suppressWarnings(caretEnsemble::caretList(Churn_Yes~.,data=train,methodList=AlgoritmListClass,trControl=trc, verbose = FALSE))

scales<-list(x=list(relation="free"),y=list(relation="free"))
bwplot(resamples(mixedModel) ,scales=scales)

When comparing multiple alternative models of models, even if they have similar performance, the best seems to be the Regularized Logistic Regression.

Null_Model DecisionTree_train DecisionTree_crossvalidation rlog_train rlog_crossvalidation Parametr
73 79 80 80 80 accuracy
0 38 41 49 51 sensitivity

With the rlog model it is possible to appreciate:

  1. a 10% improvement in sensitivity with the same overall accuracy compared to the tree model (Lower Bias)

  2. practically equal performance between training and cross validation dataset (Lower Variance)

There we can find:

library(dplyr)
library(gt)
options(scipen = 999)
names<- colnames(mixedModel$regLogistic$finalModel$W)
tb<-as.data.frame(mixedModel$regLogistic$finalModel$W)
tb[2,]<-names
tb<- t(tb)
colnames(tb)<- c("Value","Feature")

tb<-as.data.frame(tb)
tb$Value<- round(x=as.numeric(tb$Value), digits = 3)
tb<-as.data.frame(tb) %>% arrange(Value)

gt(tb)
Value Feature
-1.358 PercCharges
-1.279 `InternetService_Fiber optic`1
-1.005 InternetService_Yes1
-0.010 SeniorCitizen_CON_tenure
-0.003 tenure_CON_OnlineBackup
0.000 InternetService_Fiber_CON_OnlineBackup
0.000 tenure_QUADRATO
0.008 tenure_CON_PaymentMethod_Mailed_check
0.033 tenure
0.270 OnlineBackup_Yes1
0.427 `PaymentMethod_Mailed check`1
1.554 `Contract_Two year`1
1.729 Bias
mixedModel$regLogistic$finalModel$tuneValue
##    cost loss epsilon
## 10    1   L1   0.001

7. ERROR ANALYSIS WITH HIERICAL CLUSTERING

Let us now focus on the 569 cases that in the training set were classified as 0 (no Churn) when in reality they were 1 (yes Churn), equal to half of the total cases 1.

Note that since the performance is very stable as the dataset varies (low variance), it makes no difference whether to perform it on the training set or on the cross validation set.

confMat_rlog_train
##    rlog_train
##        0    1
##   0 2809  289
##   1  569  553

7.1 Hierarchical Clustering

library(dplyr)
library(factoextra)

Error_Matrix<-cbind(train, rlog_train) %>% 
  filter(Churn_Yes==1 & rlog_train==0) %>% 
  select(-Churn_Yes) %>% select(-rlog_train)

#Construct a Dentogram with euclidian distance
library(e1071)
distxy<- dist(Error_Matrix)
hClust<- hclust(distxy, method = "complete") #Euclidian Distance
plot(hClust)

From the dentogram analysis we assume to analyze 4 clusters.

So let’s try to figure out how many data belong to the various 4 reference clusters.

Error_Matrix <- cbind(Error_Matrix, cutree(hClust,k=4))

Error_Matrix$`cutree(hClust, k = 4)`<- as.factor(Error_Matrix$`cutree(hClust, k = 4)`)

table(cutree(hClust,k=4))
## 
##   1   2   3   4 
## 392 104  39  34

7.2 Categorical Features

Let’s now analyze the categorical features of the dataset to see how they change from the entire initial sample and that of the Error Matrix (false negative of the crossvalidation dataset).

Confront<- function(Feature){
pos_Mix<- which(colnames(mix_dataset_tra) %in% Feature)
pos_Err <-which(colnames(Error_Matrix) %in% Feature)
Sintesi<-
  rbind(round(table(mix_dataset_tra[,pos_Mix])/NROW(mix_dataset_tra)*100),
               round(table(Error_Matrix[,pos_Err])/NROW(Error_Matrix)*100))

Sintesi<- as.data.frame(Sintesi)

Sintesi[3,]<- c((Sintesi[1,1] - Sintesi[2,1]), (Sintesi[1,2]-Sintesi[2,2]))

rownames(Sintesi)<- c("Starting_dataset","Error_Matrix","Delta")
colnames(Sintesi)<-paste0(Feature,"_",colnames(Sintesi))

Sintesi<- c(colnames(Sintesi)[1], Sintesi[3,1])
return(Sintesi)  
}

library(dplyr)

Pos<- which(colnames(Error_Matrix) %in% colnames(select_if(Error_Matrix,is.factor)))

Pos<- Pos[1:(NROW(Pos)-1)]

Conf_finale<-as.data.frame(matrix(nrow=1,ncol=2))

for (i in Pos){
Conf_finale<-rbind(Conf_finale,Confront(Feature=colnames(Error_Matrix)[i]))
}

Conf_finale<-na.omit(Conf_finale)
colnames(Conf_finale)<-c("Factor_feature","Delta%_Start_db-Error_db")
library(gt)
gt(Conf_finale)
Factor_feature Delta%_Start_db-Error_db
InternetService_Fiber optic_0 10
Contract_Two year_0 -20
PaymentMethod_Mailed check_0 -5
OnlineBackup_Yes_0 -6
InternetService_Yes_0 12

Compared to the starting dataset, for categorical features only, where the data performs poorly in classifying churns, we have:

  • 20% less than Contract_Two year; this means that compared to the entire dataset, the churns that my model does not classify well have 20% fewer 2-year contracts. From the EDA (chapter 3.2) we know that in total there are still very few cases of Churn with active 2-year contracts

  • 10% more internet services; this means that compared to the whole dataset the churns that my model does not classify well have 10% less internet service/internet service with fiber optic. From the EDA (chapter 3.4) we know that when the internet services is active the % of Churn increases from 7% to 32%. Therefore having 10% less means that the model makes it more difficult to predict the outcome of Churn

Too much of the purple cluster (cluster 1) and the red one (cluster 4) is fine.

In this case each cluster tends to increase , the number 2 in % is greater.

7.2 Continue Features

Let’s now analyze the continue features of the dataset to see how they change from the entire initial sample and that of the Error Matrix (false negative of the crossvalidation dataset).

Confront<- function(Feature){
pos_Mix<- which(colnames(mix_dataset_tra) %in% Feature)
pos_Err <-which(colnames(Error_Matrix) %in% Feature)
Sintesi<-rbind(round(mean(mix_dataset_tra[,pos_Mix])),round(mean(Error_Matrix[,pos_Err])))
Sintesi<- as.data.frame(Sintesi)
Sintesi[3,]<- round(100*(Sintesi[1,1] - Sintesi[2,1])/(Sintesi[1,1]))
rownames(Sintesi)<- c("Average_Starting_dataset","Average_Error_Matrix","Delta%")
colnames(Sintesi)<-c(Feature)
Sintesi[4,]<-colnames(Sintesi)
Sintesi<-t(Sintesi)
colnames(Sintesi)<- c("Average_Starting_dataset","Average_Error_Matrix","Delta%","Feature")

rownames(Sintesi)<-NULL
return(as.data.frame(Sintesi))  
}

library(dplyr)

Pos<- which(colnames(Error_Matrix) %in% colnames(select_if(Error_Matrix,is.numeric)))

Conf_finale<-as.data.frame(matrix(nrow=1,ncol=4))
colnames(Conf_finale)<- c("Average_Starting_dataset","Average_Error_Matrix","Delta%","Feature")

for (i in Pos){
Conf_finale<-
  rbind(Conf_finale,Confront(Feature=colnames(Error_Matrix)[i]))
}

Conf_finale<-na.omit(Conf_finale)

library(gt)
gt(Conf_finale)
Average_Starting_dataset Average_Error_Matrix Delta% Feature
32 31 3 tenure
0 0 NaN PercCharges
2 2 0 InternetService_Fiber_CON_OnlineBackup
38 39 -3 SeniorCitizen_CON_tenure
37 33 11 tenure_CON_PaymentMethod_Mailed_check
54 48 11 tenure_CON_OnlineBackup
1654 1371 17 tenure_QUADRATO

In comparing the two datasets (original/whole and the one containing the classification errors of the Churn cases) for the continuous variables, it can be observed that the average values vary for some features.

The most significant differences are with respect to:

  • 2 interaction features: tenure+online backup and tenure+ Payment method check

  • 1 quadratic feature (tenure)

Let’s try to compare the graphs of the 3 features of the two datasets

In the original/whole dataset, the difference between the averages of “tenure”, as the categorical characteristics of the “Payment Method of Mailed Check” varied, was less marked than that of the error dataset.

Specifically, in the case of the presence of “Postal Check Payment Method” the average in the whole db is >10 and in the Error one it is < 10

library(dplyr)
library(gt)

Error_Matrix %>%
  group_by(`cutree(hClust, k = 4)`,`PaymentMethod_Mailed check`)%>%
  summarize(mean_tenure=round(mean(tenure)))%>%
  gt() %>%
  cols_label(
    'PaymentMethod_Mailed check' = md("**Cluster+PaymentMethod_Mailed check**"),
    mean_tenure = md("**mean_tenure**")
  )
Cluster+PaymentMethod_Mailed check mean_tenure
1
0 22
1 10
2
0 50
1 46
3
0 69
4
0 61
1 61

The goal is to reduce the difference in the mean value of the tenure as the Payment method check varies and, also here, cluster 1 performs badly because it increases this difference.

Let’s see the next feature

In the original/whole dataset, the difference between the tenure averages, as the categorical Online back up features varied, was more marked compared to that of the error.

The objective here, instead, is to increase the difference in the average value of the tenure as the on line back up varies.

library(dplyr)
library(gt)

Error_Matrix %>%
  group_by(`cutree(hClust, k = 4)`,OnlineBackup_Yes) %>%
  summarize(mean_tenure=round(mean(tenure)))%>%
  gt() %>%
  cols_label(
    OnlineBackup_Yes = md("**Cluster+OnlineBackup_Yes**"),
    mean_tenure = md("**mean_tenure**")
  )
Cluster+OnlineBackup_Yes mean_tenure
1
0 20
1 18
2
0 49
1 50
3
0 68
1 69
4
0 60
1 61

This time it seems that no cluster can contribute to the goal.

Let see now the last feature:

In the original/whole dataset, the “tenure” square feature has the 75th Quantile higher (3.000) than that of the error dataset (a little over 2.000).

library(dplyr)
library(gt)

Error_Matrix %>%
  group_by(`cutree(hClust, k = 4)`) %>%
  summarize(
    lower_tenure_2 = quantile(tenure_QUADRATO, probs = 0.025),
    mean_tenure_2 = quantile(tenure_QUADRATO, probs = 0.50),
    upper_tenure_2 = quantile(tenure_QUADRATO, probs = 0.975))%>%
  gt()
cutree(hClust, k = 4) lower_tenure_2 mean_tenure_2 upper_tenure_2
1 1 361 1699.675
2 1849 2401 3184.025
3 4356 4624 5184.000
4 3364 3600 4225.000

Of the 3 clusters, the one that simultaneously lowers the average value (50% percentile) and the high one (95% percentile) is once again cluster 1, followed by cluster 2.

7.3 The synthesis of Error Analysis and new implementation

Below it is possible to summarize the clusters to be attacked in terms to be explored

Remember first how many rows in train dataset for each cluster

table(Error_Matrix$`cutree(hClust, k = 4)`)
## 
##   1   2   3   4 
## 392 104  39  34

So let’s try to triple the data belonging to cluster 1 (where the data is overall different from that of the entire dataset) from the training set and then understand how the relative performances change.

First, however, we need to synthesize the ex ante characteristics of cluster 1 by the value of its principal features:

  • Churn_Yes =1 (obviously)

  • Contract_Two year=0

  • InternetService_Fiber optic=1

  • max. tenure =40 or max. tenure^2 =1600

Now we will triplicate the data from cluster 1, but not by selecting them directly from the cluster analysis but indirectly on the basis of the main characteristics that emerged from cluster 1.

This indirect step is necessary to test our hypotheses about the characteristics of cluster 1.

In the direct approach we therefore know that there are 392 rows, let’s now try to figure out how many there are with the indirect approach.

library(dplyr)

train %>% filter(Churn_Yes==1) %>% 
  filter(tenure<=40) %>% 
  filter(`Contract_Two year`==0) %>% 
  filter(`InternetService_Fiber optic`==1) %>% 
  NROW()
## [1] 641

These 641 rows may also include those currently correctly classified by our model (weak point of the indirect approach to the search for cluster 1).

With 4.200 rows in training set we represent cluster 1 for 15% of the whole.

Since we will now triple the number of cluster 1 (from 641 to 1.923), this will be represented at 35%

(641*3)/(4200+641*2)
## [1] 0.3507844

Now we triplicate cluster 1 in training set, with the indirect approach

library(dplyr)

NewRow<-train %>% 
  filter(Churn_Yes==1) %>% 
  filter(tenure<=40) %>% 
  filter(`Contract_Two year`==0) %>%
  filter(`InternetService_Fiber optic`==1) 

train<-rbind(train,NewRow,NewRow)
rm(NewRow)

Re-run the models with a new larger train dataset

library(lattice)
library(caret)

set.seed(10)
trc<- trainControl(method = "repeatedcv", number=3,repeats=4, classProbs=FALSE) 

logr <- train(Churn_Yes~., data=train, trControl= trc, family="binomial", method = "regLogistic")

…let see the new:

  • coefficent
Value Feature
-2.185 `InternetService_Fiber optic`1
-1.146 PercCharges
-0.860 InternetService_Yes1
-0.034 InternetService_Fiber_CON_OnlineBackup
-0.010 SeniorCitizen_CON_tenure
-0.003 tenure_CON_OnlineBackup
0.000 tenure_QUADRATO
0.018 tenure_CON_PaymentMethod_Mailed_check
0.021 tenure
0.374 `PaymentMethod_Mailed check`1
0.400 OnlineBackup_Yes1
1.407 Bias
1.441 `Contract_Two year`1
  • Lamda (parameter penalty coefficient)
logr$finalModel$tuneValue
##    cost loss epsilon
## 10    1   L1   0.001

…and now check the new performance

Null_Model DecisionTree_train DecisionTree_crossvalidation rlog_train rlog_crossvalidation Parametr
73 79 80 80 80 accuracy
0 38 41 49 51 sensitivity
library(gt)
gt(final_2[,6:9])
Parametr rlog_new_train rlog_new_crossvalidation rlog_new_test
accuracy 83 78 76
sensitivity 83 69 61

By choosing the average performance on the crossvalidation and/or test datasets, it is summarized below:

  • Null Model: 0% sensitivity and 73% accuracy

  • Regression tree: 41% sensitivity and 80% accuracy

    confMat_crossvalidation
    ##    p_crossvalidation
    ##       0   1
    ##   0 982  64
    ##   1 214 146
  • Logistic Regression with penalty: 51% Sensitivity and 80% accuracy

    confMat_rlog_crossvalidation
    ##    rlog_crossvalidation
    ##       0   1
    ##   0 939 107
    ##   1 178 182
  • Logistic Regression with penalty and 3X same data from ex-ante cluster 1: 65% Sensitivity and 77% accuracy. This is choosen like the best model in our context of hypotheses.

confMat_rlog_new_crossvalidation
##    rlog_new_crossvalidation
##       0   1
##   0 849 197
##   1 112 248
confMat_rlog_new_test
##    rlog_new_test
##       0   1
##   0 834 185
##   1 151 236

In the last model, an important variability is recorded with respect to the sensitivity parameter (83% training set and 69-61% for CrossValidation and Test set), due to the reuse 3 times of the same data in training set that come from cluster 1. This technique caused an increase in the variance of the model.

It would therefore be appropriate to collect further new data belonging to cluster 1 in order to improve both the bias and the variance of the model.

8. SETTING A BETTER THERESHOLD

Let’s try to understand how the performance of the best model (the latest one) changes by trying to modify the probability threshold value that determines the classification as Churn_Yes (1) or Churn_No (0) compared to the classic 50%

library(pROC)
library(caret)

result.predicted.prob <- predict(logr, test, type="prob") # 
result.roc <- roc(test$Churn_Yes, result.predicted.prob$`1`) # Draw ROC curve.

plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")

result.coords <- coords(result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)#to get threshold and accuracy
##           threshold  accuracy
## threshold  0.393304 0.7560455

So from the ROC graph, the threshold value that optimizes on test dataset the trade-off between sensitivity (78%) and specificity (69%), providing an accuracy of 75%, is 0.39 not the classic value of 0.5.

So let’s try to understand if the performances are actually these on crossvalidation and test dataset.

Null_Model DecisionTree_train DecisionTree_crossvalidation rlog_train rlog_crossvalidation Parametr
73 79 80 80 80 accuracy
0 38 41 49 51 sensitivity
library(gt)
gt(final_2[,6:9])
Parametr rlog_new_train rlog_new_crossvalidation rlog_new_test
accuracy 82 77 76
sensitivity 87 73 69

Ultimately by choosing the latest model (the logistic regression with penalty and with three times as much data from cluster 1) as the optimal one and having lowered the threshold value from 50% to just under 40%, we have (always considering the performance achieved on the crossvalidation and test dataset):

9. INTERPRETING THE BEST MODEL

In this chapter we proceed to the visual interpretation of the best performing model through graphs: