LinkedIn profile: https://www.linkedin.com/in/gaetano-caira-4216ba93
This project work aims to show the skills of the writer, Gaetano Caira, on the following topics:
Data wrangling
Data visualization
Supervised algorithms for classification
Bias and variance
Error analysis
Unsupervised algorithms (Hierarchical Clustering)
R programming
The dataset chosen for show these skills has been shared on Kaggle (link: https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download) in order to be able to analyze and forecast the churn phenomenon in the Telco sector.
The data set includes information about:
Customers who left within the last month – the column is called Churn
Services that each customer has signed up for phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
In detail below you can find the description of the individual features:
gender: whether the customer is a male or a female
SeniorCitizen: whether the customer is a senior citizen or not (1, 0)
Partner: whether the customer has a partner or not (Yes, No)
Dependents: whether the customer has dependents or not (Yes, No)
Tenure: number of months the customer has stayed with the company
PhoneService: whether the customer has a phone service or not (Yes, No)
MultipleLines: whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService: customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract:The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBillings: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
Whether the customer churned or not (Yes or No)
Each row represents a customer, so the ML problem is characterized as a classification problem within the family of supervised models.
A guide for the reader is summarized below and summarizes the increase in performance of the sensitivity (vertical axis) of the predictive model as the chapters progress (orizontal axis):
Below are all the actions (in sequence) aimed at:
perform one-hot-endcoding for categorical features
remove the missing values
remove features with low and no variability (no informative)
correct the datatype’s format
some features engineer: new feature as correlation between 2 features, non-linearity (only for “tenure”), mix between features (only for “% Charges”)
knitr::opts_chunk$set(error = TRUE)
mix_dataset_tra<-
as.data.frame(mix_dataset_tra)
mix_dataset_tra$customerID<-
as.character(mix_dataset_tra$customerID)
TargetName<-("Churn")
Binarize_Features <-
function(data_set, features_to_ignore=c(), leave_out_one_level=TRUE, max_level_count=10) {
require(dplyr)
text_features <- c(names(data_set[sapply(data_set, is.character)]), names(data_set[sapply(data_set, is.factor)]))
for (feature_name in setdiff(text_features, features_to_ignore)) {
feature_vector <- as.character(data_set[,feature_name])
# check that data has more than one level
if (length(unique(feature_vector)) == 1)
next
# We set any non-data to text
feature_vector[is.na(feature_vector)] <- 'NA'
feature_vector[is.infinite(feature_vector)] <- 'INF'
feature_vector[is.nan(feature_vector)] <- 'NAN'
# only give us the top x most popular categories
temp_vect <- data.frame(table(feature_vector)) %>% arrange(desc(Freq)) %>% head(max_level_count)
feature_vector <- ifelse(feature_vector %in% temp_vect$feature_vector, feature_vector, 'Other')
# loop through each level of a feature and create a new column
first_level=TRUE
for (newcol in unique(feature_vector)) {
if (leave_out_one_level & first_level) {
# avoid dummy trap and skip first level
first_level=FALSE
next
}
data_set[,paste0(feature_name,"_",newcol)] <- ifelse(feature_vector==newcol,1,0)
}
# remove original feature
data_set <- data_set[,setdiff(names(data_set),feature_name)]
}
return (data_set)
}
#str(Binarize_Features(data_set = mix_dataset_tra))
mix_dataset_tra<-
Binarize_Features(data_set = mix_dataset_tra)
for(i in 5:41){
mix_dataset_tra[,i]<- as.factor(mix_dataset_tra[,i])
}
#give NA if there is not written nothing
mix_dataset_tra[mix_dataset_tra==""]<-"NA"
mix_dataset_tra[mix_dataset_tra=="?"]<-"NA"
#remove
mix_dataset_tra<-na.omit(mix_dataset_tra)
library(caret)
#delete features with low variability
VarDaEl<-row.names(caret::nearZeroVar(mix_dataset_tra, saveMetrics=TRUE))[which(nearZeroVar(mix_dataset_tra, saveMetrics=TRUE)$nzv==TRUE)]
mix_dataset_tra<-mix_dataset_tra[,-which(colnames(mix_dataset_tra)%in% VarDaEl)]
#correct datatype
mix_dataset_tra$SeniorCitizen<- as.factor(mix_dataset_tra$SeniorCitizen)
mix_dataset_tra$PercCharges<- mix_dataset_tra$MonthlyCharges/mix_dataset_tra$TotalCharges
TargetName<-"Churn_Yes"
Fr2<-c("PercCharges","InternetService_Fiber optic","InternetService_Yes", "Contract_Two year")
#One hot endconding reverse (more understandable)
mix_dataset_tra$OnlineBackup_Yes<-ifelse(mix_dataset_tra$OnlineBackup_No==1,0,1)
mix_dataset_tra$InternetService_Yes<-ifelse(mix_dataset_tra$InternetService_No==1,0,1)
mix_dataset_tra$OnlineBackup_Yes<-as.factor(mix_dataset_tra$OnlineBackup_Yes)
mix_dataset_tra$InternetService_Yes<-as.factor(mix_dataset_tra$InternetService_Yes)
#remove old feature
mix_dataset_tra$OnlineBackup_No <- NULL
mix_dataset_tra$InternetService_No <- NULL
#correlation feature
mix_dataset_tra$InternetService_Fiber_CON_OnlineBackup<-
as.integer(mix_dataset_tra$`InternetService_Fiber optic`)*
as.integer(mix_dataset_tra$OnlineBackup_Yes)
mix_dataset_tra$SeniorCitizen_CON_tenure<-
as.integer(mix_dataset_tra$SeniorCitizen)*
as.integer(mix_dataset_tra$tenure)
mix_dataset_tra$tenure_CON_PaymentMethod_Mailed_check<-
as.integer(mix_dataset_tra$tenure)*
as.integer(mix_dataset_tra$`PaymentMethod_Mailed check`)
mix_dataset_tra$tenure_CON_OnlineBackup<-
as.integer(mix_dataset_tra$tenure)*as.integer(mix_dataset_tra$OnlineBackup_Yes)
#non linearity
mix_dataset_tra$tenure_QUADRATO<-
as.integer(mix_dataset_tra$tenure)*
as.integer(mix_dataset_tra$tenure)
A mixed (univariate + multivariate) exploratory data analysis (EDA) is made below, focused only on the most informative graphs.
The 3/4 of dataset with NO Churn and 1/4 with YES Churn. The classification problem is sufficiently balanced.
Let’s now discover the relationship between the Churns and the presence of 2-year contracts.
## You can cite this package as:
## Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
## Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
| Contract_Two year | Churn | Freq% |
|---|---|---|
| 0 | 0 | 66 |
| 1 | 0 | 97 |
| 0 | 1 | 34 |
| 1 | 1 | 3 |
There are very few (<< 25%) churns in the case of a 2-year contract.
Therefore it is very unlikely that the customer will abandon us if they have a 2-year contract in progress.
What is the relationship between the Churns and the presence of the fiber optic service?
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
| Fiber_optic | Churn_Yes | Freq% |
|---|---|---|
| 0 | 0 | 85 |
| 1 | 0 | 58 |
| 0 | 1 | 15 |
| 1 | 1 | 42 |
Customers with active Fiber service are those in % who have a higher Churn (dal 15% al 42%). In other words, the optical fiber service active at the customer is correlated to churns.
If there is a positive correlation between Churn and fiber optic internet service…how will it relate to internet service in general?
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
The relation is similar to that with Fiber optic service: almost all customers without internet service do not generate churn. These last 4 graphs must be read together: obviously if the churn generator almost always has the optical fiber service, he also has the internet service.
The % charges is the ratio between the last month’s charge and the total charged up to that moment. This value is higher when:
more recent the customer is
more customer spend in the last month than in the past
Let’s see how this % is related to the churn phenomenon:
Customers who have a higher% charge tend to be the ones who leave.
Therefore, it is the newest customers and/or those who spend the most in the last month who leave the most; let see in:
last graph how the average is higher in the case of Churn than in the case of no Churn
penultimate graph the proportion of Churn on the total when the % Charges is equal to 1
For both types, this distribution is strongly skewed to the right
See how Payment Mehod Mailed Check, Tenure and Churn are related to each other
Customers with a payment method by postal check have on average a lower tenure. In both cases (with or without payment method by postal check), those that generate churn have a shorter term of tenure. The two phenomena are additive
In other words, the most recent customers (minor tenure) are those who have the payment method by postal check. Within them, those who abandon are the most recent (in line with the analysis made for the % Charges).
See how On line back up, Tenure and Churn are related to each other
Similar mechanism seen previously only by exchanging Payment Method Mailed Check with Online backup. Customers with a Online backup have on average a higher tenure (they are older). In both cases (with or without Online backup), those that generate churn have a shorter term of tenure. The two phenomena are additive.
Let’s now investigate the two relations indirectly linked to the Churn between the internet service and :
the internet service with optical fiber
the online backup
From the last two graphs there is a correlation:
positive between the internet service and the internet service with optical fiber; of all those with active internet service, less more than half are with internet service
negative between the internet service and the online back-up; of all those with active internet service, less than half are backed up online
At last, let’s try to understand how all the 4 variables are related:
Churn
Online Back up
Tenure
Payment Method Mailed Check
Starting from the basic quadrant at the top-left, if I move:
on the top-right, with the presence of the Payment method check, both terms of the mandate are reduced (both Churn and not Churn). The same result if I go to the low-right quadrant. In other words, when there is the Payment method check, the importance of the online backup is null on the tenure.
at the bottom-left with the presence of the backup both the duration of the tenure increases (both churn and not)
In other words the effect given by the payment method check (in reducing tenure) prevails over online back up (in increasing tenure)
Synthesizing:
customers with a lower tenure with the same On line Back-up and payment method mailed check are those who abandon most.
if there is a payment method then the average tenure decreases
if there is no payment method mailed check then the average tenure can increase if there is On line back up and decrease if there is not.
Now let’s try to understand how all the 4 variables are related:
Churn
Online Back up
Internet Service Fiber Optic
In descending sequence there are several cases (in absolute and percentage) of churn:
without onlinebackup and with internet - top-right quadrant
with onlinebackup and with internet - down-right quadrant
without internet and onlinebackup - down-left quadrant
Although already mentioned above, let’s look at the relationship between tenure and churn (as well as the square of the tenure)
As seen previously, the churn generator has a lower tenure duration
At this point it is possible to reduce the columns/features of the dataset to those that express a significantly important relationship with the target variable (Churn_Yes), which were analyzed in the previous chapter
Fr<-
union(Fr2,colnames(mix_dataset_tra[,29:37]))
rm(Fr2)
Fr<-union(Fr,TargetName)
F<-c("tenure","PaymentMethod_Mailed check")
Fr<-union(Fr,F)
rm(F)
mix_dataset_tra<- mix_dataset_tra[,which(colnames(mix_dataset_tra) %in% Fr)]
rm(Fr)
Now that the number of features has been reduced, it is possible to analyze the mutual correlations more broadly with a matrix that plots the correlations for pairs of features and individually
library(GGally)
mix_dataset_tra %>% GGally::ggpairs(mapping = aes(color = mix_dataset_tra$Churn_Yes, alpha = 0.5))
As seen before, those who generate Churn tend to be:
with internet
does not have a 2-year contract
has a higher average% charge / lower tenure
Following there is an over view of the most significant features correlated with Churn, perhaps a little risky as it should first be validated by an expert in the Telco sector (every Data Scientist, like the writer, know the difference between correlation and causality):
library(rpart)
library(partykit)
dt<-rpart(Churn_Yes~., data=mix_dataset_tra, control=rpart.control(min.split=60,min.bucket=30,maxdepth = 4))
plot(as.party(dt))
In this simple decision tree (classification tree) we go so far as to predict as Churn cases those that:
(*) on this dataset
Why is there this rule? All the high values of tenure and the low ones but with on-line back up (chapter 3.7) are all merged as NOT churn. Those without on-line backup and low tenure remain to be further investigated.
Let’s calculate the mean:
library(dplyr)
library(gt)
tb<-as.data.frame(mix_dataset_tra%>%dplyr::group_by(OnlineBackup_Yes,Churn_Yes)%>%summarize( mean_tenure= round(mean(tenure))))
gt(tb)
| OnlineBackup_Yes | Churn_Yes | mean_tenure |
|---|---|---|
| 0 | 0 | 30 |
| 0 | 1 | 14 |
| 1 | 0 | 42 |
| 1 | 1 | 26 |
Assuming on the basis of the preventive actions a management cost of false positives (I predict there is Churn when there isn’t=> useless retention actions) equal to that of false negatives (I predict there is not Churn when there is =>I lose the customer without trying to keep it), we can consider that the accuracy of the classification is a pre-condition of a good model. In other words we would like the overall accuracy to be at least equal to to the value obtained in the null model (as a constraint).
We also assume that the real objective function to be maximized within the overall accuracy constraint is the sensitivity, or the ability to intercept the highest number of real churns. The % of the expected churns out of the total of those actually present is Sensitivity or TP / (TP + FN)
We will therefore carry forward the two indicators, accuracy and sensitivity, together in the choice and refinement of the classification models.
Let’s assume now that the null (simpler) model is simply to classify all customers as no churn (since churns are the least represented % class).
The null model therefore has an accuracy of 73% over the entire dataset but with a sensitivity equal to 0%. Now we look for the model with the simplest degree of complexity (classification tree seen above) when its accuracy performance improves.
accuracy_NullModel<-0.73
sensitivity_NullModel<-0
The 40% of the initial dataset (appropriately represented) on which the model is not built is used to calculate the accuracy and sensitivity and is split on 2 datasets:
Cross validation dataset
Test dataset
library(rpart)
library(caret)
library(partykit)
set.seed(10)
TrainIndex <- createDataPartition(mix_dataset_tra$Churn_Yes, p=0.60, list= FALSE)
train<-mix_dataset_tra[TrainIndex,]
other<-mix_dataset_tra[-TrainIndex,]
crossvalidation<-other[1:(0.5*NROW(other)),]
test<-other[(1+0.5*NROW(other)):(NROW(other)),]
rm(TrainIndex)
dt<-rpart(Churn_Yes~., data=train, control=rpart.control(min.split=60,min.bucket=30,maxdepth = 4))
plot(as.party(dt))
The classification tree built on 60% of the data and with the same specifications in its rule is the same as the previous one built with the entire dataset.
Let’s calculate its performance
| Null_Model | DecisionTree_train | DecisionTree_crossvalidation |
|---|---|---|
| 0.73 | 79 | 80 |
| 0 | 38 | 41 |
The increase in accuracy from the null model to the decision tree goes from 73% to 80% (+ 7%)
accuracy_crossvalidation
## [1] 0.802276
The null model fails to predict any % of real churn (sensitivity), despite its average good reliability. In the case of sensitivity we pass from 0 in the Null Model to one equal to 40% in Decision Tree
sensitivity_crossvalidation
## [1] 0.4055556
In other words, if the model is used to intercept a sufficient number of churns then it is particularly performing compared to the null model (from 0 to 40%). On the contrary, if I look at the overall performance (accuracy) I have a slight improvement of + 7%.
In order to improve sensitivity for a new kind of model, there we have to understand if with this classification tree is bias or variance
## V1 V2 V3
## names Null_Model DecisionTree_train DecisionTree_crossvalidation
## accuracy 73 79 80
## sensitivity 0 38 41
The comparison of the accuracy performances between null model and training set and crossvalidation with the tree model shows that the model has a greater component of bias than variance. This means looking for more complex models to increase performance
## [1] 0.06075829
DeltaAccuracy_crossvalidation_train
## [1] 0.01151767
library(caretEnsemble)
library(lattice)
set.seed(10)
trc<- trainControl(method = "repeatedcv", number=3,repeats=4, classProbs=FALSE)
AlgoritmListClass<-c("gbm","svmLinear","regLogistic","gaussprRadial","qda","lda","rf")
mixedModel <- suppressWarnings(caretEnsemble::caretList(Churn_Yes~.,data=train,methodList=AlgoritmListClass,trControl=trc, verbose = FALSE))
scales<-list(x=list(relation="free"),y=list(relation="free"))
bwplot(resamples(mixedModel) ,scales=scales)
When comparing multiple alternative models of models, even if they have similar performance, the best seems to be the Regularized Logistic Regression.
| Null_Model | DecisionTree_train | DecisionTree_crossvalidation | rlog_train | rlog_crossvalidation | Parametr |
|---|---|---|---|---|---|
| 73 | 79 | 80 | 80 | 80 | accuracy |
| 0 | 38 | 41 | 49 | 51 | sensitivity |
With the rlog model it is possible to appreciate:
a 10% improvement in sensitivity with the same overall accuracy compared to the tree model (Lower Bias)
practically equal performance between training and cross validation dataset (Lower Variance)
There we can find:
library(dplyr)
library(gt)
options(scipen = 999)
names<- colnames(mixedModel$regLogistic$finalModel$W)
tb<-as.data.frame(mixedModel$regLogistic$finalModel$W)
tb[2,]<-names
tb<- t(tb)
colnames(tb)<- c("Value","Feature")
tb<-as.data.frame(tb)
tb$Value<- round(x=as.numeric(tb$Value), digits = 3)
tb<-as.data.frame(tb) %>% arrange(Value)
gt(tb)
| Value | Feature |
|---|---|
| -1.358 | PercCharges |
| -1.279 | `InternetService_Fiber optic`1 |
| -1.005 | InternetService_Yes1 |
| -0.010 | SeniorCitizen_CON_tenure |
| -0.003 | tenure_CON_OnlineBackup |
| 0.000 | InternetService_Fiber_CON_OnlineBackup |
| 0.000 | tenure_QUADRATO |
| 0.008 | tenure_CON_PaymentMethod_Mailed_check |
| 0.033 | tenure |
| 0.270 | OnlineBackup_Yes1 |
| 0.427 | `PaymentMethod_Mailed check`1 |
| 1.554 | `Contract_Two year`1 |
| 1.729 | Bias |
mixedModel$regLogistic$finalModel$tuneValue
## cost loss epsilon
## 10 1 L1 0.001
Let us now focus on the 569 cases that in the training set were classified as 0 (no Churn) when in reality they were 1 (yes Churn), equal to half of the total cases 1.
Note that since the performance is very stable as the dataset varies (low variance), it makes no difference whether to perform it on the training set or on the cross validation set.
confMat_rlog_train
## rlog_train
## 0 1
## 0 2809 289
## 1 569 553
library(dplyr)
library(factoextra)
Error_Matrix<-cbind(train, rlog_train) %>%
filter(Churn_Yes==1 & rlog_train==0) %>%
select(-Churn_Yes) %>% select(-rlog_train)
#Construct a Dentogram with euclidian distance
library(e1071)
distxy<- dist(Error_Matrix)
hClust<- hclust(distxy, method = "complete") #Euclidian Distance
plot(hClust)
From the dentogram analysis we assume to analyze 4 clusters.
So let’s try to figure out how many data belong to the various 4 reference clusters.
Error_Matrix <- cbind(Error_Matrix, cutree(hClust,k=4))
Error_Matrix$`cutree(hClust, k = 4)`<- as.factor(Error_Matrix$`cutree(hClust, k = 4)`)
table(cutree(hClust,k=4))
##
## 1 2 3 4
## 392 104 39 34
Let’s now analyze the categorical features of the dataset to see how they change from the entire initial sample and that of the Error Matrix (false negative of the crossvalidation dataset).
Confront<- function(Feature){
pos_Mix<- which(colnames(mix_dataset_tra) %in% Feature)
pos_Err <-which(colnames(Error_Matrix) %in% Feature)
Sintesi<-
rbind(round(table(mix_dataset_tra[,pos_Mix])/NROW(mix_dataset_tra)*100),
round(table(Error_Matrix[,pos_Err])/NROW(Error_Matrix)*100))
Sintesi<- as.data.frame(Sintesi)
Sintesi[3,]<- c((Sintesi[1,1] - Sintesi[2,1]), (Sintesi[1,2]-Sintesi[2,2]))
rownames(Sintesi)<- c("Starting_dataset","Error_Matrix","Delta")
colnames(Sintesi)<-paste0(Feature,"_",colnames(Sintesi))
Sintesi<- c(colnames(Sintesi)[1], Sintesi[3,1])
return(Sintesi)
}
library(dplyr)
Pos<- which(colnames(Error_Matrix) %in% colnames(select_if(Error_Matrix,is.factor)))
Pos<- Pos[1:(NROW(Pos)-1)]
Conf_finale<-as.data.frame(matrix(nrow=1,ncol=2))
for (i in Pos){
Conf_finale<-rbind(Conf_finale,Confront(Feature=colnames(Error_Matrix)[i]))
}
Conf_finale<-na.omit(Conf_finale)
colnames(Conf_finale)<-c("Factor_feature","Delta%_Start_db-Error_db")
library(gt)
gt(Conf_finale)
| Factor_feature | Delta%_Start_db-Error_db |
|---|---|
| InternetService_Fiber optic_0 | 10 |
| Contract_Two year_0 | -20 |
| PaymentMethod_Mailed check_0 | -5 |
| OnlineBackup_Yes_0 | -6 |
| InternetService_Yes_0 | 12 |
Compared to the starting dataset, for categorical features only, where the data performs poorly in classifying churns, we have:
20% less than Contract_Two year; this means that compared to the entire dataset, the churns that my model does not classify well have 20% fewer 2-year contracts. From the EDA (chapter 3.2) we know that in total there are still very few cases of Churn with active 2-year contracts
10% more internet services; this means that compared to the whole dataset the churns that my model does not classify well have 10% less internet service/internet service with fiber optic. From the EDA (chapter 3.4) we know that when the internet services is active the % of Churn increases from 7% to 32%. Therefore having 10% less means that the model makes it more difficult to predict the outcome of Churn
Too much of the purple cluster (cluster 1) and the red one (cluster 4) is fine.
In this case each cluster tends to increase , the number 2 in % is greater.
Let’s now analyze the continue features of the dataset to see how they change from the entire initial sample and that of the Error Matrix (false negative of the crossvalidation dataset).
Confront<- function(Feature){
pos_Mix<- which(colnames(mix_dataset_tra) %in% Feature)
pos_Err <-which(colnames(Error_Matrix) %in% Feature)
Sintesi<-rbind(round(mean(mix_dataset_tra[,pos_Mix])),round(mean(Error_Matrix[,pos_Err])))
Sintesi<- as.data.frame(Sintesi)
Sintesi[3,]<- round(100*(Sintesi[1,1] - Sintesi[2,1])/(Sintesi[1,1]))
rownames(Sintesi)<- c("Average_Starting_dataset","Average_Error_Matrix","Delta%")
colnames(Sintesi)<-c(Feature)
Sintesi[4,]<-colnames(Sintesi)
Sintesi<-t(Sintesi)
colnames(Sintesi)<- c("Average_Starting_dataset","Average_Error_Matrix","Delta%","Feature")
rownames(Sintesi)<-NULL
return(as.data.frame(Sintesi))
}
library(dplyr)
Pos<- which(colnames(Error_Matrix) %in% colnames(select_if(Error_Matrix,is.numeric)))
Conf_finale<-as.data.frame(matrix(nrow=1,ncol=4))
colnames(Conf_finale)<- c("Average_Starting_dataset","Average_Error_Matrix","Delta%","Feature")
for (i in Pos){
Conf_finale<-
rbind(Conf_finale,Confront(Feature=colnames(Error_Matrix)[i]))
}
Conf_finale<-na.omit(Conf_finale)
library(gt)
gt(Conf_finale)
| Average_Starting_dataset | Average_Error_Matrix | Delta% | Feature |
|---|---|---|---|
| 32 | 31 | 3 | tenure |
| 0 | 0 | NaN | PercCharges |
| 2 | 2 | 0 | InternetService_Fiber_CON_OnlineBackup |
| 38 | 39 | -3 | SeniorCitizen_CON_tenure |
| 37 | 33 | 11 | tenure_CON_PaymentMethod_Mailed_check |
| 54 | 48 | 11 | tenure_CON_OnlineBackup |
| 1654 | 1371 | 17 | tenure_QUADRATO |
In comparing the two datasets (original/whole and the one containing the classification errors of the Churn cases) for the continuous variables, it can be observed that the average values vary for some features.
The most significant differences are with respect to:
2 interaction features: tenure+online backup and tenure+ Payment method check
1 quadratic feature (tenure)
Let’s try to compare the graphs of the 3 features of the two datasets
In the original/whole dataset, the difference between the averages of “tenure”, as the categorical characteristics of the “Payment Method of Mailed Check” varied, was less marked than that of the error dataset.
Specifically, in the case of the presence of “Postal Check Payment Method” the average in the whole db is >10 and in the Error one it is < 10
library(dplyr)
library(gt)
Error_Matrix %>%
group_by(`cutree(hClust, k = 4)`,`PaymentMethod_Mailed check`)%>%
summarize(mean_tenure=round(mean(tenure)))%>%
gt() %>%
cols_label(
'PaymentMethod_Mailed check' = md("**Cluster+PaymentMethod_Mailed check**"),
mean_tenure = md("**mean_tenure**")
)
| Cluster+PaymentMethod_Mailed check | mean_tenure |
|---|---|
| 1 | |
| 0 | 22 |
| 1 | 10 |
| 2 | |
| 0 | 50 |
| 1 | 46 |
| 3 | |
| 0 | 69 |
| 4 | |
| 0 | 61 |
| 1 | 61 |
The goal is to reduce the difference in the mean value of the tenure as the Payment method check varies and, also here, cluster 1 performs badly because it increases this difference.
Let’s see the next feature
In the original/whole dataset, the difference between the tenure averages, as the categorical Online back up features varied, was more marked compared to that of the error.
The objective here, instead, is to increase the difference in the average value of the tenure as the on line back up varies.
library(dplyr)
library(gt)
Error_Matrix %>%
group_by(`cutree(hClust, k = 4)`,OnlineBackup_Yes) %>%
summarize(mean_tenure=round(mean(tenure)))%>%
gt() %>%
cols_label(
OnlineBackup_Yes = md("**Cluster+OnlineBackup_Yes**"),
mean_tenure = md("**mean_tenure**")
)
| Cluster+OnlineBackup_Yes | mean_tenure |
|---|---|
| 1 | |
| 0 | 20 |
| 1 | 18 |
| 2 | |
| 0 | 49 |
| 1 | 50 |
| 3 | |
| 0 | 68 |
| 1 | 69 |
| 4 | |
| 0 | 60 |
| 1 | 61 |
This time it seems that no cluster can contribute to the goal.
Let see now the last feature:
In the original/whole dataset, the “tenure” square feature has the 75th Quantile higher (3.000) than that of the error dataset (a little over 2.000).
library(dplyr)
library(gt)
Error_Matrix %>%
group_by(`cutree(hClust, k = 4)`) %>%
summarize(
lower_tenure_2 = quantile(tenure_QUADRATO, probs = 0.025),
mean_tenure_2 = quantile(tenure_QUADRATO, probs = 0.50),
upper_tenure_2 = quantile(tenure_QUADRATO, probs = 0.975))%>%
gt()
| cutree(hClust, k = 4) | lower_tenure_2 | mean_tenure_2 | upper_tenure_2 |
|---|---|---|---|
| 1 | 1 | 361 | 1699.675 |
| 2 | 1849 | 2401 | 3184.025 |
| 3 | 4356 | 4624 | 5184.000 |
| 4 | 3364 | 3600 | 4225.000 |
Of the 3 clusters, the one that simultaneously lowers the average value (50% percentile) and the high one (95% percentile) is once again cluster 1, followed by cluster 2.
Below it is possible to summarize the clusters to be attacked in terms to be explored
Remember first how many rows in train dataset for each cluster
table(Error_Matrix$`cutree(hClust, k = 4)`)
##
## 1 2 3 4
## 392 104 39 34
So let’s try to triple the data belonging to cluster 1 (where the data is overall different from that of the entire dataset) from the training set and then understand how the relative performances change.
First, however, we need to synthesize the ex ante characteristics of cluster 1 by the value of its principal features:
Churn_Yes =1 (obviously)
Contract_Two year=0
InternetService_Fiber optic=1
max. tenure =40 or max. tenure^2 =1600
Now we will triplicate the data from cluster 1, but not by selecting them directly from the cluster analysis but indirectly on the basis of the main characteristics that emerged from cluster 1.
This indirect step is necessary to test our hypotheses about the characteristics of cluster 1.
In the direct approach we therefore know that there are 392 rows, let’s now try to figure out how many there are with the indirect approach.
library(dplyr)
train %>% filter(Churn_Yes==1) %>%
filter(tenure<=40) %>%
filter(`Contract_Two year`==0) %>%
filter(`InternetService_Fiber optic`==1) %>%
NROW()
## [1] 641
These 641 rows may also include those currently correctly classified by our model (weak point of the indirect approach to the search for cluster 1).
With 4.200 rows in training set we represent cluster 1 for 15% of the whole.
Since we will now triple the number of cluster 1 (from 641 to 1.923), this will be represented at 35%
(641*3)/(4200+641*2)
## [1] 0.3507844
Now we triplicate cluster 1 in training set, with the indirect approach
library(dplyr)
NewRow<-train %>%
filter(Churn_Yes==1) %>%
filter(tenure<=40) %>%
filter(`Contract_Two year`==0) %>%
filter(`InternetService_Fiber optic`==1)
train<-rbind(train,NewRow,NewRow)
rm(NewRow)
Re-run the models with a new larger train dataset
library(lattice)
library(caret)
set.seed(10)
trc<- trainControl(method = "repeatedcv", number=3,repeats=4, classProbs=FALSE)
logr <- train(Churn_Yes~., data=train, trControl= trc, family="binomial", method = "regLogistic")
…let see the new:
| Value | Feature |
|---|---|
| -2.185 | `InternetService_Fiber optic`1 |
| -1.146 | PercCharges |
| -0.860 | InternetService_Yes1 |
| -0.034 | InternetService_Fiber_CON_OnlineBackup |
| -0.010 | SeniorCitizen_CON_tenure |
| -0.003 | tenure_CON_OnlineBackup |
| 0.000 | tenure_QUADRATO |
| 0.018 | tenure_CON_PaymentMethod_Mailed_check |
| 0.021 | tenure |
| 0.374 | `PaymentMethod_Mailed check`1 |
| 0.400 | OnlineBackup_Yes1 |
| 1.407 | Bias |
| 1.441 | `Contract_Two year`1 |
logr$finalModel$tuneValue
## cost loss epsilon
## 10 1 L1 0.001
…and now check the new performance
| Null_Model | DecisionTree_train | DecisionTree_crossvalidation | rlog_train | rlog_crossvalidation | Parametr |
|---|---|---|---|---|---|
| 73 | 79 | 80 | 80 | 80 | accuracy |
| 0 | 38 | 41 | 49 | 51 | sensitivity |
library(gt)
gt(final_2[,6:9])
| Parametr | rlog_new_train | rlog_new_crossvalidation | rlog_new_test |
|---|---|---|---|
| accuracy | 83 | 78 | 76 |
| sensitivity | 83 | 69 | 61 |
By choosing the average performance on the crossvalidation and/or test datasets, it is summarized below:
Null Model: 0% sensitivity and 73% accuracy
Regression tree: 41% sensitivity and 80% accuracy
confMat_crossvalidation
## p_crossvalidation
## 0 1
## 0 982 64
## 1 214 146Logistic Regression with penalty: 51% Sensitivity and 80% accuracy
confMat_rlog_crossvalidation
## rlog_crossvalidation
## 0 1
## 0 939 107
## 1 178 182Logistic Regression with penalty and 3X same data from ex-ante cluster 1: 65% Sensitivity and 77% accuracy. This is choosen like the best model in our context of hypotheses.
confMat_rlog_new_crossvalidation
## rlog_new_crossvalidation
## 0 1
## 0 849 197
## 1 112 248
confMat_rlog_new_test
## rlog_new_test
## 0 1
## 0 834 185
## 1 151 236
In the last model, an important variability is recorded with respect to the sensitivity parameter (83% training set and 69-61% for CrossValidation and Test set), due to the reuse 3 times of the same data in training set that come from cluster 1. This technique caused an increase in the variance of the model.
It would therefore be appropriate to collect further new data belonging to cluster 1 in order to improve both the bias and the variance of the model.
Let’s try to understand how the performance of the best model (the latest one) changes by trying to modify the probability threshold value that determines the classification as Churn_Yes (1) or Churn_No (0) compared to the classic 50%
library(pROC)
library(caret)
result.predicted.prob <- predict(logr, test, type="prob") #
result.roc <- roc(test$Churn_Yes, result.predicted.prob$`1`) # Draw ROC curve.
plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")
result.coords <- coords(result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)#to get threshold and accuracy
## threshold accuracy
## threshold 0.393304 0.7560455
So from the ROC graph, the threshold value that optimizes on test dataset the trade-off between sensitivity (78%) and specificity (69%), providing an accuracy of 75%, is 0.39 not the classic value of 0.5.
So let’s try to understand if the performances are actually these on crossvalidation and test dataset.
| Null_Model | DecisionTree_train | DecisionTree_crossvalidation | rlog_train | rlog_crossvalidation | Parametr |
|---|---|---|---|---|---|
| 73 | 79 | 80 | 80 | 80 | accuracy |
| 0 | 38 | 41 | 49 | 51 | sensitivity |
library(gt)
gt(final_2[,6:9])
| Parametr | rlog_new_train | rlog_new_crossvalidation | rlog_new_test |
|---|---|---|---|
| accuracy | 82 | 77 | 76 |
| sensitivity | 87 | 73 | 69 |
Ultimately by choosing the latest model (the logistic regression with penalty and with three times as much data from cluster 1) as the optimal one and having lowered the threshold value from 50% to just under 40%, we have (always considering the performance achieved on the crossvalidation and test dataset):
a further improvement for Sensitivity from 65% to 70%
same value of Accuracy (77%-76,5%)
The new confusion matrices are shown below:
confMat_rlog_new_crossvalidation
## rlog_new_crossvalidation
## 0 1
## 0 813 233
## 1 96 264
confMat_rlog_new_test
## rlog_new_test
## 0 1
## 0 796 223
## 1 120 267In this chapter we proceed to the visual interpretation of the best performing model through graphs:
on the left, the feature importance plot that is a global representation, that looks all of your observations and tells you which features (columns that help you predict) have in-general the most predictive value for your model. In this case the two most important features are: Internet Service with fiber optic and tenure
on the right, the Breakdown plot is a local representation that explains one specific observation (one casual row) and give the the intercept (starting value) and the positive or negative contribution that each feature has to developing the prediction. In red there are the features that give a negative contribution to the probability of becoming Churn and in green those that give a positive contribution (you can see the consistency with what we saw in the chapter related to EDA). For example the active internet Service with fiber optic increase the probability of churn, while increasing tenure decrease it.