0. INTRODUCCION

Este conjunto de datos contiene detalles de los clientes de un banco y la variable objetivo es una variable binaria que refleja el hecho de si el cliente dejó el banco (cerró su cuenta) o si continúa siendo un cliente.

VARIABLES QUE CONTIENE EL DATASET

CreditScore: el rango de puntaje crediticio es de 350 a 850.

Geography: el banco regional tiene clientes de tres países: Francia, Alemania y España.

Age: el rango de edad del cliente es de 18 a 92.

Tenure:años que el cliente ha permanecido en el banco.

Balance: la cantidad de dinero disponible para retirar.

NumOfProducts: número de productos que los clientes utilizan en el banco.

IsActiveMember: 1 indica que está activo.

EstimatedSalary: el salario anual informado por el cliente.

Exited: si el cliente ha abandonado (cerrado la cuenta bancaria), 1 indica abandono.

FUENTE: https://www.kaggle.com/shrutimechlearn/churn-modelling

1. CARGA,PREPARACION Y LIMPIEZA DE DATOS.

library(tidyverse)
library(kableExtra)
library(caret)
library(corrplot)
library(ggpubr)
library(cowplot)
library(pROC)
library(kableExtra)
library(vip)
library(pdp)
library(tidymodels)
library(doParallel)
library(randomForest)
library(rpart)
library(xgboost)
library(embed)
library(rpart.plot)
library(rattle)
library(ROCR)

options(scipen=999)#Desactiva la notacion cientifica

churn <- read.csv("Churn_Modelling.csv")

Comprobamos si hay valores NA en nuestros datos.

colSums(is.na(churn))

##       RowNumber      CustomerId         Surname     CreditScore       Geography 
##               0               0               0               0               0 
##          Gender             Age          Tenure         Balance   NumOfProducts 
##               0               0               0               0               0 
##       HasCrCard  IsActiveMember EstimatedSalary          Exited 
##               0               0               0               0

y vemos que no tenemos y comprobamos la estructura de nuestros datos.

str(churn)

## 'data.frame':    10000 obs. of  14 variables:
##  $ RowNumber      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CustomerId     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ Surname        : chr  "Hargrave" "Hill" "Onio" "Boni" ...
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : chr  "France" "Spain" "France" "France" ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : int  1 0 1 0 0 1 0 1 0 0 ...

summary(churn)

##    RowNumber       CustomerId         Surname           CreditScore   
##  Min.   :    1   Min.   :15565701   Length:10000       Min.   :350.0  
##  1st Qu.: 2501   1st Qu.:15628528   Class :character   1st Qu.:584.0  
##  Median : 5000   Median :15690738   Mode  :character   Median :652.0  
##  Mean   : 5000   Mean   :15690941                      Mean   :650.5  
##  3rd Qu.: 7500   3rd Qu.:15753234                      3rd Qu.:718.0  
##  Max.   :10000   Max.   :15815690                      Max.   :850.0  
##   Geography            Gender               Age            Tenure      
##  Length:10000       Length:10000       Min.   :18.00   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.:32.00   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Median :37.00   Median : 5.000  
##                                        Mean   :38.92   Mean   : 5.013  
##                                        3rd Qu.:44.00   3rd Qu.: 7.000  
##                                        Max.   :92.00   Max.   :10.000  
##     Balance       NumOfProducts    HasCrCard      IsActiveMember  
##  Min.   :     0   Min.   :1.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     0   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 97199   Median :1.00   Median :1.0000   Median :1.0000  
##  Mean   : 76486   Mean   :1.53   Mean   :0.7055   Mean   :0.5151  
##  3rd Qu.:127644   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :250898   Max.   :4.00   Max.   :1.0000   Max.   :1.0000  
##  EstimatedSalary         Exited      
##  Min.   :    11.58   Min.   :0.0000  
##  1st Qu.: 51002.11   1st Qu.:0.0000  
##  Median :100193.91   Median :0.0000  
##  Mean   :100090.24   Mean   :0.2037  
##  3rd Qu.:149388.25   3rd Qu.:0.0000  
##  Max.   :199992.48   Max.   :1.0000

Comprobamos los estadisticos de nuestros datos, para ver cualquier indicio a destacar o anormal en nuestros datos.

lapply(churn, summary)

## $RowNumber
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    2501    5000    5000    7500   10000 
## 
## $CustomerId
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 15565701 15628528 15690738 15690941 15753234 15815690 
## 
## $Surname
##    Length     Class      Mode 
##     10000 character character 
## 
## $CreditScore
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   350.0   584.0   652.0   650.5   718.0   850.0 
## 
## $Geography
##    Length     Class      Mode 
##     10000 character character 
## 
## $Gender
##    Length     Class      Mode 
##     10000 character character 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   32.00   37.00   38.92   44.00   92.00 
## 
## $Tenure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   5.013   7.000  10.000 
## 
## $Balance
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0   97199   76486  127644  250898 
## 
## $NumOfProducts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.00    1.53    2.00    4.00 
## 
## $HasCrCard
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.7055  1.0000  1.0000 
## 
## $IsActiveMember
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5151  1.0000  1.0000 
## 
## $EstimatedSalary
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     11.58  51002.11 100193.91 100090.24 149388.25 199992.48 
## 
## $Exited
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2037  0.0000  1.0000

segun su estructura podemos deducir que podemos poner algunas variables como factores:

churn$Geography <- as.factor(churn$Geography)
churn$Gender <- as.factor(churn$Gender)
churn$HasCrCard <- as.factor(churn$HasCrCard)
churn$IsActiveMember <- as.factor(churn$IsActiveMember)
churn$Exited <- as.factor(churn$Exited)
churn$Tenure <- as.factor(churn$Tenure)
churn$NumOfProducts <- as.factor(churn$NumOfProducts)

Comprobamos la conversion hecha de factores:

str(churn)

## 'data.frame':    10000 obs. of  14 variables:
##  $ RowNumber      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CustomerId     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ Surname        : chr  "Hargrave" "Hill" "Onio" "Boni" ...
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
##  $ Gender         : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : Factor w/ 11 levels "0","1","2","3",..: 3 2 9 2 3 9 8 5 5 3 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : Factor w/ 4 levels "1","2","3","4": 1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 1 2 ...
##  $ IsActiveMember : Factor w/ 2 levels "0","1": 2 2 1 1 2 1 2 1 2 2 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 1 ...

Eliminamos variables que no aportan a las operaciones ni al modelaje y guardamos algunas de ellas en una nueva variable para unirlas al conjunto de datos final.

dt1 <- churn%>%select(CustomerId, Surname)

churn$RowNumber=NULL
churn$CustomerId=NULL
churn$Surname=NULL

y hechamos un vistazo preliminar a como nos queda nuestro dataset finalmente:

str(churn)

## 'data.frame':    10000 obs. of  11 variables:
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
##  $ Gender         : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : Factor w/ 11 levels "0","1","2","3",..: 3 2 9 2 3 9 8 5 5 3 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : Factor w/ 4 levels "1","2","3","4": 1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 1 2 ...
##  $ IsActiveMember : Factor w/ 2 levels "0","1": 2 2 1 1 2 1 2 1 2 2 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 1 ...

kable(head(churn,6), booktabs = T) %>%
  kable_styling(font_size=10)

CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
619	France	Female	42	2	0.00	1	1	1	101348.88	1
608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
502	France	Female	42	8	159660.80	3	1	0	113931.57	1
699	France	Female	39	1	0.00	2	0	0	93826.63	0
850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0
645	Spain	Male	44	8	113755.78	2	1	0	149756.71	1

2. ANALISIS DE LAS VARIABLES DE NUESTRO DATASET

2.1 ANALISIS DE LA VARIABLE OBJETIVO.

Vemos la tabla de contingencia y el porcentaje de abandono graficamentente que representan nuestros datos.

table(churn$Exited)

## 
##    0    1 
## 7963 2037

levels(churn$Exited)=c("No_abandono", "abandono")

table(churn$Exited)

## 
## No_abandono    abandono 
##        7963        2037

prop.table(table(churn$Exited))

## 
## No_abandono    abandono 
##      0.7963      0.2037

ggplot(data = churn, aes(x = factor(Exited), 
                        y = prop.table(stat(count)), fill = factor(Exited),
                        label = scales::percent_format(accuracy=0.01)(prop.table(stat(count))))) +
  geom_bar(position = "dodge") + 
  geom_text(stat = 'count',
            position = position_dodge(0.9), 
            vjust = -0.5, 
            size = 3.5) + coord_flip()+
  scale_y_continuous(labels = scales::percent_format(accuracy=0.01))+labs(
    x ='Abandono',y ='Porcentaje de Abandono')+ggtitle('TASA DE ABANDONO')

En primer lugar vamos a ver los datos que nos arroja en proporcion nuestra variable objetivo, por si baja de un nivel minimo aceptable alguno de sus niveles o factores y hubiera que balancear los datos (entendiendo cuestionarselo por debajo del 10% de algun nivel)

y comprobamos que no se da el caso ya que por lo menos tenemos un nivel con 20.37%, por lo tanto lo consideramos suficiente.

2.2 ANALISIS DE LAS VARIABLES CONTINUAS

Vamos ahora a mostrar los histogramas de las variables continuas:

s1 <-ggplot(data = churn, aes(x=CreditScore))+
  geom_histogram(color="white", fill="blue")+
  theme_bw() +
  theme(legend.position = "none")

s2 <- ggplot(data = churn, aes(x=Age))+
  geom_histogram(color="white", fill="purple") +
  theme_bw() +
  theme(legend.position = "none")

s3 <- ggplot(data = churn, aes(x=Balance))+
  geom_histogram(color="white", fill="green") +
  theme_bw() +
  theme(legend.position = "none")

s4 <- ggplot(data = churn, aes(x=EstimatedSalary))+
  geom_histogram(color="white", fill="red") +
  theme_bw() +
  theme(legend.position = "none")

plot_grid(s1, s2, s3,s4,labels = "AUTO")

Por edad se puede ver que el numero de clientes del banco, es sensiblemente superior entre los 25 y los 50 años (y dentro de este intervalo entre 30 y 40 años aprox.), siendo en las demás franjas de edad mucho menores el número de clientes y a partir de 50 se van reduciendo ostensiblemente. (curva con mayor asimetría hacia la derecha).

El rango de puntaje crediticio (CreditScore), se distribuye como una normal, estableciéndose entre 600 y 700 su mayor puntaje de crédito.

Hay una gran cantidad de clientes con un Balance a 0 importante más allá de los 3500 clientes, el resto de clientes tiene un Balance acorde a una distribución normal.

En cuanto al salario estimado esta bastante igualado en todos los niveles salariales de los clientes, exceptuando los de 0 y 200000 que son los 2 grupos de clientes que están a distinto nivel en general en cuanto a cantidad de clientes de esta relación.

BOXPLOT VARIABLES CUANTITATIVAS RESPECTO TASA DE ABANDONO

Vamos a comparar nuestras variables continuas con la variable objetivo especificada.

a1 <- ggplot(churn, aes(x = Exited, y = CreditScore, fill = Exited)) +
  geom_boxplot()+ggtitle("CHURN-CREDITSCORE")

a2 <- ggplot(churn, aes(x = Exited, y = Age , fill = Exited)) +
  geom_boxplot()+ggtitle("CHURN-AGE")

a3 <- ggplot(churn,aes(x = Exited, y = Balance, fill = Exited)) +
  geom_boxplot()+ggtitle("CHURN-BALANCE")

a4 <- ggplot(churn,aes(x = Exited, y = EstimatedSalary, fill = Exited)) +
  geom_boxplot()+ggtitle("CHURN-ESTIMATEDSALARY")

ggarrange(a1, a2,a3,a4, 
                  labels = "AUTO", 
                  common.legend = T, 
                  legend = "bottom", 
                  align = "hv",
                  ncol = 2)

$1 $2 attr(,“class”) [1] “list” “ggarrange”

Se puede decir con respecto al rango de puntaje crediticio (CreditScore) que en cuanto al abandono es muy similar al de los clientes que no lo hacen, inclusive dentro de un rango de valores equivalente entre aprox. 600-700 de puntaje crediticio, siendo su varianza prácticamente muy similar en torno a la media.

En la mayoría de casos los clientes de edad entre 30 y 40 años no abandonan el banco, a partir de los 40 hasta los 50 hay la mayor relación de abandonos, la varianza de los que no abandonan es un poco menor a la de los clientes que abandonan en este aspecto.

En cuanto al balance la tasa de abandono y no abandono esta equilibrada entre los clientes entre aprox. 40000 y un poco superior a 110000, siendo sus medias respectivas entre aprox.90000(No_abandono) y 110000(abandono). La varianza de los que no abandonan es superior a los que abandonan, los que abandonan lo hacen a partir de un balance aprox. de 40000 en un rango entre 40000 y 130000 aprox.

El salario estimado(estimatedsalary), es prácticamente igual en proporción entre los que abandonan y los que no (misma varianza aprox.) y su media tanto de unos como de otros esta sobre los 100000 de salario estimado.

2.2.1 CORRELACION VARIABLES CONTINUAS

Vamos ahora a comprobar si existe correlacion entre las variables continuas con el fin de evitar una posible multicolinealidad.

Para ello hacemos una matriz de correlacion de dichas variables.

dt <- churn%>%select(c("CreditScore", "Age", "Balance","EstimatedSalary"))
dat.cor <- cor(dt, method = "pearson")

corrplot(dat.cor, method = "shade",
         shade.col = NA, tl.col = "black",
         tl.srt = 90,
         addCoef.col = "black", addcolorlabel = "no",
         order = "AOE")

La correlacion es muy baja por tanto no existen problemas de multicolinealidad con nuestras variables.

2.3 ANALISIS VARIABLES CATEGORICAS

Veremos ahora graficamente los datos referentes a nuestras variables categoricas. (Exceptuando variable objetivo ya vista)

Geography <- churn%>%
  group_by(Geography)%>%
  summarise(Count= n())  

p1 <- ggbarplot(Geography, x = "Geography", y = "Count",
                fill = "Geography", color = "Geography", palette = "jco",label=TRUE,lab.size=2.5,lab.pos = "out", lab.col = "black", sort.val="desc",sort.by.groups = FALSE)+theme(legend.position="none")

Gender <- churn%>%
  group_by(Gender)%>%
  summarise(Count= n())  

p2 <- ggbarplot(Gender, x = "Gender", y = "Count",
                fill = "Gender", color = "Gender", palette = "jco",label=TRUE,lab.size=2.5,lab.pos = "out", lab.col = "black", sort.val="desc",sort.by.groups = FALSE)+theme(legend.position="none")

HasCrCard <- churn%>%
  group_by(HasCrCard)%>%
  summarise(Count= n())  

p3 <- ggbarplot(HasCrCard, x = "HasCrCard", y = "Count",
                fill = "HasCrCard", color = "HasCrCard", palette = "jco",label=TRUE,lab.size=2.5,lab.pos = "out", lab.col = "black", sort.val="desc",sort.by.groups = FALSE)+theme(legend.position="none")

IsActiveMember <- churn%>%
  group_by(IsActiveMember)%>%
  summarise(Count= n()) 

p4 <- ggbarplot(IsActiveMember, x = "IsActiveMember", y = "Count",
                fill = "IsActiveMember", color = "IsActiveMember", palette = "jco",label=TRUE,lab.size=2.5,lab.pos = "out", lab.col = "black", sort.val="desc",sort.by.groups = FALSE)+theme(legend.position="none")

Tenure  <-churn%>%
  group_by(Tenure)%>%
  summarise(Count= n()) 

p5 <- ggbarplot(Tenure, x = "Tenure", y = "Count",
                fill = "Tenure", color = "Tenure", palette = c("blue", "purple", "brown2", "red", "cyan1", "green", "yellow", "deeppink1", "forestgreen", "darksalmon", "gold1"),label=TRUE,lab.size=2.5,lab.pos = "out", lab.col = "black", sort.val="desc",sort.by.groups = FALSE)+theme(legend.position="none")

NumOfProducts  <-churn%>%
  group_by(NumOfProducts)%>%
  summarise(Count= n()) 

p6 <- ggbarplot(NumOfProducts, x = "NumOfProducts", y = "Count",
                fill = "NumOfProducts", color = "NumOfProducts", palette = "jco",label=TRUE,lab.size=2.5,lab.pos = "out", lab.col = "black", sort.val="desc",sort.by.groups = FALSE)+theme(legend.position="none")

ggarrange(p1, p2,p3,p4,p5,p6, 
                  labels = "AUTO", 
                  common.legend = F, 
                  legend = NULL, 
                  align = "hv",
                  ncol = 2)

$1 $2 $3 attr(,“class”) [1] “list” “ggarrange”

Los clientes son fundamentalmente de 3 países, Francia, Alemania y España, mayoritariamente son de Francia como se puede ver en la gráfica, seguidos con un número similar de clientes y a distancia del primero de Alemania y España.

Hay un numero mas o menos equilibrado entre hombres y mujeres clientes y de miembros activos y no activos, los que tienen tarjeta superan ampliamente a los que no la tienen.

La mayoría de los clientes son poseedores de uno o dos productos financieros del banco, muy significativamente por encima de los que tienen 3 o 4 como grupos minoritarios.

Los clientes que se quedan en el banco son significativamente superiores a los que abandonan el mismo.

Del numero de años que permanece un cliente en el banco, los clientes que no tienen 1 año y los que tienen 10 en el mismo son significativamente inferiores en número al número que tiene el resto en función de los años de permanencia en el banco.

Comparamos ahora cada variable categorica con respecto a nuestra variable objetivo para sacar conclusiones:

Geo <- churn%>%
  group_by(Geography,Exited)%>%
  summarise(Count= n())%>%
  mutate(porcentaje=round(Count/sum(Count)*100))  

c1 <- ggbarplot(Geo, x = "Geography", y = "porcentaje",
                fill = "Exited", color = "Exited", palette = "Paired",label=TRUE,lab.pos = "in",lab.size=2.5, lab.col = "white")+theme(legend.position="none")

Gen <- churn%>%
  group_by(Gender,Exited)%>%
  summarise(Count= n())%>%
  mutate(porcentaje=round(Count/sum(Count)*100))  

c2 <- ggbarplot(Gen, x = "Gender", y = "porcentaje",
                fill = "Exited", color = "Exited", palette = "Paired",label=TRUE,lab.pos = "in",lab.size=2.5, lab.col = "white")+theme(legend.position="none")

HCrCard <- churn%>%
  group_by(HasCrCard,Exited)%>%
  summarise(Count= n())%>%
  mutate(porcentaje=round(Count/sum(Count)*100))  

c3 <- ggbarplot(HCrCard, x = "HasCrCard", y = "porcentaje",
                fill = "Exited", color = "Exited", palette = "Paired",label=TRUE,lab.pos = "in",lab.size=2.5, lab.col = "white")+theme(legend.position="none")

IsActMember <- churn%>%
  group_by(IsActiveMember,Exited)%>%
  summarise(Count= n())%>%
  mutate(porcentaje=round(Count/sum(Count)*100))

c4 <- ggbarplot(IsActMember, x = "IsActiveMember", y = "porcentaje",
                fill = "Exited", color = "Exited", palette = "Paired",label=TRUE,lab.pos = "in",lab.size=2.5, lab.col = "white")+theme(legend.position="none")

Ten  <-churn%>%
  group_by(Tenure,Exited)%>%
  summarise(Count= n())%>%
  mutate(porcentaje=round(Count/sum(Count)*100))

c5 <- ggbarplot(Ten, x = "Tenure", y = "porcentaje",
                fill = "Exited", color = "Exited", palette = "Paired",label=TRUE,lab.pos = "in",lab.size=2.5, lab.col = "white")+theme(legend.position="none")

NumOfProd  <-churn%>%
  group_by(NumOfProducts,Exited)%>%
  summarise(Count= n())%>%
  mutate(porcentaje=round(Count/sum(Count)*100)) 

c6 <- ggbarplot(NumOfProd, x = "NumOfProducts", y = "porcentaje",
                fill = "Exited", color = "Exited", palette = "Paired",label=TRUE,lab.pos = "in", lab.size=2.5, lab.col = "white")+theme(legend.position="none")

ggarrange(c1,c2,c3,c4,c5,c6,  
                  labels = "AUTO",
                  common.legend = T, 
                  legend = "bottom", 
                  align = "hv", 
                  ncol = 2)

$1 $2 $3 attr(,“class”) [1] “list” “ggarrange”

De entre los 3 países con clientes en el banco, los clientes alemanes son los que mas abandonan, mientras que los 2 países restantes tienen una tasa de abandono similar en porcentaje en ambos casos.

De los clientes abandonan mas mujeres que hombres con escasa diferencia.

El abandono es similar entre los clientes que tienen tarjeta y los que no la tienen.

Hay una diferencia del doble significativa de abandono de los clientes que no son miembros activos con respecto a los clientes que si lo son.

La tasa de abandono de los clientes teniendo en cuenta el número de años de permanencia en el banco como cliente, es similar en todos los casos independientemente del numero de años que lleven estos en el mismo.

Es muy significativamente superior la tasa de abandono de los clientes que tienen entre 3 y 4 como numero de productos en el banco que los que tienen 1 o 2 de los mismos.(La tasa de abandono de los que tienen 4 es del 100%).

Ahora cambiamos los niveles de la estructura de la target por 0 y 1 en vez de 1 y 2.

levels(churn$Exited)<-c('0','1')
table(churn$Exited)

## 
##    0    1 
## 7963 2037

3. DISCRETIZACION DE VARIABLES NUMERICAS EN EL DATASET

churn2 <- recipe(Exited~.,
                  data = churn)%>%
  step_discretize_xgb(all_numeric(), outcome = "Exited")%>%
  prep()%>%
  juice

4. MODELIZACION DE DATOS.

FUNCIONES QUE VAMOS A APLICAR A LOS MODELOS

Funcion para crear una matriz de confusion

confusion<-function(real,scoring,umbral){ 
  conf<-table(real,scoring>=umbral)
  if(ncol(conf)==2) return(conf) else return(NULL)
}

Funcion para calcular las metricas de los modelos: acierto, precision, cobertura y F1

metricas<-function(matriz_conf){
  acierto <- (matriz_conf[1,1] + matriz_conf[2,2]) / sum(matriz_conf) *100
  precision <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[1,2]) *100
  cobertura <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[2,1]) *100
  F1 <- 2*precision*cobertura/(precision+cobertura)
  salida<-c(acierto,precision,cobertura,F1)
  return(salida)
}

Funcion para probar distintos umbrales y ver el efecto sobre precision y cobertura

umbrales<-function(real,scoring){
  umbrales<-data.frame(umbral=rep(0,times=19),acierto=rep(0,times=19),precision=rep(0,times=19),cobertura=rep(0,times=19),F1=rep(0,times=19))
  cont <- 1
  for (cada in seq(0.05,0.95,by = 0.05)){
    datos<-metricas(confusion(real,scoring,cada))
    registro<-c(cada,datos)
    umbrales[cont,]<-registro
    cont <- cont + 1
  }
  return(umbrales)
}

Funciones que calculan la curva ROC y el AUC

roc<-function(prediction){
  r<-performance(prediction,'tpr','fpr')
  plot(r)
}

auc<-function(prediction){
  a<-performance(prediction,'auc')
  return(a@y.values[[1]])
}

DIVISION DE LOS DATOS ENTRENAMIENTO Y PRUEBA

Creamos las particiones de entrenamiento y prueba, 70% y 30% respectivamente.

Vamos a aplicar un cross-validation a todos los modelos de vfold=10.

4.1 Regresion Logistica

set.seed(45)
split_inicial <- initial_split(
  data   = churn2,
  prop   = 0.7,
  strata = Exited
)

train <- training(split_inicial)
test  <- testing(split_inicial)

# Hacemos preprocesamiento de datos

mod_rcp <- recipe(Exited~.,
                  data = train)

# y construimos el modelo

mod_rl <- logistic_reg(mode = 'classification') %>%
  set_engine(engine = "glm",family=binomial(link='logit'))


set.seed(45)
cv_folds <- vfold_cv(
  data    = train,
  v       = 10,
  strata  = Exited
)


wf_m <- workflow()%>%
  add_recipe(mod_rcp)%>%
  add_model(mod_rl)


my_metrics <- metric_set(accuracy,roc_auc,kap)

set.seed(45)
rlog <- wf_m%>%
  fit_resamples(
    resamples    = cv_folds,
    metrics      = my_metrics,
    control      = control_resamples(save_pred = TRUE)
  )

collect_metrics(rlog)

## # A tibble: 3 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.858    10 0.00388 Preprocessor1_Model1
## 2 kap      binary     0.486    10 0.0148  Preprocessor1_Model1
## 3 roc_auc  binary     0.847    10 0.00561 Preprocessor1_Model1

show_best(rlog, metric = "roc_auc")

## # A tibble: 1 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.847    10 0.00561 Preprocessor1_Model1

best_rlog2 <- rlog%>%
  select_best("roc_auc")

wf_final <- 
  wf_m%>%
  finalize_workflow(best_rlog2)


set.seed(45)
final_model_rl <- fit(object = wf_final, data = train)

model_extr1 <- final_model_rl%>%extract_fit_engine()


summary(model_extr1)

## 
## Call:
## stats::glm(formula = ..y ~ ., family = ~binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4098  -0.5503  -0.3238  -0.1512   3.2082  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                           -1.186333   0.259253  -4.576
## CreditScore[521,564)                   0.135007   0.158926   0.849
## CreditScore[564,625)                  -0.164440   0.140742  -1.168
## CreditScore[625,648)                  -0.039350   0.164362  -0.239
## CreditScore[648,727)                  -0.265109   0.133078  -1.992
## CreditScore[727, Inf]                  0.015334   0.137038   0.112
## GeographyGermany                       0.961644   0.091350  10.527
## GeographySpain                         0.049752   0.095092   0.523
## GenderMale                            -0.569275   0.073256  -7.771
## Age[35,37)                             0.375462   0.161838   2.320
## Age[37,40)                             0.598920   0.129947   4.609
## Age[40,42)                             0.990112   0.147255   6.724
## Age[42,46)                             1.503760   0.120664  12.462
## Age[46,53)                             2.392812   0.117030  20.446
## Age[53, Inf]                           2.427798   0.124060  19.570
## Tenure1                               -0.173201   0.201117  -0.861
## Tenure2                               -0.205613   0.202676  -1.014
## Tenure3                               -0.234698   0.203052  -1.156
## Tenure4                               -0.163163   0.203253  -0.803
## Tenure5                               -0.313372   0.203507  -1.540
## Tenure6                               -0.218890   0.203744  -1.074
## Tenure7                               -0.277860   0.203017  -1.369
## Tenure8                               -0.372963   0.203247  -1.835
## Tenure9                               -0.329110   0.204302  -1.611
## Tenure10                              -0.340140   0.234921  -1.448
## Balance[8.108e+04,1.043e+05)          -0.430797   0.125569  -3.431
## Balance[1.043e+05,1.2e+05)             0.189442   0.118055   1.605
## Balance[1.2e+05,1.346e+05)            -0.078067   0.121546  -0.642
## Balance[1.346e+05,1.443e+05)          -0.254790   0.155394  -1.640
## Balance[1.443e+05,1.571e+05)          -0.185218   0.160250  -1.156
## Balance[1.571e+05, Inf]                0.049836   0.150163   0.332
## NumOfProducts2                        -1.529965   0.086785 -17.629
## NumOfProducts3                         2.697017   0.230612  11.695
## NumOfProducts4                        16.523091 197.073682   0.084
## HasCrCard1                            -0.073725   0.079836  -0.923
## IsActiveMember1                       -1.119328   0.077241 -14.491
## EstimatedSalary[2.051e+04,4.055e+04)   0.090590   0.167554   0.541
## EstimatedSalary[4.055e+04,8.082e+04)   0.106726   0.144300   0.740
## EstimatedSalary[8.082e+04,1.191e+05)   0.009553   0.146075   0.065
## EstimatedSalary[1.191e+05,1.596e+05)   0.046448   0.145106   0.320
## EstimatedSalary[1.596e+05, Inf]        0.101894   0.144955   0.703
##                                                  Pr(>|z|)    
## (Intercept)                           0.00000474040207982 ***
## CreditScore[521,564)                             0.395605    
## CreditScore[564,625)                             0.242655    
## CreditScore[625,648)                             0.810789    
## CreditScore[648,727)                             0.046356 *  
## CreditScore[727, Inf]                            0.910906    
## GeographyGermany                     < 0.0000000000000002 ***
## GeographySpain                                   0.600835    
## GenderMale                            0.00000000000000779 ***
## Age[35,37)                                       0.020342 *  
## Age[37,40)                            0.00000404717982816 ***
## Age[40,42)                            0.00000000001770755 ***
## Age[42,46)                           < 0.0000000000000002 ***
## Age[46,53)                           < 0.0000000000000002 ***
## Age[53, Inf]                         < 0.0000000000000002 ***
## Tenure1                                          0.389131    
## Tenure2                                          0.310347    
## Tenure3                                          0.247741    
## Tenure4                                          0.422115    
## Tenure5                                          0.123596    
## Tenure6                                          0.282670    
## Tenure7                                          0.171107    
## Tenure8                                          0.066501 .  
## Tenure9                                          0.107202    
## Tenure10                                         0.147648    
## Balance[8.108e+04,1.043e+05)                     0.000602 ***
## Balance[1.043e+05,1.2e+05)                       0.108559    
## Balance[1.2e+05,1.346e+05)                       0.520686    
## Balance[1.346e+05,1.443e+05)                     0.101081    
## Balance[1.443e+05,1.571e+05)                     0.247761    
## Balance[1.571e+05, Inf]                          0.739981    
## NumOfProducts2                       < 0.0000000000000002 ***
## NumOfProducts3                       < 0.0000000000000002 ***
## NumOfProducts4                                   0.933182    
## HasCrCard1                                       0.355769    
## IsActiveMember1                      < 0.0000000000000002 ***
## EstimatedSalary[2.051e+04,4.055e+04)             0.588740    
## EstimatedSalary[4.055e+04,8.082e+04)             0.459534    
## EstimatedSalary[8.082e+04,1.191e+05)             0.947855    
## EstimatedSalary[1.191e+05,1.596e+05)             0.748895    
## EstimatedSalary[1.596e+05, Inf]                  0.482095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7073.9  on 6998  degrees of freedom
## Residual deviance: 4888.4  on 6958  degrees of freedom
## AIC: 4970.4
## 
## Number of Fisher Scoring iterations: 14

Vamos a seleccionar las variables altamente signifcativas(***), para volver a pasar el modelo solo con estas, asi pues:

df <- churn2%>%select(-CreditScore,-Tenure,-HasCrCard,-EstimatedSalary)

set.seed(45)
split_inicial <- initial_split(
  data   = df,
  prop   = 0.7,
  strata = Exited
)

train <- training(split_inicial)
test  <- testing(split_inicial)


# Hacemos preprocesamiento de datos

mod_rcp <- recipe(Exited~.,
                  data = train)


# y construimos el modelo


mod_rl <- logistic_reg(mode = 'classification') %>%
  set_engine(engine = "glm",family=binomial(link='logit'))


set.seed(45)
cv_folds <- vfold_cv(
  data    = train,
  v       = 10,
  strata  = Exited)


wf_m <- workflow()%>%
  add_recipe(mod_rcp)%>%
  add_model(mod_rl)


my_metrics <- metric_set(accuracy,roc_auc,kap)

set.seed(45)
rlog <- wf_m%>%
  fit_resamples(
    resamples    = cv_folds,
    metrics      = my_metrics,
    control      = control_resamples(save_pred = TRUE)
  )

collect_metrics(rlog)

## # A tibble: 3 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.856    10 0.00427 Preprocessor1_Model1
## 2 kap      binary     0.479    10 0.0158  Preprocessor1_Model1
## 3 roc_auc  binary     0.848    10 0.00582 Preprocessor1_Model1

show_best(rlog, metric = "roc_auc")

## # A tibble: 1 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.848    10 0.00582 Preprocessor1_Model1

best_rlog3 <- rlog%>%
  select_best("roc_auc")

wf_final <- 
  wf_m%>%
  finalize_workflow(best_rlog3)


set.seed(45)
final_model_rl <- fit(object = wf_final, data = train)

model_extr2 <- final_model_rl%>%extract_fit_engine()


summary(model_extr2)

## 
## Call:
## stats::glm(formula = ..y ~ ., family = ~binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3370  -0.5566  -0.3119  -0.1606   3.2073  
## 
## Coefficients:
##                               Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                   -1.50792    0.11298 -13.346 < 0.0000000000000002
## GeographyGermany               0.95511    0.09096  10.500 < 0.0000000000000002
## GeographySpain                 0.03624    0.09467   0.383             0.701863
## GenderMale                    -0.56434    0.07296  -7.735   0.0000000000000103
## Age[35,37)                     0.36688    0.16146   2.272             0.023074
## Age[37,40)                     0.59914    0.12953   4.626   0.0000037354500132
## Age[40,42)                     0.97065    0.14675   6.614   0.0000000000373779
## Age[42,46)                     1.49764    0.12026  12.454 < 0.0000000000000002
## Age[46,53)                     2.37758    0.11631  20.441 < 0.0000000000000002
## Age[53, Inf]                   2.41459    0.12331  19.581 < 0.0000000000000002
## Balance[8.108e+04,1.043e+05)  -0.42791    0.12492  -3.425             0.000614
## Balance[1.043e+05,1.2e+05)     0.18154    0.11764   1.543             0.122794
## Balance[1.2e+05,1.346e+05)    -0.07128    0.12088  -0.590             0.555402
## Balance[1.346e+05,1.443e+05)  -0.24685    0.15501  -1.593             0.111269
## Balance[1.443e+05,1.571e+05)  -0.20058    0.15988  -1.255             0.209643
## Balance[1.571e+05, Inf]        0.04917    0.14957   0.329             0.742333
## NumOfProducts2                -1.52930    0.08646 -17.688 < 0.0000000000000002
## NumOfProducts3                 2.62837    0.22844  11.506 < 0.0000000000000002
## NumOfProducts4                16.47790  197.25339   0.084             0.933425
## IsActiveMember1               -1.10792    0.07680 -14.427 < 0.0000000000000002
##                                 
## (Intercept)                  ***
## GeographyGermany             ***
## GeographySpain                  
## GenderMale                   ***
## Age[35,37)                   *  
## Age[37,40)                   ***
## Age[40,42)                   ***
## Age[42,46)                   ***
## Age[46,53)                   ***
## Age[53, Inf]                 ***
## Balance[8.108e+04,1.043e+05) ***
## Balance[1.043e+05,1.2e+05)      
## Balance[1.2e+05,1.346e+05)      
## Balance[1.346e+05,1.443e+05)    
## Balance[1.443e+05,1.571e+05)    
## Balance[1.571e+05, Inf]         
## NumOfProducts2               ***
## NumOfProducts3               ***
## NumOfProducts4                  
## IsActiveMember1              ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7073.9  on 6998  degrees of freedom
## Residual deviance: 4910.1  on 6979  degrees of freedom
## AIC: 4950.1
## 
## Number of Fisher Scoring iterations: 14

y vemos ahora que todas las variables cuentan al menos en alguno de sus niveles con 3 estrellas(***).

#La Prediccion del scoring del modelo (conjunto test) sera:

rl_predict <- predict(model_extr2,test,type = "response")
head(rl_predict,20)

##           1           2           3           4           5           6 
## 0.935066782 0.239896785 0.131684892 0.030510660 0.017472975 0.368563217 
##           7           8           9          10          11          12 
## 0.047381495 0.294724868 0.008928977 0.037643880 0.183282657 0.256283706 
##          13          14          15          16          17          18 
## 0.689627315 0.409606253 0.113664400 0.077260927 0.045772279 0.445767073 
##          19          20 
## 0.488146069 0.719686411

El resultado es el scoring de la probabilidad de un cliente de que abandone la empresa.

Veamos graficamente la pinta que tiene:

plot(rl_predict~test$Exited)

Calculamos ahora el pseudo R cuadrado, visualizando con anterioridad los coeficientes y significancia de acuerdo con lo que hemos decidido escoger para nuestro modelo:

psdo_R <- 1-(model_extr1$deviance/model_extr1$null.deviance)

psdo_R

## [1] 0.3089571

El valor que nos da es muy bueno, este valor nos indica que con las variables que tenemos somos capaces de explicar aprox. el 69% de los motivos por los que un cliente abandona, hay solo aprox. un 30% que no somos capaces de explicar.

Ahora tenemos que transformar la probabilidad en una decision de si el cliente va a abandonar o se mantiene.

Con la funcion umbrales probamos diferentes cortes y para este caso vamos a determinar el punto de corte que maximiza a la F1(Punto intermedio entre precision y cobertura).

umb_rl<-umbrales(test$Exited,rl_predict)
umb_rl

##    umbral  acierto precision cobertura        F1
## 1    0.05 51.18294  28.91745 95.588235 44.402277
## 2    0.10 64.64512  35.29797 88.071895 50.397382
## 3    0.15 71.70943  40.21470 79.575163 53.428415
## 4    0.20 77.24092  46.30593 72.712418 56.579784
## 5    0.25 80.13995  50.98765 67.483660 58.087201
## 6    0.30 82.10596  55.30410 63.888889 59.287339
## 7    0.35 83.87204  60.95890 58.169935 59.531773
## 8    0.40 84.70510  65.45455 52.941176 58.536585
## 9    0.45 84.77174  68.23529 47.385621 55.930569
## 10   0.50 85.13829  72.67760 43.464052 54.396728
## 11   0.55 84.77174  73.99381 39.052288 51.122995
## 12   0.60 84.37188  77.60618 32.843137 46.153846
## 13   0.65 84.23859  81.44796 29.411765 43.217287
## 14   0.70 84.13862  84.34343 27.287582 41.234568
## 15   0.75 82.80573  88.09524 18.137255 30.081301
## 16   0.80 82.37254  89.52381 15.359477 26.220363
## 17   0.85 81.67278  91.89189 11.111111 19.825073
## 18   0.90 81.37288 100.00000  8.660131 15.939850
## 19   0.95 80.60646 100.00000  4.901961  9.345794

Seleccionamos la formula para hallarlo, aunque tambien podemos deducirlo de la tabla que seria el numero mas alto de la tabla F1 del cual el siguiente baja, el que de el umbral corresponderia a esa fila.

umbral_final_rl<-umb_rl[which.max(umb_rl$F1),1]
umbral_final_rl

## [1] 0.35

Hallamos la matriz de confusion del umbral optimizado.

confusion(test$Exited,rl_predict,0.35)

##     
## real FALSE TRUE
##    0  2161  228
##    1   256  356

rl_metricas<-filter(umb_rl,umbral==umbral_final_rl)
rl_metricas

##   umbral  acierto precision cobertura       F1
## 1   0.35 83.87204   60.9589  58.16993 59.53177

Evaluamos la ROC

#creamos el objeto prediction
rl_prediction<-prediction(rl_predict,test$Exited)
#visualizamos la ROC
roc(rl_prediction)

#Sacamos las metricas definitivas incluyendo el AUC

rl_metricas<-cbind(rl_metricas,AUC=round(auc(rl_prediction),2)*100)
print(t(rl_metricas))

##               [,1]
## umbral     0.35000
## acierto   83.87204
## precision 60.95890
## cobertura 58.16993
## F1        59.53177
## AUC       84.00000

4.2 Arbol de decision

set.seed(45)
split_inicial <- initial_split(
  data   = churn2,
  prop   = 0.7,
  strata = Exited
)
train <- training(split_inicial)
test  <- testing(split_inicial)


mod_rp <- decision_tree(
  mode = 'classification',
  cost_complexity=0.00001,
  tree_depth = tune(),
  min_n = tune()
)%>%
  set_engine(engine = "rpart",parms = list(split="information"))


mod_rcp <- recipe(Exited~.,
                  data = train)

set.seed(45)
cv_folds <- vfold_cv(
  data = train,
  v = 10,
  strata = Exited
)

wf_rp <- workflow()%>%
  add_recipe(mod_rcp)%>%
  add_model(mod_rp)

registerDoParallel(cores = parallel::detectCores() - 1)

set.seed(45)
rp_fit <- wf_rp%>%tune_bayes(
  resamples = cv_folds,
  initial = 15,
  iter = 20,
  metrics = metric_set(roc_auc),
  control = control_bayes(save_pred=TRUE,no_improve = 10, verbose = FALSE))

stopImplicitCluster()

show_best(rp_fit, metric = "roc_auc")

## # A tibble: 5 x 9
##   tree_depth min_n .metric .estimator  mean     n std_err .config .iter
##        <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   <int>
## 1         13    29 roc_auc binary     0.843    10 0.00816 Iter13     13
## 2         13    32 roc_auc binary     0.842    10 0.00762 Iter4       4
## 3          9    28 roc_auc binary     0.842    10 0.00750 Iter14     14
## 4         10    38 roc_auc binary     0.842    10 0.00790 Iter20     20
## 5         10    40 roc_auc binary     0.842    10 0.00809 Iter8       8

mejor_hiper1 <- select_best(rp_fit, metric="roc_auc")
mejor_hiper1

## # A tibble: 1 x 3
##   tree_depth min_n .config
##        <int> <int> <chr>  
## 1         13    29 Iter13

wf_finalrp <- 
  wf_rp%>%
  finalize_workflow(mejor_hiper1)


set.seed(45)
final_model_rp <- fit(object = wf_finalrp, data = train)

model_ext <- final_model_rp%>%
  extract_fit_engine()

# Visualizamos el arbol del modelo.

fancyRpartPlot(model_ext)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Ahora revisamos la tabla donde podemos ver en primer termino el parametro de complejidad, error relativo,error absoluto, y desviacion estandar.

printcp(model_ext)

## 
## Classification tree:
## rpart::rpart(formula = ..y ~ ., data = data, parms = ~list(split = "information"), 
##     cp = ~0.00001, maxdepth = ~13, minsplit = min_rows(29, data))
## 
## Variables actually used in tree construction:
##  [1] Age             Balance         CreditScore     EstimatedSalary
##  [5] Gender          Geography       HasCrCard       IsActiveMember 
##  [9] NumOfProducts   Tenure         
## 
## Root node error: 1425/6999 = 0.2036
## 
## n= 6999 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.07017544      0   1.00000 1.00000 0.023641
## 2  0.02807018      3   0.78947 0.78947 0.021563
## 3  0.01684211      4   0.76140 0.75579 0.021184
## 4  0.00947368      6   0.72772 0.72772 0.020857
## 5  0.00701754      8   0.70877 0.72561 0.020832
## 6  0.00678363      9   0.70175 0.72702 0.020849
## 7  0.00421053     12   0.68140 0.71228 0.020673
## 8  0.00350877     14   0.67298 0.71509 0.020706
## 9  0.00327485     19   0.65404 0.71088 0.020656
## 10 0.00280702     22   0.64421 0.70807 0.020622
## 11 0.00245614     24   0.63860 0.70316 0.020562
## 12 0.00210526     26   0.63368 0.70877 0.020630
## 13 0.00175439     28   0.62947 0.70456 0.020579
## 14 0.00163743     34   0.61895 0.71298 0.020681
## 15 0.00154386     37   0.61404 0.71298 0.020681
## 16 0.00112281     43   0.60351 0.71368 0.020690
## 17 0.00105263     55   0.58175 0.70807 0.020622
## 18 0.00070175     59   0.57754 0.71789 0.020740
## 19 0.00060150     66   0.57263 0.74105 0.021014
## 20 0.00056140     74   0.56702 0.74105 0.021014
## 21 0.00035088     79   0.56421 0.74175 0.021022
## 22 0.00023392     81   0.56351 0.75018 0.021119
## 23 0.00001000     84   0.56281 0.76140 0.021248

En ella nos fijamos en el parametro de complejidad desde el que tomamos de partida 0.00001 abajo del todo hasta arriba de esa relacion, tenemos que mirar en la columna xerror(error absoluto), que empieza a bajar desde arriba.

En el momento que encontramos un numero donde empieza a cambiar la tendencia (a aumentar) ese seria, el parametro de complejidad que tenemos que tomar correspondiente a ese valor, y por ahi vamos a podar el arbol y ejecutar otra vez el modelo.

En este caso el parametro de complejidad que corresponderia a lo dicho anteriormente parece que minimiza aproximadamente en 0.001.

Tambien lo podemos interpretar viendo la representacion grafica (ejes: y- xerror x-cp) como vemos a continuacion.

plotcp(model_ext)

Cambiamos el parametro de complejidad de partida a 0.001 e incluimos en el arbol que no tenga mas de 5 niveles.

set.seed(45)
split_inicial <- initial_split(
  data   = churn2,
  prop   = 0.7,
  strata = Exited
)
train <- training(split_inicial)
test  <- testing(split_inicial)
  

mod_rp <- decision_tree(
  mode = 'classification',
  cost_complexity=0.001,
  tree_depth =5,
  min_n = tune()
)%>%
  set_engine(engine = "rpart",parms = list(split="information"))

mod_rcp <- recipe(Exited~.,
                  data = train)

set.seed(45)
cv_folds <- vfold_cv(
  data = train,
  v = 10,
  strata = Exited)

wf_rp <- workflow()%>%
  add_recipe(mod_rcp)%>%
  add_model(mod_rp)

registerDoParallel(cores = parallel::detectCores() - 1)

set.seed(45)
rp_fit <- wf_rp%>%tune_bayes(
  resamples = cv_folds,
  initial = 15,
  iter = 30,
  metrics = metric_set(roc_auc),
  control = control_bayes(save_pred=TRUE,no_improve = 10, verbose = FALSE))

stopImplicitCluster()

show_best(rp_fit, metric = "roc_auc")

## # A tibble: 5 x 8
##   min_n .metric .estimator  mean     n std_err .config               .iter
##   <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 <int>
## 1    37 roc_auc binary     0.799    10 0.00920 Preprocessor1_Model06     0
## 2    38 roc_auc binary     0.799    10 0.00920 Iter6                     6
## 3    39 roc_auc binary     0.799    10 0.00924 Preprocessor1_Model13     0
## 4    40 roc_auc binary     0.799    10 0.00925 Iter9                     9
## 5    34 roc_auc binary     0.799    10 0.00913 Preprocessor1_Model01     0

mejor_hiper2 <- select_best(rp_fit, metric="roc_auc")
mejor_hiper2

## # A tibble: 1 x 2
##   min_n .config              
##   <int> <chr>                
## 1    37 Preprocessor1_Model06

wf_finalrp <- 
  wf_rp%>%
  finalize_workflow(mejor_hiper2)


set.seed(45)
final_model_rp <- fit(object = wf_finalrp, data = train)

model_extr2 <- final_model_rp%>%
  extract_fit_engine()

#Visualizamos tabla complejidad-errores.

printcp(model_extr2)

## 
## Classification tree:
## rpart::rpart(formula = ..y ~ ., data = data, parms = ~list(split = "information"), 
##     cp = ~0.001, maxdepth = ~5, minsplit = min_rows(37, data))
## 
## Variables actually used in tree construction:
## [1] Age            Balance        Geography      IsActiveMember NumOfProducts 
## [6] Tenure        
## 
## Root node error: 1425/6999 = 0.2036
## 
## n= 6999 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.0701754      0   1.00000 1.00000 0.023641
## 2 0.0280702      3   0.78947 0.78947 0.021563
## 3 0.0168421      4   0.76140 0.75579 0.021184
## 4 0.0063158      6   0.72772 0.72772 0.020857
## 5 0.0035088      7   0.72140 0.73965 0.020997
## 6 0.0028070     10   0.71088 0.73895 0.020989
## 7 0.0024561     12   0.70526 0.73404 0.020932
## 8 0.0017544     14   0.70035 0.72842 0.020865
## 9 0.0010000     16   0.69684 0.72702 0.020849

Podemos ver una clara tendencia descendente de xerror(error absoluto) hasta llegar al parametro de complejidad 0.001 de partida y tanto su error relativo como el absoluto son bastane proximos en dicho parametro y en los demas.

Por tanto consideramos como buena dicha aproximacion.

#Visualizamos grafica error validacion cruzada y complejidad.

plotcp(model_extr2)

# Visualizamos el arbol del modelo.

fancyRpartPlot(model_extr2)

Reglas que podrian ser utiizadas para hacer implantacion del arbol.(y es Churn–la variable objetivo(dependiente) en las reglas del arbol)

rpart.rules(model_extr2, style = 'tall', cover = T,roundint=FALSE)

## ..y is 0.04 with cover 5% when
##     Age is [42,46) or [46,53) or [53, Inf]
##     NumOfProducts is 2
##     IsActiveMember is 1
##     Balance is [-Inf,8.108e+04) or [1.443e+05,1.571e+05)
## 
## ..y is 0.04 with cover 33% when
##     Age is [-Inf,35) or [35,37) or [37,40) or [40,42)
##     NumOfProducts is 2
## 
## ..y is 0.11 with cover 3% when
##     Age is [42,46) or [46,53)
##     NumOfProducts is 2
##     IsActiveMember is 0
##     Balance is [-Inf,8.108e+04) or [1.443e+05,1.571e+05)
## 
## ..y is 0.14 with cover 2% when
##     Age is [42,46)
##     NumOfProducts is 2
##     Balance is [8.108e+04,1.043e+05) or [1.043e+05,1.2e+05) or [1.2e+05,1.346e+05) or [1.346e+05,1.443e+05) or [1.571e+05, Inf]
## 
## ..y is 0.15 with cover 33% when
##     Age is [-Inf,35) or [35,37) or [37,40) or [40,42)
##     NumOfProducts is 1
## 
## ..y is 0.25 with cover 7% when
##     Age is [42,46) or [46,53) or [53, Inf]
##     NumOfProducts is 1
##     IsActiveMember is 1
##     Geography is France or Spain
## 
## ..y is 0.27 with cover 2% when
##     Age is [46,53) or [53, Inf]
##     NumOfProducts is 2
##     IsActiveMember is 1
##     Balance is [8.108e+04,1.043e+05) or [1.043e+05,1.2e+05) or [1.2e+05,1.346e+05) or [1.346e+05,1.443e+05) or [1.571e+05, Inf]
## 
## ..y is 0.29 with cover 0% when
##     Age is [-Inf,35) or [35,37) or [37,40) or [40,42)
##     NumOfProducts is 3 or 4
##     Balance is [-Inf,8.108e+04)
##     Tenure is 4 or 7 or 8 or 9
## 
## ..y is 0.48 with cover 3% when
##     Age is [42,46)
##     NumOfProducts is 1
##     IsActiveMember is 0
## 
## ..y is 0.53 with cover 3% when
##     Age is [42,46) or [46,53) or [53, Inf]
##     NumOfProducts is 1
##     IsActiveMember is 1
##     Geography is Germany
## 
## ..y is 0.61 with cover 1% when
##     Age is [46,53) or [53, Inf]
##     NumOfProducts is 2
##     IsActiveMember is 0
##     Balance is [8.108e+04,1.043e+05) or [1.043e+05,1.2e+05) or [1.2e+05,1.346e+05) or [1.346e+05,1.443e+05) or [1.571e+05, Inf]
## 
## ..y is 0.61 with cover 0% when
##     Age is [53, Inf]
##     NumOfProducts is 2
##     IsActiveMember is 0
##     Balance is [-Inf,8.108e+04) or [1.443e+05,1.571e+05)
## 
## ..y is 0.76 with cover 0% when
##     Age is [-Inf,35) or [35,37) or [37,40) or [40,42)
##     NumOfProducts is 3 or 4
##     Balance is [-Inf,8.108e+04)
##     Tenure is 1 or 2 or 3 or 5 or 6 or 10
## 
## ..y is 0.81 with cover 7% when
##     Age is [46,53) or [53, Inf]
##     NumOfProducts is 1 or 3 or 4
##     IsActiveMember is 0
## 
## ..y is 0.88 with cover 1% when
##     Age is [42,46) or [46,53) or [53, Inf]
##     NumOfProducts is 3 or 4
##     IsActiveMember is 1
## 
## ..y is 0.89 with cover 1% when
##     Age is [-Inf,35) or [35,37) or [37,40) or [40,42)
##     NumOfProducts is 3 or 4
##     Balance is [8.108e+04,1.043e+05) or [1.043e+05,1.2e+05) or [1.2e+05,1.346e+05) or [1.346e+05,1.443e+05) or [1.443e+05,1.571e+05) or [1.571e+05, Inf]
## 
## ..y is 1.00 with cover 0% when
##     Age is [42,46)
##     NumOfProducts is 3 or 4
##     IsActiveMember is 0

Vemos el numero de nodos final por cliente.

rp_numnod <- rpart.predict(model_extr2,test,nn=T)

head(rp_numnod)

##           0          1 nn
## 1 0.0000000 1.00000000 61
## 2 0.7505423 0.24945770 56
## 3 0.8560606 0.14393939 26
## 4 0.8560606 0.14393939 26
## 5 0.9564655 0.04353448  4
## 6 0.5181818 0.48181818 60

Calculo de prediccion del modelo (scorings).

rp_predict <- predict(model_extr2, test, type = 'prob')[,2]

head(rp_predict,20)

##          1          2          3          4          5          6          7 
## 1.00000000 0.24945770 0.14393939 0.14393939 0.04353448 0.48181818 0.04353448 
##          8          9         10         11         12         13         14 
## 0.15197568 0.04353448 0.15197568 0.04353448 0.15197568 0.80901288 0.24945770 
##         15         16         17         18         19         20 
## 0.15197568 0.04353448 0.04353448 0.15197568 0.48181818 0.80901288

Veamos graficamente como se ve:

plot(rp_predict~test$Exited)

Probamos cortes con la funcion umbrales, visualizamos la tabla respectiva.

umb_rp<-umbrales(test$Exited,rp_predict)
umb_rp

##    umbral  acierto precision cobertura        F1
## 1    0.05 56.51450  30.99287 92.320261 46.406571
## 2    0.10 56.51450  30.99287 92.320261 46.406571
## 3    0.15 59.04698  31.92736 89.052288 47.003019
## 4    0.20 81.13962  53.16804 63.071895 57.698057
## 5    0.25 84.07198  62.78626 53.758170 57.922535
## 6    0.30 84.87171  66.73729 51.470588 58.118081
## 7    0.35 84.87171  66.73729 51.470588 58.118081
## 8    0.40 84.87171  66.73729 51.470588 58.118081
## 9    0.45 84.87171  66.73729 51.470588 58.118081
## 10   0.50 85.10497  71.65354 44.607843 54.984894
## 11   0.55 85.07164  76.45161 38.725490 51.409978
## 12   0.60 85.07164  76.45161 38.725490 51.409978
## 13   0.65 85.00500  80.45113 34.967320 48.747153
## 14   0.70 85.00500  80.45113 34.967320 48.747153
## 15   0.75 85.00500  80.45113 34.967320 48.747153
## 16   0.80 85.03832  82.47012 33.823529 47.972190
## 17   0.85 81.50616  92.53731 10.130719 18.262150
## 18   0.90 80.00666  92.85714  2.124183  4.153355
## 19   0.95 80.00666  92.85714  2.124183  4.153355

Lo hallamos directamente con la siguiente sintaxis.

umbral_final_rp<-umb_rp[which.max(umb_rp$F1),1]
umbral_final_rp

## [1] 0.3

Hallamos la matriz de confusion del umbral optimizado.

confusion(test$Exited,rp_predict,0.3)

##     
## real FALSE TRUE
##    0  2232  157
##    1   297  315

rp_metricas<-filter(umb_rp,umbral==umbral_final_rp)
rp_metricas

##   umbral  acierto precision cobertura       F1
## 1    0.3 84.87171  66.73729  51.47059 58.11808

Evaluamos la ROC

#creamos el objeto prediction
rp_prediction<-prediction(rp_predict,test$Exited)
#visualizamos la ROC
roc(rp_prediction)

#Sacamos las metricas definitivas incluyendo el AUC

rp_metricas<-cbind(rp_metricas,AUC=round(auc(rp_prediction),2)*100)
print(t(rp_metricas))

##               [,1]
## umbral     0.30000
## acierto   84.87171
## precision 66.73729
## cobertura 51.47059
## F1        58.11808
## AUC       82.00000

4.3 Random Forest

set.seed(45)
split_inicial <- initial_split(
  data   = churn2,
  prop   = 0.7,
  strata = Exited
)
train <- training(split_inicial)
test  <- testing(split_inicial)


mod_rf <- rand_forest(
  mode  = "classification",
  mtry  = tune(),
  trees = 500,
  min_n = tune()
  )%>%
  set_engine(engine = "randomForest",importance=T)

set.seed(45)
cv_folds <- vfold_cv(
  data = train,
  v = 10,
  strata = Exited)

mod_rcp <- recipe(Exited~.,
                  data = train)

wf_rf <- workflow()%>%
  add_recipe(mod_rcp)%>%
  add_model(mod_rf)

hiperpar_grid=grid_max_entropy(
  # Rango de busqueda para cada hiperparametro
  mtry(range = c(1L, 10L), trans = NULL),
  min_n(range = c(2L, 100L), trans = NULL),
  # Numero de combinaciones totales
  size = 100
)

registerDoParallel(cores = parallel::detectCores() - 1)

set.seed(45)
rf_fit <- wf_rf%>%tune_grid(
  resamples = cv_folds,
  metrics = metric_set(roc_auc),
  control = control_resamples(save_pred = TRUE),
  grid = hiperpar_grid)

stopImplicitCluster()

show_best(rf_fit, metric = "roc_auc")

## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     5    36 roc_auc binary     0.862    10 0.00684 Preprocessor1_Model34
## 2     5    29 roc_auc binary     0.862    10 0.00620 Preprocessor1_Model20
## 3     5    25 roc_auc binary     0.862    10 0.00623 Preprocessor1_Model07
## 4     5    33 roc_auc binary     0.862    10 0.00647 Preprocessor1_Model69
## 5     5    20 roc_auc binary     0.862    10 0.00649 Preprocessor1_Model85

mejor_hiper3 <- select_best(rf_fit, metric="roc_auc")
mejor_hiper3

## # A tibble: 1 x 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     5    36 Preprocessor1_Model34

wf_finalrf <- 
  wf_rf%>%
  finalize_workflow(mejor_hiper3)


set.seed(45)
final_model_rf <- fit(object = wf_finalrf, data = train)

model_xtr <- final_model_rf%>%
  extract_fit_engine()

Como hay 2 criterios, creamos una unica variable agregada para tener una mejor idea de la importancia de cada variable.

importancia <- randomForest::importance(model_xtr)[,3:4]
#normalizamos para poner las dos variables en la misma escala. los valores negativos son las que menos predicen y los positivos las que mas
importancia_norm <- as.data.frame(scale(importancia))
#creamos una unica variable como suma de las otras
importancia_norm <- importancia_norm %>% mutate(
  Variable = rownames(importancia_norm),
  Imp_tot = MeanDecreaseAccuracy + MeanDecreaseGini) %>%
  mutate(Imp_tot = Imp_tot + abs(min(Imp_tot))) %>% 
  arrange(desc(Imp_tot)) %>% 
  select(Variable,Imp_tot,MeanDecreaseAccuracy,MeanDecreaseGini)
#hacemos un grafico para ver la curva de caida de importancia
ggplot(importancia_norm, aes(reorder(Variable,-Imp_tot),Imp_tot)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90,size = 7))

importancia_norm

##                        Variable   Imp_tot MeanDecreaseAccuracy MeanDecreaseGini
## Age                         Age 5.2719011           1.58237871       1.94997401
## NumOfProducts     NumOfProducts 5.1802924           1.80058147       1.64016262
## IsActiveMember   IsActiveMember 2.2400630           0.57662585      -0.07611118
## Balance                 Balance 1.5377288          -0.11100592      -0.09081359
## Geography             Geography 1.2275198          -0.06557148      -0.44645704
## Tenure                   Tenure 1.0650326          -0.66429186      -0.01022388
## CreditScore         CreditScore 0.3753788          -0.78527792      -0.57889162
## Gender                   Gender 0.2533321          -0.64085754      -0.84535867
## EstimatedSalary EstimatedSalary 0.2442347          -0.88991090      -0.60540274
## HasCrCard             HasCrCard 0.0000000          -0.80267041      -0.93687792

Vemos el grafico y nuestros datos en la tabla de forma decreciente segun nivel de importancia, y vemos donde puede estar un corte claro, diferencia de una con respecto a la anterior, y vemos de la variable ‘Tenure’ a ‘CreditScore’ un corte interesante.

Vamos a deseleccionar de nuestros datos esas 4 variables sobrantes por debajo del corte de ‘Tenure’ y nos quedaremos con esta y las de la parte superior.

churn3 <- churn2%>%select(-CreditScore,-Gender,-EstimatedSalary,-HasCrCard)

y ahora vamos a realizar de nuevo el modelo con las nuevas variables.

set.seed(45)
split_inicial <- initial_split(
  data   = churn3,
  prop   = 0.7,
  strata = Exited
)
train <- training(split_inicial)
test  <- testing(split_inicial)

mod_rf <- rand_forest(
  mode  = "classification",
  mtry  = tune(),
  trees = 500,
  min_n = tune()
) %>%
  set_engine(engine = "randomForest",importance=T)

mod_rcp <- recipe(Exited~.,
                  data = train)

set.seed(45)
cv_folds <- vfold_cv(
  data = train,
  v = 10,
  strata = Exited)

wf_rf <- workflow()%>%
  add_recipe(mod_rcp)%>%
  add_model(mod_rf)

hiperpar_grid=grid_max_entropy(
  # Rango de busqueda para cada hiperparametro
  mtry(range = c(1L, 10L), trans = NULL),
  min_n(range = c(2L, 100L), trans = NULL),
  # Numero de combinaciones totales
  size = 100
)

registerDoParallel(cores = parallel::detectCores() - 1)


set.seed(45)
rf_fit <- wf_rf%>%tune_grid(
  resamples = cv_folds,
  metrics = metric_set(roc_auc),
  control = control_resamples(save_pred = TRUE),
  grid = hiperpar_grid)

stopImplicitCluster()

show_best(rf_fit, metric = "roc_auc")

## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     5    25 roc_auc binary     0.855    10 0.00693 Preprocessor1_Model07
## 2     3    27 roc_auc binary     0.854    10 0.00791 Preprocessor1_Model21
## 3     3    37 roc_auc binary     0.854    10 0.00756 Preprocessor1_Model52
## 4     4    44 roc_auc binary     0.854    10 0.00731 Preprocessor1_Model45
## 5     2    37 roc_auc binary     0.853    10 0.00786 Preprocessor1_Model04

mejor_hiper4 <- select_best(rf_fit, metric="roc_auc")
mejor_hiper4

## # A tibble: 1 x 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     5    25 Preprocessor1_Model07

wf_finalrf <- 
  wf_rf%>%
  finalize_workflow(mejor_hiper4)


set.seed(45)
final_model_rf <- fit(object = wf_finalrf, data = train)

model_extr3 <- final_model_rf%>%
  extract_fit_engine()

Calculo de prediccion del modelo (scorings).

rf_predict <- predict(model_extr3, test, type = 'prob')[,2]

head(rf_predict,20)

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.974 0.120 0.070 0.074 0.010 0.514 0.000 0.096 0.000 0.000 0.106 0.262 0.300 
##    14    15    16    17    18    19    20 
## 0.118 0.178 0.002 0.000 0.408 0.520 0.992

Veamos graficamente como se ve:

plot(rf_predict~test$Exited)

Probamos cortes con la funcion umbrales, visualizamos la tabla respectiva.

umb_rf<-umbrales(test$Exited,rf_predict)
umb_rf

##    umbral  acierto precision cobertura       F1
## 1    0.05 74.94169  43.47015  76.14379 55.34442
## 2    0.10 79.14029  49.18415  68.95425 57.41497
## 3    0.15 81.87271  54.65753  65.19608 59.46349
## 4    0.20 83.53882  59.24765  61.76471 60.48000
## 5    0.25 84.30523  61.80905  60.29412 61.04218
## 6    0.30 84.73842  64.31227  56.53595 60.17391
## 7    0.35 85.07164  66.53226  53.92157 59.56679
## 8    0.40 85.20493  68.18182  51.47059 58.65922
## 9    0.45 85.17161  69.64706  48.36601 57.08775
## 10   0.50 85.23825  71.50127  45.91503 55.92040
## 11   0.55 85.30490  73.29700  43.95425 54.95403
## 12   0.60 85.17161  76.01246  39.86928 52.30439
## 13   0.65 85.27158  77.59740  39.05229 51.95652
## 14   0.70 85.23825  79.23875  37.41830 50.83241
## 15   0.75 85.07164  80.37037  35.45752 49.20635
## 16   0.80 84.77174  82.15768  32.35294 46.42438
## 17   0.85 84.70510  85.25346  30.22876 44.63209
## 18   0.90 84.03865  85.18519  26.30719 40.19975
## 19   0.95 82.77241  83.68794  19.28105 31.34130

Lo hallamos directamente con la siguiente sintaxis.

umbral_final_rf<-umb_rf[which.max(umb_rf$F1),1]
umbral_final_rf

## [1] 0.25

Hallamos la matriz de confusion del umbral optimizado.

confusion(test$Exited,rf_predict,0.25)

##     
## real FALSE TRUE
##    0  2161  228
##    1   243  369

rf_metricas<-filter(umb_rf,umbral==umbral_final_rf)
rf_metricas

##   umbral  acierto precision cobertura       F1
## 1   0.25 84.30523  61.80905  60.29412 61.04218

#Evaluamos la ROC

#creamos el objeto prediction
rf_prediction<-prediction(rf_predict,test$Exited)
#visualizamos la ROC
roc(rf_prediction)

#Sacamos las metricas definitivas incluyendo el AUC

rf_metricas<-cbind(rf_metricas,AUC=round(auc(rf_prediction),2)*100)
print(t(rf_metricas))

##               [,1]
## umbral     0.25000
## acierto   84.30523
## precision 61.80905
## cobertura 60.29412
## F1        61.04218
## AUC       83.00000

4.4 XGBOOST (extra gradient boosting)

set.seed(45)
split_inicial <- initial_split(
  data   = churn2,
  prop   = 0.7,
  strata = Exited
)
train <- training(split_inicial)
test  <- testing(split_inicial)



mod_xg <- boost_tree(mtry = tune(),
                     min_n = tune(),
                     tree_depth = tune(),
                     trees = 500,
                     learn_rate = 0.1,
                     sample_size = 0.8) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")


set.seed(3001)
cv_folds <- vfold_cv(
  data = train,
  v = 10,
  strata = Exited)


wf_xg<- workflow()%>%
  add_formula(Exited~.)%>%
  add_model(mod_xg)


# Parametros para tuning
params_xgb <- parameters(
  finalize(mtry(), x = train[, -1]),
  min_n(range = c(2L, 50L)),
  tree_depth(range = c(3L, 8L))
)

# Grid
set.seed(45)
grid_xgb <- params_xgb %>% 
  grid_max_entropy(size = 10)

registerDoParallel(cores = parallel::detectCores() - 1)

set.seed(2631)
xg_fit <- wf_xg%>%tune_grid(
  resamples = cv_folds,
  metrics = metric_set(roc_auc),
  grid = grid_xgb,
  control = control_grid(save_pred = TRUE))


stopImplicitCluster()

show_best(xg_fit, metric = "roc_auc")

## # A tibble: 5 x 9
##    mtry min_n tree_depth .metric .estimator  mean     n std_err .config         
##   <int> <int>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1    10    50          5 roc_auc binary     0.860    10 0.00374 Preprocessor1_M~
## 2     8    38          7 roc_auc binary     0.860    10 0.00408 Preprocessor1_M~
## 3     8    31          5 roc_auc binary     0.860    10 0.00398 Preprocessor1_M~
## 4     4    31          5 roc_auc binary     0.860    10 0.00402 Preprocessor1_M~
## 5     8    11          3 roc_auc binary     0.859    10 0.00440 Preprocessor1_M~

mejor_hiper5 <- select_best(xg_fit, metric="roc_auc")
mejor_hiper5

## # A tibble: 1 x 4
##    mtry min_n tree_depth .config              
##   <int> <int>      <int> <chr>                
## 1    10    50          5 Preprocessor1_Model05

wf_finalxg <- 
  wf_rf%>%
  finalize_workflow(mejor_hiper5)


set.seed(45)
final_model_xg <- fit(object = wf_finalxg, data = train)

## Warning: 10 columns were requested but there were 6 predictors in the data. 6
## will be used.

model_extr4 <- final_model_xg%>%
  extract_fit_engine()

Calculo de prediccion del modelo (scorings).

xg_predict <- predict(model_extr4, test, type = 'prob')[,2]

head(xg_predict,20)

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.942 0.102 0.046 0.016 0.000 0.524 0.000 0.044 0.000 0.000 0.038 0.222 0.346 
##    14    15    16    17    18    19    20 
## 0.048 0.096 0.000 0.000 0.392 0.546 1.000

Veamos graficamente como se ve:

plot(xg_predict~test$Exited)

Probamos cortes con la funcion umbrales, visualizamos la tabla respectiva.

umb_xg<-umbrales(test$Exited,xg_predict)
umb_xg

##    umbral  acierto precision cobertura       F1
## 1    0.05 79.30690  49.48216  70.26144 58.06887
## 2    0.10 82.40586  55.88235  65.19608 60.18100
## 3    0.15 83.50550  59.27100  61.11111 60.17699
## 4    0.20 84.17194  61.79002  58.66013 60.18441
## 5    0.25 84.53849  63.60294  56.53595 59.86159
## 6    0.30 84.83839  65.24272  54.90196 59.62733
## 7    0.35 85.13829  67.29167  52.77778 59.15751
## 8    0.40 85.40487  69.41964  50.81699 58.67925
## 9    0.45 85.60480  71.32701  49.18301 58.22050
## 10   0.50 85.27158  71.03960  46.89542 56.49606
## 11   0.55 85.10497  71.88329  44.28105 54.80283
## 12   0.60 85.30490  74.92711  41.99346 53.82199
## 13   0.65 85.47151  77.32919  40.68627 53.31906
## 14   0.70 85.37154  78.73754  38.72549 51.91676
## 15   0.75 85.03832  78.59649  36.60131 49.94426
## 16   0.80 84.97168  79.92565  35.13072 48.80817
## 17   0.85 84.73842  82.08333  32.18954 46.24413
## 18   0.90 84.13862  83.00971  27.94118 41.80929
## 19   0.95 83.40553  85.18519  22.54902 35.65891

Lo hallamos directamente con la siguiente sintaxis.

umbral_final_xg<-umb_xg[which.max(umb_xg$F1),1]
umbral_final_xg

## [1] 0.2

Hallamos la matriz de confusion del umbral optimizado.

confusion(test$Exited,xg_predict,0.2)

##     
## real FALSE TRUE
##    0  2167  222
##    1   253  359

xg_metricas<-filter(umb_xg,umbral==umbral_final_xg)
xg_metricas

##   umbral  acierto precision cobertura       F1
## 1    0.2 84.17194  61.79002  58.66013 60.18441

Evaluamos la ROC

#creamos el objeto prediction
xg_prediction<-prediction(xg_predict,test$Exited)
#visualizamos la ROC
roc(xg_prediction)

#Sacamos las metricas definitivas incluyendo el AUC

xg_metricas<-cbind(xg_metricas,AUC=round(auc(xg_prediction),2)*100)
print(t(xg_metricas))

##               [,1]
## umbral     0.20000
## acierto   84.17194
## precision 61.79002
## cobertura 58.66013
## F1        60.18441
## AUC       83.00000

4.5 Comparativa de metricas y curvas ROC.

4.5.1 Comparativa de metricas

comparativa <- rbind(rl_metricas,rp_metricas,rf_metricas,xg_metricas)
rownames(comparativa) <- c('Regresion Logistica', 'Arbol Decision', 'Random Forest', 'Extra Gradiant Boosting')
t(comparativa)

##           Regresion Logistica Arbol Decision Random Forest
## umbral                0.35000        0.30000       0.25000
## acierto              83.87204       84.87171      84.30523
## precision            60.95890       66.73729      61.80905
## cobertura            58.16993       51.47059      60.29412
## F1                   59.53177       58.11808      61.04218
## AUC                  84.00000       82.00000      83.00000
##           Extra Gradiant Boosting
## umbral                    0.20000
## acierto                  84.17194
## precision                61.79002
## cobertura                58.66013
## F1                       60.18441
## AUC                      83.00000

4.5.2 Comparativas de Curvas ROC de los modelos.

rl_auc <- rlog%>%collect_predictions(summarize=TRUE,parameters=best_rlog2)%>%
  roc_curve(Exited,.pred_0)%>%
  mutate(model = "Regresion Logistica")

rp_auc <- rp_fit%>%collect_predictions(summarize=TRUE,parameters=mejor_hiper2)%>%
  roc_curve(Exited,.pred_0)%>%
  mutate(model = "Arbol de Decision") 

rf_auc <- rf_fit%>%collect_predictions(summarize=TRUE,parameters=mejor_hiper4)%>%
  roc_curve(Exited,.pred_0)%>%
  mutate(model = "Random Forest") 

xg_auc <- xg_fit%>%collect_predictions(summarize=TRUE,parameters=mejor_hiper5)%>%
  roc_curve(Exited,.pred_0)%>%
  mutate(model = "Xgboost")

bind_rows(rl_auc,rp_auc,rf_auc,xg_auc)%>% # Dibuja las 4 curvas AUC juntas
  ggplot(aes(x=1-specificity, y = sensitivity, col = model))+ # Especifica el eje x e y dibuja la columna para usar el nombre de la metrica
  geom_path(lwd = 1.5, alpha = 0.8) + # Conecta a las 4 AUC, lwd = anchura de linea, alpha = Color transparencia del valor.
  geom_abline(lty = 3) + # abline del dibujo, lty= linea tipo
  coord_equal() + # Asegura que los rangos de los ejes sean iguales
  scale_color_viridis_d(option = "plasma", end = .6)

En base a la comparativa el modelo de Regresion Logistica ha obtenido levemente mejores resultados, asi que seleccionamos este modelo ganador en la comparativa.

4.6 Guardado del modelo seleccionado.

Añadimos la variable scoring a nuestro dataframe en base a la prediccion del modelo seleccionado y guardamos el modelo final.

churn2$SCORING_CHURN <- predict(model_extr1,churn2,type='response')
saveRDS(model_extr1,'modelo_final.rds')

# Añadimos las variables apartadas que no intervinieron en la ejecucion del modelo
# y visualizamos la tabla con el scoring añadido.

dt2<-bind_cols(dt1, churn2)

kable(head(dt2,6), booktabs = T) %>%
  kable_styling(font_size=8)

CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	SCORING_CHURN
15634602	Hargrave	[564,625)	France	Female	[42,46)	2	[-Inf,8.108e+04)	1	1	1	[8.082e+04,1.191e+05)	1	0.2251112
15647311	Hill	[564,625)	Spain	Female	[40,42)	1	[8.108e+04,1.043e+05)	1	0	1	[8.082e+04,1.191e+05)	0	0.1166358
15619304	Onio	[-Inf,521)	France	Female	[42,46)	8	[1.571e+05, Inf]	3	1	0	[8.082e+04,1.191e+05)	1	0.9325883
15701354	Boni	[648,727)	France	Female	[37,40)	1	[-Inf,8.108e+04)	2	0	0	[8.082e+04,1.191e+05)	0	0.0726866
15737888	Mitchell	[727, Inf]	Spain	Female	[42,46)	2	[1.2e+05,1.346e+05)	1	1	1	[4.055e+04,8.082e+04)	0	0.2714073
15574012	Chu	[625,648)	Spain	Male	[42,46)	8	[1.043e+05,1.2e+05)	2	1	0	[1.191e+05,1.596e+05)	1	0.1210836

CHURN MODELLING

Javier Calviño Tilves

8/11/2021

0. INTRODUCCION

1. CARGA,PREPARACION Y LIMPIEZA DE DATOS.

2. ANALISIS DE LAS VARIABLES DE NUESTRO DATASET

2.1 ANALISIS DE LA VARIABLE OBJETIVO.

2.2 ANALISIS DE LAS VARIABLES CONTINUAS

2.2.1 CORRELACION VARIABLES CONTINUAS

2.3 ANALISIS VARIABLES CATEGORICAS

3. DISCRETIZACION DE VARIABLES NUMERICAS EN EL DATASET

4. MODELIZACION DE DATOS.

4.1 Regresion Logistica

4.2 Arbol de decision

4.3 Random Forest

4.4 XGBOOST (extra gradient boosting)

4.5 Comparativa de metricas y curvas ROC.

4.5.1 Comparativa de metricas

4.5.2 Comparativas de Curvas ROC de los modelos.

4.6 Guardado del modelo seleccionado.