Desarrollo del Proyecto

1.Preparación datos y librerias

1.1 - Cargamos las librerias que vamos a utilizar

#lista de paquetes que vamos a usar
paquetes <- c('data.table',#para leer y escribir datos de forma rapida
              'dplyr',#para manipulacion de datos
              'tidyr',#para manipulacion de datos
              'ggplot2',#para graficos
              'randomForest',#para crear los modelos
              'ROCR',#para evaluar modelos
              'purrr',#para usar la funcion map que aplica la misma funciona a varios componentes de un dataframe
              'smbinning',#para calcular la para importancia de las variables
              'rpart',#para crear arboles de decision
              'rpart.plot'#para el grafico del arbol
)
#Crea un vector logico con si estan instalados o no
instalados <- paquetes %in% installed.packages()
#Si hay al menos uno no instalado los instala
if(sum(instalados == FALSE) > 0) {
  install.packages(paquetes[!instalados])
}
lapply(paquetes,require,character.only = TRUE)

1.2. - Cargamos los datos

Usamos fread de data.table para una lectura mucho mas rapida

datos_1 <- fread('Telco-Customer-Churn.csv')

2. Analisis de datos

2.1 - Analisis exploratorio general y tipo de datos

as.data.frame(sort(names(datos_1)))

##    sort(names(datos_1))
## 1                 Churn
## 2              Contract
## 3            customerID
## 4            Dependents
## 5      DeviceProtection
## 6                gender
## 7       InternetService
## 8        MonthlyCharges
## 9         MultipleLines
## 10         OnlineBackup
## 11       OnlineSecurity
## 12     PaperlessBilling
## 13              Partner
## 14        PaymentMethod
## 15         PhoneService
## 16        SeniorCitizen
## 17      StreamingMovies
## 18          StreamingTV
## 19          TechSupport
## 20               tenure
## 21         TotalCharges

str(datos_1)

## Classes 'data.table' and 'data.frame':   7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
##  - attr(*, ".internal.selfref")=<externalptr>

glimpse(datos_1)

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CF...
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Femal...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "...
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49...
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No...
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service"...
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber ...
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes"...
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No",...
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No",...
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "...
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", ...
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "...
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "O...
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No...
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check"...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 2...
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1...
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No",...

CONCLUSIONES:

Hay 17 variables como caracter que deberian ser factores: gender, SeniorCitizen, Partner, Dependents, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling y PaymentMethod. De momento las guardamos en la variable ‘Var_factores’

Reservamos las variables a pasar a factores:

Var_factores <- c('gender', 'SeniorCitizen', 'Partner',
                  'Dependents', 'PhoneService', 'MultipleLines',
                  'InternetService', 'OnlineSecurity',
                  'OnlineBackup', 'DeviceProtection',
                  'TechSupport', 'TechSupport', 'StreamingTV',
                  'StreamingMovies', 'Contract',
                  'PaperlessBilling', 'PaymentMethod')

2.2 - Calidad de datos: Estadísticos básicos

Hacemos un summary, con lapply que sale en formato de lista y se lee mejor

lapply(datos_1,summary)

## $customerID
##    Length     Class      Mode 
##      7043 character character 
## 
## $gender
##    Length     Class      Mode 
##      7043 character character 
## 
## $SeniorCitizen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1621  0.0000  1.0000 
## 
## $Partner
##    Length     Class      Mode 
##      7043 character character 
## 
## $Dependents
##    Length     Class      Mode 
##      7043 character character 
## 
## $tenure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   32.37   55.00   72.00 
## 
## $PhoneService
##    Length     Class      Mode 
##      7043 character character 
## 
## $MultipleLines
##    Length     Class      Mode 
##      7043 character character 
## 
## $InternetService
##    Length     Class      Mode 
##      7043 character character 
## 
## $OnlineSecurity
##    Length     Class      Mode 
##      7043 character character 
## 
## $OnlineBackup
##    Length     Class      Mode 
##      7043 character character 
## 
## $DeviceProtection
##    Length     Class      Mode 
##      7043 character character 
## 
## $TechSupport
##    Length     Class      Mode 
##      7043 character character 
## 
## $StreamingTV
##    Length     Class      Mode 
##      7043 character character 
## 
## $StreamingMovies
##    Length     Class      Mode 
##      7043 character character 
## 
## $Contract
##    Length     Class      Mode 
##      7043 character character 
## 
## $PaperlessBilling
##    Length     Class      Mode 
##      7043 character character 
## 
## $PaymentMethod
##    Length     Class      Mode 
##      7043 character character 
## 
## $MonthlyCharges
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.50   70.35   64.76   89.85  118.75 
## 
## $TotalCharges
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    18.8   401.4  1397.5  2283.3  3794.7  8684.8      11 
## 
## $Churn
##    Length     Class      Mode 
##      7043 character character

Conclusiones:

Es extraño que el minimo de la variable Tenure sea 0
Las 3 variables númericas tienen valores razonables salvo por los 11 nulos de TotalCharges

2.3 - Calidad de datos: Análisis de nulos y Ceros

Comprobamos los valores Nulo de Data Frame

data.frame(colSums(is.na(datos_1)))

##                  colSums.is.na.datos_1..
## customerID                             0
## gender                                 0
## SeniorCitizen                          0
## Partner                                0
## Dependents                             0
## tenure                                 0
## PhoneService                           0
## MultipleLines                          0
## InternetService                        0
## OnlineSecurity                         0
## OnlineBackup                           0
## DeviceProtection                       0
## TechSupport                            0
## StreamingTV                            0
## StreamingMovies                        0
## Contract                               0
## PaperlessBilling                       0
## PaymentMethod                          0
## MonthlyCharges                         0
## TotalCharges                          11
## Churn                                  0

Conclusion:

Tal y como apuntabamos hay 11 nulos en la variable Tatalcharges

Analisis de Ceros:

contar_ceros <- function(variable) {
    temp <- transmute(datos_1,if_else(variable==0,1,0))
    sum(temp)
}

num_ceros <- sapply(datos_1,contar_ceros)
num_ceros <- data.frame(VARIABLE=names(num_ceros),CEROS = as.numeric(num_ceros),stringsAsFactors = F) #el as.numeric es para sacar solo el valor de num_ceros, sin el nombre
num_ceros <- num_ceros %>%
  arrange(desc(CEROS)) %>%
  mutate(PORCENTAJE = CEROS / nrow(datos_1) * 100)
num_ceros

##            VARIABLE CEROS PORCENTAJE
## 1     SeniorCitizen  5901 83.7853188
## 2            tenure    11  0.1561834
## 3        customerID     0  0.0000000
## 4            gender     0  0.0000000
## 5           Partner     0  0.0000000
## 6        Dependents     0  0.0000000
## 7      PhoneService     0  0.0000000
## 8     MultipleLines     0  0.0000000
## 9   InternetService     0  0.0000000
## 10   OnlineSecurity     0  0.0000000
## 11     OnlineBackup     0  0.0000000
## 12 DeviceProtection     0  0.0000000
## 13      TechSupport     0  0.0000000
## 14      StreamingTV     0  0.0000000
## 15  StreamingMovies     0  0.0000000
## 16         Contract     0  0.0000000
## 17 PaperlessBilling     0  0.0000000
## 18    PaymentMethod     0  0.0000000
## 19   MonthlyCharges     0  0.0000000
## 20            Churn     0  0.0000000
## 21     TotalCharges    NA         NA

Conclusion:

Hay 11 ceros en la variable Tenure que probablemente coincidad con los 11 Nulos en totalCharges

Comprobamos si son los mismos casos:

Se puede hacer filtrando por tenure:

datos_1 %>% 
  filter(tenure == 0) %>%
  select(customerID,tenure,TotalCharges)

##    customerID tenure TotalCharges
## 1  4472-LVYGI      0           NA
## 2  3115-CZMZD      0           NA
## 3  5709-LVOEQ      0           NA
## 4  4367-NUYAO      0           NA
## 5  1371-DWPAZ      0           NA
## 6  7644-OMVMY      0           NA
## 7  3213-VVOLG      0           NA
## 8  2520-SGTTA      0           NA
## 9  2923-ARZLG      0           NA
## 10 4075-WKNIU      0           NA
## 11 2775-SEFEE      0           NA

O bien filtrando por TotalCharges:

datos_1 %>% 
  filter( is.na(TotalCharges)) %>%
  select(customerID,tenure, TotalCharges)

##    customerID tenure TotalCharges
## 1  4472-LVYGI      0           NA
## 2  3115-CZMZD      0           NA
## 3  5709-LVOEQ      0           NA
## 4  4367-NUYAO      0           NA
## 5  1371-DWPAZ      0           NA
## 6  7644-OMVMY      0           NA
## 7  3213-VVOLG      0           NA
## 8  2520-SGTTA      0           NA
## 9  2923-ARZLG      0           NA
## 10 4075-WKNIU      0           NA
## 11 2775-SEFEE      0           NA

Conclusion:

Son los mismos 11 registros. Como no podemos preguntar por ellos asumimos que son un error de datos con lo que al ser un % muy pequeño la decisión es eliminarlos totalmente.

2.4 - Calidad de datos: Análisis de atípicos

2.4.1 - Analizamos las que son de tipo numerico

out <- function(variable){
  t(t(head(sort(variable,decreasing = T),20))) #la doble traspuesta es un truco para que se visualice la salida, si no lo que crearia es una coleccion de dataframes que no se ven bien
}
lapply(datos_1,function(x){
  if(is.integer(x)| is.double(x)) out(x)
})

## $customerID
## NULL
## 
## $gender
## NULL
## 
## $SeniorCitizen
##       [,1]
##  [1,]    1
##  [2,]    1
##  [3,]    1
##  [4,]    1
##  [5,]    1
##  [6,]    1
##  [7,]    1
##  [8,]    1
##  [9,]    1
## [10,]    1
## [11,]    1
## [12,]    1
## [13,]    1
## [14,]    1
## [15,]    1
## [16,]    1
## [17,]    1
## [18,]    1
## [19,]    1
## [20,]    1
## 
## $Partner
## NULL
## 
## $Dependents
## NULL
## 
## $tenure
##       [,1]
##  [1,]   72
##  [2,]   72
##  [3,]   72
##  [4,]   72
##  [5,]   72
##  [6,]   72
##  [7,]   72
##  [8,]   72
##  [9,]   72
## [10,]   72
## [11,]   72
## [12,]   72
## [13,]   72
## [14,]   72
## [15,]   72
## [16,]   72
## [17,]   72
## [18,]   72
## [19,]   72
## [20,]   72
## 
## $PhoneService
## NULL
## 
## $MultipleLines
## NULL
## 
## $InternetService
## NULL
## 
## $OnlineSecurity
## NULL
## 
## $OnlineBackup
## NULL
## 
## $DeviceProtection
## NULL
## 
## $TechSupport
## NULL
## 
## $StreamingTV
## NULL
## 
## $StreamingMovies
## NULL
## 
## $Contract
## NULL
## 
## $PaperlessBilling
## NULL
## 
## $PaymentMethod
## NULL
## 
## $MonthlyCharges
##         [,1]
##  [1,] 118.75
##  [2,] 118.65
##  [3,] 118.60
##  [4,] 118.60
##  [5,] 118.35
##  [6,] 118.20
##  [7,] 117.80
##  [8,] 117.60
##  [9,] 117.50
## [10,] 117.45
## [11,] 117.35
## [12,] 117.20
## [13,] 117.15
## [14,] 116.95
## [15,] 116.85
## [16,] 116.80
## [17,] 116.75
## [18,] 116.60
## [19,] 116.60
## [20,] 116.55
## 
## $TotalCharges
##          [,1]
##  [1,] 8684.80
##  [2,] 8672.45
##  [3,] 8670.10
##  [4,] 8594.40
##  [5,] 8564.75
##  [6,] 8547.15
##  [7,] 8543.25
##  [8,] 8529.50
##  [9,] 8496.70
## [10,] 8477.70
## [11,] 8477.60
## [12,] 8476.50
## [13,] 8468.20
## [14,] 8456.75
## [15,] 8443.70
## [16,] 8436.25
## [17,] 8425.30
## [18,] 8425.15
## [19,] 8424.90
## [20,] 8405.00
## 
## $Churn
## NULL

2.5 - Analisis longitudinal

longi <- datos_1 %>% 
  summarise_all(mean) %>% #calcular la media de cada variable
  t() %>% #transponerlo para tenerlo en una sola columna y leerlo mejor
  as.data.frame() #reconvertirlo a dataframe porque t() lo pasa a matriz
data.frame(variable = rownames(longi), media = longi$V1) %>% #crear un nuevo dataframe para poder ordenar por el nombre
  arrange(desc(variable)) #ordenar por el nombre para tener la vision longitudinal

##            variable      media
## 1      TotalCharges         NA
## 2            tenure 32.3711487
## 3       TechSupport         NA
## 4       StreamingTV         NA
## 5   StreamingMovies         NA
## 6     SeniorCitizen  0.1621468
## 7      PhoneService         NA
## 8     PaymentMethod         NA
## 9           Partner         NA
## 10 PaperlessBilling         NA
## 11   OnlineSecurity         NA
## 12     OnlineBackup         NA
## 13    MultipleLines         NA
## 14   MonthlyCharges 64.7616925
## 15  InternetService         NA
## 16           gender         NA
## 17 DeviceProtection         NA
## 18       Dependents         NA
## 19       customerID         NA
## 20         Contract         NA
## 21            Churn         NA

2.6 Revisamos los valores de las variables categoricas

Este codigo da error en el Markdown, pero en el proyecto original funciona sin problemas. el texto del error es: Quitting from lines 175-181 (MD_ProyectoFinCurso2.Rmd) Error in df$Contract : objeto de tipo ‘closure’ no es subconjunto Calls: … withCallingHandlers -> withVisible -> eval -> eval -> table

table(df$Contract) table(df$PaymentMethod) table(df$TechSupport) table(df$InternetService) table(df$OnlineSecurity)

Conclusiones:

Todos los datos parecen normales solo hay que eliminar los Nulos

2.7 - Acciones resultado del analisis de calidad de datos y exploratorio:

transformar a factor las variables de ‘Var_factores’
eliminar los registros con tenure 0

datos_1 <- datos_1 %>%
  filter(tenure > 0) %>%
  mutate_at(Var_factores,funs(factor))

3 - Trasformación de datos

3.1 - Creacion de la variable target

Creacion de la variable cancelación de clientes “Crurn” (para el entrenamiento)

Este código no lo exporta correctamente

texto que falta en código linea 168

datos2 <- datos_1 %>% 
  mutate(TARGET_Churn = as.factor(as.numeric(ifelse((Churn == "Yes" ),1,0)
         ))) %>% #el as.numeric es para que los niveles del factor se queden como 0 y 1, y no como 1 y 2
  select(-Churn) #eliminamos la originale para que no confunda

3.2 - Preparacion de las variables independientes

3.2.1 - Preseleccion de variables independientes

Creamos una lista larga con todas las variables independientes.

ind_larga<-names(datos2) #lista con todas las variables
no_usar <- c('customerID','TARGET_Churn') #identificamos las que no queremos usar como VI

ind_larga<-setdiff(ind_larga,no_usar) #quitamos las que no usaremos

3.2.1.1 - Preseleccion con RandomForest

pre_rf <- randomForest(formula = reformulate(ind_larga,'TARGET_Churn'), 
                       data= datos2,mtry=2,ntree=50, importance = T)
imp_rf <- importance(pre_rf)[,4] #como importance devuelve una matriz con varias metricsa tenemos que extraer asi el decrecimiento en Gini que es el que mas nos interesa
imp_rf <- data.frame(VARIABLE = names(imp_rf), IMP_RF = imp_rf) #lo transformamos a dataframe
imp_rf <- imp_rf %>% arrange(desc(IMP_RF)) %>% mutate(RANKING_RF = 1:nrow(imp_rf)) #creamos el ranking
View(imp_rf)

3.2.1.2 - Preseleccion con Information Value

No funciona la libreria smbinning por lo que la selección de variables la llevaré a cabo solo con la información obtenida del Random Forest

3.2.2 - Seleccionar la lista de variables finales del proyecto

Una vez que ya hemos identificado las variables importantes

Decision: Vamos a descartar aquellas variables que no hayan salido entre las 8 mas importantes

ind_corta <- imp_rf %>%
  filter(RANKING_RF <= 8) %>% 
  select(VARIABLE) #nos quedamos solo con el nombre
ind_corta <- as.character(ind_corta$VARIABLE) #lo pasamos a un vector en vez de un dataframe

Estas son las variables predictoras con las que vamos a trabajar finalmente

ind_corta

## [1] "TotalCharges"    "tenure"          "MonthlyCharges"  "Contract"       
## [5] "PaymentMethod"   "OnlineSecurity"  "TechSupport"     "InternetService"

Seleccionamos las variables finales que iran en el nuevo Dataframe

Variables <- union(ind_corta, no_usar)#creamos la lista de variables finales, incluyendo  la target

3.3 - Fichero final y limpieza del entorno

3.3.1 - Fichero final

Creamos el Dataframe final al que llamamos df

df <- datos2 %>% 
  select(one_of(Variables))

3.3.2 - Limpieza del entorno

Vemos lo que hay en el entorno de trabajo y borramos todo except el Data Frame final y las variables target e indep (con las variables independientes)

ls() #Vemos todo lo que tenemos cargado en el entorno

##  [1] "contar_ceros" "datos_1"      "datos2"       "df"           "imp_rf"      
##  [6] "ind_corta"    "ind_larga"    "instalados"   "longi"        "no_usar"     
## [11] "num_ceros"    "out"          "paquetes"     "pre_rf"       "Var_factores"
## [16] "Variables"

rm(list=setdiff(ls(),'df')) #borramos todo excepto el nuevo df
# y vamos a dejar preparadas unas variables que nos van a facilitar cosas en el futuro:
target <- 'TARGET_Churn'
indep <- setdiff(names(df),c(target,'customerID'))

Guardamos un cache temporal de datos

saveRDS(df,'cache1.rds')

4 - Creacion de variables sinteticas

Dado que no hay histórico no vamos a crear variables sinteticas

Cargamos el cache

df <- readRDS(file = 'cache1.rds')

4.1 - Discretizacion

NOTA: No podemos discretizar por problemas con smbinning

Tenemos entonces que discretizar a mano

TotalCharges

Lo primero es ver que distribucion tiene la variable

ggplot(df,aes(TotalCharges)) + geom_density()

#Pedimos un histograma para aproximar mejor la forma que queremos conseguir
ggplot(df,aes(TotalCharges)) + geom_histogram(bins = 10)

#Ya sabemos que queremos una forma decreciente ahora veamos como se comporta la variable target para ver si podremos generar algo monotonico

ggplot(df,aes(TotalCharges,fill=TARGET_Churn)) + geom_histogram(bins = 20,position='fill') + scale_x_continuous(limits = c(0, 7500))

#Sabiendo ambas cosas vamos a apoyarnos en los deciles para intuir donde podemos hacer buenos cortes 
as.data.frame(quantile(df$TotalCharges,prob = seq(0, 1, length = 11)))

##      quantile(df$TotalCharges, prob = seq(0, 1, length = 11))
## 0%                                                     18.800
## 10%                                                    84.600
## 20%                                                   267.070
## 30%                                                   551.995
## 40%                                                   944.170
## 50%                                                  1397.475
## 60%                                                  2048.950
## 70%                                                  3141.130
## 80%                                                  4475.410
## 90%                                                  5976.640
## 100%                                                 8684.800

Procedemos a discretizarla manualmente

df <- df %>% mutate(TotalCharges_Disc = as.factor(case_when(
      TotalCharges <= 600 ~ '01_MENOR_600',
      TotalCharges > 600 & TotalCharges <= 1500 ~ '02_DE_600_A_1500',
      TotalCharges > 1500 & TotalCharges <= 3000 ~ '03_DE_1500_A_3000',
      TotalCharges > 3000 & TotalCharges <= 5000 ~ '04_DE_3000_A_5000',
      TotalCharges > 5000  ~ '05_Mayor_4000',
      TRUE ~ '00_ERROR'))
)

comprobamos como ha quedado la distribución hasta que nos parezca y ajustamos los cortes hasta que parezca correcta

table(df$TotalCharges_Disc)

## 
##      01_MENOR_600  02_DE_600_A_1500 03_DE_1500_A_3000 04_DE_3000_A_5000 
##              2202              1459              1167              1069 
##     05_Mayor_4000 
##              1135

Veamos si la distribucion ha quedado similar a la original

ggplot(df,aes(TotalCharges_Disc)) + geom_bar()

Y ahora vamos a comprobar si la penetracion de la target es monotonica

ggplot(df,aes(TotalCharges_Disc,fill=TARGET_Churn)) + geom_bar(position='fill')

MonthlyCharges

Lo primero es ver que distribucion tiene la variable

ggplot(df,aes(MonthlyCharges)) + geom_density()

#Pedimos un histograma para aproximar mejor la forma que queremos conseguir
ggplot(df,aes(MonthlyCharges)) + geom_histogram(bins = 10)

#Ya sabemos que queremos una forma decreciente ahora veamos como se comporta la variable target para ver si podremos generar algo monotonico

ggplot(df,aes(MonthlyCharges,fill=TARGET_Churn)) + geom_histogram(bins = 20,position='fill')

#Sabiendo ambas cosas vamos a apoyarnos en los deciles para intuir donde podemos hacer buenos cortes 
as.data.frame(quantile(df$MonthlyCharges,prob = seq(0, 1, length = 11)))

##      quantile(df$MonthlyCharges, prob = seq(0, 1, length = 11))
## 0%                                                       18.250
## 10%                                                      20.050
## 20%                                                      25.050
## 30%                                                      45.900
## 40%                                                      58.920
## 50%                                                      70.350
## 60%                                                      79.150
## 70%                                                      85.535
## 80%                                                      94.300
## 90%                                                     102.645
## 100%                                                    118.750

Procedemos a discretizarla manualmente

df <- df %>% mutate(MonthlyCharges_Disc = as.factor(case_when(
      MonthlyCharges <= 30 ~ '01_MENOR_30',
      MonthlyCharges > 30 & MonthlyCharges <= 50 ~ '02_DE_30_A_50',
      MonthlyCharges > 50 & MonthlyCharges <= 75 ~ '03_DE_50_A_75',
      MonthlyCharges > 75 & MonthlyCharges <= 90 ~ '04_DE_75_A_90',
      MonthlyCharges > 90  ~ '05_Mayor_90',
      TRUE ~ '00_ERROR'))
)

comprobamos como ha quedado la distribución hasta que nos parezca y ajustamos los cortes hasta que parezca correcta

table(df$MonthlyCharges_Disc)

## 
##   01_MENOR_30 02_DE_30_A_50 03_DE_50_A_75 04_DE_75_A_90   05_Mayor_90 
##          1647           646          1620          1380          1739

Veamos si la distribucion ha quedado similar a la original

ggplot(df,aes(MonthlyCharges_Disc)) + geom_bar()

Y ahora vamos a comprobar si la penetracion de la target es monotonica

ggplot(df,aes(MonthlyCharges_Disc,fill=TARGET_Churn)) + geom_bar(position='fill')

tenure

Lo primero es ver que distribucion tiene la variable

ggplot(df,aes(tenure)) + geom_density()

#Pedimos un histograma para aproximar mejor la forma que queremos conseguir
ggplot(df,aes(tenure)) + geom_histogram(bins = 10)

#Ya sabemos que queremos una forma decreciente ahora veamos como se comporta la variable target para ver si podremos generar algo monotonico

ggplot(df,aes(tenure,fill=TARGET_Churn)) + geom_histogram(bins = 20,position='fill')

#Sabiendo ambas cosas vamos a apoyarnos en los deciles para intuir donde podemos hacer buenos cortes 
as.data.frame(quantile(df$tenure,prob = seq(0, 1, length = 11)))

##      quantile(df$tenure, prob = seq(0, 1, length = 11))
## 0%                                                  1.0
## 10%                                                 2.0
## 20%                                                 6.0
## 30%                                                12.0
## 40%                                                20.0
## 50%                                                29.0
## 60%                                                40.0
## 70%                                                50.0
## 80%                                                60.8
## 90%                                                69.0
## 100%                                               72.0

Procedemos a discretizarla manualmente

df <- df %>% mutate(tenure_Disc = as.factor(case_when(
      tenure <= 15 ~ '01_MENOR_15',
      tenure > 15 & tenure <= 30 ~ '02_DE_15_A_30',
      tenure > 30 & tenure <= 45 ~ '03_DE_30_A_45',
      tenure > 45 & tenure <= 60 ~ '04_DE_45_A_60',
      tenure > 60  ~ '05_Mayor_60',
      TRUE ~ '00_ERROR'))
)

comprobamos como ha quedado la distribución hasta que nos parezca y ajustamos los cortes hasta que parezca correcta

table(df$tenure_Disc)

## 
##   01_MENOR_15 02_DE_15_A_30 03_DE_30_A_45 04_DE_45_A_60   05_Mayor_60 
##          2459          1171           957          1038          1407

Veamos si la distribucion ha quedado similar a la original

ggplot(df,aes(tenure_Disc)) + geom_bar()

Y ahora vamos a comprobar si la penetracion de la target es monotonica

ggplot(df,aes(tenure_Disc,fill=TARGET_Churn)) + geom_bar(position='fill')

Vamos a hacer una inspeccion visual de todas las variables a ver si han salido bien

df %>% 
  select_if(is.factor) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    geom_bar() +
    facet_wrap(~ key, scales = "free") +
    theme(axis.text=element_text(size=4))#esto es para cambiar el tamaño del texto del eje y que se lea bien

Ahora vamos a analizar la penetración de la target en cada categoría para ver si las variables han salido monotónicas

a <- function(var1,var2) {
  df_temp <- data.frame(var1 = df[[var1]],var2 = df[[var2]])
  df_temp %>% 
    group_by(var1) %>% 
    summarise(Conteo = n(), Porc = mean(as.numeric(as.character(var2)))) %>% 
  ggplot(aes(var1,Porc)) + geom_bar(stat='identity') + xlab(var1)
}
df2_nombres <- df %>% select_if(is.factor) %>% names()
lapply(df2_nombres,function(x){a(x,'TARGET_Churn')})

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

4.3 - Limpieza

Eliminamos las variables continuas que hemos discretizado del Dataframe

eliminar <- c("tenure", "MonthlyCharges", "TotalCharges" )

df <- select(df,-eliminar)

Vamos a reordernar las variables

#creamos un vector con las variables centrales
centrales <- setdiff(names(df),c('customerID','TARGET_Churn'))
df <- df %>% select(
  customerID,
  one_of(centrales),
  TARGET_Churn)

Echamos un vistazo a ver como queda el data frame definitivo

glimpse(df)

## Rows: 7,032
## Columns: 10
## $ customerID          <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795...
## $ Contract            <fct> Month-to-month, One year, Month-to-month, One y...
## $ PaymentMethod       <fct> Electronic check, Mailed check, Mailed check, B...
## $ OnlineSecurity      <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Ye...
## $ TechSupport         <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, N...
## $ InternetService     <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, F...
## $ TotalCharges_Disc   <fct> 01_MENOR_600, 03_DE_1500_A_3000, 01_MENOR_600, ...
## $ MonthlyCharges_Disc <fct> 01_MENOR_30, 03_DE_50_A_75, 03_DE_50_A_75, 02_D...
## $ tenure_Disc         <fct> 01_MENOR_15, 03_DE_30_A_45, 01_MENOR_15, 03_DE_...
## $ TARGET_Churn        <fct> 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,...

Limpiamos el entorno de cualquier cosa que no sea el dataframe

a_borrar <- setdiff(ls(),'df')
rm(list=c(a_borrar,'a_borrar'))

Guardamos otro cache temporal

saveRDS(df,'cache2.rds')

Cargamos el cache temporal

df <- readRDS('cache2.rds')

5. Modelizacion

5.1 - Preparar las funciones que vamos a necesitar

Funcion para crear una matriz de confusion

confusion<-function(real,scoring,umbral){ 
  conf<-table(real,scoring>=umbral)
  if(ncol(conf)==2) return(conf) else return(NULL)
}

Funcion para calcular las metricas de los modelos: acierto, precision, cobertura y F1

metricas<-function(matriz_conf){
  acierto <- (matriz_conf[1,1] + matriz_conf[2,2]) / sum(matriz_conf) *100
  precision <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[1,2]) *100
  cobertura <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[2,1]) *100
  F1 <- 2*precision*cobertura/(precision+cobertura)
  salida<-c(acierto,precision,cobertura,F1)
  return(salida)
}

Funcion para probar distintos umbrales y ver el efecto sobre precision y cobertura

umbrales<-function(real,scoring){
  umbrales<-data.frame(umbral=rep(0,times=19),acierto=rep(0,times=19),precision=rep(0,times=19),cobertura=rep(0,times=19),F1=rep(0,times=19))
  cont <- 1
  for (cada in seq(0.05,0.95,by = 0.05)){
    datos<-metricas(confusion(real,scoring,cada))
    registro<-c(cada,datos)
    umbrales[cont,]<-registro
    cont <- cont + 1
  }
  return(umbrales)
}

Funciones que calculan la curva ROC y el AUC

roc<-function(prediction){
  r<-performance(prediction,'tpr','fpr')
  plot(r)
}

auc<-function(prediction){
  a<-performance(prediction,'auc')
  return(a@y.values[[1]])
}

5.2 - Creamos las particiones de training (70%) y test (30%)

Generamos una variable aleatoria con una distribucion 70-30

df$random<-sample(0:1,size = nrow(df),replace = T,prob = c(0.3,0.7))

Creamos los dos dataframes

train<-filter(df,random==1)
test<-filter(df,random==0)
#Eliminamos ya la random para que no moleste
df$random <- NULL

5.3 - Creación del modelo de propensión

5.3.1 - Identificamos las variables

#Las independientes seran todas menos el codigo cliente y la target
independientes <- setdiff(names(df),c('customerID','TARGET_Churn'))
target <- 'TARGET_Churn'

5.3.2 - Creamos la formula para usar en el modelo

formula <- reformulate(independientes,target)

5.3.3 Guardamos los DataFrame

saveRDS(train,'train.rds')
saveRDS(test,'test.rds')

###5.3.4 - Modelizamos con regresion logistica

Primero vamos a hacer un modelo con todas las variables

formula_rl <- formula
rl<- glm(formula_rl,train,family=binomial(link='logit'))
summary(rl)

## 
## Call:
## glm(formula = formula_rl, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8696  -0.6506  -0.2975   0.7333   3.1340  
## 
## Coefficients: (2 not defined because of singularities)
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -0.20960    0.24772  -0.846 0.397473    
## ContractOne year                     -0.89911    0.12718  -7.069 1.56e-12 ***
## ContractTwo year                     -1.75692    0.21726  -8.087 6.13e-16 ***
## PaymentMethodCredit card (automatic) -0.12722    0.13404  -0.949 0.342550    
## PaymentMethodElectronic check         0.38698    0.11026   3.510 0.000448 ***
## PaymentMethodMailed check            -0.19509    0.13465  -1.449 0.147356    
## OnlineSecurityNo internet service    -1.15430    0.26135  -4.417 1.00e-05 ***
## OnlineSecurityYes                    -0.40559    0.10127  -4.005 6.20e-05 ***
## TechSupportNo internet service             NA         NA      NA       NA    
## TechSupportYes                       -0.35727    0.10364  -3.447 0.000566 ***
## InternetServiceFiber optic            0.78283    0.15131   5.174 2.30e-07 ***
## InternetServiceNo                          NA         NA      NA       NA    
## TotalCharges_Disc02_DE_600_A_1500    -0.50456    0.13589  -3.713 0.000205 ***
## TotalCharges_Disc03_DE_1500_A_3000   -0.61298    0.22686  -2.702 0.006893 ** 
## TotalCharges_Disc04_DE_3000_A_5000   -0.81053    0.30692  -2.641 0.008270 ** 
## TotalCharges_Disc05_Mayor_4000       -0.54382    0.38305  -1.420 0.155693    
## MonthlyCharges_Disc02_DE_30_A_50      0.18228    0.25422   0.717 0.473358    
## MonthlyCharges_Disc03_DE_50_A_75     -0.03526    0.25588  -0.138 0.890395    
## MonthlyCharges_Disc04_DE_75_A_90      0.21597    0.29357   0.736 0.461933    
## MonthlyCharges_Disc05_Mayor_90        0.59609    0.31450   1.895 0.058045 .  
## tenure_Disc02_DE_15_A_30             -0.58406    0.16417  -3.558 0.000374 ***
## tenure_Disc03_DE_30_A_45             -0.49543    0.23191  -2.136 0.032656 *  
## tenure_Disc04_DE_45_A_60             -0.81781    0.28780  -2.842 0.004489 ** 
## tenure_Disc05_Mayor_60               -1.20127    0.35017  -3.431 0.000602 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5747.6  on 4976  degrees of freedom
## Residual deviance: 4162.6  on 4955  degrees of freedom
## AIC: 4206.6
## 
## Number of Fisher Scoring iterations: 6

Revisamos la significatividad y mantenemos todas las variables que tengan tres estrellas en alguna categoria,

a_mantener <- setdiff(names(df),c("MonthlyCharges_Disc", "customerID", "TARGET_Churn"))

Volvemos a modelizar

formula_rl <- reformulate(a_mantener,target)
rl<- glm(formula_rl,train,family=binomial(link='logit'))
summary(rl)

## 
## Call:
## glm(formula = formula_rl, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7004  -0.6509  -0.3029   0.7330   3.0947  
## 
## Coefficients: (2 not defined because of singularities)
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -0.18792    0.12803  -1.468 0.142146    
## ContractOne year                     -0.88378    0.12648  -6.987 2.80e-12 ***
## ContractTwo year                     -1.73241    0.21660  -7.998 1.26e-15 ***
## PaymentMethodCredit card (automatic) -0.13483    0.13370  -1.008 0.313244    
## PaymentMethodElectronic check         0.40307    0.10983   3.670 0.000243 ***
## PaymentMethodMailed check            -0.19723    0.13423  -1.469 0.141730    
## OnlineSecurityNo internet service    -1.16121    0.15250  -7.615 2.64e-14 ***
## OnlineSecurityYes                    -0.39669    0.09992  -3.970 7.19e-05 ***
## TechSupportNo internet service             NA         NA      NA       NA    
## TechSupportYes                       -0.31903    0.10090  -3.162 0.001569 ** 
## InternetServiceFiber optic            0.96184    0.10303   9.335  < 2e-16 ***
## InternetServiceNo                          NA         NA      NA       NA    
## TotalCharges_Disc02_DE_600_A_1500    -0.43472    0.12996  -3.345 0.000823 ***
## TotalCharges_Disc03_DE_1500_A_3000   -0.42296    0.21191  -1.996 0.045938 *  
## TotalCharges_Disc04_DE_3000_A_5000   -0.43015    0.27992  -1.537 0.124374    
## TotalCharges_Disc05_Mayor_4000        0.05468    0.34515   0.158 0.874117    
## tenure_Disc02_DE_15_A_30             -0.62886    0.16143  -3.896 9.80e-05 ***
## tenure_Disc03_DE_30_A_45             -0.63596    0.22558  -2.819 0.004815 ** 
## tenure_Disc04_DE_45_A_60             -1.06661    0.27899  -3.823 0.000132 ***
## tenure_Disc05_Mayor_60               -1.50546    0.34011  -4.426 9.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5747.6  on 4976  degrees of freedom
## Residual deviance: 4181.5  on 4959  degrees of freedom
## AIC: 4217.5
## 
## Number of Fisher Scoring iterations: 6

Vemos que ahora ya todas las variables tienen al menos una categoria con 3 estrellas de significacion

Vamos a mirar el signo de los coeficientes, que debera seguir la logica de negocio: todas las variables tienen logica, asi que vamos a comprobar este modelo sobre el conjunto de test

Y calculamos el pseudo R cuadrado:

pr2_rl <- 1 -(rl$deviance / rl$null.deviance)
pr2_rl

## [1] 0.2724858

Aplicamos el modelo al conjunto de test, generando un vector con las probabilidades

rl_predict<-predict(rl,test,type = 'response')

Vemos que pinta tiene

plot(rl_predict~test$TARGET_Churn)

Ahora tenemos que transformar la probabilidad en una decision de si el cliente se va a dar de baja o no

Con la funcion umbrales probamos diferentes cortes

umb_rl<-umbrales(test$TARGET_Churn,rl_predict)
umb_rl

##    umbral  acierto precision cobertura       F1
## 1    0.05 51.38686  35.44801  97.83394 52.04033
## 2    0.10 59.12409  39.26426  94.40433 55.46129
## 3    0.15 63.94161  42.20183  91.33574 57.72961
## 4    0.20 69.34307  46.20000  83.39350 59.45946
## 5    0.25 72.89538  49.83203  80.32491 61.50657
## 6    0.30 74.69586  52.03837  78.33935 62.53602
## 7    0.35 77.76156  56.68966  74.18773 64.26896
## 8    0.40 77.37226  56.85670  66.60650 61.34663
## 9    0.45 78.34550  60.66536  55.95668 58.21596
## 10   0.50 79.17275  63.93805  52.16606 57.45527
## 11   0.55 79.75669  70.05814  43.50181 53.67483
## 12   0.60 78.54015  74.04255  31.40794 44.10646
## 13   0.65 77.90754  74.27184  27.61733 40.26316
## 14   0.70 76.49635  77.51938  18.05054 29.28258
## 15   0.75 76.49635  78.86179  17.50903 28.65583
## 16   0.80  0.80000   0.80000   0.80000  0.80000
## 17   0.85  0.85000   0.85000   0.85000  0.85000
## 18   0.90  0.90000   0.90000   0.90000  0.90000
## 19   0.95  0.95000   0.95000   0.95000  0.95000

Seleccionamos el umbral que maximiza la F1

umbral_final_rl<-umb_rl[which.max(umb_rl$F1),1]
umbral_final_rl

## [1] 0.35

Evaluamos la matriz de confusion y las metricas con el umbral optimizado

confusion(test$TARGET_Churn,rl_predict,umbral_final_rl)

##     
## real FALSE TRUE
##    0  1187  314
##    1   143  411

rl_metricas<-filter(umb_rl,umbral==umbral_final_rl)
rl_metricas

##   umbral  acierto precision cobertura       F1
## 1   0.35 77.76156  56.68966  74.18773 64.26896

Evaluamos la ROC

#creamos el objeto prediction
rl_prediction<-prediction(rl_predict,test$TARGET_Churn)
#visualizamos la ROC
roc(rl_prediction)

Sacamos las metricas definitivas incluyendo el AUC

rl_metricas<-cbind(rl_metricas,AUC=round(auc(rl_prediction),2)*100)
print(t(rl_metricas))

##               [,1]
## umbral     0.35000
## acierto   77.76156
## precision 56.68966
## cobertura 74.18773
## F1        64.26896
## AUC       83.00000

5.3.5 - Modelizamos con Arboles de decision

Creamos el primer modelo

formula_ar <- formula
ar<-rpart(formula_ar, train, method = 'class', parms = list(
  split = "information"), 
  control = rpart.control(cp = 0.00001))

Revisamos donde el error de validacion cruzada empieza a crecer

printcp(ar)

## 
## Classification tree:
## rpart(formula = formula_ar, data = train, method = "class", parms = list(split = "information"), 
##     control = rpart.control(cp = 1e-05))
## 
## Variables actually used in tree construction:
## [1] Contract            InternetService     MonthlyCharges_Disc
## [4] OnlineSecurity      PaymentMethod       TechSupport        
## [7] tenure_Disc         TotalCharges_Disc  
## 
## Root node error: 1315/4977 = 0.26422
## 
## n= 4977 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.05209125      0   1.00000 1.00000 0.023654
## 2  0.01254753      3   0.79772 0.79772 0.021881
## 3  0.00304183      5   0.77262 0.79772 0.021881
## 4  0.00273764     10   0.75741 0.79316 0.021835
## 5  0.00190114     16   0.73916 0.78099 0.021711
## 6  0.00152091     18   0.73536 0.77338 0.021632
## 7  0.00114068     23   0.72776 0.76730 0.021569
## 8  0.00101394     29   0.72091 0.76882 0.021585
## 9  0.00076046     33   0.71635 0.76882 0.021585
## 10 0.00043455     35   0.71483 0.79163 0.021819
## 11 0.00038023     42   0.71179 0.79011 0.021804
## 12 0.00019011     50   0.70875 0.79087 0.021812
## 13 0.00001000     58   0.70722 0.79316 0.021835

plotcp(ar)

Parece que minimiza aprox en 0.0013 de complejidad Generamos un nuevo arbol con ese parametro Ademas vamos a incluir un nuevo paramtero para que el arbol no tenga mas de 8 niveles

ar<-rpart(formula, train, method = 'class', parms = list(
  split = "information"), 
  control = rpart.control(cp = 0.0013,maxdepth = 7))

Revisamos de nuevo la complejidad

printcp(ar)

## 
## Classification tree:
## rpart(formula = formula, data = train, method = "class", parms = list(split = "information"), 
##     control = rpart.control(cp = 0.0013, maxdepth = 7))
## 
## Variables actually used in tree construction:
## [1] Contract            InternetService     MonthlyCharges_Disc
## [4] OnlineSecurity      PaymentMethod       TechSupport        
## [7] tenure_Disc         TotalCharges_Disc  
## 
## Root node error: 1315/4977 = 0.26422
## 
## n= 4977 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.0520913      0   1.00000 1.00000 0.023654
## 2 0.0125475      3   0.79772 0.79772 0.021881
## 3 0.0030418      5   0.77262 0.79087 0.021812
## 4 0.0027376     10   0.75741 0.78783 0.021781
## 5 0.0019011     15   0.74373 0.78099 0.021711
## 6 0.0015209     17   0.73992 0.78403 0.021742
## 7 0.0013000     20   0.73536 0.77262 0.021624

plotcp(ar)

Ahora parece bastante estable Vamos a crear el grafico del arbol para analizarlo

rpart.plot(ar,type=2,extra = 7, under = TRUE,under.cex = 0.7,fallen.leaves=F,gap = 0,cex=0.2,yesno = 2,box.palette = "GnYlRd",branch.lty = 3)

Otra visualizacion alternativa

prp(ar, type=2, extra=104, nn=TRUE, fallen.leaves=F,
faclen=0, varlen=0, shadow.col="grey", branch.lty=3)

Vamos a sacar las reglas que podrian ser utilizadas por ejemplo para hacer una implantacion del arbol

rpart.rules(ar,style = 'tall',cover = T)

## TARGET_Churn is 0.07 with cover 45% when
##     Contract is One year or Two year
## 
## TARGET_Churn is 0.16 with cover 8% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45 or 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is DSL or No
## 
## TARGET_Churn is 0.22 with cover 7% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No internet service or Yes
##     MonthlyCharges_Disc is 01_MENOR_30 or 03_DE_50_A_75 or 04_DE_75_A_90
## 
## TARGET_Churn is 0.23 with cover 1% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No
##     TotalCharges_Disc is 02_DE_600_A_1500
## 
## TARGET_Churn is 0.25 with cover 0% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No internet service or Yes
##     PaymentMethod is Bank transfer (automatic) or Credit card (automatic)
##     MonthlyCharges_Disc is 02_DE_30_A_50
## 
## TARGET_Churn is 0.28 with cover 1% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45 or 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 04_DE_3000_A_5000 or 05_Mayor_4000
## 
## TARGET_Churn is 0.30 with cover 7% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45 or 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is Fiber optic
##     PaymentMethod is Bank transfer (automatic) or Credit card (automatic) or Mailed check
## 
## TARGET_Churn is 0.30 with cover 0% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     TotalCharges_Disc is 01_MENOR_600 or 03_DE_1500_A_3000
##     MonthlyCharges_Disc is 05_Mayor_90
## 
## TARGET_Churn is 0.34 with cover 1% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No
##     PaymentMethod is Bank transfer (automatic) or Electronic check or Mailed check
##     TotalCharges_Disc is 01_MENOR_600
##     TechSupport is Yes
## 
## TARGET_Churn is 0.38 with cover 0% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45 or 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 02_DE_600_A_1500 or 03_DE_1500_A_3000
##     MonthlyCharges_Disc is 03_DE_50_A_75 or 04_DE_75_A_90
## 
## TARGET_Churn is 0.42 with cover 1% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     MonthlyCharges_Disc is 03_DE_50_A_75 or 04_DE_75_A_90
## 
## TARGET_Churn is 0.43 with cover 2% when
##     Contract is Month-to-month
##     tenure_Disc is 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 04_DE_3000_A_5000 or 05_Mayor_4000
## 
## TARGET_Churn is 0.46 with cover 3% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No
##     PaymentMethod is Bank transfer (automatic) or Credit card (automatic) or Mailed check
##     TotalCharges_Disc is 01_MENOR_600
##     TechSupport is No
## 
## TARGET_Churn is 0.55 with cover 1% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 04_DE_3000_A_5000 or 05_Mayor_4000
## 
## TARGET_Churn is 0.57 with cover 3% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 01_MENOR_600
##     TechSupport is No
## 
## TARGET_Churn is 0.58 with cover 2% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45 or 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 02_DE_600_A_1500 or 03_DE_1500_A_3000
##     MonthlyCharges_Disc is 03_DE_50_A_75 or 04_DE_75_A_90
## 
## TARGET_Churn is 0.61 with cover 0% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No internet service or Yes
##     PaymentMethod is Electronic check or Mailed check
##     MonthlyCharges_Disc is 02_DE_30_A_50
## 
## TARGET_Churn is 0.61 with cover 2% when
##     Contract is Month-to-month
##     tenure_Disc is 02_DE_15_A_30 or 03_DE_30_A_45 or 04_DE_45_A_60 or 05_Mayor_60
##     InternetService is Fiber optic
##     PaymentMethod is Electronic check
##     TotalCharges_Disc is 02_DE_600_A_1500 or 03_DE_1500_A_3000
##     MonthlyCharges_Disc is 05_Mayor_90
## 
## TARGET_Churn is 0.62 with cover 0% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is DSL or No
##     OnlineSecurity is No
##     PaymentMethod is Credit card (automatic)
##     TotalCharges_Disc is 01_MENOR_600
##     TechSupport is Yes
## 
## TARGET_Churn is 0.70 with cover 13% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is Fiber optic
##     OnlineSecurity is No
## 
## TARGET_Churn is 0.86 with cover 0% when
##     Contract is Month-to-month
##     tenure_Disc is 01_MENOR_15
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     TotalCharges_Disc is 02_DE_600_A_1500
##     MonthlyCharges_Disc is 05_Mayor_90

#style sirve para que la salida sea mas legible y cover añade el % de casos e los que aplica la regla

Podemos llevarnos el nodo final de cada cliente a un data.frame para poder hacer una explotacion posterior

#Para ello usarmos el predict especficio de rpart y con el parametro nn
ar_numnodos<-rpart.predict(ar,test,nn = T)
head(ar_numnodos)

##           0          1  nn
## 1 0.4296875 0.57031250 223
## 2 0.7765668 0.22343324  52
## 3 0.7765668 0.22343324  52
## 4 0.3913043 0.60869565 107
## 5 0.7023121 0.29768786  28
## 6 0.9348115 0.06518847   2

Vamos a calcular los scorings y evaluar el modelo

ar_predict<-predict(ar,test,type = 'prob')[,2]

Vemos que pinta tiene

plot(ar_predict~test$TARGET_Churn)

Con la funcion umbrales probamos diferentes cortes

umb_ar<-umbrales(test$TARGET_Churn,ar_predict)
umb_ar

##    umbral  acierto precision cobertura        F1
## 1    0.05  0.05000   0.05000  0.050000  0.050000
## 2    0.10 64.33090  42.23764 87.906137 57.059168
## 3    0.15 64.33090  42.23764 87.906137 57.059168
## 4    0.20 69.92701  46.77419 83.754513 60.025873
## 5    0.25 74.79319  52.24439 75.631769 61.799410
## 6    0.30 77.22628  56.86901 64.259928 60.338983
## 7    0.35 77.56691  57.81513 62.093863 59.878155
## 8    0.40 77.90754  58.50340 62.093863 60.245184
## 9    0.45 78.15085  59.84991 57.581227 58.693652
## 10   0.50 78.73479  62.26415 53.610108 57.613967
## 11   0.55 78.73479  62.26415 53.610108 57.613967
## 12   0.60 79.27007  69.27711 41.516245 51.918736
## 13   0.65 79.41606  73.47670 37.003610 49.219688
## 14   0.70 79.41606  73.47670 37.003610 49.219688
## 15   0.75 73.28467  85.71429  1.083032  2.139037
## 16   0.80 73.28467  85.71429  1.083032  2.139037
## 17   0.85 73.28467  85.71429  1.083032  2.139037
## 18   0.90  0.90000   0.90000  0.900000  0.900000
## 19   0.95  0.95000   0.95000  0.950000  0.950000

Seleccionamos automaticamente el mejor umbral

umbral_final_ar<-umb_ar[which.max(umb_ar$F1),1]
umbral_final_ar

## [1] 0.25

Evaluamos la matriz de confusion y las metricas con el umbral optimizado

confusion(test$TARGET_Churn,ar_predict,umbral_final_ar)

##     
## real FALSE TRUE
##    0  1118  383
##    1   135  419

ar_metricas<-filter(umb_ar,umbral==umbral_final_ar)
ar_metricas

##   umbral  acierto precision cobertura       F1
## 1   0.25 74.79319  52.24439  75.63177 61.79941

Evaluamos la ROC

#creamos el objeto prediction
ar_prediction<-prediction(ar_predict,test$TARGET_Churn)
#visualizamos la ROC
roc(ar_prediction)

Sacamos las metricas definitivas incluyendo el AUC

ar_metricas<-cbind(ar_metricas,AUC=round(auc(ar_prediction),2)*100)
print(t(ar_metricas))

##               [,1]
## umbral     0.25000
## acierto   74.79319
## precision 52.24439
## cobertura 75.63177
## F1        61.79941
## AUC       81.00000

5.3.6 - Modelizamos con Random Forest

Creamos el modelo

formula_rf <- formula
rf<-randomForest(formula_rf,train,importance=T)
rf

## 
## Call:
##  randomForest(formula = formula_rf, data = train, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.43%
## Confusion matrix:
##      0   1 class.error
## 0 3281 381   0.1040415
## 1  636 679   0.4836502

Visualizamos las variables mas importantes

varImpPlot(rf)

Como hay dos criterios vamos a crear una unica variable agregada y visualizarla para tener una mejor idea de la importancia de cada variable

importancia <- importance(rf)[,3:4]
#normalizamos para poner las dos variables en la misma scala. los valores negativos son las que menos predicen y los positivos las que mas
importancia_norm <- as.data.frame(scale(importancia))
#creamos una unica variable como suma de las otras
importancia_norm <- importancia_norm %>% mutate(
  Variable = rownames(importancia_norm),
  Imp_tot = MeanDecreaseAccuracy + MeanDecreaseGini) %>%
  mutate(Imp_tot = Imp_tot + abs(min(Imp_tot))) %>% 
  arrange(desc(Imp_tot)) %>% 
  select(Variable,Imp_tot,MeanDecreaseAccuracy,MeanDecreaseGini)
#hacemos un grafico para ver la curva de caida de importancia
ggplot(importancia_norm, aes(reorder(Variable,-Imp_tot),Imp_tot)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90,size = 7))

importancia_norm

##              Variable   Imp_tot MeanDecreaseAccuracy MeanDecreaseGini
## 1            Contract 5.0403996            0.5409954      1.998957328
## 2         tenure_Disc 4.1759059            0.8663800      0.809079058
## 3     InternetService 3.4191477            1.3633772     -0.444676411
## 4      OnlineSecurity 3.2740522            0.7283126      0.045292764
## 5       PaymentMethod 1.9239157           -0.5791162      0.002585071
## 6   TotalCharges_Disc 1.3084414           -0.4811461     -0.710859362
## 7         TechSupport 0.8617125           -1.1038236     -0.534910788
## 8 MonthlyCharges_Disc 0.0000000           -1.3349792     -1.165467660

La variable que claramente desentona de Montntly

a_mantener <- importancia_norm %>% 
  filter(Imp_tot > 1) %>% 
  select(Variable)

#Extraemos los nombres como un vector
a_mantener <- as.character((a_mantener$Variable))

Creamos de nuevo el modelo con las nuevas variables

formula_rf <- reformulate(a_mantener,target)
rf<-randomForest(formula_rf,train,importance=T)
rf

## 
## Call:
##  randomForest(formula = formula_rf, data = train, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.27%
## Confusion matrix:
##      0   1 class.error
## 0 3254 408   0.1114145
## 1  601 714   0.4570342

Aplicamos el modelo al conjunto de test, generando un vector con las probabilidades Notar que por el metodo predict de randomforest hay que poner el type=prob para tener el scoring, lo cual nos dara una matriz que nos tenemos que quedar con la segunda columna

rf_predict<-predict(rf,test,type = 'prob')[,2]

Vemos que pinta tiene

plot(rf_predict~test$TARGET_Churn)

Con la funcion umbrales probamos diferentes cortes

umb_rf<-umbrales(test$TARGET_Churn,rf_predict)
umb_rf

##    umbral  acierto precision cobertura       F1
## 1    0.05 71.24088  47.96480  78.70036 59.60355
## 2    0.10 74.59854  52.17391  69.31408 59.53488
## 3    0.15 75.57178  53.73563  67.50903 59.84000
## 4    0.20 76.54501  55.55556  64.98195 59.90017
## 5    0.25 77.46959  57.83133  60.64982 59.20705
## 6    0.30 77.76156  58.49387  60.28881 59.37778
## 7    0.35 78.24818  60.22945  56.85921 58.49582
## 8    0.40 78.68613  61.64659  55.41516 58.36502
## 9    0.45 78.92944  62.37219  55.05415 58.48514
## 10   0.50 78.92944  62.37219  55.05415 58.48514
## 11   0.55 78.83212  62.63270  53.24910 57.56098
## 12   0.60 79.27007  65.16588  49.63899 56.35246
## 13   0.65 78.97810  67.83626  41.87726 51.78571
## 14   0.70 79.07543  68.23529  41.87726 51.90157
## 15   0.75 78.97810  69.30380  39.53069 50.34483
## 16   0.80 79.22141  70.55016  39.35018 50.52144
## 17   0.85 79.07543  75.20325  33.39350 46.25000
## 18   0.90 78.63747  75.10917  31.04693 43.93359
## 19   0.95 77.66423  75.40107  25.45126 38.05668

Seleccionamos automaticamente el mejor umbral

umbral_final_rf<-umb_rf[which.max(umb_rf$F1),1]
umbral_final_rf

## [1] 0.2

Evaluamos la matriz de confusion y las metricas con el umbral optimizado

confusion(test$TARGET_Churn,rf_predict,umbral_final_rf)

##     
## real FALSE TRUE
##    0  1213  288
##    1   194  360

rf_metricas<-filter(umb_rf,umbral==umbral_final_rf)
rf_metricas

##   umbral  acierto precision cobertura       F1
## 1    0.2 76.54501  55.55556  64.98195 59.90017

Evaluamos la ROC

#creamos el objeto prediction
rf_prediction<-prediction(rf_predict,test$TARGET_Churn)
#visualizamos la ROC
roc(rf_prediction)

Sacamos las metricas definitivas incluyendo el AUC

rf_metricas<-cbind(rf_metricas,AUC=round(auc(rf_prediction),2)*100)
print(t(rf_metricas))

##               [,1]
## umbral     0.20000
## acierto   76.54501
## precision 55.55556
## cobertura 64.98195
## F1        59.90017
## AUC       81.00000

5.3.7 - Comparamos los 3 metodos

comparativa <- rbind(rl_metricas,ar_metricas,rf_metricas)
rownames(comparativa) <- c('Regresion Logistica','Arbol Decision','Random Forest')
t(comparativa) #t simplemente transpone para leerlo mejor

##           Regresion Logistica Arbol Decision Random Forest
## umbral                0.35000        0.25000       0.20000
## acierto              77.76156       74.79319      76.54501
## precision            56.68966       52.24439      55.55556
## cobertura            74.18773       75.63177      64.98195
## F1                   64.26896       61.79941      59.90017
## AUC                  83.00000       81.00000      81.00000

Conclusion:

La regresion Logistica parece que da mejores valores en cuanto a umbral y AUC que los otros 2 modelos

5.3.8 - Escribimos el scoring final en el dataset y guardamos el modelo

df$SCORING <- predict(rl,df,type = 'response')
saveRDS(rl,'03_modelo_final.rds')

saveRDS(df,'cache3.rds')

6. Evaluacion y analisis de negocio

Vamos a visualizar la contratacion real por tramos de scoring. Este grafico es muy potente para ver que el modelo es consistente, ya que debe presentar una linea descente en la tasa de contratacion conforme se desciende en el scoring

#Creamos una funcion para visualizar la contratacion real por percentiles de scoring
vis <- function(scoring,real) {
    #Preparar el dataframe de visualizaciÛn
    vis_df <- data.frame(Scoring = scoring, Perc_Scoring = cut_number(scoring, 15), Real = real)
    levels(vis_df$Perc_Scoring) <- seq(from = 100,to = 5,by = -5)
    vis_gr <- vis_df %>% group_by(Perc_Scoring) %>% summarise(Tasa_Contr = mean(as.numeric(as.character(Real)))) %>% arrange(Perc_Scoring)
    #ordenar el factor para el grafico
    vis_gr$Perc_Scoring <- factor(vis_gr$Perc_Scoring, levels = vis_gr$Perc_Scoring[order(vis_gr$Perc_Scoring, decreasing = T)])
    #hacemos el grafico
    ggplot(vis_gr,aes(Perc_Scoring, Tasa_Contr)) + 
      geom_col(fill='grey') + 
      geom_hline(aes(yintercept =      mean(as.numeric(as.character(vis_df$Real)))), col = 'black') +
      labs(title = 'Bajas reales por tramo de scoring', x = 'Tramo de Scoring', y = 'Bajas reales')
}
vis(df$SCORING,df$TARGET_Churn)

Creamos el DF con los clientes que no se han dado de baja. A estos les lanzaremos una campaña de fidelización en función del scoring del modelo creado.

df_real <- df %>%
  filter(TARGET_Churn==0) %>% 
  arrange(desc(SCORING)) %>%
  select(customerID,SCORING)

Suponemos una campaña de 1000 acciones

tamaño_campaña <- 1000
penetracion_target <- mean(as.numeric(as.character(df_real$TARGET_Churn)))
df_real %>% 
  arrange(desc(SCORING)) %>% 
  ggplot(aes(y = SCORING, x = seq_along(SCORING))) +
  geom_line() + 
  geom_vline(xintercept = tamaño_campaña, col = 'orange') +
  geom_hline(yintercept = penetracion_target,col='blue') +
  labs(x = 'CLIENTES ORDENADOS POR SCORING', y = 'SCORING')

creamos el DataFrame definitivo con los clientes que van a la campaña:

df_Campaña <- df_real %>%
  filter(row_number() <= 1000)

Limpiamos el entorno de cualquier cosa que no sea los 3 dataframe finales

a_borrar <- setdiff(ls(),c('df', 'df_real', 'df_Campaña'))
rm(list=c(a_borrar,'a_borrar'))

Guardamos los data frame finales

saveRDS(df_real,'df_real.rds')
saveRDS(df_Campaña,'df_Campaña.rds')

Proyecto Fin Curso

Germán Pérez

11/6/2020

Contexto

Desarrollo del Proyecto

1.Preparación datos y librerias

1.1 - Cargamos las librerias que vamos a utilizar

1.2. - Cargamos los datos

2. Analisis de datos

2.1 - Analisis exploratorio general y tipo de datos

2.2 - Calidad de datos: Estadísticos básicos

2.3 - Calidad de datos: Análisis de nulos y Ceros

2.4 - Calidad de datos: Análisis de atípicos

2.4.1 - Analizamos las que son de tipo numerico

2.5 - Analisis longitudinal

2.6 Revisamos los valores de las variables categoricas

2.7 - Acciones resultado del analisis de calidad de datos y exploratorio:

3 - Trasformación de datos

3.1 - Creacion de la variable target

3.2 - Preparacion de las variables independientes

3.2.1 - Preseleccion de variables independientes

3.2.1.1 - Preseleccion con RandomForest

3.2.1.2 - Preseleccion con Information Value

3.2.2 - Seleccionar la lista de variables finales del proyecto

3.3 - Fichero final y limpieza del entorno

3.3.1 - Fichero final

3.3.2 - Limpieza del entorno

4 - Creacion de variables sinteticas

4.1 - Discretizacion

TotalCharges

MonthlyCharges

tenure

4.3 - Limpieza

5. Modelizacion

5.1 - Preparar las funciones que vamos a necesitar

5.2 - Creamos las particiones de training (70%) y test (30%)

5.3 - Creación del modelo de propensión

5.3.1 - Identificamos las variables

5.3.2 - Creamos la formula para usar en el modelo

5.3.3 Guardamos los DataFrame

5.3.5 - Modelizamos con Arboles de decision

5.3.6 - Modelizamos con Random Forest

5.3.7 - Comparamos los 3 metodos

5.3.8 - Escribimos el scoring final en el dataset y guardamos el modelo

6. Evaluacion y analisis de negocio