Detalle del trabajo realizado

Importación: Cargamos los datos Usamos fread de data.table para una lectura mucho mas rapida

df <- fread('Telco.csv')

Muestreo: Analisis exploratorio

glimpse(df)

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CF...
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Femal...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "...
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49...
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No...
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service"...
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber ...
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes"...
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No",...
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No",...
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "...
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", ...
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "...
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "O...
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No...
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check"...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 2...
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1...
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No",...

2.1. Calidad de datos: Estadísticos básicos Hacemos un summary, con lapply que sale en formato de lista y se lee mejor

lapply(df,summary)

## $customerID
##    Length     Class      Mode 
##      7043 character character 
## 
## $gender
##    Length     Class      Mode 
##      7043 character character 
## 
## $SeniorCitizen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1621  0.0000  1.0000 
## 
## $Partner
##    Length     Class      Mode 
##      7043 character character 
## 
## $Dependents
##    Length     Class      Mode 
##      7043 character character 
## 
## $tenure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   32.37   55.00   72.00 
## 
## $PhoneService
##    Length     Class      Mode 
##      7043 character character 
## 
## $MultipleLines
##    Length     Class      Mode 
##      7043 character character 
## 
## $InternetService
##    Length     Class      Mode 
##      7043 character character 
## 
## $OnlineSecurity
##    Length     Class      Mode 
##      7043 character character 
## 
## $OnlineBackup
##    Length     Class      Mode 
##      7043 character character 
## 
## $DeviceProtection
##    Length     Class      Mode 
##      7043 character character 
## 
## $TechSupport
##    Length     Class      Mode 
##      7043 character character 
## 
## $StreamingTV
##    Length     Class      Mode 
##      7043 character character 
## 
## $StreamingMovies
##    Length     Class      Mode 
##      7043 character character 
## 
## $Contract
##    Length     Class      Mode 
##      7043 character character 
## 
## $PaperlessBilling
##    Length     Class      Mode 
##      7043 character character 
## 
## $PaymentMethod
##    Length     Class      Mode 
##      7043 character character 
## 
## $MonthlyCharges
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.50   70.35   64.76   89.85  118.75 
## 
## $TotalCharges
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    18.8   401.4  1397.5  2283.3  3794.7  8684.8      11 
## 
## $Churn
##    Length     Class      Mode 
##      7043 character character

Primeras observaciones:

*Podemos observar un archivo de una empresa de telecomunicaciones, con 7.043 registros y 21 variables, entre las que se encuentran productos contratados, métodos de pago, permanencias, facturacion mensual, género…

Nuestra target será Churn, detecteremos la tasa de abandono.

Debemos cambiar de formato algnunas variables actualmente en formato “character” a “factor”

Reservamos las variables a pasar a factor

a_factor <- c('gender','Partner','Dependents', 'PhoneService','MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod', 'PaperlessBilling', 'Churn')

2.2 - Calidad de datos: Estadísticos básicos Hacemos un summary, con lapply que sale en formato de lista y se lee mejor

lapply(df,summary)

## $customerID
##    Length     Class      Mode 
##      7043 character character 
## 
## $gender
##    Length     Class      Mode 
##      7043 character character 
## 
## $SeniorCitizen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1621  0.0000  1.0000 
## 
## $Partner
##    Length     Class      Mode 
##      7043 character character 
## 
## $Dependents
##    Length     Class      Mode 
##      7043 character character 
## 
## $tenure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   32.37   55.00   72.00 
## 
## $PhoneService
##    Length     Class      Mode 
##      7043 character character 
## 
## $MultipleLines
##    Length     Class      Mode 
##      7043 character character 
## 
## $InternetService
##    Length     Class      Mode 
##      7043 character character 
## 
## $OnlineSecurity
##    Length     Class      Mode 
##      7043 character character 
## 
## $OnlineBackup
##    Length     Class      Mode 
##      7043 character character 
## 
## $DeviceProtection
##    Length     Class      Mode 
##      7043 character character 
## 
## $TechSupport
##    Length     Class      Mode 
##      7043 character character 
## 
## $StreamingTV
##    Length     Class      Mode 
##      7043 character character 
## 
## $StreamingMovies
##    Length     Class      Mode 
##      7043 character character 
## 
## $Contract
##    Length     Class      Mode 
##      7043 character character 
## 
## $PaperlessBilling
##    Length     Class      Mode 
##      7043 character character 
## 
## $PaymentMethod
##    Length     Class      Mode 
##      7043 character character 
## 
## $MonthlyCharges
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.50   70.35   64.76   89.85  118.75 
## 
## $TotalCharges
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    18.8   401.4  1397.5  2283.3  3794.7  8684.8      11 
## 
## $Churn
##    Length     Class      Mode 
##      7043 character character

En un principio los estadisticos básicos se muestran normales, bien es cierto que aplican a pocas variables, porque la gran mayoría son actualmente tipo character.

Observamos también algunos aspectos como tenure (permanencia), que tiene un plazo máximo de 72 meses. También podemos ver los cargos mensuales, de media rondan los 65€.

2.3 - Calidad de datos: Análisis de nulos

data.frame(colSums(is.na(df)))

##                  colSums.is.na.df..
## customerID                        0
## gender                            0
## SeniorCitizen                     0
## Partner                           0
## Dependents                        0
## tenure                            0
## PhoneService                      0
## MultipleLines                     0
## InternetService                   0
## OnlineSecurity                    0
## OnlineBackup                      0
## DeviceProtection                  0
## TechSupport                       0
## StreamingTV                       0
## StreamingMovies                   0
## Contract                          0
## PaperlessBilling                  0
## PaymentMethod                     0
## MonthlyCharges                    0
## TotalCharges                     11
## Churn                             0

Pocos datos nulos, exáctamente 11 registros en la variable TotalCharges. Sospechamos que pueden estar reconvertidos a ceros

2.4 - Calidad de datos: Análisis de ceros

No es algo que se haga siempre, pero en el analisis general superior habiamos detectado muchos ceros. Vamos a constuir una funcion concreta para analizar esto

contar_ceros <- function(variable) {
    temp <- transmute(df,if_else(variable==0,1,0))
    sum(temp)
}

num_ceros <- sapply(df,contar_ceros)
num_ceros <- data.frame(VARIABLE=names(num_ceros),CEROS = as.numeric(num_ceros),stringsAsFactors = F) #el as.numeric es para sacar solo el valor de num_ceros, sin el nombre
num_ceros <- num_ceros %>%
  arrange(desc(CEROS)) %>%
  mutate(PORCENTAJE = CEROS / nrow(df) * 100)
num_ceros

##            VARIABLE CEROS PORCENTAJE
## 1     SeniorCitizen  5901 83.7853188
## 2            tenure    11  0.1561834
## 3        customerID     0  0.0000000
## 4            gender     0  0.0000000
## 5           Partner     0  0.0000000
## 6        Dependents     0  0.0000000
## 7      PhoneService     0  0.0000000
## 8     MultipleLines     0  0.0000000
## 9   InternetService     0  0.0000000
## 10   OnlineSecurity     0  0.0000000
## 11     OnlineBackup     0  0.0000000
## 12 DeviceProtection     0  0.0000000
## 13      TechSupport     0  0.0000000
## 14      StreamingTV     0  0.0000000
## 15  StreamingMovies     0  0.0000000
## 16         Contract     0  0.0000000
## 17 PaperlessBilling     0  0.0000000
## 18    PaymentMethod     0  0.0000000
## 19   MonthlyCharges     0  0.0000000
## 20            Churn     0  0.0000000
## 21     TotalCharges    NA         NA

Encontramos que el mayor número de ceros es el de SeniorCitizen, lo que puede parecer normal, siendo una operadora con múltiples servicios enfocada a perfiles más jóvenes. En Tenure también has algunos registros, 11 en total, pero pueden ser contrataciones nuevas y es un número poco significativo.

2.5 - Calidad de datos: Análisis de atípicos

2.5.1 - Analizamos las que son de tipo numerico

out <- function(variable){
  t(t(head(sort(variable,decreasing = T),20))) #la doble traspuesta es un truco para que se visualice la salida, si no lo que crearia es una coleccion de dataframes que no se ven bien
}
lapply(df,function(x){
  if(is.double(x)) out(x)
})

## $customerID
## NULL
## 
## $gender
## NULL
## 
## $SeniorCitizen
## NULL
## 
## $Partner
## NULL
## 
## $Dependents
## NULL
## 
## $tenure
## NULL
## 
## $PhoneService
## NULL
## 
## $MultipleLines
## NULL
## 
## $InternetService
## NULL
## 
## $OnlineSecurity
## NULL
## 
## $OnlineBackup
## NULL
## 
## $DeviceProtection
## NULL
## 
## $TechSupport
## NULL
## 
## $StreamingTV
## NULL
## 
## $StreamingMovies
## NULL
## 
## $Contract
## NULL
## 
## $PaperlessBilling
## NULL
## 
## $PaymentMethod
## NULL
## 
## $MonthlyCharges
##         [,1]
##  [1,] 118.75
##  [2,] 118.65
##  [3,] 118.60
##  [4,] 118.60
##  [5,] 118.35
##  [6,] 118.20
##  [7,] 117.80
##  [8,] 117.60
##  [9,] 117.50
## [10,] 117.45
## [11,] 117.35
## [12,] 117.20
## [13,] 117.15
## [14,] 116.95
## [15,] 116.85
## [16,] 116.80
## [17,] 116.75
## [18,] 116.60
## [19,] 116.60
## [20,] 116.55
## 
## $TotalCharges
##          [,1]
##  [1,] 8684.80
##  [2,] 8672.45
##  [3,] 8670.10
##  [4,] 8594.40
##  [5,] 8564.75
##  [6,] 8547.15
##  [7,] 8543.25
##  [8,] 8529.50
##  [9,] 8496.70
## [10,] 8477.70
## [11,] 8477.60
## [12,] 8476.50
## [13,] 8468.20
## [14,] 8456.75
## [15,] 8443.70
## [16,] 8436.25
## [17,] 8425.30
## [18,] 8425.15
## [19,] 8424.90
## [20,] 8405.00
## 
## $Churn
## NULL

2.5.2 - Analizamos las que son de tipo integer

out <- function(variable){
  t(t(table(variable))) #la doble traspuesta es un truco para que se visualice la salida, si no lo que crearia es una coleccion de dataframes que no se ven bien
}
lapply(df,function(x){
  if(is.integer(x)) out(x)
})

## $customerID
## NULL
## 
## $gender
## NULL
## 
## $SeniorCitizen
##         
## variable [,1]
##        0 5901
##        1 1142
## 
## $Partner
## NULL
## 
## $Dependents
## NULL
## 
## $tenure
##         
## variable [,1]
##       0    11
##       1   613
##       2   238
##       3   200
##       4   176
##       5   133
##       6   110
##       7   131
##       8   123
##       9   119
##       10  116
##       11   99
##       12  117
##       13  109
##       14   76
##       15   99
##       16   80
##       17   87
##       18   97
##       19   73
##       20   71
##       21   63
##       22   90
##       23   85
##       24   94
##       25   79
##       26   79
##       27   72
##       28   57
##       29   72
##       30   72
##       31   65
##       32   69
##       33   64
##       34   65
##       35   88
##       36   50
##       37   65
##       38   59
##       39   56
##       40   64
##       41   70
##       42   65
##       43   65
##       44   51
##       45   61
##       46   74
##       47   68
##       48   64
##       49   66
##       50   68
##       51   68
##       52   80
##       53   70
##       54   68
##       55   64
##       56   80
##       57   65
##       58   67
##       59   60
##       60   76
##       61   76
##       62   70
##       63   72
##       64   80
##       65   76
##       66   89
##       67   98
##       68  100
##       69   95
##       70  119
##       71  170
##       72  362
## 
## $PhoneService
## NULL
## 
## $MultipleLines
## NULL
## 
## $InternetService
## NULL
## 
## $OnlineSecurity
## NULL
## 
## $OnlineBackup
## NULL
## 
## $DeviceProtection
## NULL
## 
## $TechSupport
## NULL
## 
## $StreamingTV
## NULL
## 
## $StreamingMovies
## NULL
## 
## $Contract
## NULL
## 
## $PaperlessBilling
## NULL
## 
## $PaymentMethod
## NULL
## 
## $MonthlyCharges
## NULL
## 
## $TotalCharges
## NULL
## 
## $Churn
## NULL

Aqui observamos un número significativo de clientes cuya tenure es de 1 mes, esto puede ser motivo de posible abandono.

2.6 - Analisis longitudinal

longi <- df %>% 
  summarise_all(mean) %>% #calcular la media de cada variable
  t() %>% #transponerlo para tenerlo en una sola columna y leerlo mejor
  as.data.frame() #reconvertirlo a dataframe porque t() lo pasa a matriz

## Warning in mean.default(customerID): argument is not numeric or logical:
## returning NA

## Warning in mean.default(gender): argument is not numeric or logical: returning
## NA

## Warning in mean.default(Partner): argument is not numeric or logical: returning
## NA

## Warning in mean.default(Dependents): argument is not numeric or logical:
## returning NA

## Warning in mean.default(PhoneService): argument is not numeric or logical:
## returning NA

## Warning in mean.default(MultipleLines): argument is not numeric or logical:
## returning NA

## Warning in mean.default(InternetService): argument is not numeric or logical:
## returning NA

## Warning in mean.default(OnlineSecurity): argument is not numeric or logical:
## returning NA

## Warning in mean.default(OnlineBackup): argument is not numeric or logical:
## returning NA

## Warning in mean.default(DeviceProtection): argument is not numeric or logical:
## returning NA

## Warning in mean.default(TechSupport): argument is not numeric or logical:
## returning NA

## Warning in mean.default(StreamingTV): argument is not numeric or logical:
## returning NA

## Warning in mean.default(StreamingMovies): argument is not numeric or logical:
## returning NA

## Warning in mean.default(Contract): argument is not numeric or logical: returning
## NA

## Warning in mean.default(PaperlessBilling): argument is not numeric or logical:
## returning NA

## Warning in mean.default(PaymentMethod): argument is not numeric or logical:
## returning NA

## Warning in mean.default(Churn): argument is not numeric or logical: returning NA

data.frame(variable = rownames(longi), media = longi$V1) %>% #crear un nuevo dataframe para poder ordenar por el nombre
  arrange(desc(variable)) #ordenar por el nombre para tener la vision longitudinal

##            variable      media
## 1      TotalCharges         NA
## 2            tenure 32.3711487
## 3       TechSupport         NA
## 4       StreamingTV         NA
## 5   StreamingMovies         NA
## 6     SeniorCitizen  0.1621468
## 7      PhoneService         NA
## 8     PaymentMethod         NA
## 9           Partner         NA
## 10 PaperlessBilling         NA
## 11   OnlineSecurity         NA
## 12     OnlineBackup         NA
## 13    MultipleLines         NA
## 14   MonthlyCharges 64.7616925
## 15  InternetService         NA
## 16           gender         NA
## 17 DeviceProtection         NA
## 18       Dependents         NA
## 19       customerID         NA
## 20         Contract         NA
## 21            Churn         NA

Conclusiones: Todos los datos son aparentemente normales.

2.7 - Acciones resultado del analisis de calidad de datos y exploratorio

Vamos a hacer lo siguiente: - Convertir a factor las variables almacenadas en a_factor - Eliminamos los valores nulos - Eliminamos la variable ID, no es relevante

df <-df %>%
  mutate_at(a_factor,as.factor)%>%
  na.omit(df)%>%
  select(-customerID)

2.8 - Graficos

ggplot(data = df, aes(x = tenure)) + geom_bar() + 
  xlab("Permanencia") + 
  ylab("Número de Usuarios") + 
  ggtitle("Histograma de permanencia ")

ggplot(data = df, aes(x = MonthlyCharges)) + geom_histogram(binwidth = 5) + 
  xlab("Factura mensual") + 
  ylab("Número de Usuarios") + 
  ggtitle("Histograma de Cargos mensuales ")

ggplot(data = df, aes(x = TotalCharges)) + geom_histogram(binwidth = 75) + 
  xlab("Cargos totales") + 
  ylab("Número de Usuarios") + 
  ggtitle("Histograma de Cargos totales ")

3 - Transformación de datos

3.1 - Creación de la target

df <- df %>% 
  mutate(TARGET = as.numeric(ifelse((Churn == "Yes"),1,0))) %>% #el as.numeric es para que los niveles del factor se queden como 0 y 1, y no como 1 y 2
  select(-Churn) #eliminamos la original para que no confunda

3.2 - Preparacion de las variables independientes

3.2.1 - Preseleccion de variables independientes Creamos una lista con todas las variables independientes. Pero para ver si predicen solo necesitamos un mes de las que son historicas (en nuestro caso tenemos datos de sólo 1 mes, así que entran todas menos la target)

independientes <- setdiff(names(df), "TARGET")

Creamos una muestra m menor para que los calculos sean mas rapidos

set.seed(12345)
m <- sample_n(df,500)

3.2.1.1 - Preseleccion con RandomForest

pre_rf <- randomForest(formula = reformulate(independientes,'TARGET'), data= m,mtry=2,ntree=50, importance = T)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

imp_rf <- importance(pre_rf)[,2] #como importance devuelve una matriz con varias metricas, tenemos que extraer asi el decrecimiento en Gini que es el que mas nos interesa
imp_rf <- data.frame(VARIABLE = names(imp_rf), IMP_RF = imp_rf) #lo transformamos a dataframe
imp_rf <- imp_rf %>% arrange(desc(IMP_RF)) %>% mutate(RANKING_RF = 1:nrow(imp_rf)) #creamos el ranking

visualizamos

imp_rf

##            VARIABLE    IMP_RF RANKING_RF
## 1      TotalCharges 8.3917411          1
## 2    MonthlyCharges 7.7358742          2
## 3            tenure 7.6233634          3
## 4          Contract 4.3387693          4
## 5     PaymentMethod 4.2402255          5
## 6    OnlineSecurity 4.2116586          6
## 7   InternetService 4.0347408          7
## 8  DeviceProtection 2.4120151          8
## 9  PaperlessBilling 2.2838378          9
## 10      TechSupport 2.1595157         10
## 11  StreamingMovies 1.8830690         11
## 12     OnlineBackup 1.7484280         12
## 13    SeniorCitizen 1.7453463         13
## 14    MultipleLines 1.7215946         14
## 15          Partner 1.7037941         15
## 16      StreamingTV 1.5760623         16
## 17       Dependents 1.4833962         17
## 18           gender 1.2660705         18
## 19     PhoneService 0.3553374         19

3.2.1.2 - Preseleccion con Information Value

m2 <- mutate(m,TARGET = as.numeric(as.character(TARGET))) #transformo la target a numerico temporalmente porque este algoritmo necesita que este en numerico, y el as.character es para que lo convierta a 0 y 1, y no a 1 y 2
imp_iv <- smbinning.sumiv(m2[c(independientes,'TARGET')],y="TARGET")

##  
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |--                                                |   5%
  |                                                        
  |-----                                             |  10%
  |                                                        
  |--------                                          |  15%
  |                                                        
  |----------                                        |  20%
  |                                                        
  |------------                                      |  25%
  |                                                        
  |---------------                                   |  30%
  |                                                        
  |------------------                                |  35%
  |                                                        
  |--------------------                              |  40%
  |                                                        
  |----------------------                            |  45%
  |                                                        
  |-------------------------                         |  50%
  |                                                        
  |----------------------------                      |  55%
  |                                                        
  |------------------------------                    |  60%
  |                                                        
  |--------------------------------                  |  65%
  |                                                        
  |-----------------------------------               |  70%
  |                                                        
  |--------------------------------------            |  75%
  |                                                        
  |----------------------------------------          |  80%
  |                                                        
  |------------------------------------------        |  85%
  |                                                        
  |---------------------------------------------     |  90%
  |                                                        
  |------------------------------------------------  |  95%
  |                                                        
  |--------------------------------------------------| 100%
##

imp_iv <- imp_iv %>% mutate(Ranking = 1:nrow(imp_iv)) %>% select(-Process)
names(imp_iv) <- c('VARIABLE','IMP_IV','RANKING_IV')

visualizamos

imp_iv

##            VARIABLE IMP_IV RANKING_IV
## 1          Contract 1.3470          1
## 2    OnlineSecurity 1.0306          2
## 3   InternetService 0.9788          3
## 4    MonthlyCharges 0.8235          4
## 5       TechSupport 0.7586          5
## 6            tenure 0.7311          6
## 7  DeviceProtection 0.6835          7
## 8      OnlineBackup 0.6561          8
## 9       StreamingTV 0.6105          9
## 10  StreamingMovies 0.5624         10
## 11    PaymentMethod 0.5033         11
## 12 PaperlessBilling 0.3614         12
## 13     TotalCharges 0.3097         13
## 14       Dependents 0.1217         14
## 15    MultipleLines 0.1133         15
## 16     PhoneService 0.0766         16
## 17          Partner 0.0617         17
## 18           gender 0.0141         18
## 19    SeniorCitizen     NA         19

3.2.1.3 - Preseleccion final

imp_final <- inner_join(imp_rf,imp_iv,by='VARIABLE') %>% 
  select(VARIABLE,IMP_RF,IMP_IV,RANKING_RF,RANKING_IV) %>% #ponerlos en orden mas legible
  mutate(RANKING_TOT = RANKING_RF + RANKING_IV) %>% 
  arrange(RANKING_TOT)
imp_final

##            VARIABLE    IMP_RF IMP_IV RANKING_RF RANKING_IV RANKING_TOT
## 1          Contract 4.3387693 1.3470          4          1           5
## 2    MonthlyCharges 7.7358742 0.8235          2          4           6
## 3    OnlineSecurity 4.2116586 1.0306          6          2           8
## 4            tenure 7.6233634 0.7311          3          6           9
## 5   InternetService 4.0347408 0.9788          7          3          10
## 6      TotalCharges 8.3917411 0.3097          1         13          14
## 7  DeviceProtection 2.4120151 0.6835          8          7          15
## 8       TechSupport 2.1595157 0.7586         10          5          15
## 9     PaymentMethod 4.2402255 0.5033          5         11          16
## 10     OnlineBackup 1.7484280 0.6561         12          8          20
## 11 PaperlessBilling 2.2838378 0.3614          9         12          21
## 12  StreamingMovies 1.8830690 0.5624         11         10          21
## 13      StreamingTV 1.5760623 0.6105         16          9          25
## 14    MultipleLines 1.7215946 0.1133         14         15          29
## 15       Dependents 1.4833962 0.1217         17         14          31
## 16    SeniorCitizen 1.7453463     NA         13         19          32
## 17          Partner 1.7037941 0.0617         15         17          32
## 18     PhoneService 0.3553374 0.0766         19         16          35
## 19           gender 1.2660705 0.0141         18         18          36

¿Son los metodos fiables? Vamos a hacer una correlacion entre ellos a ver si dan cosas similares

cor(imp_final$IMP_RF,imp_final$IMP_IV,use = 'complete.obs')

## [1] 0.4221867

los resultados de la correlacion son… 0,4. no me convencen

Decision: vamos a descartar aquellas variables que no hayan salido entre las 10 mas importantes en ninguno de los dos sistemas de seleccion de variables

ind_corta <- imp_final %>%
  filter(RANKING_RF <= 10 | RANKING_IV <= 10) %>% 
  select(VARIABLE) #nos quedamos solo con el nombre
ind_corta <- as.character(ind_corta$VARIABLE) #lo pasamos a un vector en vez de un dataframe

Estas son las variables predictoras con las que vamos a trabajar finalmente

ind_corta

##  [1] "Contract"         "MonthlyCharges"   "OnlineSecurity"   "tenure"          
##  [5] "InternetService"  "TotalCharges"     "DeviceProtection" "TechSupport"     
##  [9] "PaymentMethod"    "OnlineBackup"     "PaperlessBilling" "StreamingMovies" 
## [13] "StreamingTV"

- Variables Sinteticas Creación de variables a través de otras para aumentar la capacidad de predicción del modelo.

4.1 - Creacion de variables sinteticas

4.1.1 - Productos contratados Crearemos una variable que cuantifique los servicios contratados

Primero extraemos un valor lógico si el servicio está contratado o no:

telefono <- as.numeric(if_else(df$PhoneService == "Yes", 1,0))
multiples <- as.numeric(if_else(df$MultipleLines == "Yes", 1,0))
internet <- as.numeric(if_else(df$InternetService == "No", 0,1))
seguridad <- as.numeric(if_else(df$OnlineSecurity == "Yes", 1,0))
backup <- as.numeric(if_else(df$OnlineBackup == "Yes", 1,0))
proteccion <- as.numeric(if_else(df$DeviceProtection == "Yes", 1,0))
soporte <- as.numeric(if_else(df$TechSupport == "Yes", 1,0))
tv <- as.numeric(if_else(df$StreamingTV == "Yes", 1,0))
pelis <- as.numeric(if_else(df$StreamingMovies == "Yes", 1,0))

Hacemos sumatoria y almacenamos en servicios contratados

df <-df %>%
  mutate (serv_contratados = (
    telefono + multiples + internet + 
    seguridad + backup + proteccion + 
      soporte + tv + pelis))

Eliminamos variables creadas

rm('telefono', 'multiples','internet', 'seguridad', 'backup' , 'proteccion' , 'soporte' , 'tv' , 'pelis')

4.1.2 - Finalización permanencia

Nos interesa saber si la vinculación con la empresa está a punto de expirar, por lo que estableceremos una variable lógica que nos indique si su tenure es de 1 mes.

df<- df%>%
  mutate(permanencia = ifelse(tenure == 1, 1, 0))

4.1.3 - Consumos altos

Igualmente crearemos una variable lógica con indicador de consumos altos mensuales, para ello establecemos como alto consumo a los cargos superiores al 3er cuartil.

cuartil <- quantile (df$MonthlyCharges, prob = c(0.75))

df <- df%>%
  mutate(consumo = ifelse(df$MonthlyCharges >= cuartil, 1, 0))

Guardamos cache temporal

saveRDS(df,'cacheT1.rds')

Cargamos el cache temporal

df <- readRDS('cacheT1.rds')

4.2 - Discretizacion Primero vamos a crear la funcion que va a discretizar de forma automatica maximizando la capacidad predictiva de la nueva variable Ademas, como vamos a usar en la modelizacion un algoritmo lineal, que es la regresion logistica, vamos a intentar que la discretizacion sea monotonica

discretizar <- function(vi,target){
  temp_df <- data.frame(vi = vi, target = target)
  #smbinning necesita que la target sea numerica
  temp_df$target <- as.numeric(as.character(temp_df$target))
  disc <- smbinning(temp_df, y = 'target', x = 'vi')
  return(disc)
}

4.2.1 - Discretizamos tenure

disc_temp_tenure <- discretizar(df$tenure,df$TARGET)
df_temp <- select(df,tenure,TARGET) #creamos este temporal porque smbinning.gen necesita que el df tenga el mismo numero de columnas que la salida de la funcion discretizar
df_temp <- smbinning.gen(df_temp,disc_temp_tenure,chrname = 'tenure_DISC')
#Metemos en df la nueva variable discretizada y eliminamos la original
df <- cbind(df,df_temp[3]) %>% select(-tenure)

4.2.2 - Discretizamos MonthlyCharges

disc_temp_MonthlyCharges <- discretizar(df$MonthlyCharges,df$TARGET)
df_temp <- select(df,MonthlyCharges,TARGET) #creamos este temporal porque smbinning.gen necesita que el df tenga el mismo numero de columnas que la salida de la funcion discretizar
df_temp <- smbinning.gen(df_temp,disc_temp_MonthlyCharges,chrname = 'MonthlyCharges_DISC')
#Metemos en df la nueva variable discretizada y eliminamos la original
df <- cbind(df,df_temp[3]) %>% select(-MonthlyCharges)

4.2.3 - Discretizamos TotalCharges

disc_temp_TotalCharges <- discretizar(df$TotalCharges,df$TARGET)
df_temp <- select(df,TotalCharges,TARGET) #creamos este temporal porque smbinning.gen necesita que el df tenga el mismo numero de columnas que la salida de la funcion discretizar
df_temp <- smbinning.gen(df_temp,disc_temp_TotalCharges,chrname = 'TotalCharges_DISC')
#Metemos en df la nueva variable discretizada y eliminamos la original
df <- cbind(df,df_temp[3]) %>% select(-TotalCharges)

Vamos a hacer una inspeccion visual de todas las variables a ver si han salido bien

df %>% 
  select_if(is.factor) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    geom_bar() +
    facet_wrap(~ key, scales = "free") +
    theme(axis.text=element_text(size=4))#esto es para cambiar el tamaño del texto del eje y que se lea bien

## Warning: attributes are not identical across measure variables;
## they will be dropped

Ahora vamos a analizar la penetración de la target en cada categoría para ver si las variables han salido monotónicas

a <- function(var1,var2) {
  df_temp <- data.frame(var1 = df[[var1]],var2 = df[[var2]])
  df_temp %>% 
    group_by(var1) %>% 
    summarise(Conteo = n(), Porc = mean(as.numeric(as.character(var2)))) %>% 
  ggplot(aes(var1,Porc)) + geom_bar(stat='identity') + xlab(var1)
}
df2_nombres <- df %>% select_if(is.factor) %>% names()
lapply(df2_nombres,function(x){a(x,'TARGET')})

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

Antes de continuar vamos a guardar en un objeto de R las discretizaciones, porque las necesitaremos despues para poner el modelo en produccion

#Vamos a crear un objeto de tipo lista que es lo ideal para guardar objetos complejos como las discretizaciones
discretizaciones <- list(
 disc_temp_tenure = disc_temp_tenure,
 disc_temp_MonthlyCharges = disc_temp_MonthlyCharges, 
 disc_temp_TotalCharges = disc_temp_TotalCharges
 )
saveRDS(discretizaciones,'01_CortesDiscretizaciones.rds')

4.3 - Ordenación y limpieza del dataframe

4.3.1 - Dejaremos el df con las variables elegidas como preferentes para la modelización, obteniéndolas de los procesos de selección de independientes (ind_corta), sinteticas y discretizadas.

Para hacerlo más facil identificamos las variables actuales en el df y eliminamos las que no coinciden con ind_corta, manteniendo las sinteticas y discretizadas.

names(df)

##  [1] "gender"              "SeniorCitizen"       "Partner"            
##  [4] "Dependents"          "PhoneService"        "MultipleLines"      
##  [7] "InternetService"     "OnlineSecurity"      "OnlineBackup"       
## [10] "DeviceProtection"    "TechSupport"         "StreamingTV"        
## [13] "StreamingMovies"     "Contract"            "PaperlessBilling"   
## [16] "PaymentMethod"       "TARGET"              "serv_contratados"   
## [19] "permanencia"         "consumo"             "tenure_DISC"        
## [22] "MonthlyCharges_DISC" "TotalCharges_DISC"

ind_corta

##  [1] "Contract"         "MonthlyCharges"   "OnlineSecurity"   "tenure"          
##  [5] "InternetService"  "TotalCharges"     "DeviceProtection" "TechSupport"     
##  [9] "PaymentMethod"    "OnlineBackup"     "PaperlessBilling" "StreamingMovies" 
## [13] "StreamingTV"

Aplicamos un método menos “técnico” al hacerlo de forma manual, pero es la forma más rápida y sencilla dadas las características del df.

df <- df%>%
  select(-gender, -SeniorCitizen, -Partner, -Dependents, -PhoneService, -MultipleLines)

Ahora incluiremos un id al dataframe y ordenaremos los nombres de las variables

df <- df %>% mutate(id = row_number())

centrales <- setdiff(names(df),c('id','TARGET'))
df <- df %>% select(
  id,
  one_of(centrales),
  TARGET)

4.3.2. - Limpieza Limpiamos el entorno de cualquier cosa que no sea el dataframe

a_borrar <- setdiff(ls(),'df')
rm(list=c(a_borrar,'a_borrar'))

Guardamos otro cache temporal

saveRDS(df,'cacheT2.rds')

Cargamos el cache temporal

df <- readRDS('cacheT2.rds')

Modelizacion

5.1 - Preparar las funciones que vamos a necesitar

Funcion para crear una matriz de confusion

confusion<-function(real,scoring,umbral){ 
  conf<-table(real,scoring>=umbral)
  if(ncol(conf)==2) return(conf) else return(NULL)
}

Funcion para calcular las metricas de los modelos: acierto, precision, cobertura y F1

metricas<-function(matriz_conf){
  acierto <- (matriz_conf[1,1] + matriz_conf[2,2]) / sum(matriz_conf) *100
  precision <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[1,2]) *100
  cobertura <- matriz_conf[2,2] / (matriz_conf[2,2] + matriz_conf[2,1]) *100
  F1 <- 2*precision*cobertura/(precision+cobertura)
  salida<-c(acierto,precision,cobertura,F1)
  return(salida)
}

Funcion para probar distintos umbrales y ver el efecto sobre precision y cobertura

umbrales<-function(real,scoring){
  umbrales<-data.frame(umbral=rep(0,times=19),acierto=rep(0,times=19),precision=rep(0,times=19),cobertura=rep(0,times=19),F1=rep(0,times=19))
  cont <- 1
  for (cada in seq(0.05,0.95,by = 0.05)){
    datos<-metricas(confusion(real,scoring,cada))
    registro<-c(cada,datos)
    umbrales[cont,]<-registro
    cont <- cont + 1
  }
  return(umbrales)
}

Funciones que calculan la curva ROC y el AUC

roc<-function(prediction){
  r<-performance(prediction,'tpr','fpr')
  plot(r)
}

auc<-function(prediction){
  a<-performance(prediction,'auc')
  return(a@y.values[[1]])
}

5.2 - Creamos las particiones de training (70%) y test (30%)

Establecemos una semilla para que nos salgan los mismos resultados

set.seed(12345)

Generamos una variable aleatoria con una distribucion 70-30

df$random<-sample(0:1,size = nrow(df),replace = T,prob = c(0.3,0.7))

Creamos los dos dataframes

train<-filter(df,random==1)
test<-filter(df,random==0)
#Eliminamos ya la random para que no moleste
df$random <- NULL

5.3 - Creación del modelo de propensión Nota: Vamos a probar dos algoritmos diferentes para ver cual funciona mejor y aprender como se comparan

5.3.1 - Identificamos las variables

#Las independientes seran todas menos el codigo cliente y la target
independientes <- setdiff(names(df),c('id','TARGET'))
target <- 'TARGET'

5.3.2 - Creamos la formula para usar en el modelo

formula <- reformulate(independientes,target)

5.3.3 - Modelizamos con regresion logistica

Primero vamos a hacer un modelo con todas las variables

formula_rl <- formula
rl<- glm(formula_rl,train,family=binomial(link='logit'))
summary(rl)

## 
## Call:
## glm(formula = formula_rl, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3733  -0.6801  -0.2698   0.6016   3.1304  
## 
## Coefficients: (7 not defined because of singularities)
##                                      Estimate Std. Error z value
## (Intercept)                          -3.55949    0.58664  -6.068
## InternetServiceFiber optic            1.55288    0.25683   6.046
## InternetServiceNo                    -1.04259    0.31970  -3.261
## OnlineSecurityNo internet service          NA         NA      NA
## OnlineSecurityYes                    -0.54571    0.12585  -4.336
## OnlineBackupNo internet service            NA         NA      NA
## OnlineBackupYes                      -0.32066    0.11884  -2.698
## DeviceProtectionNo internet service        NA         NA      NA
## DeviceProtectionYes                  -0.11285    0.12195  -0.925
## TechSupportNo internet service             NA         NA      NA
## TechSupportYes                       -0.45312    0.12603  -3.595
## StreamingTVNo internet service             NA         NA      NA
## StreamingTVYes                        0.30876    0.12432   2.484
## StreamingMoviesNo internet service         NA         NA      NA
## StreamingMoviesYes                    0.26089    0.12366   2.110
## ContractOne year                     -0.62701    0.12820  -4.891
## ContractTwo year                     -1.41830    0.22294  -6.362
## PaperlessBillingYes                   0.36239    0.09012   4.021
## PaymentMethodCredit card (automatic) -0.11936    0.13526  -0.882
## PaymentMethodElectronic check         0.22929    0.11256   2.037
## PaymentMethodMailed check            -0.29548    0.13779  -2.144
## serv_contratados                      0.28363    0.08891   3.190
## permanencia                           3.48898    0.51539   6.770
## consumo                              -0.40139    0.17358  -2.312
## tenure_DISC02 <= 5                    2.75955    0.48974   5.635
## tenure_DISC03 <= 16                   2.32378    0.47296   4.913
## tenure_DISC04 <= 22                   2.02087    0.48490   4.168
## tenure_DISC05 <= 49                   1.66819    0.45844   3.639
## tenure_DISC06 <= 59                   1.47877    0.43992   3.361
## tenure_DISC07 <= 70                   1.23137    0.40302   3.055
## tenure_DISC08 > 70                         NA         NA      NA
## MonthlyCharges_DISC02 <= 55.95       -0.02933    0.32474  -0.090
## MonthlyCharges_DISC03 <= 68.8        -0.94236    0.39944  -2.359
## MonthlyCharges_DISC04 <= 106.75      -1.02365    0.45160  -2.267
## MonthlyCharges_DISC05 > 106.75       -1.21762    0.55304  -2.202
## TotalCharges_DISC02 <= 3233.85       -0.36112    0.19201  -1.881
## TotalCharges_DISC03 <= 5643.4        -0.67431    0.26404  -2.554
## TotalCharges_DISC04 > 5643.4         -0.43254    0.36788  -1.176
##                                             Pr(>|z|)    
## (Intercept)                          0.0000000012981 ***
## InternetServiceFiber optic           0.0000000014817 ***
## InternetServiceNo                           0.001110 ** 
## OnlineSecurityNo internet service                 NA    
## OnlineSecurityYes                    0.0000145042383 ***
## OnlineBackupNo internet service                   NA    
## OnlineBackupYes                             0.006971 ** 
## DeviceProtectionNo internet service               NA    
## DeviceProtectionYes                         0.354763    
## TechSupportNo internet service                    NA    
## TechSupportYes                              0.000324 ***
## StreamingTVNo internet service                    NA    
## StreamingTVYes                              0.013007 *  
## StreamingMoviesNo internet service                NA    
## StreamingMoviesYes                          0.034880 *  
## ContractOne year                     0.0000010037711 ***
## ContractTwo year                     0.0000000001995 ***
## PaperlessBillingYes                  0.0000579023513 ***
## PaymentMethodCredit card (automatic)        0.377522    
## PaymentMethodElectronic check               0.041648 *  
## PaymentMethodMailed check                   0.031999 *  
## serv_contratados                            0.001422 ** 
## permanencia                          0.0000000000129 ***
## consumo                                     0.020758 *  
## tenure_DISC02 <= 5                   0.0000000175387 ***
## tenure_DISC03 <= 16                  0.0000008956442 ***
## tenure_DISC04 <= 22                  0.0000307798802 ***
## tenure_DISC05 <= 49                         0.000274 ***
## tenure_DISC06 <= 59                         0.000775 ***
## tenure_DISC07 <= 70                         0.002248 ** 
## tenure_DISC08 > 70                                NA    
## MonthlyCharges_DISC02 <= 55.95              0.928022    
## MonthlyCharges_DISC03 <= 68.8               0.018314 *  
## MonthlyCharges_DISC04 <= 106.75             0.023407 *  
## MonthlyCharges_DISC05 > 106.75              0.027688 *  
## TotalCharges_DISC02 <= 3233.85              0.060015 .  
## TotalCharges_DISC03 <= 5643.4               0.010656 *  
## TotalCharges_DISC04 > 5643.4                0.239688    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5796.6  on 4933  degrees of freedom
## Residual deviance: 4052.5  on 4903  degrees of freedom
## AIC: 4114.5
## 
## Number of Fisher Scoring iterations: 6

Revisamos la significatividad y mantenemos todas las variables que tengan tres estrellas en alguna categoria, menos SALDO1ER_PASIVO_TEND porque ya estan entrando otras dos de la misma variable origen

a_mantener <- c(
  'InternetService',
  'OnlineSecurity',
  'TechSupport',
  'Contract',
  'PaperlessBilling',
  'tenure_DISC',
  'permanencia'
)

Volvemos a modelizar

formula_rl <- reformulate(a_mantener,target)
rl<- glm(formula_rl,train,family=binomial(link='logit'))
summary(rl)

## 
## Call:
## glm(formula = formula_rl, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0081  -0.7091  -0.2855   0.5904   3.0461  
## 
## Coefficients: (3 not defined because of singularities)
##                                   Estimate Std. Error z value
## (Intercept)                        0.38315    0.13703   2.796
## InternetServiceFiber optic         1.04109    0.09129  11.405
## InternetServiceNo                 -1.42445    0.15302  -9.309
## OnlineSecurityNo internet service       NA         NA      NA
## OnlineSecurityYes                 -0.46415    0.09810  -4.731
## TechSupportNo internet service          NA         NA      NA
## TechSupportYes                    -0.28219    0.09720  -2.903
## ContractOne year                  -0.63000    0.12404  -5.079
## ContractTwo year                  -1.44055    0.21699  -6.639
## PaperlessBillingYes                0.44901    0.08800   5.102
## tenure_DISC02 <= 5                -0.78426    0.15136  -5.181
## tenure_DISC03 <= 16               -1.30232    0.14263  -9.131
## tenure_DISC04 <= 22               -1.62027    0.17797  -9.104
## tenure_DISC05 <= 49               -1.95739    0.14716 -13.301
## tenure_DISC06 <= 59               -2.14782    0.19717 -10.893
## tenure_DISC07 <= 70               -2.23149    0.20877 -10.689
## tenure_DISC08 > 70                -3.33092    0.42015  -7.928
## permanencia                             NA         NA      NA
##                                               Pr(>|z|)    
## (Intercept)                                    0.00517 ** 
## InternetServiceFiber optic        < 0.0000000000000002 ***
## InternetServiceNo                 < 0.0000000000000002 ***
## OnlineSecurityNo internet service                   NA    
## OnlineSecurityYes                  0.00000223111739051 ***
## TechSupportNo internet service                      NA    
## TechSupportYes                                 0.00369 ** 
## ContractOne year                   0.00000037905555435 ***
## ContractTwo year                   0.00000000003164912 ***
## PaperlessBillingYes                0.00000033546801185 ***
## tenure_DISC02 <= 5                 0.00000022027276229 ***
## tenure_DISC03 <= 16               < 0.0000000000000002 ***
## tenure_DISC04 <= 22               < 0.0000000000000002 ***
## tenure_DISC05 <= 49               < 0.0000000000000002 ***
## tenure_DISC06 <= 59               < 0.0000000000000002 ***
## tenure_DISC07 <= 70               < 0.0000000000000002 ***
## tenure_DISC08 > 70                 0.00000000000000223 ***
## permanencia                                         NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5796.6  on 4933  degrees of freedom
## Residual deviance: 4165.8  on 4919  degrees of freedom
## AIC: 4195.8
## 
## Number of Fisher Scoring iterations: 6

Vemos que ahora ya todas las variables tienen al menos una categoria con 3 estrellas de significacion. Comprobaremos este modelo sobre el conjunto de test:

Y calculamos el pseudo R cuadrado:

pr2_rl <- 1 -(rl$deviance / rl$null.deviance)
pr2_rl

## [1] 0.2813361

Aplicamos el modelo al conjunto de test, generando un vector con las probabilidades

rl_predict<-predict(rl,test,type = 'response')

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

Vemos que pinta tiene

plot(rl_predict~test$TARGET)

Ahora tenemos que transformar la probabilidad en una decision de si el cliente va a comprar o no

Con la funcion umbrales probamos diferentes cortes

umb_rl<-umbrales(test$TARGET,rl_predict)
umb_rl

##    umbral  acierto precision cobertura       F1
## 1    0.05 49.95234  32.70725 97.868217 49.02913
## 2    0.10 59.81888  37.56654 95.736434 53.95958
## 3    0.15 65.72927  41.22731 92.441860 57.02331
## 4    0.20 69.73308  44.31710 89.922481 59.37300
## 5    0.25 73.49857  47.83550 85.658915 61.38889
## 6    0.30 76.21544  51.05068 80.038760 62.33962
## 7    0.35 77.74071  53.38866 74.806202 62.30831
## 8    0.40 79.59962  57.07395 68.798450 62.39016
## 9    0.45 80.45758  59.96241 61.821705 60.87786
## 10   0.50 80.69590  63.24582 51.356589 56.68449
## 11   0.55 81.17255  66.57534 47.093023 55.16459
## 12   0.60 81.31554  71.08844 40.503876 51.60494
## 13   0.65 79.55195  72.77487 26.937984 39.32107
## 14   0.70 78.93232  78.46154 19.767442 31.57895
## 15   0.75 77.16873  81.35593  9.302326 16.69565
## 16   0.80 77.21640  82.75862  9.302326 16.72474
## 17   0.85 76.64442  82.50000  6.395349 11.87050
## 18   0.90  0.90000   0.90000  0.900000  0.90000
## 19   0.95  0.95000   0.95000  0.950000  0.95000

Seleccionamos el umbral que maximiza la F1

umbral_final_rl<-umb_rl[which.max(umb_rl$F1),1]
umbral_final_rl

## [1] 0.4

Evaluamos la matriz de confusion y las metricas con el umbral optimizado

confusion(test$TARGET,rl_predict,umbral_final_rl)

##     
## real FALSE TRUE
##    0  1315  267
##    1   161  355

rl_metricas<-filter(umb_rl,umbral==umbral_final_rl)
rl_metricas

##   umbral  acierto precision cobertura       F1
## 1    0.4 79.59962  57.07395  68.79845 62.39016

Evaluamos la ROC

#creamos el objeto prediction
rl_prediction<-prediction(rl_predict,test$TARGET)
#visualizamos la ROC
roc(rl_prediction)

Sacamos las metricas definitivas incluyendo el AUC

rl_metricas<-cbind(rl_metricas,AUC=round(auc(rl_prediction),2)*100)
print(t(rl_metricas))

##               [,1]
## umbral     0.40000
## acierto   79.59962
## precision 57.07395
## cobertura 68.79845
## F1        62.39016
## AUC       85.00000

5.3.4 - Modelizamos con Arboles de decision

Creamos el primer modelo

formula_ar <- formula
ar<-rpart(formula_ar, train, method = 'class', parms = list(
  split = "information"), 
  control = rpart.control(cp = 0.00001))

Revisamos donde el error de validacion cruzada empieza a crecer

printcp(ar)

## 
## Classification tree:
## rpart(formula = formula_ar, data = train, method = "class", parms = list(split = "information"), 
##     control = rpart.control(cp = 0.00001))
## 
## Variables actually used in tree construction:
##  [1] consumo             Contract            DeviceProtection   
##  [4] InternetService     MonthlyCharges_DISC OnlineBackup       
##  [7] OnlineSecurity      PaperlessBilling    PaymentMethod      
## [10] permanencia         serv_contratados    StreamingMovies    
## [13] StreamingTV         TechSupport         tenure_DISC        
## [16] TotalCharges_DISC  
## 
## Root node error: 1353/4934 = 0.27422
## 
## n= 4934 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.05691057      0   1.00000 1.00000 0.023161
## 2  0.01219512      3   0.79084 0.81449 0.021623
## 3  0.01108647      5   0.76644 0.81005 0.021580
## 4  0.00665188      7   0.74427 0.77088 0.021197
## 5  0.00443459      8   0.73762 0.76497 0.021137
## 6  0.00406504      9   0.73319 0.76497 0.021137
## 7  0.00369549     15   0.70584 0.76201 0.021107
## 8  0.00295639     16   0.70214 0.75831 0.021069
## 9  0.00221729     18   0.69623 0.75388 0.021024
## 10 0.00172456     23   0.68514 0.74945 0.020978
## 11 0.00147820     26   0.67997 0.75166 0.021001
## 12 0.00110865     38   0.66223 0.75388 0.021024
## 13 0.00098546     41   0.65854 0.75462 0.021032
## 14 0.00073910     50   0.64597 0.75314 0.021016
## 15 0.00049273     61   0.63784 0.76718 0.021160
## 16 0.00036955     70   0.63341 0.78197 0.021308
## 17 0.00027716     74   0.63193 0.79823 0.021467
## 18 0.00024637     83   0.62897 0.80414 0.021524
## 19 0.00018477     98   0.62528 0.80414 0.021524
## 20 0.00012318    102   0.62454 0.81153 0.021594
## 21 0.00001000    108   0.62380 0.81818 0.021657

plotcp(ar)

Parece que minimiza aprox en cp = 0.0016 de complejidad Generamos un nuevo arbol con ese parametro Ademas vamos a incluir un nuevo paramtero para que el arbol no tenga mas de 7 niveles

ar<-rpart(formula, train, method = 'class', parms = list(
  split = "information"), 
  control = rpart.control(cp = 0.0016,maxdepth = 7))

Revisamos de nuevo la complejidad

printcp(ar)

## 
## Classification tree:
## rpart(formula = formula, data = train, method = "class", parms = list(split = "information"), 
##     control = rpart.control(cp = 0.0016, maxdepth = 7))
## 
## Variables actually used in tree construction:
##  [1] consumo             Contract            InternetService    
##  [4] MonthlyCharges_DISC OnlineSecurity      PaperlessBilling   
##  [7] PaymentMethod       permanencia         serv_contratados   
## [10] StreamingMovies     StreamingTV         TechSupport        
## [13] tenure_DISC         TotalCharges_DISC  
## 
## Root node error: 1353/4934 = 0.27422
## 
## n= 4934 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0569106      0   1.00000 1.00000 0.023161
## 2  0.0121951      3   0.79084 0.79970 0.021481
## 3  0.0110865      5   0.76644 0.79157 0.021402
## 4  0.0066519      7   0.74427 0.77310 0.021219
## 5  0.0044346      8   0.73762 0.77014 0.021190
## 6  0.0040650      9   0.73319 0.76423 0.021130
## 7  0.0036955     15   0.70584 0.76349 0.021122
## 8  0.0029564     16   0.70214 0.75758 0.021062
## 9  0.0022173     18   0.69623 0.75610 0.021047
## 10 0.0016000     20   0.69180 0.75240 0.021009

plotcp(ar)

Conseguimos con estos parámetros que el error cruzado no llegue a subir, así que seleccionamos este árbol como definitivo.

Vamos a crear el grafico del arbol para analizarlo

rpart.plot(ar,type=2,extra = 7, under = TRUE,under.cex = 0.7,fallen.leaves=F,gap = 0,cex=0.2,yesno = 2,box.palette = "GnYlRd",branch.lty = 3)

Vamos a sacar las reglas que podrian ser utilizadas por ejemplo para hacer una implantacion del arbol

rpart.rules(ar,style = 'tall',cover = T)

## TARGET is 0.07 with cover 44% when
##     Contract is One year or Two year
## 
## TARGET is 0.14 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5 or 03 <= 16
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Credit card (automatic) or Mailed check
##     permanencia is 0
##     serv_contratados < 3
## 
## TARGET is 0.19 with cover 15% when
##     Contract is Month-to-month
##     tenure_DISC is 03 <= 16 or 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
##     InternetService is DSL or No
## 
## TARGET is 0.20 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5
##     InternetService is DSL or No
##     PaperlessBilling is Yes
##     TechSupport is No
##     MonthlyCharges_DISC is 03 <= 68.8
## 
## TARGET is 0.28 with cover 1% when
##     Contract is Month-to-month
##     tenure_DISC is 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     PaymentMethod is Electronic check
## 
## TARGET is 0.28 with cover 1% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5 or 03 <= 16
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     permanencia is 0
##     StreamingTV is No
## 
## TARGET is 0.29 with cover 5% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5
##     InternetService is DSL or No
##     TechSupport is No internet service or Yes
## 
## TARGET is 0.30 with cover 7% when
##     Contract is Month-to-month
##     tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
##     InternetService is Fiber optic
##     PaymentMethod is Bank transfer (automatic) or Credit card (automatic) or Mailed check
## 
## TARGET is 0.33 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 04 <= 22
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     PaymentMethod is Electronic check
##     consumo is 0
## 
## TARGET is 0.38 with cover 1% when
##     Contract is Month-to-month
##     tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     PaperlessBilling is Yes
##     TotalCharges_DISC is 04 > 5643.4
## 
## TARGET is 0.41 with cover 1% when
##     Contract is Month-to-month
##     tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     PaperlessBilling is No
## 
## TARGET is 0.43 with cover 2% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5
##     InternetService is DSL or No
##     PaymentMethod is Bank transfer (automatic) or Electronic check or Mailed check
##     PaperlessBilling is No
##     TechSupport is No
##     StreamingMovies is No
## 
## TARGET is 0.54 with cover 1% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5 or 03 <= 16
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Bank transfer (automatic) or Electronic check
##     permanencia is 0
##     serv_contratados < 3
## 
## TARGET is 0.60 with cover 5% when
##     Contract is Month-to-month
##     tenure_DISC is 04 <= 22 or 05 <= 49 or 06 <= 59 or 07 <= 70 or 08 > 70
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     PaymentMethod is Electronic check
##     PaperlessBilling is Yes
##     TotalCharges_DISC is 02 <= 3233.85 or 03 <= 5643.4
## 
## TARGET is 0.64 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5
##     InternetService is DSL or No
##     PaperlessBilling is No
##     TechSupport is No
##     StreamingMovies is Yes
## 
## TARGET is 0.65 with cover 3% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5
##     InternetService is DSL or No
##     PaperlessBilling is Yes
##     TechSupport is No
##     MonthlyCharges_DISC is 01 <= 26.9 or 02 <= 55.95 or 04 <= 106.75
## 
## TARGET is 0.69 with cover 9% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5 or 03 <= 16
##     InternetService is Fiber optic
##     OnlineSecurity is No
##     permanencia is 0
##     serv_contratados >= 3
## 
## TARGET is 0.70 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5 or 03 <= 16
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     permanencia is 0
##     StreamingTV is Yes
## 
## TARGET is 0.71 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5
##     InternetService is DSL or No
##     PaymentMethod is Credit card (automatic)
##     PaperlessBilling is No
##     TechSupport is No
##     StreamingMovies is No
## 
## TARGET is 0.83 with cover 0% when
##     Contract is Month-to-month
##     tenure_DISC is 04 <= 22
##     InternetService is Fiber optic
##     OnlineSecurity is Yes
##     PaymentMethod is Electronic check
##     consumo is 1
## 
## TARGET is 0.88 with cover 4% when
##     Contract is Month-to-month
##     tenure_DISC is 01 <= 1 or 02 <= 5 or 03 <= 16
##     InternetService is Fiber optic
##     permanencia is 1

#style sirve para que la salida sea mas legible y cover añade el % de casos e los que aplica la regla

Podemos llevarnos el nodo final de cada cliente a un data.frame para poder hacer una explotacion posterior

#Para ello usarmos el predict especficio de rpart y con el parametro nn
ar_numnodos<-rpart.predict(ar,test,nn = T)
head(ar_numnodos)

##           0          1  nn
## 1 0.3525180 0.64748201 111
## 2 0.9268966 0.07310345   2
## 3 0.3525180 0.64748201 111
## 4 0.9268966 0.07310345   2
## 5 0.4047619 0.59523810 239
## 6 0.9268966 0.07310345   2

Vamos a calcular los scorings y evaluar el modelo

ar_predict<-predict(ar,test,type = 'prob')[,2]

Vemos que pinta tiene

plot(ar_predict~test$TARGET)

Con la funcion umbrales probamos diferentes cortes

umb_ar<-umbrales(test$TARGET,ar_predict)
umb_ar

##    umbral  acierto precision cobertura       F1
## 1    0.05  0.05000   0.05000  0.050000  0.05000
## 2    0.10 66.15825  41.30824 89.341085 56.49510
## 3    0.15 66.01525  41.08597 87.984496 56.01481
## 4    0.20 75.30982  49.87805 79.263566 61.22754
## 5    0.25 75.26215  49.81595 78.682171 61.00676
## 6    0.30 79.31363  57.64925 59.883721 58.74525
## 7    0.35 79.36130  57.78612 59.689922 58.72259
## 8    0.40 79.55195  58.34933 58.914729 58.63067
## 9    0.45 80.12393  60.97561 53.294574 56.87694
## 10   0.50 80.12393  60.97561 53.294574 56.87694
## 11   0.55 80.36225  62.20657 51.356589 56.26327
## 12   0.60 81.02955  67.98780 43.217054 52.84360
## 13   0.65 80.93422  71.01449 37.984496 49.49495
## 14   0.70 77.26406  79.10448 10.271318 18.18182
## 15   0.75 77.26406  81.96721  9.689922 17.33102
## 16   0.80 77.26406  81.96721  9.689922 17.33102
## 17   0.85 77.16873  81.35593  9.302326 16.69565
## 18   0.90  0.90000   0.90000  0.900000  0.90000
## 19   0.95  0.95000   0.95000  0.950000  0.95000

Seleccionamos automaticamente el mejor umbral

umbral_final_ar<-umb_ar[which.max(umb_ar$F1),1]
umbral_final_ar

## [1] 0.2

Evaluamos la matriz de confusion y las metricas con el umbral optimizado

confusion(test$TARGET,ar_predict,umbral_final_ar)

##     
## real FALSE TRUE
##    0  1171  411
##    1   107  409

ar_metricas<-filter(umb_ar,umbral==umbral_final_ar)
ar_metricas

##   umbral  acierto precision cobertura       F1
## 1    0.2 75.30982  49.87805  79.26357 61.22754

Evaluamos la ROC

#creamos el objeto prediction
ar_prediction<-prediction(ar_predict,test$TARGET)
#visualizamos la ROC
roc(ar_prediction)

Sacamos las metricas definitivas incluyendo el AUC

ar_metricas<-cbind(ar_metricas,AUC=round(auc(ar_prediction),2)*100)
print(t(ar_metricas))

##               [,1]
## umbral     0.20000
## acierto   75.30982
## precision 49.87805
## cobertura 79.26357
## F1        61.22754
## AUC       82.00000

5.4 - Comparamos los 2 metodos

comparativa <- rbind(rl_metricas,ar_metricas)
rownames(comparativa) <- c('Regresion Logistica','Arbol Decision')
t(comparativa) #t simplemente transpone para leerlo mejor

##           Regresion Logistica Arbol Decision
## umbral                0.40000        0.20000
## acierto              79.59962       75.30982
## precision            57.07395       49.87805
## cobertura            68.79845       79.26357
## F1                   62.39016       61.22754
## AUC                  85.00000       82.00000

Conclusion: todos serían igualmente predictivos, entonces por un criterio de parsimonia vamos a quedarnos con la regresion logistica

5.4.1 - Escribimos el scoring final en el dataset y guardamos el modelo

df$SCORING_ABANDONO <- predict(rl,df,type = 'response')

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

saveRDS(rl,'modelo_final_telco.rds')

Cargamos el cache temporal del modelo

rl <- readRDS('modelo_final_telco.rds')

Evaluacion y analisis de negocio

Vamos a visualizar la tasa de abandono por tramos de scoring. Este grafico es muy potente para ver que el modelo es consistente, ya que debe presentar una linea descente en la tasa de contratacion conforme se desciende en el scoring

#Creamos una funcion para visualizar el abandono real por percentiles de scoring
vis <- function(scoring,real) {
    #Preparar el dataframe de visualización
    vis_df <- data.frame(Scoring = scoring, Perc_Scoring = cut_number(scoring, 20), Real = real)
    levels(vis_df$Perc_Scoring) <- seq(from = 100,to = 5,by = -5)
    vis_gr <- vis_df %>% group_by(Perc_Scoring) %>% summarise(Tasa_Churn = mean(as.numeric(as.character(Real)))) %>% arrange(Perc_Scoring)
    #ordenar el factor para el gráfico
    vis_gr$Perc_Scoring <- factor(vis_gr$Perc_Scoring, levels = vis_gr$Perc_Scoring[order(vis_gr$Perc_Scoring, decreasing = T)])
    #hacemos el gráfico
    ggplot(vis_gr,aes(Perc_Scoring, Tasa_Churn)) + 
      geom_col(fill='grey') + 
      geom_hline(aes(yintercept =      mean(as.numeric(as.character(vis_df$Real)))), col = 'black') +
      labs(title = 'Abandono real por tramo de scoring', x = 'Tramo de scoring', y = 'Abandono real')
}
vis(df$SCORING_ABANDONO,df$TARGET)

6.1 - Simulación de acción comercial para evitar la tasa de abandono.

El tamaño de campaña viene definido por un criterio de negocio como por ejemplo el presupuesto total asignado a la campaña

#Supongamos que tenemos un presupuesto de 20.000€ para retener al cliente
#Y que la campaña se realizara por call center, con un coste unitario de 20€ por cliente contactado
#Entonces el numero de clientes a contactar sera de 20.000 / 20 = 1.000
#Para extraerlos simplemente cogemos los 1.000 primeros ordenados por scoring
tamaño_campaña <- 2000
bote_campaña <- df %>% 
  arrange(desc(SCORING_ABANDONO)) %>% 
  slice(1:tamaño_campaña) %>%
  select(id,SCORING_ABANDONO)
#Previsualizamos la salida
head(bote_campaña,50)

##      id SCORING_ABANDONO
## 1    81        0.8668347
## 2   140        0.8668347
## 3   346        0.8668347
## 4   384        0.8668347
## 5   420        0.8668347
## 6   460        0.8668347
## 7   475        0.8668347
## 8   489        0.8668347
## 9   534        0.8668347
## 10  585        0.8668347
## 11  642        0.8668347
## 12  651        0.8668347
## 13  672        0.8668347
## 14  808        0.8668347
## 15  843        0.8668347
## 16  914        0.8668347
## 17  974        0.8668347
## 18  981        0.8668347
## 19 1092        0.8668347
## 20 1202        0.8668347
## 21 1205        0.8668347
## 22 1322        0.8668347
## 23 1367        0.8668347
## 24 1501        0.8668347
## 25 1557        0.8668347
## 26 1596        0.8668347
## 27 1647        0.8668347
## 28 1700        0.8668347
## 29 1727        0.8668347
## 30 1735        0.8668347
## 31 1779        0.8668347
## 32 1875        0.8668347
## 33 1925        0.8668347
## 34 1949        0.8668347
## 35 1950        0.8668347
## 36 1972        0.8668347
## 37 1991        0.8668347
## 38 2033        0.8668347
## 39 2091        0.8668347
## 40 2125        0.8668347
## 41 2129        0.8668347
## 42 2187        0.8668347
## 43 2190        0.8668347
## 44 2204        0.8668347
## 45 2271        0.8668347
## 46 2279        0.8668347
## 47 2362        0.8668347
## 48 2363        0.8668347
## 49 2393        0.8668347
## 50 2418        0.8668347

#Vamos a ver gráficamente si de esta forma estamos aprovechando el potencial de nuestro modelo
penetracion_target <- mean(as.numeric(as.character(df$TARGET)))
df %>% 
  arrange(desc(SCORING_ABANDONO)) %>% 
  ggplot(aes(y = SCORING_ABANDONO, x = seq_along(SCORING_ABANDONO))) +
  geom_line() + 
  geom_vline(xintercept = tamaño_campaña, col = 'orange') +
  geom_hline(yintercept = penetracion_target,col='blue') +
  labs(x = 'CLIENTES ORDENADOS POR SCORING', y = 'SCORING')

Según la visualización en la gráfica, podemos comprobar que con una inversion de 20.000€ más la fidelización acordada por la empresa, podríamos retener a la gran mayoría de abandonos potenciales.

Informe TELCO

Jorge Arias

17/8/2020

Proyecto TECLO DS4B

Contexto

Metodología aplicada

Conclusiones tras el análisis:

Detalle del trabajo realizado