Ejercicio de arbol de decision

Modelo de lealtad

¿Cuándo los clientes renuncian a su lealtad con la compañía X y se mudan a la compañía Y? Este es un problema conocido como la rotación de clientes, donde los clientes pasan de una empresa a un competidor. Churn de clientes y ciencia de datos Como líder empresarial necesitamos minimizar la deserción. Para hacerlo, debemos mirar los datos para interpretar por qué los clientes se revuelven y crean estrategias basadas en estos conocimientos. Aquí es donde entra su trabajo como científico de datos. En el siguiente escenario, su objetivo es explorar un conjunto de datos para ayudar a una empresa de telecomunicaciones a reducir la rotación de clientes. Escenario: Churn cliente en una empresa de telecomunicaciones A continuación se muestra un conjunto de datos de muestra sobre las características demográficas de los clientes, su comportamiento y si terminan batiéndose (abandonándose). Este conjunto de datos proporciona información para ayudarlo a predecir el comportamiento para retener clientes. Puede analizar todos los datos relevantes de los clientes y desarrollar programas de retención de clientes enfocados. Una empresa de telecomunicaciones está preocupada por la cantidad de clientes que dejan su negocio de línea fija para los competidores de cable. Necesitan entender quién se va. Imagina que eres un analista en esta empresa y tienes que averiguar quién se va y por qué. El conjunto de datos en la rotación de clientes incluye información sobre: Clientes que se fueron en el último mes: la columna se llama Churn Servicios a los que cada cliente se ha suscrito. Teléfono, varias líneas, Internet, seguridad en línea, respaldo en línea, protección del dispositivo, soporte técnico y transmisión de TV y películas Información de la cuenta del cliente cuánto tiempo han sido clientes (tenencia), contrato, método de pago, facturación electrónica, cargos mensuales y cargos totales Información demográfica sobre los clientes. Género, rango de edad, y si tienen pareja y dependientes.

Obteniendo los datos La siguiente celda de código descarga los datos directamente en Data Scientist Workbench usando R

rm(list=ls())
# download.file("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv",destfile = "custchurn.csv")
# print("File saved under /custchurn.csv")
library(Information)

## Warning: package 'Information' was built under R version 3.5.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.2

## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.2

## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

DATA <- read.csv("custchurn.csv", header = TRUE)
DATA=DATA[,-1]
head(DATA)

##   gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 Female             0     Yes         No      1           No
## 2   Male             0      No         No     34          Yes
## 3   Male             0      No         No      2          Yes
## 4   Male             0      No         No     45           No
## 5 Female             0      No         No      2          Yes
## 6 Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service             DSL             No          Yes
## 2               No             DSL            Yes           No
## 3               No             DSL            Yes          Yes
## 4 No phone service             DSL            Yes           No
## 5               No     Fiber optic             No           No
## 6              Yes     Fiber optic             No           No
##   DeviceProtection TechSupport StreamingTV StreamingMovies       Contract
## 1               No          No          No              No Month-to-month
## 2              Yes          No          No              No       One year
## 3               No          No          No              No Month-to-month
## 4              Yes         Yes          No              No       One year
## 5               No          No          No              No Month-to-month
## 6              Yes          No         Yes             Yes Month-to-month
##   PaperlessBilling             PaymentMethod MonthlyCharges TotalCharges
## 1              Yes          Electronic check          29.85        29.85
## 2               No              Mailed check          56.95      1889.50
## 3              Yes              Mailed check          53.85       108.15
## 4               No Bank transfer (automatic)          42.30      1840.75
## 5              Yes          Electronic check          70.70       151.65
## 6              Yes          Electronic check          99.65       820.50
##   Churn
## 1    No
## 2    No
## 3   Yes
## 4    No
## 5   Yes
## 6   Yes

DATA$SeniorCitizen=as.factor(DATA$SeniorCitizen)
glimpse(DATA)

## Observations: 7,043
## Variables: 20
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Mal...
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes...
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes...
## $ MultipleLines    <fct> No phone service, No, No, No phone service, N...
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic,...
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, ...
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, N...
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, N...
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No,...
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No...
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No,...
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One...
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No,...
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check,...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, N...

str(DATA)

## 'data.frame':    7043 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Ahora podemos llegar a construir nuestro modelo propiamente dicho. Para los árboles de decisión, vamos a utilizar dos bibliotecas diferentes pero relacionadas: rpart para crear el árbol de decisión y rpart.plot para visualizar nuestro árbol de decisión. Para importar bibliotecas, usamos la función de biblioteca, así:

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.5.3

# Create a classification decision tree using "Class" as the variable we want to predict and everything else as its predictors.
myDecisionTree <- rpart(Churn ~ ., data = DATA, method = "class")
# Print out a summary of our created model.
print(myDecisionTree)

## n= 7043 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 7043 1869 No (0.73463013 0.26536987)  
##    2) Contract=One year,Two year 3168  214 No (0.93244949 0.06755051) *
##    3) Contract=Month-to-month 3875 1655 No (0.57290323 0.42709677)  
##      6) InternetService=DSL,No 1747  493 No (0.71780195 0.28219805) *
##      7) InternetService=Fiber optic 2128  966 Yes (0.45394737 0.54605263)  
##       14) tenure>=15.5 1092  445 No (0.59249084 0.40750916) *
##       15) tenure< 15.5 1036  319 Yes (0.30791506 0.69208494) *

rpart.plot(myDecisionTree, type = 3, extra = 2, under = TRUE, faclen=5, cex = .75)

newCase  <- DATA[10,-20]
newCase

##    gender SeniorCitizen Partner Dependents tenure PhoneService
## 10   Male             0      No        Yes     62          Yes
##    MultipleLines InternetService OnlineSecurity OnlineBackup
## 10            No             DSL            Yes          Yes
##    DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 10               No          No          No              No One year
##    PaperlessBilling             PaymentMethod MonthlyCharges TotalCharges
## 10               No Bank transfer (automatic)          56.15      3487.95

predict(myDecisionTree, newCase, type = "class")

## 10 
## No 
## Levels: No Yes

¿Qué tan preciso es nuestro modelo?

n <- nrow(DATA)
smp_size <- floor(0.75 * n)

## set the seed to make your partition reproductible
set.seed(123)
train_ind <- base::sample(c(1:n), size = smp_size)

mushrooms_train <- DATA[train_ind, ]
mushrooms_test <- DATA[-train_ind, ]

newDT <- rpart(Churn ~ ., data = mushrooms_train, method = "class")

result <- predict(newDT, mushrooms_test[,-20], type = "class")
head(result)

##  2  9 10 11 12 16 
## No No No No No No 
## Levels: No Yes

head(mushrooms_test$Churn)

## [1] No  Yes No  No  No  No 
## Levels: No Yes

library(caret)

## Warning: package 'caret' was built under R version 3.5.2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

base::table(mushrooms_test$Churn, result)

##      result
##         No  Yes
##   No  1171  137
##   Yes  221  232

 confusionMatrix(mushrooms_test$Churn,result)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1171  137
##        Yes  221  232
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7771, 0.8153)
##     No Information Rate : 0.7905          
##     P-Value [Acc > NIR] : 0.2705          
##                                           
##                   Kappa : 0.4337          
##  Mcnemar's Test P-Value : 1.151e-05       
##                                           
##             Sensitivity : 0.8412          
##             Specificity : 0.6287          
##          Pos Pred Value : 0.8953          
##          Neg Pred Value : 0.5121          
##              Prevalence : 0.7905          
##          Detection Rate : 0.6650          
##    Detection Prevalence : 0.7428          
##       Balanced Accuracy : 0.7350          
##                                           
##        'Positive' Class : No              
##

NUEVA FORMA DE HACER EL EJERCICIO

set.seed(8656)
data <- DATA[base::sample(nrow(DATA)), ] # BAJARAR LOS PRIMEROS DATOS
bound <- floor(0.7 * nrow(DATA))
DATA_train <- data[1:bound, ] 
DATA_test <- data[(bound + 1):nrow(DATA), ]
cat("NUMERO DE ENTRANEMIENTO Y DE PRUEBA SON ", nrow(DATA_train),"Y", nrow(DATA_test))

## NUMERO DE ENTRANEMIENTO Y DE PRUEBA SON  4930 Y 2113

NROW(DATA)

## [1] 7043

#arbol_1 <- rpart(formula = Ins ~ DDABal + SavBal , data = seguro_train)
arbol_1 <- rpart(Churn ~ ., data = DATA_train, method = "class")

PREDICCION CON TRAIN

predict(myDecisionTree, newCase, type = "class")

## 10 
## No 
## Levels: No Yes

DATA_train1=DATA_train[,-20]
pred_arbol_train=predict(arbol_1, newdata = DATA_train1, type = "class")

PREDICCION CON TEST

DATA_test1=DATA_test[,-20]
pred_arbol_test=as.factor(predict(arbol_1, newdata = DATA_test1, type="class"))

MAGTRIZ DE CONFUSION PARA EL TRAIN Y TEST

library(caret)
observaciones_train=as.factor(DATA_train$Churn)
observaciones_test=as.factor(DATA_test$Churn)

Matriz_conf_train_arbol= confusionMatrix(observaciones_train, pred_arbol_train)
print(Matriz_conf_train_arbol)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3250  364
##        Yes  636  680
##                                           
##                Accuracy : 0.7972          
##                  95% CI : (0.7857, 0.8083)
##     No Information Rate : 0.7882          
##     P-Value [Acc > NIR] : 0.06415         
##                                           
##                   Kappa : 0.4453          
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.8363          
##             Specificity : 0.6513          
##          Pos Pred Value : 0.8993          
##          Neg Pred Value : 0.5167          
##              Prevalence : 0.7882          
##          Detection Rate : 0.6592          
##    Detection Prevalence : 0.7331          
##       Balanced Accuracy : 0.7438          
##                                           
##        'Positive' Class : No              
##

matriz de confusion para el test

Matriz_conf_test_arbol= confusionMatrix(observaciones_test, pred_arbol_test)
print(Matriz_conf_test_arbol)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1359  201
##        Yes  259  294
##                                           
##                Accuracy : 0.7823          
##                  95% CI : (0.7641, 0.7997)
##     No Information Rate : 0.7657          
##     P-Value [Acc > NIR] : 0.037358        
##                                           
##                   Kappa : 0.4169          
##  Mcnemar's Test P-Value : 0.007869        
##                                           
##             Sensitivity : 0.8399          
##             Specificity : 0.5939          
##          Pos Pred Value : 0.8712          
##          Neg Pred Value : 0.5316          
##              Prevalence : 0.7657          
##          Detection Rate : 0.6432          
##    Detection Prevalence : 0.7383          
##       Balanced Accuracy : 0.7169          
##                                           
##        'Positive' Class : No              
##