Pregunta 17 - Árboles de Decisión

Ajusta dos modelos de árboles de decisión a la base de datos ‘Default’ que viene en los datasets del paquete ‘ISLR’. El objetivo de estos árboles será identificar a aquellas personas que pueden dejar de pagar su tarjeta de crédito (default=‘yes’).

Para cargar los datos puedes utilizar los siguientes comandos

# Cargar datos
library(ISLR)
library(rpart)
library(rpart.plot)

df <- Default
MyData <- as.data.frame(Default)
head(MyData)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
str(MyData)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

Separa la base de datos ‘data’ en train y test, de forma que se tengan las siguientes proporciones 70/30.

set.seed(123)
n <- nrow(df)
train_index <- sample(1:n, size = round(0.7 * n))
data_train <- df[train_index, ]
data_test <- df[-train_index, ]

Utilizando únicamente los datos de entrenamiento (data_train) ajusta un árbol de decisión que utilce todas las variables independientes para explicar la variable binaria ‘default’

tree_model <- rpart(default ~ student + balance + income, 
                    data = data_train, 
                    method = "class")

x11()

rpart.plot(tree_model, type = 4, extra = 104)

summary(tree_model)
## Call:
## rpart(formula = default ~ student + balance + income, data = data_train, 
##     method = "class")
##   n= 7000 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.09871245      0 1.0000000 1.0000000 0.06441264
## 2 0.01716738      2 0.8025751 0.8798283 0.06054339
## 3 0.01000000      4 0.7682403 0.8540773 0.05967714
## 
## Variable importance
## balance  income student 
##      95       3       2 
## 
## Node number 1: 7000 observations,    complexity param=0.09871245
##   predicted class=No   expected loss=0.03328571  P(node) =1
##     class counts:  6767   233
##    probabilities: 0.967 0.033 
##   left son=2 (6788 obs) right son=3 (212 obs)
##   Primary splits:
##       balance < 1788.349 to the left,  improve=113.3555000, (0 missing)
##       student splits as  LR,           improve=  0.5084245, (0 missing)
##       income  < 22039.06 to the right, improve=  0.5045954, (0 missing)
## 
## Node number 2: 6788 observations
##   predicted class=No   expected loss=0.01738362  P(node) =0.9697143
##     class counts:  6670   118
##    probabilities: 0.983 0.017 
## 
## Node number 3: 212 observations,    complexity param=0.09871245
##   predicted class=Yes  expected loss=0.4575472  P(node) =0.03028571
##     class counts:    97   115
##    probabilities: 0.458 0.542 
##   left son=6 (130 obs) right son=7 (82 obs)
##   Primary splits:
##       balance < 1972.743 to the left,  improve=15.153670, (0 missing)
##       income  < 33973.76 to the left,  improve= 7.445816, (0 missing)
##       student splits as  RL,           improve= 4.096988, (0 missing)
##   Surrogate splits:
##       income < 49819.21 to the left,  agree=0.627, adj=0.037, (0 split)
## 
## Node number 6: 130 observations,    complexity param=0.01716738
##   predicted class=No   expected loss=0.3923077  P(node) =0.01857143
##     class counts:    79    51
##    probabilities: 0.608 0.392 
##   left son=12 (78 obs) right son=13 (52 obs)
##   Primary splits:
##       income  < 27874.34 to the left,  improve=3.702564, (0 missing)
##       balance < 1889.562 to the left,  improve=2.262628, (0 missing)
##       student splits as  RL,           improve=1.846520, (0 missing)
##   Surrogate splits:
##       student splits as  RL,           agree=0.923, adj=0.808, (0 split)
##       balance < 1962.872 to the left,  agree=0.615, adj=0.038, (0 split)
## 
## Node number 7: 82 observations
##   predicted class=Yes  expected loss=0.2195122  P(node) =0.01171429
##     class counts:    18    64
##    probabilities: 0.220 0.780 
## 
## Node number 12: 78 observations
##   predicted class=No   expected loss=0.2948718  P(node) =0.01114286
##     class counts:    55    23
##    probabilities: 0.705 0.295 
## 
## Node number 13: 52 observations,    complexity param=0.01716738
##   predicted class=Yes  expected loss=0.4615385  P(node) =0.007428571
##     class counts:    24    28
##    probabilities: 0.462 0.538 
##   left son=26 (32 obs) right son=27 (20 obs)
##   Primary splits:
##       balance < 1889.562 to the left,  improve=1.6961540, (0 missing)
##       income  < 42894.58 to the right, improve=0.2989275, (0 missing)
##   Surrogate splits:
##       student splits as  LR,           agree=0.635, adj=0.05, (0 split)
##       income  < 29718.2  to the right, agree=0.635, adj=0.05, (0 split)
## 
## Node number 26: 32 observations
##   predicted class=No   expected loss=0.4375  P(node) =0.004571429
##     class counts:    18    14
##    probabilities: 0.562 0.438 
## 
## Node number 27: 20 observations
##   predicted class=Yes  expected loss=0.3  P(node) =0.002857143
##     class counts:     6    14
##    probabilities: 0.300 0.700

Genera las predicciones correspondiente a tu modelo con los datos de pruebs (data_test). Muestra la matriz de confusiones correspondiente a estas predicciones con las métricas relevantes e interpreta estos resultados.

pred_test <- predict(tree_model, data_test, type = "class")

conf_test <- table(Predicho = pred_test, Real = data_test$default)
conf_test
##         Real
## Predicho   No  Yes
##      No  2890   69
##      Yes   10   31
accuracy_test <- sum(diag(conf_test)) / sum(conf_test)
sensitivity_test <- conf_test[2,2] / sum(conf_test[,2])  
specificity_test <- conf_test[1,1] / sum(conf_test[,1])  
precision_test <- conf_test[2,2] / sum(conf_test[2,])    

cat("Metricas en DATA TEST\n")
## Metricas en DATA TEST
cat("Accuracy:    ", round(accuracy_test, 4), "\n")
## Accuracy:     0.9737
cat("Sensibilidad:", round(sensitivity_test, 4), "\n")
## Sensibilidad: 0.31
cat("Especificidad:", round(specificity_test, 4), "\n")
## Especificidad: 0.9966
cat("Precision:   ", round(precision_test, 4), "\n")
## Precision:    0.7561

Ahora ajusta un árbol de decisiones utilizando los datos de entrenamiento (data_train) que excluya la variable ‘income’.

tree_model_no_income <- rpart(default ~ student + balance, 
                    data = data_train, 
                    method = "class")

x11()

rpart.plot(tree_model_no_income, type = 4, extra = 104)

summary(tree_model_no_income)
## Call:
## rpart(formula = default ~ student + balance, data = data_train, 
##     method = "class")
##   n= 7000 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.09871245      0 1.0000000 1.0000000 0.06441264
## 2 0.02360515      2 0.8025751 0.8497854 0.05953138
## 3 0.01000000      4 0.7553648 0.8583691 0.05982249
## 
## Variable importance
## balance 
##     100 
## 
## Node number 1: 7000 observations,    complexity param=0.09871245
##   predicted class=No   expected loss=0.03328571  P(node) =1
##     class counts:  6767   233
##    probabilities: 0.967 0.033 
##   left son=2 (6788 obs) right son=3 (212 obs)
##   Primary splits:
##       balance < 1788.349 to the left,  improve=113.3555000, (0 missing)
##       student splits as  LR,           improve=  0.5084245, (0 missing)
## 
## Node number 2: 6788 observations
##   predicted class=No   expected loss=0.01738362  P(node) =0.9697143
##     class counts:  6670   118
##    probabilities: 0.983 0.017 
## 
## Node number 3: 212 observations,    complexity param=0.09871245
##   predicted class=Yes  expected loss=0.4575472  P(node) =0.03028571
##     class counts:    97   115
##    probabilities: 0.458 0.542 
##   left son=6 (130 obs) right son=7 (82 obs)
##   Primary splits:
##       balance < 1972.743 to the left,  improve=15.153670, (0 missing)
##       student splits as  RL,           improve= 4.096988, (0 missing)
## 
## Node number 6: 130 observations,    complexity param=0.02360515
##   predicted class=No   expected loss=0.3923077  P(node) =0.01857143
##     class counts:    79    51
##    probabilities: 0.608 0.392 
##   left son=12 (86 obs) right son=13 (44 obs)
##   Primary splits:
##       balance < 1889.562 to the left,  improve=2.262628, (0 missing)
##       student splits as  RL,           improve=1.846520, (0 missing)
## 
## Node number 7: 82 observations
##   predicted class=Yes  expected loss=0.2195122  P(node) =0.01171429
##     class counts:    18    64
##    probabilities: 0.220 0.780 
## 
## Node number 12: 86 observations
##   predicted class=No   expected loss=0.3255814  P(node) =0.01228571
##     class counts:    58    28
##    probabilities: 0.674 0.326 
## 
## Node number 13: 44 observations,    complexity param=0.02360515
##   predicted class=Yes  expected loss=0.4772727  P(node) =0.006285714
##     class counts:    21    23
##    probabilities: 0.477 0.523 
##   left son=26 (29 obs) right son=27 (15 obs)
##   Primary splits:
##       balance < 1909.523 to the right, improve=5.3844310, (0 missing)
##       student splits as  RL,           improve=0.7454357, (0 missing)
## 
## Node number 26: 29 observations
##   predicted class=No   expected loss=0.3448276  P(node) =0.004142857
##     class counts:    19    10
##    probabilities: 0.655 0.345 
## 
## Node number 27: 15 observations
##   predicted class=Yes  expected loss=0.1333333  P(node) =0.002142857
##     class counts:     2    13
##    probabilities: 0.133 0.867

Genera las predicciones correspondiente a este nuevo modelo con los datos de prueba (data_test). Muestra la matriz de confusiones correspondiente a estas predicciones con las métricas relevantes e interpreta estos resultados.

pred_test_2 <- predict(tree_model_no_income, data_test, type = "class")

conf_test_2 <- table(Predicho = pred_test_2, Real = data_test$default)
conf_test_2
##         Real
## Predicho   No  Yes
##      No  2887   73
##      Yes   13   27
accuracy_test <- sum(diag(conf_test_2)) / sum(conf_test_2)
sensitivity_test <- conf_test_2[2,2] / sum(conf_test_2[,2])  
specificity_test <- conf_test_2[1,1] / sum(conf_test_2[,1])  
precision_test <- conf_test_2[2,2] / sum(conf_test_2[2,])    

cat("Metricas en DATA TEST\n")
## Metricas en DATA TEST
cat("Accuracy:    ", round(accuracy_test, 4), "\n")
## Accuracy:     0.9713
cat("Sensibilidad:", round(sensitivity_test, 4), "\n")
## Sensibilidad: 0.27
cat("Especificidad:", round(specificity_test, 4), "\n")
## Especificidad: 0.9955
cat("Precision:   ", round(precision_test, 4), "\n")
## Precision:    0.675

¿Qué puedes concluir al comparar estos dos modelos?

Aunque balance domina ambos árboles (95-100% de importancia), income aporta información adicional en el nodo intermedio (income < 28,000) que ayuda a discriminar mejor entre quienes caen en default en el rango de balance entre 1,788 y 1,973.

Ambos modelos comparten una limitación importante: la sensibilidad es baja (~30%), lo que significa que solo detectan alrededor de 1 de cada 3 “true negatives”. Esto se debe al fuerte desbalance de clases (solo 3.3% son “Yes”).

¿Cuál es el mejor modelo?

El Modelo 1 (con todas las variables: student, balance, income) es el mejor modelo. Supera al Modelo 2 en todas las métricas.

Sensibilidad (0.31 vs 0.27): detecta un 4% más de los “false positive”, que es precisamente el objetivo del ejercicio.

Precisión (0.756 vs 0.675): cuando predice “Yes”, acierta con mayor frecuencia.

Por último genera una visualización del mejor modelo

x11()

rpart.plot(tree_model, type = 4, extra = 104)