Modelos de regresión de variables binarias, Machine Learning

Aprendizaje automático para modelos de probabilidad

Continuando con los modelos de probabilidad, incluiremos un nuevo conjunto de datos

Ejemplo: German Credit (source: uci.edu)

Liga de la documentación del conjunto de datos Liga al centro de datos

En esta sección, exploraremos un conjunto de datos de crédito en Alemania. Este conjunto de datos clasifica a las personas/entidades descritas por un conjunto de atributos como riesgos crediticios buenos o malos (predeterminado).

No	Tipo de variable	Nombre	Descripción
1	(Cualitativa)	chk_acct	Estatus de la cuenta
2	(Numérico)	duration	Duración en meses
3	(Cualitativa)	credit_his	Historia crediticia
4	(Cualitativa)	purpose	Propósito del crédito
5	(Numérico)	amount	Monto
6	(Cualitativa)	saving_acct	Cuenta de ahorros
7	(Cualitativa)	present_emp	Tiempo como empleado
8	(Numérico)	installment_rate	Tasa de pago
9	(Cualitativa)	sex	Estado civil/sexo
10	(Cualitativa)	other_debtor	Otros deudores
11	(Numérico)	present_resid	Tiempo de residencia actual
12	(Cualitativa)	property	Propiedad
13	(Numérico)	age	Edad
14	(Cualitativa)	other_install	Otros planes de pago
15	(Cualitativa)	housing	Estatus de vivienda
16	(Numérico)	n_credits	Número de créditos vigentes en el banco
17	(Cualitativa)	job	Trabajo
18	(Numérico)	n_people	Número de dependientes
19	(Cualitativa)	telephone	Teléfono
20	(Cualitativa)	foreign	Trabajador extranjero
21	(Cualitativa)	default	(2: Mal crédito, 1: Buen crédito)

Cargamos librerías

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stats) 
library(readxl) 
library(gmodels)
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-2

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

#library(MASS) 
#library(ISLR) 
#library(fRegression) 
#library(psych)

Descargamos los datos originales. En este conjunto de datos podemos observar las variables cualitativas y cuantitativas

german_creditdef = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
head(german_creditdef)

##    V1 V2  V3  V4   V5  V6  V7 V8  V9  V10 V11  V12 V13  V14  V15 V16  V17 V18
## 1 A11  6 A34 A43 1169 A65 A75  4 A93 A101   4 A121  67 A143 A152   2 A173   1
## 2 A12 48 A32 A43 5951 A61 A73  2 A92 A101   2 A121  22 A143 A152   1 A173   1
## 3 A14 12 A34 A46 2096 A61 A74  2 A93 A101   3 A121  49 A143 A152   1 A172   2
## 4 A11 42 A32 A42 7882 A61 A74  2 A93 A103   4 A122  45 A143 A153   1 A173   2
## 5 A11 24 A33 A40 4870 A61 A73  3 A93 A101   4 A124  53 A143 A153   2 A173   2
## 6 A14 36 A32 A46 9055 A65 A73  2 A93 A101   4 A124  35 A143 A153   1 A172   2
##    V19  V20 V21
## 1 A192 A201   1
## 2 A191 A201   2
## 3 A191 A201   1
## 4 A191 A201   1
## 5 A191 A201   2
## 6 A192 A201   1

Otros datos que nos ofrece la fuente de información es un conjunto con solamente datos numéricos que incluye más columnas, e incluso simplifica variables numéricas. Para fines de demostración se incluyen los datos a continuación, sin embargo nuestro análisis utilizará los datos del chunck anterior.

german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric")

head(german_credit)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1  1  6  4 12  5  5  3  4  1  67   3   2   1   2   1   0   0   1   0   0   1
## 2  2 48  2 60  1  3  2  2  1  22   3   1   1   1   1   0   0   1   0   0   1
## 3  4 12  4 21  1  4  3  3  1  49   3   1   2   1   1   0   0   1   0   0   1
## 4  1 42  2 79  1  4  3  4  2  45   3   1   2   1   1   0   0   0   0   0   0
## 5  1 24  3 49  1  3  3  4  4  53   3   2   2   1   1   1   0   1   0   0   0
## 6  4 36  2 91  5  3  3  4  4  35   3   1   2   2   1   0   0   1   0   0   0
##   V22 V23 V24 V25
## 1   0   0   1   1
## 2   0   0   1   2
## 3   0   1   0   1
## 4   0   0   1   1
## 5   0   0   1   2
## 6   0   1   0   1

Utilizando los datos guardados en german_creditdef, observamos que muchas variables son categóricas, por tanto, a continuación los tranformaremos la variables categóricas en datos numéricos para su análisis y los almacenaremos en a variable german.

german = german_creditdef
for(i in 1:length(german)){
  german[,i] <- gsub(paste('A',i,sep = ''), '', german[,i]) 
}
head(german)

##   V1 V2 V3 V4   V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1  1  6  4  3 1169  5  5  4  3   1   4   1  67   3   2   2   3   1   2   1   1
## 2  2 48  2  3 5951  1  3  2  2   1   2   1  22   3   2   1   3   1   1   1   2
## 3  4 12  4  6 2096  1  4  2  3   1   3   1  49   3   2   1   2   2   1   1   1
## 4  1 42  2  2 7882  1  4  2  3   3   4   2  45   3   3   1   3   2   1   1   1
## 5  1 24  3  0 4870  1  3  3  3   1   4   4  53   3   3   2   3   2   1   1   2
## 6  4 36  2  6 9055  5  3  2  3   1   4   4  35   3   3   1   2   2   2   1   1

Para mejor lectura de los datos, reemplazamos los nombres de las columnas

colnames(german) = c("chk_acct", "duration", "credit_his", "purpose",
"amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor",
"present_resid", "property", "age", "other_install", "housing", "n_credits", "job", "n_people", "telephone", "foreign", "default")

head(german)

##   chk_acct duration credit_his purpose amount saving_acct present_emp
## 1        1        6          4       3   1169           5           5
## 2        2       48          2       3   5951           1           3
## 3        4       12          4       6   2096           1           4
## 4        1       42          2       2   7882           1           4
## 5        1       24          3       0   4870           1           3
## 6        4       36          2       6   9055           5           3
##   installment_rate sex other_debtor present_resid property age other_install
## 1                4   3            1             4        1  67             3
## 2                2   2            1             2        1  22             3
## 3                2   3            1             3        1  49             3
## 4                2   3            3             4        2  45             3
## 5                3   3            1             4        4  53             3
## 6                2   3            1             4        4  35             3
##   housing n_credits job n_people telephone foreign default
## 1       2         2   3        1         2       1       1
## 2       2         1   3        1         1       1       2
## 3       2         1   2        2         1       1       1
## 4       3         1   3        2         1       1       1
## 5       3         2   3        2         1       1       2
## 6       3         1   2        2         2       1       1

Buscamos NA's y analizamos el resumen

summary(german)

##    chk_acct           duration          credit_his          purpose         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     amount          saving_acct        present_emp        installment_rate  
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      sex            other_debtor       present_resid        property        
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      age            other_install        housing           n_credits        
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      job              n_people          telephone           foreign         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    default         
##  Length:1000       
##  Class :character  
##  Mode  :character

Observamos que no hay presencia de NA's. También observamos que todas las columnas son de tipo 'character'

str(german)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : chr  "1" "2" "4" "1" ...
##  $ duration        : chr  "6" "48" "12" "42" ...
##  $ credit_his      : chr  "4" "2" "4" "2" ...
##  $ purpose         : chr  "3" "3" "6" "2" ...
##  $ amount          : chr  "1169" "5951" "2096" "7882" ...
##  $ saving_acct     : chr  "5" "1" "1" "1" ...
##  $ present_emp     : chr  "5" "3" "4" "4" ...
##  $ installment_rate: chr  "4" "2" "2" "2" ...
##  $ sex             : chr  "3" "2" "3" "3" ...
##  $ other_debtor    : chr  "1" "1" "1" "3" ...
##  $ present_resid   : chr  "4" "2" "3" "4" ...
##  $ property        : chr  "1" "1" "1" "2" ...
##  $ age             : chr  "67" "22" "49" "45" ...
##  $ other_install   : chr  "3" "3" "3" "3" ...
##  $ housing         : chr  "2" "2" "2" "3" ...
##  $ n_credits       : chr  "2" "1" "1" "1" ...
##  $ job             : chr  "3" "3" "2" "3" ...
##  $ n_people        : chr  "1" "1" "2" "2" ...
##  $ telephone       : chr  "2" "1" "1" "1" ...
##  $ foreign         : chr  "1" "1" "1" "1" ...
##  $ default         : chr  "1" "2" "1" "1" ...

Para nuestro fines requerimos transformar las columnas a datos numéricos.

for(i in 1:length(german)){
  german[,i] <- as.numeric(german[,i]) 
}
str(german)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : num  1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : num  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : num  4 2 4 2 3 2 2 2 2 4 ...
##  $ purpose         : num  3 3 6 2 0 6 2 1 3 0 ...
##  $ amount          : num  1169 5951 2096 7882 4870 ...
##  $ saving_acct     : num  5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : num  5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: num  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : num  3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : num  1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : num  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : num  1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : num  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : num  2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : num  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : num  3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : num  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : num  2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ default         : num  1 2 1 1 2 1 1 1 1 2 ...

Nuevamente realizamos un resumen de nuestros datos

summary(german)

##     chk_acct        duration      credit_his       purpose      
##  Min.   :1.000   Min.   : 4.0   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:12.0   1st Qu.:2.000   1st Qu.: 1.000  
##  Median :2.000   Median :18.0   Median :2.000   Median : 2.000  
##  Mean   :2.577   Mean   :20.9   Mean   :2.545   Mean   : 2.828  
##  3rd Qu.:4.000   3rd Qu.:24.0   3rd Qu.:4.000   3rd Qu.: 3.000  
##  Max.   :4.000   Max.   :72.0   Max.   :4.000   Max.   :10.000  
##      amount       saving_acct     present_emp    installment_rate
##  Min.   :  250   Min.   :1.000   Min.   :1.000   Min.   :1.000   
##  1st Qu.: 1366   1st Qu.:1.000   1st Qu.:3.000   1st Qu.:2.000   
##  Median : 2320   Median :1.000   Median :3.000   Median :3.000   
##  Mean   : 3271   Mean   :2.105   Mean   :3.384   Mean   :2.973   
##  3rd Qu.: 3972   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:4.000   
##  Max.   :18424   Max.   :5.000   Max.   :5.000   Max.   :4.000   
##       sex         other_debtor   present_resid      property    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :3.000   Median :1.000   Median :3.000   Median :2.000  
##  Mean   :2.682   Mean   :1.145   Mean   :2.845   Mean   :2.358  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :3.000   Max.   :4.000   Max.   :4.000  
##       age        other_install      housing        n_credits    
##  Min.   :19.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:27.00   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :33.00   Median :3.000   Median :2.000   Median :1.000  
##  Mean   :35.55   Mean   :2.675   Mean   :1.929   Mean   :1.407  
##  3rd Qu.:42.00   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :75.00   Max.   :3.000   Max.   :3.000   Max.   :4.000  
##       job           n_people       telephone        foreign         default   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.0  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.0  
##  Median :3.000   Median :1.000   Median :1.000   Median :1.000   Median :1.0  
##  Mean   :2.904   Mean   :1.155   Mean   :1.404   Mean   :1.037   Mean   :1.3  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.0  
##  Max.   :4.000   Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.0

Estandarizamos la variable amount por su logaritmo

german <- german %>% mutate(amountLog = log(amount))

Observamos ahora el summary de amountLog

summary(german$amountLog)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.521   7.219   7.749   7.789   8.287   9.821

Por cuestiones de practicidad creamos un conjunto nuevo que no tenga la variable amount

# Creamos un respaldo
german_bk <- german

german$amount <- NULL

#write.csv(german, 'german_credit.csv')

str(german)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : num  1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : num  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : num  4 2 4 2 3 2 2 2 2 4 ...
##  $ purpose         : num  3 3 6 2 0 6 2 1 3 0 ...
##  $ saving_acct     : num  5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : num  5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: num  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : num  3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : num  1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : num  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : num  1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : num  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : num  2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : num  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : num  3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : num  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : num  2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ default         : num  1 2 1 1 2 1 1 1 1 2 ...
##  $ amountLog       : num  7.06 8.69 7.65 8.97 8.49 ...

La columna default tiene valores entre 1 y 2. Para nuestros de probabilidad necesitamos que se encuentren entre 0 y 1.

#german[,21] <- german[,21] - 1
german$default <- german$default - 1
summary(german$default)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.3     1.0     1.0

La sesión anterios utilizamos un modelo de probabilida que estaba conformado por dos variables independientes y una dependiente. Con un número reducido de variables independientes, es posible construir un modelo de forma manual.

Para este nuevo caso tenemos 20 variables independiente candidatas a formar parte del modelo, y una variable dependiente que es default.

El número de combinaciones posibles de 20 variables independientes, es de:

\[2^{20} - 1 = 1048575\] Obtener el mejor modelo de forma manual a patir de un universo de combinanciones tan grande, sobra decir que es totalmente ineficiente. Por tanto, se justifica el uso de poder de cómputo nuestro propósito.

Creamos nuestro conjunto de entrenamiento y de prueba

set.seed(1234) 

# p = 0.8 toma el 80% de la muestra aleatoriamente, y deamos el resto para la prueba 
train <- createDataPartition(y = german$default, p = 0.8, list = FALSE, times = 1) 
datos_train <- german[train, ]
datos_test <- german[-train, ]

Iniciamos nuestro modelo completo con una regresión probit y los datos de entrenamiento

# Creamos el modelo completo

german_model_full <- glm(default ~ ., family = binomial(link = "probit"), data = datos_train) 
summary(german_model_full)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "probit"), 
##     data = datos_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8217  -0.7586  -0.4068   0.8049   2.6260  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.372840   0.937850   1.464 0.143244    
## chk_acct         -0.377954   0.045801  -8.252  < 2e-16 ***
## duration          0.019495   0.005994   3.252 0.001145 ** 
## credit_his       -0.215898   0.055529  -3.888 0.000101 ***
## purpose          -0.004827   0.019447  -0.248 0.803959    
## saving_acct      -0.135299   0.037447  -3.613 0.000303 ***
## present_emp      -0.112522   0.047409  -2.373 0.017624 *  
## installment_rate  0.144196   0.055667   2.590 0.009589 ** 
## sex              -0.101129   0.076818  -1.316 0.188015    
## other_debtor     -0.184607   0.117013  -1.578 0.114642    
## present_resid     0.008439   0.050991   0.166 0.868550    
## property          0.133776   0.059936   2.232 0.025616 *  
## age              -0.005475   0.005319  -1.029 0.303322    
## other_install    -0.190200   0.072975  -2.606 0.009150 ** 
## housing          -0.130081   0.109424  -1.189 0.234526    
## n_credits         0.163879   0.101961   1.607 0.107996    
## job               0.088043   0.090759   0.970 0.332010    
## n_people          0.085947   0.156380   0.550 0.582592    
## telephone        -0.185548   0.121791  -1.523 0.127635    
## foreign          -0.451524   0.345871  -1.305 0.191734    
## amountLog         0.070444   0.105398   0.668 0.503903    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 979.07  on 799  degrees of freedom
## Residual deviance: 753.68  on 779  degrees of freedom
## AIC: 795.68
## 
## Number of Fisher Scoring iterations: 5

Backward selection

set.seed(1234)
credit_probit_bwd <- step(german_model_full, direction = "backward", scope = ~ 1, trace=0) 
summary(credit_probit_bwd)

## 
## Call:
## glm(formula = default ~ chk_acct + duration + credit_his + saving_acct + 
##     present_emp + installment_rate + other_debtor + property + 
##     other_install + housing + n_credits, family = binomial(link = "probit"), 
##     data = datos_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8259  -0.7628  -0.4207   0.8121   2.7784  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.191151   0.408612   2.915 0.003556 ** 
## chk_acct         -0.375378   0.044956  -8.350  < 2e-16 ***
## duration          0.023065   0.004389   5.256 1.48e-07 ***
## credit_his       -0.221499   0.054299  -4.079 4.52e-05 ***
## saving_acct      -0.140487   0.036931  -3.804 0.000142 ***
## present_emp      -0.120050   0.044892  -2.674 0.007491 ** 
## installment_rate  0.123773   0.048357   2.560 0.010480 *  
## other_debtor     -0.202656   0.115630  -1.753 0.079667 .  
## property          0.148169   0.056496   2.623 0.008725 ** 
## other_install    -0.171959   0.072122  -2.384 0.017113 *  
## housing          -0.176260   0.100910  -1.747 0.080689 .  
## n_credits         0.142115   0.098860   1.438 0.150564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 979.07  on 799  degrees of freedom
## Residual deviance: 761.75  on 788  degrees of freedom
## AIC: 785.75
## 
## Number of Fisher Scoring iterations: 5

Forward seletion

Creamos el modelo simple

german_model_reduced = glm(default ~ 1, family = binomial(link = "probit"), data=datos_train)
summary(german_model_reduced)

## 
## Call:
## glm(formula = default ~ 1, family = binomial(link = "probit"), 
##     data = datos_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8467  -0.8467  -0.8467   1.5491   1.5491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.52081    0.04657  -11.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 979.07  on 799  degrees of freedom
## Residual deviance: 979.07  on 799  degrees of freedom
## AIC: 981.07
## 
## Number of Fisher Scoring iterations: 4

set.seed(1234)
#Forward selection
credit_probit_fwd <- step(german_model_reduced, direction = "forward", scope = ~ chk_acct + duration + credit_his + purpose + amountLog + saving_acct + present_emp + installment_rate + sex + other_debtor + present_resid + property + age + other_install + housing + n_credits + job + n_people + telephone + foreign, trace=0)

summary(credit_probit_fwd)

## 
## Call:
## glm(formula = default ~ chk_acct + duration + credit_his + saving_acct + 
##     property + present_emp + other_install + installment_rate + 
##     housing + other_debtor + n_credits, family = binomial(link = "probit"), 
##     data = datos_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8259  -0.7628  -0.4207   0.8121   2.7784  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.191151   0.408612   2.915 0.003556 ** 
## chk_acct         -0.375378   0.044956  -8.350  < 2e-16 ***
## duration          0.023065   0.004389   5.256 1.48e-07 ***
## credit_his       -0.221499   0.054299  -4.079 4.52e-05 ***
## saving_acct      -0.140487   0.036931  -3.804 0.000142 ***
## property          0.148169   0.056496   2.623 0.008725 ** 
## present_emp      -0.120050   0.044892  -2.674 0.007491 ** 
## other_install    -0.171959   0.072122  -2.384 0.017113 *  
## installment_rate  0.123773   0.048357   2.560 0.010480 *  
## housing          -0.176260   0.100910  -1.747 0.080689 .  
## other_debtor     -0.202656   0.115630  -1.753 0.079667 .  
## n_credits         0.142115   0.098860   1.438 0.150564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 979.07  on 799  degrees of freedom
## Residual deviance: 761.75  on 788  degrees of freedom
## AIC: 785.75
## 
## Number of Fisher Scoring iterations: 5

Observamos que con ambos métodos obtenemos el mismo modelo.

Evaluamos la precisión y exactitud con datos de entrenamiento y prueba.

Con datos de entrenamiento datos_train

predicciones <- ifelse(test = credit_probit_fwd$fitted.values > 0.3, yes =1, no = 0)
matriz_conf_prob <- table(credit_probit_fwd$model$default, predicciones, dnn = c("observaciones", "predicciones"))
matriz_conf_prob

##              predicciones
## observaciones   0   1
##             0 398 161
##             1  63 178

exactitud <- (matriz_conf_prob[1] + matriz_conf_prob[4] ) / sum(matriz_conf_prob[1:4])
exactitud

## [1] 0.72

Con datos de prueba datos_test

predicciones_F <- predict(credit_probit_fwd,datos_test) 
ProbProbit <- pnorm(predicciones_F)
predicciones <- ifelse(test = ProbProbit > 0.3, yes =1, no = 0)
matriz_conf_prob_test <- table(datos_test$default, predicciones, dnn = c("observaciones", "predicciones"))
matriz_conf_prob_test

##              predicciones
## observaciones   0   1
##             0 101  40
##             1  17  42

exactitud <- (matriz_conf_prob_test[1] + matriz_conf_prob_test[4] ) / sum(matriz_conf_prob_test[1:4])
exactitud

## [1] 0.715

Modelos de regresión de variables binarias, Machine Learning

Armando Tapia

6/10/2021

Aprendizaje automático para modelos de probabilidad

Ejemplo: German Credit (source: uci.edu)