Continuando con los modelos de probabilidad, incluiremos un nuevo conjunto de datos
Liga de la documentación del conjunto de datos Liga al centro de datos
En esta sección, exploraremos un conjunto de datos de crédito en Alemania. Este conjunto de datos clasifica a las personas/entidades descritas por un conjunto de atributos como riesgos crediticios buenos o malos (predeterminado).
| No | Tipo de variable | Nombre | Descripción |
|---|---|---|---|
| 1 | (Cualitativa) | chk_acct | Estatus de la cuenta |
| 2 | (Numérico) | duration | Duración en meses |
| 3 | (Cualitativa) | credit_his | Historia crediticia |
| 4 | (Cualitativa) | purpose | Propósito del crédito |
| 5 | (Numérico) | amount | Monto |
| 6 | (Cualitativa) | saving_acct | Cuenta de ahorros |
| 7 | (Cualitativa) | present_emp | Tiempo como empleado |
| 8 | (Numérico) | installment_rate | Tasa de pago |
| 9 | (Cualitativa) | sex | Estado civil/sexo |
| 10 | (Cualitativa) | other_debtor | Otros deudores |
| 11 | (Numérico) | present_resid | Tiempo de residencia actual |
| 12 | (Cualitativa) | property | Propiedad |
| 13 | (Numérico) | age | Edad |
| 14 | (Cualitativa) | other_install | Otros planes de pago |
| 15 | (Cualitativa) | housing | Estatus de vivienda |
| 16 | (Numérico) | n_credits | Número de créditos vigentes en el banco |
| 17 | (Cualitativa) | job | Trabajo |
| 18 | (Numérico) | n_people | Número de dependientes |
| 19 | (Cualitativa) | telephone | Teléfono |
| 20 | (Cualitativa) | foreign | Trabajador extranjero |
| 21 | (Cualitativa) | default | (2: Mal crédito, 1: Buen crédito) |
Cargamos librerías
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stats)
library(readxl)
library(gmodels)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#library(MASS)
#library(ISLR)
#library(fRegression)
#library(psych)
Descargamos los datos originales. En este conjunto de datos podemos observar las variables cualitativas y cuantitativas
german_creditdef = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
head(german_creditdef)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
## 1 A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173 1
## 2 A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173 1
## 3 A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172 2
## 4 A11 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173 2
## 5 A11 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173 2
## 6 A14 36 A32 A46 9055 A65 A73 2 A93 A101 4 A124 35 A143 A153 1 A172 2
## V19 V20 V21
## 1 A192 A201 1
## 2 A191 A201 2
## 3 A191 A201 1
## 4 A191 A201 1
## 5 A191 A201 2
## 6 A192 A201 1
Otros datos que nos ofrece la fuente de información es un conjunto con solamente datos numéricos que incluye más columnas, e incluso simplifica variables numéricas. Para fines de demostración se incluyen los datos a continuación, sin embargo nuestro análisis utilizará los datos del chunck anterior.
german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric")
head(german_credit)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1 1 6 4 12 5 5 3 4 1 67 3 2 1 2 1 0 0 1 0 0 1
## 2 2 48 2 60 1 3 2 2 1 22 3 1 1 1 1 0 0 1 0 0 1
## 3 4 12 4 21 1 4 3 3 1 49 3 1 2 1 1 0 0 1 0 0 1
## 4 1 42 2 79 1 4 3 4 2 45 3 1 2 1 1 0 0 0 0 0 0
## 5 1 24 3 49 1 3 3 4 4 53 3 2 2 1 1 1 0 1 0 0 0
## 6 4 36 2 91 5 3 3 4 4 35 3 1 2 2 1 0 0 1 0 0 0
## V22 V23 V24 V25
## 1 0 0 1 1
## 2 0 0 1 2
## 3 0 1 0 1
## 4 0 0 1 1
## 5 0 0 1 2
## 6 0 1 0 1
Utilizando los datos guardados en german_creditdef, observamos que muchas variables son categóricas, por tanto, a continuación los tranformaremos la variables categóricas en datos numéricos para su análisis y los almacenaremos en a variable german.
german = german_creditdef
for(i in 1:length(german)){
german[,i] <- gsub(paste('A',i,sep = ''), '', german[,i])
}
head(german)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1 1 6 4 3 1169 5 5 4 3 1 4 1 67 3 2 2 3 1 2 1 1
## 2 2 48 2 3 5951 1 3 2 2 1 2 1 22 3 2 1 3 1 1 1 2
## 3 4 12 4 6 2096 1 4 2 3 1 3 1 49 3 2 1 2 2 1 1 1
## 4 1 42 2 2 7882 1 4 2 3 3 4 2 45 3 3 1 3 2 1 1 1
## 5 1 24 3 0 4870 1 3 3 3 1 4 4 53 3 3 2 3 2 1 1 2
## 6 4 36 2 6 9055 5 3 2 3 1 4 4 35 3 3 1 2 2 2 1 1
Para mejor lectura de los datos, reemplazamos los nombres de las columnas
colnames(german) = c("chk_acct", "duration", "credit_his", "purpose",
"amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor",
"present_resid", "property", "age", "other_install", "housing", "n_credits", "job", "n_people", "telephone", "foreign", "default")
head(german)
## chk_acct duration credit_his purpose amount saving_acct present_emp
## 1 1 6 4 3 1169 5 5
## 2 2 48 2 3 5951 1 3
## 3 4 12 4 6 2096 1 4
## 4 1 42 2 2 7882 1 4
## 5 1 24 3 0 4870 1 3
## 6 4 36 2 6 9055 5 3
## installment_rate sex other_debtor present_resid property age other_install
## 1 4 3 1 4 1 67 3
## 2 2 2 1 2 1 22 3
## 3 2 3 1 3 1 49 3
## 4 2 3 3 4 2 45 3
## 5 3 3 1 4 4 53 3
## 6 2 3 1 4 4 35 3
## housing n_credits job n_people telephone foreign default
## 1 2 2 3 1 2 1 1
## 2 2 1 3 1 1 1 2
## 3 2 1 2 2 1 1 1
## 4 3 1 3 2 1 1 1
## 5 3 2 3 2 1 1 2
## 6 3 1 2 2 2 1 1
Buscamos NA's y analizamos el resumen
summary(german)
## chk_acct duration credit_his purpose
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## amount saving_acct present_emp installment_rate
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## sex other_debtor present_resid property
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## age other_install housing n_credits
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## job n_people telephone foreign
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## default
## Length:1000
## Class :character
## Mode :character
Observamos que no hay presencia de NA's. También observamos que todas las columnas son de tipo 'character'
str(german)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : chr "1" "2" "4" "1" ...
## $ duration : chr "6" "48" "12" "42" ...
## $ credit_his : chr "4" "2" "4" "2" ...
## $ purpose : chr "3" "3" "6" "2" ...
## $ amount : chr "1169" "5951" "2096" "7882" ...
## $ saving_acct : chr "5" "1" "1" "1" ...
## $ present_emp : chr "5" "3" "4" "4" ...
## $ installment_rate: chr "4" "2" "2" "2" ...
## $ sex : chr "3" "2" "3" "3" ...
## $ other_debtor : chr "1" "1" "1" "3" ...
## $ present_resid : chr "4" "2" "3" "4" ...
## $ property : chr "1" "1" "1" "2" ...
## $ age : chr "67" "22" "49" "45" ...
## $ other_install : chr "3" "3" "3" "3" ...
## $ housing : chr "2" "2" "2" "3" ...
## $ n_credits : chr "2" "1" "1" "1" ...
## $ job : chr "3" "3" "2" "3" ...
## $ n_people : chr "1" "1" "2" "2" ...
## $ telephone : chr "2" "1" "1" "1" ...
## $ foreign : chr "1" "1" "1" "1" ...
## $ default : chr "1" "2" "1" "1" ...
Para nuestro fines requerimos transformar las columnas a datos numéricos.
for(i in 1:length(german)){
german[,i] <- as.numeric(german[,i])
}
str(german)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : num 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : num 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_his : num 4 2 4 2 3 2 2 2 2 4 ...
## $ purpose : num 3 3 6 2 0 6 2 1 3 0 ...
## $ amount : num 1169 5951 2096 7882 4870 ...
## $ saving_acct : num 5 1 1 1 1 5 3 1 4 1 ...
## $ present_emp : num 5 3 4 4 3 3 5 3 4 1 ...
## $ installment_rate: num 4 2 2 2 3 2 3 2 2 4 ...
## $ sex : num 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtor : num 1 1 1 3 1 1 1 1 1 1 ...
## $ present_resid : num 4 2 3 4 4 4 4 2 4 2 ...
## $ property : num 1 1 1 2 4 4 2 3 1 3 ...
## $ age : num 67 22 49 45 53 35 53 35 61 28 ...
## $ other_install : num 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : num 2 2 2 3 3 3 2 1 2 2 ...
## $ n_credits : num 2 1 1 1 2 1 1 1 1 2 ...
## $ job : num 3 3 2 3 3 2 3 4 2 4 ...
## $ n_people : num 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : num 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ default : num 1 2 1 1 2 1 1 1 1 2 ...
Nuevamente realizamos un resumen de nuestros datos
summary(german)
## chk_acct duration credit_his purpose
## Min. :1.000 Min. : 4.0 Min. :0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:12.0 1st Qu.:2.000 1st Qu.: 1.000
## Median :2.000 Median :18.0 Median :2.000 Median : 2.000
## Mean :2.577 Mean :20.9 Mean :2.545 Mean : 2.828
## 3rd Qu.:4.000 3rd Qu.:24.0 3rd Qu.:4.000 3rd Qu.: 3.000
## Max. :4.000 Max. :72.0 Max. :4.000 Max. :10.000
## amount saving_acct present_emp installment_rate
## Min. : 250 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 1366 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median : 2320 Median :1.000 Median :3.000 Median :3.000
## Mean : 3271 Mean :2.105 Mean :3.384 Mean :2.973
## 3rd Qu.: 3972 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :18424 Max. :5.000 Max. :5.000 Max. :4.000
## sex other_debtor present_resid property
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000
## Median :3.000 Median :1.000 Median :3.000 Median :2.000
## Mean :2.682 Mean :1.145 Mean :2.845 Mean :2.358
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :3.000 Max. :4.000 Max. :4.000
## age other_install housing n_credits
## Min. :19.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:27.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :33.00 Median :3.000 Median :2.000 Median :1.000
## Mean :35.55 Mean :2.675 Mean :1.929 Mean :1.407
## 3rd Qu.:42.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :75.00 Max. :3.000 Max. :3.000 Max. :4.000
## job n_people telephone foreign default
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.0
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.0
## Median :3.000 Median :1.000 Median :1.000 Median :1.000 Median :1.0
## Mean :2.904 Mean :1.155 Mean :1.404 Mean :1.037 Mean :1.3
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.0
## Max. :4.000 Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.0
Estandarizamos la variable amount por su logaritmo
german <- german %>% mutate(amountLog = log(amount))
Observamos ahora el summary de amountLog
summary(german$amountLog)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.521 7.219 7.749 7.789 8.287 9.821
Por cuestiones de practicidad creamos un conjunto nuevo que no tenga la variable amount
# Creamos un respaldo
german_bk <- german
german$amount <- NULL
#write.csv(german, 'german_credit.csv')
str(german)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : num 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : num 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_his : num 4 2 4 2 3 2 2 2 2 4 ...
## $ purpose : num 3 3 6 2 0 6 2 1 3 0 ...
## $ saving_acct : num 5 1 1 1 1 5 3 1 4 1 ...
## $ present_emp : num 5 3 4 4 3 3 5 3 4 1 ...
## $ installment_rate: num 4 2 2 2 3 2 3 2 2 4 ...
## $ sex : num 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtor : num 1 1 1 3 1 1 1 1 1 1 ...
## $ present_resid : num 4 2 3 4 4 4 4 2 4 2 ...
## $ property : num 1 1 1 2 4 4 2 3 1 3 ...
## $ age : num 67 22 49 45 53 35 53 35 61 28 ...
## $ other_install : num 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : num 2 2 2 3 3 3 2 1 2 2 ...
## $ n_credits : num 2 1 1 1 2 1 1 1 1 2 ...
## $ job : num 3 3 2 3 3 2 3 4 2 4 ...
## $ n_people : num 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : num 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ default : num 1 2 1 1 2 1 1 1 1 2 ...
## $ amountLog : num 7.06 8.69 7.65 8.97 8.49 ...
La columna default tiene valores entre 1 y 2. Para nuestros de probabilidad necesitamos que se encuentren entre 0 y 1.
#german[,21] <- german[,21] - 1
german$default <- german$default - 1
summary(german$default)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.3 1.0 1.0
La sesión anterios utilizamos un modelo de probabilida que estaba conformado por dos variables independientes y una dependiente. Con un número reducido de variables independientes, es posible construir un modelo de forma manual.
Para este nuevo caso tenemos 20 variables independiente candidatas a formar parte del modelo, y una variable dependiente que es default.
El número de combinaciones posibles de 20 variables independientes, es de:
\[2^{20} - 1 = 1048575\] Obtener el mejor modelo de forma manual a patir de un universo de combinanciones tan grande, sobra decir que es totalmente ineficiente. Por tanto, se justifica el uso de poder de cómputo nuestro propósito.
Creamos nuestro conjunto de entrenamiento y de prueba
set.seed(1234)
# p = 0.8 toma el 80% de la muestra aleatoriamente, y deamos el resto para la prueba
train <- createDataPartition(y = german$default, p = 0.8, list = FALSE, times = 1)
datos_train <- german[train, ]
datos_test <- german[-train, ]
Iniciamos nuestro modelo completo con una regresión probit y los datos de entrenamiento
# Creamos el modelo completo
german_model_full <- glm(default ~ ., family = binomial(link = "probit"), data = datos_train)
summary(german_model_full)
##
## Call:
## glm(formula = default ~ ., family = binomial(link = "probit"),
## data = datos_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8217 -0.7586 -0.4068 0.8049 2.6260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.372840 0.937850 1.464 0.143244
## chk_acct -0.377954 0.045801 -8.252 < 2e-16 ***
## duration 0.019495 0.005994 3.252 0.001145 **
## credit_his -0.215898 0.055529 -3.888 0.000101 ***
## purpose -0.004827 0.019447 -0.248 0.803959
## saving_acct -0.135299 0.037447 -3.613 0.000303 ***
## present_emp -0.112522 0.047409 -2.373 0.017624 *
## installment_rate 0.144196 0.055667 2.590 0.009589 **
## sex -0.101129 0.076818 -1.316 0.188015
## other_debtor -0.184607 0.117013 -1.578 0.114642
## present_resid 0.008439 0.050991 0.166 0.868550
## property 0.133776 0.059936 2.232 0.025616 *
## age -0.005475 0.005319 -1.029 0.303322
## other_install -0.190200 0.072975 -2.606 0.009150 **
## housing -0.130081 0.109424 -1.189 0.234526
## n_credits 0.163879 0.101961 1.607 0.107996
## job 0.088043 0.090759 0.970 0.332010
## n_people 0.085947 0.156380 0.550 0.582592
## telephone -0.185548 0.121791 -1.523 0.127635
## foreign -0.451524 0.345871 -1.305 0.191734
## amountLog 0.070444 0.105398 0.668 0.503903
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 979.07 on 799 degrees of freedom
## Residual deviance: 753.68 on 779 degrees of freedom
## AIC: 795.68
##
## Number of Fisher Scoring iterations: 5
Backward selection
set.seed(1234)
credit_probit_bwd <- step(german_model_full, direction = "backward", scope = ~ 1, trace=0)
summary(credit_probit_bwd)
##
## Call:
## glm(formula = default ~ chk_acct + duration + credit_his + saving_acct +
## present_emp + installment_rate + other_debtor + property +
## other_install + housing + n_credits, family = binomial(link = "probit"),
## data = datos_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8259 -0.7628 -0.4207 0.8121 2.7784
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.191151 0.408612 2.915 0.003556 **
## chk_acct -0.375378 0.044956 -8.350 < 2e-16 ***
## duration 0.023065 0.004389 5.256 1.48e-07 ***
## credit_his -0.221499 0.054299 -4.079 4.52e-05 ***
## saving_acct -0.140487 0.036931 -3.804 0.000142 ***
## present_emp -0.120050 0.044892 -2.674 0.007491 **
## installment_rate 0.123773 0.048357 2.560 0.010480 *
## other_debtor -0.202656 0.115630 -1.753 0.079667 .
## property 0.148169 0.056496 2.623 0.008725 **
## other_install -0.171959 0.072122 -2.384 0.017113 *
## housing -0.176260 0.100910 -1.747 0.080689 .
## n_credits 0.142115 0.098860 1.438 0.150564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 979.07 on 799 degrees of freedom
## Residual deviance: 761.75 on 788 degrees of freedom
## AIC: 785.75
##
## Number of Fisher Scoring iterations: 5
Forward seletion
Creamos el modelo simple
german_model_reduced = glm(default ~ 1, family = binomial(link = "probit"), data=datos_train)
summary(german_model_reduced)
##
## Call:
## glm(formula = default ~ 1, family = binomial(link = "probit"),
## data = datos_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8467 -0.8467 -0.8467 1.5491 1.5491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.52081 0.04657 -11.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 979.07 on 799 degrees of freedom
## Residual deviance: 979.07 on 799 degrees of freedom
## AIC: 981.07
##
## Number of Fisher Scoring iterations: 4
set.seed(1234)
#Forward selection
credit_probit_fwd <- step(german_model_reduced, direction = "forward", scope = ~ chk_acct + duration + credit_his + purpose + amountLog + saving_acct + present_emp + installment_rate + sex + other_debtor + present_resid + property + age + other_install + housing + n_credits + job + n_people + telephone + foreign, trace=0)
summary(credit_probit_fwd)
##
## Call:
## glm(formula = default ~ chk_acct + duration + credit_his + saving_acct +
## property + present_emp + other_install + installment_rate +
## housing + other_debtor + n_credits, family = binomial(link = "probit"),
## data = datos_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8259 -0.7628 -0.4207 0.8121 2.7784
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.191151 0.408612 2.915 0.003556 **
## chk_acct -0.375378 0.044956 -8.350 < 2e-16 ***
## duration 0.023065 0.004389 5.256 1.48e-07 ***
## credit_his -0.221499 0.054299 -4.079 4.52e-05 ***
## saving_acct -0.140487 0.036931 -3.804 0.000142 ***
## property 0.148169 0.056496 2.623 0.008725 **
## present_emp -0.120050 0.044892 -2.674 0.007491 **
## other_install -0.171959 0.072122 -2.384 0.017113 *
## installment_rate 0.123773 0.048357 2.560 0.010480 *
## housing -0.176260 0.100910 -1.747 0.080689 .
## other_debtor -0.202656 0.115630 -1.753 0.079667 .
## n_credits 0.142115 0.098860 1.438 0.150564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 979.07 on 799 degrees of freedom
## Residual deviance: 761.75 on 788 degrees of freedom
## AIC: 785.75
##
## Number of Fisher Scoring iterations: 5
Observamos que con ambos métodos obtenemos el mismo modelo.
Evaluamos la precisión y exactitud con datos de entrenamiento y prueba.
Con datos de entrenamiento datos_train
predicciones <- ifelse(test = credit_probit_fwd$fitted.values > 0.3, yes =1, no = 0)
matriz_conf_prob <- table(credit_probit_fwd$model$default, predicciones, dnn = c("observaciones", "predicciones"))
matriz_conf_prob
## predicciones
## observaciones 0 1
## 0 398 161
## 1 63 178
exactitud <- (matriz_conf_prob[1] + matriz_conf_prob[4] ) / sum(matriz_conf_prob[1:4])
exactitud
## [1] 0.72
Con datos de prueba datos_test
predicciones_F <- predict(credit_probit_fwd,datos_test)
ProbProbit <- pnorm(predicciones_F)
predicciones <- ifelse(test = ProbProbit > 0.3, yes =1, no = 0)
matriz_conf_prob_test <- table(datos_test$default, predicciones, dnn = c("observaciones", "predicciones"))
matriz_conf_prob_test
## predicciones
## observaciones 0 1
## 0 101 40
## 1 17 42
exactitud <- (matriz_conf_prob_test[1] + matriz_conf_prob_test[4] ) / sum(matriz_conf_prob_test[1:4])
exactitud
## [1] 0.715