Para proceder con el análisis, empezamos descargando los paquetes de software que nos permitirá la ejecución de los comandos a utilizar.
options(scipen=999)
pkges<-c("mfx","pROC","tidyverse","forecast","data.table")
#install.packages(pkges)
lapply(pkges,"library",character.only=T)
## Warning: package 'mfx' was built under R version 4.0.2
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.0.2
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: MASS
## Loading required package: betareg
## Warning: package 'betareg' was built under R version 4.0.2
## Warning: package 'pROC' was built under R version 4.0.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: package 'tidyverse' was built under R version 4.0.2
## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
## Warning: package 'forecast' was built under R version 4.0.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
## [[1]]
## [1] "mfx" "betareg" "MASS" "lmtest" "zoo" "sandwich"
## [7] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [13] "base"
##
## [[2]]
## [1] "pROC" "mfx" "betareg" "MASS" "lmtest" "zoo"
## [7] "sandwich" "stats" "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
##
## [[3]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "pROC" "mfx" "betareg"
## [13] "MASS" "lmtest" "zoo" "sandwich" "stats" "graphics"
## [19] "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "forecast" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "pROC" "mfx"
## [13] "betareg" "MASS" "lmtest" "zoo" "sandwich" "stats"
## [19] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "data.table" "forecast" "forcats" "stringr" "dplyr"
## [6] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [11] "tidyverse" "pROC" "mfx" "betareg" "MASS"
## [16] "lmtest" "zoo" "sandwich" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
library(mfx)
También, vamos a direccionar esta hoja de trabajo a la carpeta de nuestra pc, donde esta la data almacenada y luego a cargarlos.
maindir <- getwd()
subdir <- c("Users/DELL/Desktop/uni 2020/RIESGOS/semana 9")
datos <- read.csv(paste0(maindir,"/UCI_Credit_Card.csv"),header = T)
datos <- datos[complete.cases(datos),]
Se esta trabajando con un Conjunto de datos de clientes morosos que poseen tarjetas de crédito; donde se presentan las siguientes variables: ID, línea de crédito, sexo, estado civil, edad y default payment next month (entre otros que tienen las características de los créditos otorgados), siendo esta última mencionada, nuestra variable dependiente.
De lo presentado se tiene que la variable ID no es relevante para nuestro estudio, es por ello que luego lo quitamos de nuestra data; también precisar que se puede hacer una revisión gráfica de las variables como sexo, educación, default payment next month, con el fin de hacer conclusiones generales.
plot1 <- ggplot(data = datos, aes(x=factor(SEX), fill =factor(default.payment.next.month))) +
geom_bar() +
ylab("Observations count") +
scale_x_discrete(labels = c('Male','Female')) +
xlab("")
plot1
Por ejemplo de esta gráfica se puede interpretar como que las mujeres (female) tienen más probabilidad de caer en morosidad (default=1) que en comparación que los hombres (male), esto se podría entender dada la dimensión de créditos otorgados a las mujeres.
plot2 <- ggplot(data = datos, aes(x=factor(EDUCATION), fill =factor(SEX))) +
geom_bar() +
ylab("Observations count") +
xlab("(1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)")
plot2
De esta gráfica, podemos deducir que las mujeres son las que tienen un nivel educativo mas bajo que en comparación a los hombres. Es por ello, que hoy en día esta metodología es muy usada en temas de inclusión financiera, para cerrar brechas de género de acceso a créditos y su correlación con el nivel educativo de los usuarios.
“El análisis de regresión logística se enmarca en el conjunto de Modelos Lineales Generalizados que usa como función de enlace, la función logit. La regresión logística analiza datos distribuidos binomialmente de la forma donde los números de ensayos Bernoulli son conocidos y las probabilidades de éxito son desconocidas. El modelo es entonces obtenido a base de lo que cada ensayo (valor de i) y el conjunto de variables explicativas/independientes puedan informar acerca de la probabilidad final.”
— Wikipedia
Este tipo de función nos interesa, dado que analizaremos cuando un crédito caerá o no en default, en morosidad. Es decir, modelar la probabilidad (p) con la que el crédito cae en morosidad (1-p) o con la que no; en conclusión, toma dos valores (1 ó 0). Nuestra variable dependiente (default.payment.next.month) va a tomar dos valores. Cuando corramos el modelo logit, nos dará: log(p/1-p); de lo cual, luego extraeremos el ratio de oods.
Sin embargo, no habría que olvidarnos del “beta”, pues es la velocidad a la que puede cambiar la probabilidad de éxito ante un cambio de “x”.
cred <- datos[,-1]
cred$SEX <- as.factor(cred$SEX)
cred$MARRIAGE <- as.factor(cred$MARRIAGE)
cred$EDUCATION <- as.factor(cred$EDUCATION)
table(cred$default.payment.next.month)
##
## 0 1
## 23364 6636
mean(cred$default.payment.next.month)
## [1] 0.2212
modelo1 <- glm(default.payment.next.month ~ .,data = cred,
family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(modelo1)
##
## Call:
## glm(formula = default.payment.next.month ~ ., family = binomial(link = "logit"),
## data = cred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1360 -0.7020 -0.5446 -0.2823 3.8693
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.1098677275 82.4386671934 -0.159 0.87365
## LIMIT_BAL -0.0000006998 0.0000001578 -4.435 0.00000918995 ***
## SEX2 -0.1125473057 0.0307294362 -3.663 0.00025 ***
## EDUCATION1 10.8032174781 82.4370190102 0.131 0.89574
## EDUCATION2 10.7186578628 82.4370193533 0.130 0.89655
## EDUCATION3 10.6967653493 82.4370244363 0.130 0.89676
## EDUCATION4 9.6558963456 82.4379629323 0.117 0.90676
## EDUCATION5 9.4400204583 82.4374020911 0.115 0.90883
## EDUCATION6 10.5007097809 82.4379975564 0.127 0.89864
## MARRIAGE1 1.3193692317 0.5159761385 2.557 0.01056 *
## MARRIAGE2 1.1304392031 0.5161274883 2.190 0.02851 *
## MARRIAGE3 1.2406146704 0.5329097416 2.328 0.01991 *
## AGE 0.0053722868 0.0018619598 2.885 0.00391 **
## PAY_0 0.5775622576 0.0177111945 32.610 < 0.0000000000000002 ***
## PAY_2 0.0815524098 0.0202014836 4.037 0.00005415025 ***
## PAY_3 0.0713238589 0.0226156524 3.154 0.00161 **
## PAY_4 0.0225008520 0.0250122334 0.900 0.36834
## PAY_5 0.0343817876 0.0268899180 1.279 0.20103
## PAY_6 0.0068923468 0.0221533280 0.311 0.75571
## BILL_AMT1 -0.0000055130 0.0000011378 -4.845 0.00000126459 ***
## BILL_AMT2 0.0000024128 0.0000015050 1.603 0.10889
## BILL_AMT3 0.0000013451 0.0000013233 1.016 0.30940
## BILL_AMT4 -0.0000001405 0.0000013513 -0.104 0.91718
## BILL_AMT5 0.0000007589 0.0000015207 0.499 0.61774
## BILL_AMT6 0.0000001952 0.0000011943 0.163 0.87014
## PAY_AMT1 -0.0000136444 0.0000023062 -5.916 0.00000000329 ***
## PAY_AMT2 -0.0000094930 0.0000020876 -4.547 0.00000543334 ***
## PAY_AMT3 -0.0000026410 0.0000017179 -1.537 0.12421
## PAY_AMT4 -0.0000040687 0.0000017861 -2.278 0.02273 *
## PAY_AMT5 -0.0000032130 0.0000017752 -1.810 0.07031 .
## PAY_AMT6 -0.0000020923 0.0000012973 -1.613 0.10679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 31705 on 29999 degrees of freedom
## Residual deviance: 27827 on 29969 degrees of freedom
## AIC: 27889
##
## Number of Fisher Scoring iterations: 11
model.AIC <- stepAIC(modelo1)
## Start: AIC=27889.26
## default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## AGE + PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
## BILL_AMT2 + BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 +
## PAY_AMT1 + PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - BILL_AMT4 1 27827 27887
## - BILL_AMT6 1 27827 27887
## - PAY_6 1 27827 27887
## - BILL_AMT5 1 27828 27888
## - PAY_4 1 27828 27888
## - BILL_AMT3 1 27828 27888
## - PAY_5 1 27829 27889
## <none> 27827 27889
## - PAY_AMT3 1 27830 27890
## - BILL_AMT2 1 27830 27890
## - PAY_AMT6 1 27830 27890
## - PAY_AMT5 1 27831 27891
## - PAY_AMT4 1 27833 27893
## - AGE 1 27836 27896
## - PAY_3 1 27837 27897
## - SEX 1 27841 27901
## - PAY_2 1 27844 27904
## - LIMIT_BAL 1 27847 27907
## - PAY_AMT2 1 27853 27913
## - BILL_AMT1 1 27854 27914
## - MARRIAGE 3 27865 27921
## - PAY_AMT1 1 27874 27934
## - EDUCATION 6 27887 27937
## - PAY_0 1 28886 28946
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=27887.27
## default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## AGE + PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
## BILL_AMT2 + BILL_AMT3 + BILL_AMT5 + BILL_AMT6 + PAY_AMT1 +
## PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - BILL_AMT6 1 27827 27885
## - PAY_6 1 27827 27885
## - BILL_AMT5 1 27828 27886
## - PAY_4 1 27828 27886
## - BILL_AMT3 1 27828 27886
## - PAY_5 1 27829 27887
## <none> 27827 27887
## - BILL_AMT2 1 27830 27888
## - PAY_AMT6 1 27830 27888
## - PAY_AMT3 1 27831 27889
## - PAY_AMT5 1 27831 27889
## - PAY_AMT4 1 27834 27892
## - AGE 1 27836 27894
## - PAY_3 1 27837 27895
## - SEX 1 27841 27899
## - PAY_2 1 27844 27902
## - LIMIT_BAL 1 27847 27905
## - PAY_AMT2 1 27853 27911
## - BILL_AMT1 1 27854 27912
## - MARRIAGE 3 27865 27919
## - PAY_AMT1 1 27874 27932
## - EDUCATION 6 27887 27935
## - PAY_0 1 28886 28944
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=27885.29
## default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## AGE + PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
## BILL_AMT2 + BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 +
## PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - PAY_6 1 27827 27883
## - PAY_4 1 27828 27884
## - BILL_AMT3 1 27829 27885
## - BILL_AMT5 1 27829 27885
## - PAY_5 1 27829 27885
## <none> 27827 27885
## - BILL_AMT2 1 27830 27886
## - PAY_AMT6 1 27830 27886
## - PAY_AMT3 1 27831 27887
## - PAY_AMT5 1 27832 27888
## - PAY_AMT4 1 27834 27890
## - AGE 1 27836 27892
## - PAY_3 1 27837 27893
## - SEX 1 27841 27897
## - PAY_2 1 27844 27900
## - LIMIT_BAL 1 27847 27903
## - PAY_AMT2 1 27853 27909
## - BILL_AMT1 1 27854 27910
## - MARRIAGE 3 27865 27917
## - PAY_AMT1 1 27874 27930
## - EDUCATION 6 27887 27933
## - PAY_0 1 28886 28942
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=27883.4
## default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## AGE + PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + BILL_AMT1 +
## BILL_AMT2 + BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 +
## PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - PAY_4 1 27828 27882
## - BILL_AMT3 1 27829 27883
## - BILL_AMT5 1 27829 27883
## <none> 27827 27883
## - BILL_AMT2 1 27830 27884
## - PAY_AMT6 1 27830 27884
## - PAY_5 1 27830 27884
## - PAY_AMT3 1 27831 27885
## - PAY_AMT5 1 27832 27886
## - PAY_AMT4 1 27834 27888
## - AGE 1 27836 27890
## - PAY_3 1 27837 27891
## - SEX 1 27841 27895
## - PAY_2 1 27844 27898
## - LIMIT_BAL 1 27848 27902
## - PAY_AMT2 1 27853 27907
## - BILL_AMT1 1 27855 27909
## - MARRIAGE 3 27865 27915
## - PAY_AMT1 1 27875 27929
## - EDUCATION 6 27887 27931
## - PAY_0 1 28888 28942
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=27882.24
## default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## AGE + PAY_0 + PAY_2 + PAY_3 + PAY_5 + BILL_AMT1 + BILL_AMT2 +
## BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
## PAY_AMT4 + PAY_AMT5 + PAY_AMT6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - BILL_AMT3 1 27830 27882
## - BILL_AMT5 1 27830 27882
## <none> 27828 27882
## - BILL_AMT2 1 27831 27883
## - PAY_AMT6 1 27831 27883
## - PAY_AMT3 1 27832 27884
## - PAY_AMT5 1 27833 27885
## - PAY_AMT4 1 27835 27887
## - PAY_5 1 27837 27889
## - AGE 1 27837 27889
## - SEX 1 27842 27894
## - PAY_3 1 27844 27896
## - PAY_2 1 27845 27897
## - LIMIT_BAL 1 27849 27901
## - PAY_AMT2 1 27854 27906
## - BILL_AMT1 1 27856 27908
## - MARRIAGE 3 27866 27914
## - PAY_AMT1 1 27876 27928
## - EDUCATION 6 27888 27930
## - PAY_0 1 28897 28949
##
## Step: AIC=27881.48
## default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## AGE + PAY_0 + PAY_2 + PAY_3 + PAY_5 + BILL_AMT1 + BILL_AMT2 +
## BILL_AMT5 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 +
## PAY_AMT6
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 27830 27882
## - PAY_AMT6 1 27832 27882
## - BILL_AMT5 1 27834 27884
## - PAY_AMT5 1 27834 27884
## - PAY_AMT3 1 27835 27885
## - BILL_AMT2 1 27836 27886
## - PAY_AMT4 1 27838 27888
## - PAY_5 1 27838 27888
## - AGE 1 27838 27888
## - SEX 1 27843 27893
## - PAY_3 1 27845 27895
## - PAY_2 1 27846 27896
## - LIMIT_BAL 1 27850 27900
## - PAY_AMT2 1 27857 27907
## - BILL_AMT1 1 27857 27907
## - MARRIAGE 3 27867 27913
## - PAY_AMT1 1 27878 27928
## - EDUCATION 6 27889 27929
## - PAY_0 1 28898 28948
yhat1<-model.AIC$fitted.values
hist(yhat1)
Esta estimación nos da que la mejor combinación de variables con menor AIC (27881.48) fue: default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_2 + PAY_3 + PAY_5 + BILL_AMT1 + BILL_AMT2 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6
Lo cual asumiremos como cierto y procederemos a evaluarlo con el gráfico de sensibilidad y especificidad (donde el punto máximo es el punto de corte de las líneas):
c<-seq(0.01,0.3,by=0.001)
sens<-c()
spec<-c()
for (i in 1:length(c)){
y.pred<-ifelse(model.AIC$fitted.values > c[i], yes = 1, no = 0)
spec[i]<-prop.table(table(cred$default.payment.next.month,y.pred),1)[1]
sens[i]<-prop.table(table(cred$default.payment.next.month,y.pred),1)[4]
}
o.cut<-mean(c[which(round(spec,1)==round(sens,1))],na.rm = T)
plot(c,sens,type="l",col=2,main=c("Especificidad vs Sensibilidad"),ylab=c("Especificidad/Sensibilidad"))
lines(c,spec,col=3)
abline(v=o.cut)
print(o.cut)
## [1] 0.217
Se puede observar que el punto de corte ha sido: 0.217, que es el punto donde la sensibilidad y especificidad se maximizan, con el modelo logit general (usando todas las variables).
Procedemos a evaluar lo anterior con la matriz de confusion. Cabe mencionar que la sensibilidad es la probabilidad de éxito dado que la observación fue exitosa, es decir, si cayó en morosidad y la especificidad es la probabilida del fracaso dado que en realidad es fracaso; pues lo que nos interesa modelar son los fracasos y los éxitos:
print(o.cut)
## [1] 0.217
y.pred<-ifelse(model.AIC$fitted.values > o.cut, yes = 1, no = 0)
matriz_confusion <- table(cred$default.payment.next.month, y.pred,
dnn = c("observaciones", "predicciones"))
prop.table(matriz_confusion,1)
## predicciones
## observaciones 0 1
## 0 0.6725732 0.3274268
## 1 0.3384569 0.6615431
De esta matriz se puede inferir que según la sensibilidad, hay una probabilidad de 66.15% de caer en morosidad (probabilidad de éxito) y en base a la especificidad, hay una probabilidad de 67.25% de no caer en default (probabilidad de fracaso).
Ahora calcularemos el área bajo la Curva ROC; este índice nos manifiesta que tan bueno ha sido nuestro Modelo Logit-General para predecir. Donde si el valor del área bajo la curva es mayor igual a 0.7, se considerará aceptable y si es mayor a 0.9 se considerará muy bueno:
roc(cred$default.payment.next.month,yhat1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = cred$default.payment.next.month, predictor = yhat1)
##
## Data: yhat1 in 23364 controls (cred$default.payment.next.month 0) < 6636 cases (cred$default.payment.next.month 1).
## Area under the curve: 0.726
plot(roc(cred$default.payment.next.month,yhat1),main=c("Curva ROC"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Cabe mencionar que si la curva de color negro, estuviese más pegada a la esquina, a 1, nuestro modelo tendría una muy buena predicción, dado que su área sería 0.9 La curva de ROC, junto a la sensibilidad y especificidad, forman parte de los Tests de Predicción. De la gráfica anterior se observa, que nuestro modelo es aceptable, pues el área bajo la curva de ROC nos dió: 0.726
Sin embargo, más adelante solo se trabajará con un modelo logit de 5 covariables.
summary(modelo1)
##
## Call:
## glm(formula = default.payment.next.month ~ ., family = binomial(link = "logit"),
## data = cred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1360 -0.7020 -0.5446 -0.2823 3.8693
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.1098677275 82.4386671934 -0.159 0.87365
## LIMIT_BAL -0.0000006998 0.0000001578 -4.435 0.00000918995 ***
## SEX2 -0.1125473057 0.0307294362 -3.663 0.00025 ***
## EDUCATION1 10.8032174781 82.4370190102 0.131 0.89574
## EDUCATION2 10.7186578628 82.4370193533 0.130 0.89655
## EDUCATION3 10.6967653493 82.4370244363 0.130 0.89676
## EDUCATION4 9.6558963456 82.4379629323 0.117 0.90676
## EDUCATION5 9.4400204583 82.4374020911 0.115 0.90883
## EDUCATION6 10.5007097809 82.4379975564 0.127 0.89864
## MARRIAGE1 1.3193692317 0.5159761385 2.557 0.01056 *
## MARRIAGE2 1.1304392031 0.5161274883 2.190 0.02851 *
## MARRIAGE3 1.2406146704 0.5329097416 2.328 0.01991 *
## AGE 0.0053722868 0.0018619598 2.885 0.00391 **
## PAY_0 0.5775622576 0.0177111945 32.610 < 0.0000000000000002 ***
## PAY_2 0.0815524098 0.0202014836 4.037 0.00005415025 ***
## PAY_3 0.0713238589 0.0226156524 3.154 0.00161 **
## PAY_4 0.0225008520 0.0250122334 0.900 0.36834
## PAY_5 0.0343817876 0.0268899180 1.279 0.20103
## PAY_6 0.0068923468 0.0221533280 0.311 0.75571
## BILL_AMT1 -0.0000055130 0.0000011378 -4.845 0.00000126459 ***
## BILL_AMT2 0.0000024128 0.0000015050 1.603 0.10889
## BILL_AMT3 0.0000013451 0.0000013233 1.016 0.30940
## BILL_AMT4 -0.0000001405 0.0000013513 -0.104 0.91718
## BILL_AMT5 0.0000007589 0.0000015207 0.499 0.61774
## BILL_AMT6 0.0000001952 0.0000011943 0.163 0.87014
## PAY_AMT1 -0.0000136444 0.0000023062 -5.916 0.00000000329 ***
## PAY_AMT2 -0.0000094930 0.0000020876 -4.547 0.00000543334 ***
## PAY_AMT3 -0.0000026410 0.0000017179 -1.537 0.12421
## PAY_AMT4 -0.0000040687 0.0000017861 -2.278 0.02273 *
## PAY_AMT5 -0.0000032130 0.0000017752 -1.810 0.07031 .
## PAY_AMT6 -0.0000020923 0.0000012973 -1.613 0.10679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 31705 on 29999 degrees of freedom
## Residual deviance: 27827 on 29969 degrees of freedom
## AIC: 27889
##
## Number of Fisher Scoring iterations: 11
En este caso, vamos a probar con las covariables: LIMIT_BAL+factor(SEX)+factor(MARRIAGE)+AGE+factor(PAY_0); para saber cuanto de AIC nos arroja el software.
Cabe mencionar que ponemos la etiqueta de factor a las variables: sex, marriage y pay_0, para que el modelo reconozca que son variables categóricas, discretas o dicotómicas, porque sino lo tomaría como variables continuas.
XB2 <- as.formula("default.payment.next.month ~ LIMIT_BAL+factor(SEX)+factor(MARRIAGE)+AGE+factor(PAY_0)")
modelo2 <- glm(XB2,data = cred)
summary(modelo2)
##
## Call:
## glm(formula = XB2, data = cred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.81101 -0.17087 -0.13211 -0.04936 1.05535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.07142234101 0.05263345381 1.357 0.17480
## LIMIT_BAL -0.00000030768 0.00000001761 -17.469 < 0.0000000000000002
## factor(SEX)2 -0.02341640199 0.00446083560 -5.249 0.000000154
## factor(MARRIAGE)1 0.14638570542 0.05102432529 2.869 0.00412
## factor(MARRIAGE)2 0.12419755455 0.05103685057 2.433 0.01496
## factor(MARRIAGE)3 0.14384226456 0.05503082326 2.614 0.00896
## AGE 0.00050961884 0.00026909286 1.894 0.05826
## factor(PAY_0)-1 0.02141026224 0.00871456636 2.457 0.01402
## factor(PAY_0)0 -0.03361975185 0.00795197153 -4.228 0.000023663
## factor(PAY_0)1 0.17517247494 0.00958791701 18.270 < 0.0000000000000002
## factor(PAY_0)2 0.51493844499 0.01045115496 49.271 < 0.0000000000000002
## factor(PAY_0)3 0.57101458795 0.02223260855 25.684 < 0.0000000000000002
## factor(PAY_0)4 0.49441611681 0.04360265345 11.339 < 0.0000000000000002
## factor(PAY_0)5 0.31363371866 0.07379122454 4.250 0.000021414
## factor(PAY_0)6 0.36441659325 0.11301381404 3.225 0.00126
## factor(PAY_0)7 0.60256928879 0.12487737969 4.825 0.000001405
## factor(PAY_0)8 0.39746460368 0.08612991067 4.615 0.000003953
##
## (Intercept)
## LIMIT_BAL ***
## factor(SEX)2 ***
## factor(MARRIAGE)1 **
## factor(MARRIAGE)2 *
## factor(MARRIAGE)3 **
## AGE .
## factor(PAY_0)-1 *
## factor(PAY_0)0 ***
## factor(PAY_0)1 ***
## factor(PAY_0)2 ***
## factor(PAY_0)3 ***
## factor(PAY_0)4 ***
## factor(PAY_0)5 ***
## factor(PAY_0)6 **
## factor(PAY_0)7 ***
## factor(PAY_0)8 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1398442)
##
## Null deviance: 5168.1 on 29999 degrees of freedom
## Residual deviance: 4192.9 on 29983 degrees of freedom
## AIC: 26139
##
## Number of Fisher Scoring iterations: 2
El modelo2 nos da un AIC de 26139, a compración del AIC del modelo1 que nos dió un AIC de 27881.48. Por lo tanto, seguimos probando con otras variables, hasta lograr tener un AIC mucho menor.
En este caso, vamos a probar con las covariables: LIMIT_BAL+factor(SEX)+factor(MARRIAGE)+factor(PAY_0)+factor(PAY_4); para saber cuanto de AIC nos arroja el software.
Cabe mencionar que ponemos la etiqueta de factor a las variables: sex, marriage, pay_0 y pay_4, para que el modelo reconozca que son variables categóricas, discretas o dicotómicas, porque sino lo tomaría como variables continuas.
XB3 <- as.formula("default.payment.next.month ~ LIMIT_BAL+factor(SEX)+factor(MARRIAGE)+factor(PAY_0)+factor(PAY_4)")
modelo3 <- glm(XB3,data = cred,
family = binomial(link = "logit"))
summary(modelo3)
##
## Call:
## glm(formula = XB3, family = binomial(link = "logit"), data = cred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9568 -0.5904 -0.5145 -0.3848 2.6448
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5490242995 0.5142772800 -4.957 0.0000007176793384 ***
## LIMIT_BAL -0.0000021075 0.0000001407 -14.978 < 0.0000000000000002 ***
## factor(SEX)2 -0.1601634232 0.0316889494 -5.054 0.0000004321174375 ***
## factor(MARRIAGE)1 1.4038392946 0.5105288024 2.750 0.005964 **
## factor(MARRIAGE)2 1.2136540962 0.5104691279 2.378 0.017429 *
## factor(MARRIAGE)3 1.4032104897 0.5295130999 2.650 0.008049 **
## factor(PAY_0)-1 0.3619455468 0.0797418632 4.539 0.0000056530947992 ***
## factor(PAY_0)0 -0.1901296456 0.0787921654 -2.413 0.015820 *
## factor(PAY_0)1 0.9411585282 0.0752220926 12.512 < 0.0000000000000002 ***
## factor(PAY_0)2 2.2399976274 0.0859448749 26.063 < 0.0000000000000002 ***
## factor(PAY_0)3 2.3638115162 0.1552969620 15.221 < 0.0000000000000002 ***
## factor(PAY_0)4 1.9887206182 0.2660137596 7.476 0.0000000000000766 ***
## factor(PAY_0)5 0.9255252773 0.4056154431 2.282 0.022502 *
## factor(PAY_0)6 1.1200579834 0.6337112552 1.767 0.077152 .
## factor(PAY_0)7 2.2987705591 0.8627685465 2.664 0.007712 **
## factor(PAY_0)8 2.2570461601 0.7106944443 3.176 0.001494 **
## factor(PAY_4)-1 -0.3480307649 0.0645024463 -5.396 0.0000000682868881 ***
## factor(PAY_4)0 -0.1432251648 0.0591007235 -2.423 0.015376 *
## factor(PAY_4)1 0.9943219633 1.4280080623 0.696 0.486241
## factor(PAY_4)2 0.5987338771 0.0669508000 8.943 < 0.0000000000000002 ***
## factor(PAY_4)3 0.6209371488 0.1847260183 3.361 0.000775 ***
## factor(PAY_4)4 0.5552437792 0.3045056453 1.823 0.068239 .
## factor(PAY_4)5 -0.3958041474 0.5303499880 -0.746 0.455482
## factor(PAY_4)6 -0.2528507914 1.0035677002 -0.252 0.801078
## factor(PAY_4)7 0.8019120599 0.3637023361 2.205 0.027464 *
## factor(PAY_4)8 -0.2660022884 1.5203152647 -0.175 0.861107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 31705 on 29999 degrees of freedom
## Residual deviance: 26454 on 29974 degrees of freedom
## AIC: 26506
##
## Number of Fisher Scoring iterations: 4
El modelo3 nos da un AIC de 26506, a compración del AIC del modelo2 que nos dió un AIC de 26139. Por lo tanto, seguimos probando con otras variables, hasta lograr tener un AIC mucho menor.
En este caso, vamos a probar con las covariables: LIMIT_BAL+factor(SEX)+factor(EDUCATION)+AGE+factor(PAY_0); para saber cuanto de AIC nos arroja el software.
Cabe mencionar que ponemos la etiqueta de factor a las variables: sex, education y pay_0, para que el modelo reconozca que son variables categóricas, discretas o dicotómicas, porque sino lo tomaría como variables continuas.
XB4 <- as.formula("default.payment.next.month ~ LIMIT_BAL+factor(SEX)+factor(EDUCATION)+AGE+factor(PAY_0)")
modelo4 <- glm(XB4,data = cred,
family = binomial(link = "logit"))
summary(modelo4)
##
## Call:
## glm(formula = XB4, family = binomial(link = "logit"), data = cred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8770 -0.6042 -0.5257 -0.3701 2.9003
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.528269203 84.719735502 -0.148 0.882438
## LIMIT_BAL -0.000002365 0.000000144 -16.418 < 0.0000000000000002 ***
## factor(SEX)2 -0.152349960 0.031593469 -4.822 0.00000142 ***
## factor(EDUCATION)1 11.016218163 84.719690872 0.130 0.896542
## factor(EDUCATION)2 11.080105552 84.719690618 0.131 0.895945
## factor(EDUCATION)3 11.004610074 84.719696170 0.130 0.896650
## factor(EDUCATION)4 9.811834569 84.720640370 0.116 0.907800
## factor(EDUCATION)5 9.626342754 84.720089895 0.114 0.909535
## factor(EDUCATION)6 10.623637599 84.720758789 0.125 0.900210
## AGE 0.007979282 0.001728073 4.617 0.00000388 ***
## factor(PAY_0)-1 0.161532900 0.067173918 2.405 0.016186 *
## factor(PAY_0)0 -0.283437080 0.063408405 -4.470 0.00000782 ***
## factor(PAY_0)1 0.971186975 0.067781669 14.328 < 0.0000000000000002 ***
## factor(PAY_0)2 2.381596269 0.072301707 32.940 < 0.0000000000000002 ***
## factor(PAY_0)3 2.629360484 0.144038118 18.255 < 0.0000000000000002 ***
## factor(PAY_0)4 2.229541862 0.256314208 8.698 < 0.0000000000000002 ***
## factor(PAY_0)5 1.451520320 0.402164717 3.609 0.000307 ***
## factor(PAY_0)6 1.663122757 0.612320364 2.716 0.006606 **
## factor(PAY_0)7 2.843534150 0.809218267 3.514 0.000442 ***
## factor(PAY_0)8 1.825472061 0.474855960 3.844 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 31705 on 29999 degrees of freedom
## Residual deviance: 26721 on 29980 degrees of freedom
## AIC: 26761
##
## Number of Fisher Scoring iterations: 11
Entonces hasta el momentos tenemos lo siguiente:
Por lo tanto, nos quedamos con el modelo 2, pues presenta el menor AIC a comparación de los otros.
Realizamos nuevamente un resumen estadístico del modelo2:
summary(modelo2)
##
## Call:
## glm(formula = XB2, data = cred)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.81101 -0.17087 -0.13211 -0.04936 1.05535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.07142234101 0.05263345381 1.357 0.17480
## LIMIT_BAL -0.00000030768 0.00000001761 -17.469 < 0.0000000000000002
## factor(SEX)2 -0.02341640199 0.00446083560 -5.249 0.000000154
## factor(MARRIAGE)1 0.14638570542 0.05102432529 2.869 0.00412
## factor(MARRIAGE)2 0.12419755455 0.05103685057 2.433 0.01496
## factor(MARRIAGE)3 0.14384226456 0.05503082326 2.614 0.00896
## AGE 0.00050961884 0.00026909286 1.894 0.05826
## factor(PAY_0)-1 0.02141026224 0.00871456636 2.457 0.01402
## factor(PAY_0)0 -0.03361975185 0.00795197153 -4.228 0.000023663
## factor(PAY_0)1 0.17517247494 0.00958791701 18.270 < 0.0000000000000002
## factor(PAY_0)2 0.51493844499 0.01045115496 49.271 < 0.0000000000000002
## factor(PAY_0)3 0.57101458795 0.02223260855 25.684 < 0.0000000000000002
## factor(PAY_0)4 0.49441611681 0.04360265345 11.339 < 0.0000000000000002
## factor(PAY_0)5 0.31363371866 0.07379122454 4.250 0.000021414
## factor(PAY_0)6 0.36441659325 0.11301381404 3.225 0.00126
## factor(PAY_0)7 0.60256928879 0.12487737969 4.825 0.000001405
## factor(PAY_0)8 0.39746460368 0.08612991067 4.615 0.000003953
##
## (Intercept)
## LIMIT_BAL ***
## factor(SEX)2 ***
## factor(MARRIAGE)1 **
## factor(MARRIAGE)2 *
## factor(MARRIAGE)3 **
## AGE .
## factor(PAY_0)-1 *
## factor(PAY_0)0 ***
## factor(PAY_0)1 ***
## factor(PAY_0)2 ***
## factor(PAY_0)3 ***
## factor(PAY_0)4 ***
## factor(PAY_0)5 ***
## factor(PAY_0)6 **
## factor(PAY_0)7 ***
## factor(PAY_0)8 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1398442)
##
## Null deviance: 5168.1 on 29999 degrees of freedom
## Residual deviance: 4192.9 on 29983 degrees of freedom
## AIC: 26139
##
## Number of Fisher Scoring iterations: 2
Es la proporción de la probabilidad de éxito sobre la probabilidad de fracaso. Se entiende como las chances de éxito, es decir, si es dos veces más probables o tres más probables que el fracaso.
# limit_bal
exp(-0.00000030768)-1
## [1] -0.00000030768
# sex
exp(-0.02341640199)-1
## [1] -0.02314437
# marriage (casado)
exp(0.14638570542)-1
## [1] 0.1576426
# marriage (soltero)
exp(0.12419755455)-1
## [1] 0.1322395
# AGE
exp(0.00050961884)-1
## [1] 0.0005097487
# pay_0
exp(0.02141026224)-1
## [1] 0.02164111
De lo cual, podemos concluir lo siguiente:
# limit_bal
exp(-0.00000030768)/(1+exp(-0.00000030768))
## [1] 0.4999999
# sex
exp(-0.02341640199)/(1+exp(-0.02341640199))
## [1] 0.4941462
# marriage (casado)
exp(0.14638570542)/(1+exp(0.14638570542))
## [1] 0.5365312
# marriage (soltero)
exp(0.12419755455)/(1+exp(0.12419755455))
## [1] 0.5310095
# age
exp(0.00050961884)/(1+exp(0.00050961884))
## [1] 0.5001274
# pay_0
exp(0.02141026224)/(1+exp(0.02141026224))
## [1] 0.5053524
Ante este cálculo, se puede entender lo siguiente:
model.AIC1 <- stepAIC(modelo2)
## Start: AIC=26138.52
## default.payment.next.month ~ LIMIT_BAL + factor(SEX) + factor(MARRIAGE) +
## AGE + factor(PAY_0)
##
## Df Deviance AIC
## <none> 4192.9 26139
## - AGE 1 4193.5 26140
## - factor(MARRIAGE) 3 4196.8 26160
## - factor(SEX) 1 4196.8 26164
## - LIMIT_BAL 1 4235.6 26440
## - factor(PAY_0) 10 5026.2 31557
yhat2<-model.AIC1$fitted.values
hist(yhat2)
c1<-seq(0.01,0.4,by=0.001)
sens1<-c()
spec1<-c()
for (i in 1:length(c1)){
y.pred1<-ifelse(model.AIC1$fitted.values > c1[i], yes = 1, no = 0)
spec1[i]<-prop.table(table(cred$default.payment.next.month,y.pred1),1)[1]
sens1[i]<-prop.table(table(cred$default.payment.next.month,y.pred1),1)[4]
}
o.cut1<-mean(c1[which(round(spec1,1)==round(sens1,1))],na.rm = T)
plot(c1,sens1,type="l",col=2,main=c("Especificidad vs Sensibilidad"),ylab=c("Especificidad/Sensibilidad"))
lines(c1,spec1,col=3)
abline(v=o.cut1)
print(o.cut1)
## [1] 0.172
Como se puede apreciar, el punto de corte es 0.172, el cual maximiza la sensibilidad y especificidad.
y.pred1<-ifelse(model.AIC1$fitted.values > o.cut1, yes = 1, no = 0)
matriz_confusion1 <- table(cred$default.payment.next.month, y.pred1,
dnn = c("observaciones", "predicciones"))
prop.table(matriz_confusion1,1)
## predicciones
## observaciones 0 1
## 0 0.6870827 0.3129173
## 1 0.3274563 0.6725437
Se puede entender lo siguiente:
La especificidad, nos indica que con el modelo elegido hay una probabilidad de 68.70% de no default (de fracaso)
La sensibilidad, nos indica que con el modelo elegido hay una probabilidad de 67.25% de default (de éxito)
roc(cred$default.payment.next.month,yhat2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = cred$default.payment.next.month, predictor = yhat2)
##
## Data: yhat2 in 23364 controls (cred$default.payment.next.month 0) < 6636 cases (cred$default.payment.next.month 1).
## Area under the curve: 0.7463
plot(roc(cred$default.payment.next.month,yhat2),main=c("Curva ROC"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
El área bajo la curva es de 0.7463, por lo cual, se puede decir que nuestro modelo es aceptable; sin embargo, lo óptimo, hubiese sido que sea casi 0.9, para que tenga mayor aceptibilidad en cuanto a la predicción.