#1.- INTRODUCCIÓN
En el presente trabajo se realizará un análisis de los datos que provienen de la plataforma de concursos de análisis de datos kaggle, dicha base de datos contiene información de incumplimientos de pagos, factores demográficos, datos crediticios, historial de pagos y estados de cuenta de clientes de tarjeta de crédito desde abril de 2005 hasta septiembre de 2005. La base de datos de utilizará para correr un modelo logistico en la que variable dependiente es default_payment.next.month y se escogerá las variables independientes según el análisis.
Por otro lado, aparte del análisis a realizar también se busca obtener un modelo que prediga de la mejor manera la prob de que una persona con ciertas caracteristica entre en default, esto quiere decir que incumpla con el pago el proximo mes, por ello se hará la aplicación del Modelo Logit. Luego de encontrar un modelo que prediga mejor la variable dependiente en este caso:default_payment.next.month se:
Estimara e interpretara el efecto de los parámetros sobre el ratio de odds.
Estimara la probabilidad de caer en default asumiendo valores específicos para sus covariables.
Evaluará la capacidad predictiva del modelo:
Punto de corte para los valores ajustados.
Especificidad / Sensibilidad
Análisis de la curva ROC
options(scipen=999)
pkges <- c("mfx", "pROC", "tidyverse", "forecast", "data.table")
#installed.packages(pkges)
lapply(pkges, "library", character.only=T)
## Warning: package 'mfx' was built under R version 4.2.3
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.2.3
## Loading required package: betareg
## Warning: package 'betareg' was built under R version 4.2.3
## Warning: package 'pROC' was built under R version 4.2.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## Warning: package 'forecast' was built under R version 4.2.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Warning: package 'data.table' was built under R version 4.2.3
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
## [[1]]
## [1] "mfx" "betareg" "MASS" "lmtest" "zoo" "sandwich"
## [7] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [13] "base"
##
## [[2]]
## [1] "pROC" "mfx" "betareg" "MASS" "lmtest" "zoo"
## [7] "sandwich" "stats" "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
##
## [[3]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "pROC" "mfx"
## [13] "betareg" "MASS" "lmtest" "zoo" "sandwich" "stats"
## [19] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "forecast" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "pROC"
## [13] "mfx" "betareg" "MASS" "lmtest" "zoo" "sandwich"
## [19] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [25] "base"
##
## [[5]]
## [1] "data.table" "forecast" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "pROC" "mfx" "betareg"
## [16] "MASS" "lmtest" "zoo" "sandwich" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
#2.- EJERCICIO
x1<-c(45000,40000,60000,50000,55000,50000,35000,65000,53000,48000,37000,31000,
40000,75000,43000,49000,37500,71000,34000,27000)
x2<-c(2,4,3,2,2,5,7,2,2,1,5,7,4,2,9,2,4,1,5,6)
y2<-c(0,0,1,1,0,1,1,1,0,0,1,1,1,0,1,0,1,0,0,0)
summary(x1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27000 37375 46500 47275 53500 75000
summary(x2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.50 3.75 5.00 9.00
summary(y2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.5 0.5 1.0 1.0
Datos de 20 familias.
x1: ingreso.
x2: edad (años) del coche más viejo.
y: si seis meses después habian comprado otro coche (1= sí, 0= no)
##2.1. - MODELO DE REGRESIÓN
modelo2 = glm(y2~x1+x2, family=binomial(link="logit"))
summary(modelo2)
##
## Call:
## glm(formula = y2 ~ x1 + x2, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5635 -0.8045 -0.1397 0.9535 1.7915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.04706135 4.67423180 -1.508 0.132
## x1 0.00007382 0.00006371 1.159 0.247
## x2 0.98788607 0.52736796 1.873 0.061 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.726 on 19 degrees of freedom
## Residual deviance: 21.082 on 17 degrees of freedom
## AIC: 27.082
##
## Number of Fisher Scoring iterations: 5
X1 (+0.00007) Se estima que un aumento en X1 (ingreso), con el resto de los factores constantes, provocaría (+) un aumento de la P(Y=1) (probabilidad de comprar otro coche en seis meses)
X2 (+0.9878) Se estima que un aumento de X2 (Edad del antiguo coche), con el resto de los factores constantes, provocaría (+) un aumento de la P(Y=1) (probabilidad de comprar otro coche en seis meses).
##2.2.- Predicciones
pred2 <- predict(modelo2)
pred2
## 1 2 3 4 5 6 7
## -1.7495337 -0.1428455 0.3456042 -1.3804498 -1.0113658 1.5832084 2.4517287
## 8 9 10 11 12 13 14
## -0.2731979 -1.1589994 -2.5159694 0.6235902 2.1564616 -0.1428455 0.4649700
## 15 16 17 18 19 20
## 5.0180352 -1.4542665 -0.3273875 -0.8181833 0.4021398 0.8733083
##2.3.- ODDS
exp(pred2)
## 1 2 3 4 5 6
## 0.17385499 0.86688797 1.41284330 0.25146543 0.36372186 4.87055764
## 7 8 9 10 11 12
## 11.60839715 0.76094217 0.31380001 0.08078456 1.86561392 8.64050968
## 13 14 15 16 17 18
## 0.86688797 1.59196638 151.11409898 0.23357161 0.72080438 0.44123253
## 19 20
## 1.49502033 2.39482067
caso 1: odd = 0.17
P(comprar el nuevo coche)/P(no comprarlo) = 0.17
P(comprar) = 0.17 P (no comprarlo) es decir, es más probable no adquirir el nuevo vehículo.
Caso 3: odd = 1.41
P(comprar el nuevo coche)/P(no comprarlo) = 1.41
P(comprar) = 1.41 P(no comprarlo), es decir, es más probable adquirir el nuevo vehículo.
##2.4. - Coeficientes
coma <- coefficients(modelo2); coma
## (Intercept) x1 x2
## -7.04706135342 0.00007381679 0.98788606559
d1 <- coma[1]
d2 <- coma[2]
d3 <- coma[3]
print(d1)
## (Intercept)
## -7.047061
print(d2)
## x1
## 0.00007381679
print(d3)
## x2
## 0.9878861
##** 2.5.- Predicción Puntual**
X1 = 50.000 dólares X2 = 2 años
y22 <- d1 + d2*50000+d3*2
y22
## (Intercept)
## -1.38045
prob22 <- 1/(1+exp(-y2))
print(prob22)
## [1] 0.5000000 0.5000000 0.7310586 0.7310586 0.5000000 0.7310586 0.7310586
## [8] 0.7310586 0.5000000 0.5000000 0.7310586 0.7310586 0.7310586 0.5000000
## [15] 0.7310586 0.5000000 0.7310586 0.5000000 0.5000000 0.5000000
La probabilidad de que compre un nuevo coche dentro de seis meses es del 20%
X1 = 45.000 dólares X2 = 2 años
y3 <- d1 + d2*45000+d3*2
print(y3)
## (Intercept)
## -1.749534
prob3 <- 1/(1+exp(-y3))
print(prob3)
## (Intercept)
## 0.148106
La probabilidad de adquirir el nuevo vehículo a los seis meses es del 14%.
¿Cuánto varía la probabilidad cuando se incrementa la renta de 45000 a 50000 dólares si el antiguo vehículo tiene dos años?
P(45000, 2) = 0.148106
P(50000; 2) = 0.2009368
diferencialP = prob22-prob3
diferencialP
## [1] 0.3518940 0.3518940 0.5829526 0.5829526 0.3518940 0.5829526 0.5829526
## [8] 0.5829526 0.3518940 0.3518940 0.5829526 0.5829526 0.5829526 0.3518940
## [15] 0.5829526 0.3518940 0.5829526 0.3518940 0.3518940 0.3518940
Aumentaría la probabilidad en un 5.28%.
##2.6.- Probabilidades de cada individuo
pre2 <- predict(modelo2, type = "response"); pre2
## 1 2 3 4 5 6 7
## 0.14810602 0.46434922 0.58555120 0.20093678 0.26671264 0.82965843 0.92068778
## 8 9 10 11 12 13 14
## 0.43212218 0.23884915 0.07474622 0.65103464 0.89627104 0.46434922 0.61419253
## 15 16 17 18 19 20
## 0.99342599 0.18934581 0.41887642 0.30614944 0.59920166 0.70543363
1/(1+exp(-(d1+d2*x1*d3*x2)))
## [1] 0.38130110 0.99024866 0.99771477 0.56099382 0.72600005 0.99998610
## [7] 0.99997998 0.91929998 0.66434815 0.02801017 0.99841188 0.99984580
## [13] 0.99024866 0.97999030 1.00000000 0.52481740 0.97999030 0.13359283
## [19] 0.99527324 0.99156083
##2.7.- Curva de ROC
roc(y2,pre2,plot = TRUE, legacy.axes = TRUE,
percent = TRUE, xlab = "% Falsos positivos",
ylab = "% verdaderos postivios", col = "red", lwd = 2,
print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = y2, predictor = pre2, percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "% Falsos positivos", ylab = "% verdaderos postivios", col = "red", lwd = 2, print.auc = TRUE)
##
## Data: pre2 in 10 controls (y2 0) < 10 cases (y2 1).
## Area under the curve: 77.5%
El área bajo la curva de ROC asciende a 77.5%. El valor es bastante elevado y parece que el modelo predice correctamente.
##** 2.8.- Matriz de confusión**
tabla2<-table(true=y2,pred=round(fitted(modelo2)))
sum(diag(tabla2))/sum(tabla2)
## [1] 0.65
La fiabilidad sería del 65%.
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
matriz2<-confusionMatrix(tabla2);matriz2
## Confusion Matrix and Statistics
##
## pred
## true 0 1
## 0 7 3
## 1 4 6
##
## Accuracy : 0.65
## 95% CI : (0.4078, 0.8461)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 0.252
##
## Kappa : 0.3
##
## Mcnemar's Test P-Value : 1.000
##
## Sensitivity : 0.6364
## Specificity : 0.6667
## Pos Pred Value : 0.7000
## Neg Pred Value : 0.6000
## Prevalence : 0.5500
## Detection Rate : 0.3500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.6515
##
## 'Positive' Class : 0
##
##2.9.- Test de bondad del ajuste Hosmer Lemeshow.
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.2.3
## ResourceSelection 0.3-6 2023-06-27
h2<-hoslem.test(modelo2$y,fitted(modelo2),g=10)
h2
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: modelo2$y, fitted(modelo2)
## X-squared = 7.3153, df = 8, p-value = 0.503
El p valor es mayor al nivel de significación del 5%. Tenemos evidencia empírica suficiente para aceptar la hipótesis nula de que la bondad del ajuste no es suficientemente correcta.
##2.10.- Conclusión
El modelo no consigue predecir correctamente la probabilidad de comprar un nuevo vehículo dentro de seis meses.