#1.- INTRODUCCIÓN

En el presente trabajo se realizará un análisis de los datos que provienen de la plataforma de concursos de análisis de datos kaggle, dicha base de datos contiene información de incumplimientos de pagos, factores demográficos, datos crediticios, historial de pagos y estados de cuenta de clientes de tarjeta de crédito desde abril de 2005 hasta septiembre de 2005. La base de datos de utilizará para correr un modelo logistico en la que variable dependiente es default_payment.next.month y se escogerá las variables independientes según el análisis.

Por otro lado, aparte del análisis a realizar también se busca obtener un modelo que prediga de la mejor manera la prob de que una persona con ciertas caracteristica entre en default, esto quiere decir que incumpla con el pago el proximo mes, por ello se hará la aplicación del Modelo Logit. Luego de encontrar un modelo que prediga mejor la variable dependiente en este caso:default_payment.next.month se:

options(scipen=999)
pkges <- c("mfx", "pROC", "tidyverse", "forecast", "data.table")
#installed.packages(pkges)
lapply(pkges, "library", character.only=T)
## Warning: package 'mfx' was built under R version 4.2.3
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.2.3
## Loading required package: betareg
## Warning: package 'betareg' was built under R version 4.2.3
## Warning: package 'pROC' was built under R version 4.2.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## Warning: package 'forecast' was built under R version 4.2.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Warning: package 'data.table' was built under R version 4.2.3
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## [[1]]
##  [1] "mfx"       "betareg"   "MASS"      "lmtest"    "zoo"       "sandwich" 
##  [7] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [13] "base"     
## 
## [[2]]
##  [1] "pROC"      "mfx"       "betareg"   "MASS"      "lmtest"    "zoo"      
##  [7] "sandwich"  "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [13] "methods"   "base"     
## 
## [[3]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "pROC"      "mfx"      
## [13] "betareg"   "MASS"      "lmtest"    "zoo"       "sandwich"  "stats"    
## [19] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "forecast"  "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "pROC"     
## [13] "mfx"       "betareg"   "MASS"      "lmtest"    "zoo"       "sandwich" 
## [19] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [25] "base"     
## 
## [[5]]
##  [1] "data.table" "forecast"   "lubridate"  "forcats"    "stringr"   
##  [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [11] "ggplot2"    "tidyverse"  "pROC"       "mfx"        "betareg"   
## [16] "MASS"       "lmtest"     "zoo"        "sandwich"   "stats"     
## [21] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [26] "base"

#2.- EJERCICIO

x1<-c(45000,40000,60000,50000,55000,50000,35000,65000,53000,48000,37000,31000,
40000,75000,43000,49000,37500,71000,34000,27000)
x2<-c(2,4,3,2,2,5,7,2,2,1,5,7,4,2,9,2,4,1,5,6)
y2<-c(0,0,1,1,0,1,1,1,0,0,1,1,1,0,1,0,1,0,0,0)
summary(x1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27000   37375   46500   47275   53500   75000
summary(x2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.50    3.75    5.00    9.00
summary(y2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.5     0.5     1.0     1.0

Datos de 20 familias.

x1: ingreso.

x2: edad (años) del coche más viejo.

y: si seis meses después habian comprado otro coche (1= sí, 0= no)

##2.1. - MODELO DE REGRESIÓN

modelo2 = glm(y2~x1+x2, family=binomial(link="logit"))
summary(modelo2)
## 
## Call:
## glm(formula = y2 ~ x1 + x2, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5635  -0.8045  -0.1397   0.9535   1.7915  
## 
## Coefficients:
##                Estimate  Std. Error z value Pr(>|z|)  
## (Intercept) -7.04706135  4.67423180  -1.508    0.132  
## x1           0.00007382  0.00006371   1.159    0.247  
## x2           0.98788607  0.52736796   1.873    0.061 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 21.082  on 17  degrees of freedom
## AIC: 27.082
## 
## Number of Fisher Scoring iterations: 5

X1 (+0.00007) Se estima que un aumento en X1 (ingreso), con el resto de los factores constantes, provocaría (+) un aumento de la P(Y=1) (probabilidad de comprar otro coche en seis meses)

X2 (+0.9878) Se estima que un aumento de X2 (Edad del antiguo coche), con el resto de los factores constantes, provocaría (+) un aumento de la P(Y=1) (probabilidad de comprar otro coche en seis meses).

##2.2.- Predicciones

pred2 <- predict(modelo2)
pred2
##          1          2          3          4          5          6          7 
## -1.7495337 -0.1428455  0.3456042 -1.3804498 -1.0113658  1.5832084  2.4517287 
##          8          9         10         11         12         13         14 
## -0.2731979 -1.1589994 -2.5159694  0.6235902  2.1564616 -0.1428455  0.4649700 
##         15         16         17         18         19         20 
##  5.0180352 -1.4542665 -0.3273875 -0.8181833  0.4021398  0.8733083

##2.3.- ODDS

exp(pred2)
##            1            2            3            4            5            6 
##   0.17385499   0.86688797   1.41284330   0.25146543   0.36372186   4.87055764 
##            7            8            9           10           11           12 
##  11.60839715   0.76094217   0.31380001   0.08078456   1.86561392   8.64050968 
##           13           14           15           16           17           18 
##   0.86688797   1.59196638 151.11409898   0.23357161   0.72080438   0.44123253 
##           19           20 
##   1.49502033   2.39482067

caso 1: odd = 0.17

P(comprar el nuevo coche)/P(no comprarlo) = 0.17

P(comprar) = 0.17 P (no comprarlo) es decir, es más probable no adquirir el nuevo vehículo.

Caso 3: odd = 1.41

P(comprar el nuevo coche)/P(no comprarlo) = 1.41

P(comprar) = 1.41 P(no comprarlo), es decir, es más probable adquirir el nuevo vehículo.

##2.4. - Coeficientes

coma <- coefficients(modelo2); coma
##    (Intercept)             x1             x2 
## -7.04706135342  0.00007381679  0.98788606559
d1 <- coma[1]
d2 <- coma[2]
d3 <- coma[3]
print(d1)
## (Intercept) 
##   -7.047061
print(d2)
##            x1 
## 0.00007381679
print(d3)
##        x2 
## 0.9878861

##** 2.5.- Predicción Puntual**

X1 = 50.000 dólares X2 = 2 años

y22 <- d1 + d2*50000+d3*2
y22
## (Intercept) 
##    -1.38045
prob22 <- 1/(1+exp(-y2))
print(prob22)
##  [1] 0.5000000 0.5000000 0.7310586 0.7310586 0.5000000 0.7310586 0.7310586
##  [8] 0.7310586 0.5000000 0.5000000 0.7310586 0.7310586 0.7310586 0.5000000
## [15] 0.7310586 0.5000000 0.7310586 0.5000000 0.5000000 0.5000000

La probabilidad de que compre un nuevo coche dentro de seis meses es del 20%

X1 = 45.000 dólares X2 = 2 años

y3 <- d1 + d2*45000+d3*2
print(y3)
## (Intercept) 
##   -1.749534
prob3 <- 1/(1+exp(-y3))
print(prob3)
## (Intercept) 
##    0.148106

La probabilidad de adquirir el nuevo vehículo a los seis meses es del 14%.

¿Cuánto varía la probabilidad cuando se incrementa la renta de 45000 a 50000 dólares si el antiguo vehículo tiene dos años?

P(45000, 2) = 0.148106

P(50000; 2) = 0.2009368

diferencialP = prob22-prob3
diferencialP
##  [1] 0.3518940 0.3518940 0.5829526 0.5829526 0.3518940 0.5829526 0.5829526
##  [8] 0.5829526 0.3518940 0.3518940 0.5829526 0.5829526 0.5829526 0.3518940
## [15] 0.5829526 0.3518940 0.5829526 0.3518940 0.3518940 0.3518940

Aumentaría la probabilidad en un 5.28%.

##2.6.- Probabilidades de cada individuo

pre2 <- predict(modelo2, type = "response"); pre2
##          1          2          3          4          5          6          7 
## 0.14810602 0.46434922 0.58555120 0.20093678 0.26671264 0.82965843 0.92068778 
##          8          9         10         11         12         13         14 
## 0.43212218 0.23884915 0.07474622 0.65103464 0.89627104 0.46434922 0.61419253 
##         15         16         17         18         19         20 
## 0.99342599 0.18934581 0.41887642 0.30614944 0.59920166 0.70543363
1/(1+exp(-(d1+d2*x1*d3*x2)))
##  [1] 0.38130110 0.99024866 0.99771477 0.56099382 0.72600005 0.99998610
##  [7] 0.99997998 0.91929998 0.66434815 0.02801017 0.99841188 0.99984580
## [13] 0.99024866 0.97999030 1.00000000 0.52481740 0.97999030 0.13359283
## [19] 0.99527324 0.99156083

##2.7.- Curva de ROC

roc(y2,pre2,plot = TRUE, legacy.axes = TRUE,
    percent = TRUE, xlab = "% Falsos positivos",
    ylab = "% verdaderos postivios", col = "red", lwd = 2,
    print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = y2, predictor = pre2, percent = TRUE,     plot = TRUE, legacy.axes = TRUE, xlab = "% Falsos positivos",     ylab = "% verdaderos postivios", col = "red", lwd = 2, print.auc = TRUE)
## 
## Data: pre2 in 10 controls (y2 0) < 10 cases (y2 1).
## Area under the curve: 77.5%

El área bajo la curva de ROC asciende a 77.5%. El valor es bastante elevado y parece que el modelo predice correctamente.

##** 2.8.- Matriz de confusión**

tabla2<-table(true=y2,pred=round(fitted(modelo2)))
sum(diag(tabla2))/sum(tabla2)
## [1] 0.65

La fiabilidad sería del 65%.

library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
matriz2<-confusionMatrix(tabla2);matriz2
## Confusion Matrix and Statistics
## 
##     pred
## true 0 1
##    0 7 3
##    1 4 6
##                                           
##                Accuracy : 0.65            
##                  95% CI : (0.4078, 0.8461)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.252           
##                                           
##                   Kappa : 0.3             
##                                           
##  Mcnemar's Test P-Value : 1.000           
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.7000          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.5500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6515          
##                                           
##        'Positive' Class : 0               
## 

##2.9.- Test de bondad del ajuste Hosmer Lemeshow.

library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.2.3
## ResourceSelection 0.3-6   2023-06-27
h2<-hoslem.test(modelo2$y,fitted(modelo2),g=10)
h2
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  modelo2$y, fitted(modelo2)
## X-squared = 7.3153, df = 8, p-value = 0.503

El p valor es mayor al nivel de significación del 5%. Tenemos evidencia empírica suficiente para aceptar la hipótesis nula de que la bondad del ajuste no es suficientemente correcta.

##2.10.- Conclusión

El modelo no consigue predecir correctamente la probabilidad de comprar un nuevo vehículo dentro de seis meses.