Caso de estudio 1 : Modelo de calidad de hipotecas.

https://towardsdatascience.com/exploratory-data-analysis-in-r-for-beginners-fe031add7072

#install.packages('skimr') 
library(data.table)
library(readr)
library(skimr)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1

library(mltools)
library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ purrr   0.3.5      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()    masks data.table::between()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::first()      masks data.table::first()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ dplyr::last()       masks data.table::last()
## ✖ purrr::lift()       masks caret::lift()
## ✖ tidyr::replace_na() masks mltools::replace_na()
## ✖ purrr::transpose()  masks data.table::transpose()

library(dplyr)
#install.packages('gbm')
library(gbm)

## Loaded gbm 2.1.8.1

#install.packages('pROC')
library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(rpart)
#install.packages("rpart.plot") 
library(rpart.plot)

Carga de datos.

Vamos a cargar inicialmente solo los datos de entrenamiento. Con el proposito de explorarlos y limpiar. Una vez que tengamos el proceso de limpieza de datos definido lo envolvemos en una función a la que podemos alimentar tambien los datos de prueba. De esta forma podemos garantizar que el modelado se generó sin utilizar ninguna información de los datos de prueba y estos solo se utilizan para medir la calidad del modelo.

df_train <- read_csv("~/documents/r_projects/prestamos_c1/loan_train.csv")

## Rows: 431 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Pro...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Análisis Exploratorio.

Para la exploración inicial de los datos vamos a aplicar primero un proceso manual y despues uno automatizado. (Ellis, 2018)

dim(df_train)

## [1] 431  13

head(df_train)

## # A tibble: 6 × 13
##   Loan_ID Gender Married Depen…¹ Educa…² Self_…³ Appli…⁴ Coapp…⁵ LoanA…⁶ Loan_…⁷
##   <chr>   <chr>  <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>
## 1 LP0010… Male   No      0       Gradua… No         5849       0     128     360
## 2 LP0010… Male   Yes     1       Gradua… No         4583    1508     128     360
## 3 LP0010… Male   Yes     0       Gradua… Yes        3000       0      66     360
## 4 LP0010… Male   No      0       Gradua… No         6000       0     141     360
## 5 LP0010… Male   Yes     2       Gradua… Yes        5417    4196     267     360
## 6 LP0010… Male   Yes     0       Not Gr… No         2333    1516      95     360
## # … with 3 more variables: Credit_History <dbl>, Property_Area <chr>,
## #   Loan_Status <chr>, and abbreviated variable names ¹Dependents, ²Education,
## #   ³Self_Employed, ⁴ApplicantIncome, ⁵CoapplicantIncome, ⁶LoanAmount,
## #   ⁷Loan_Amount_Term

glimpse(df_train)

## Rows: 431
## Columns: 13
## $ Loan_ID           <chr> "LP001002", "LP001003", "LP001005", "LP001008", "LP0…
## $ Gender            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Married           <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes"…
## $ Dependents        <chr> "0", "1", "0", "0", "2", "0", "3+", "2", "1", "2", "…
## $ Education         <chr> "Graduate", "Graduate", "Graduate", "Graduate", "Gra…
## $ Self_Employed     <chr> "No", "No", "Yes", "No", "Yes", "No", "No", "No", "N…
## $ ApplicantIncome   <dbl> 5849, 4583, 3000, 6000, 5417, 2333, 3036, 4006, 1284…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 0, 4196, 1516, 2504, 1526, 10968, 700, 1…
## $ LoanAmount        <dbl> 128, 128, 66, 141, 267, 95, 158, 168, 349, 70, 109, …
## $ Loan_Amount_Term  <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History    <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ Property_Area     <chr> "Urban", "Rural", "Urban", "Urban", "Urban", "Urban"…
## $ Loan_Status       <chr> "Y", "N", "Y", "Y", "Y", "Y", "N", "Y", "N", "Y", "Y…

skim(df_train)

Data summary
Name	df_train
Number of rows	431
Number of columns	13
_______________________
Column type frequency:
character	8
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Loan_ID	1	8	8	431
Gender	1	4	6	2
Married	1	2	3	2
Dependents	1	1	2	4
Education	1	8	12	2
Self_Employed	1	2	3	2
Property_Area	1	5	9	3
Loan_Status	1	1	1	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ApplicantIncome	1	5525.90	6820.69	150	2874	3833	5705.5	81000	▇▁▁▁▁
CoapplicantIncome	1	1653.66	3278.82	0	0	1030	2304.0	41667	▇▁▁▁▁
LoanAmount	1	145.50	83.52	9	101	128	166.5	700	▇▃▁▁▁
Loan_Amount_Term	1	340.98	61.53	36	360	360	360.0	480	▁▁▁▇▁
Credit_History	1	0.86	0.35	0	1	1	1.0	1	▂▁▁▁▇

Preparación de datos para modelado.

En esta sección encapsulamos toda la transformación de los datos en una sola función que vamos a utilizar para preparar primero los datos de entrenamiento y despues los de prueba.

Variables especiales: Loan_ID -> Indice Loan_Status -> Objetivo. Aplicar 1 hot Encode a variables categoricas. (Datatricks, 2019) Aplicar Scale a variables numericas. (Finnstats, 2021)

munge_data <- function(df) {
    
    
    bin_columns = c(
        
        'Gender', # Male
        'Married', #Yes
        'Education', 
        'Self_Employed'     
    )
    categorical_columns = c(
        'Dependents',
        'Property_Area')
    numerical_columns = c(
        #'ApplicantIncome',
        #'CoapplicantIncome',
        'LoanAmount',
        'Credit_History' ,
        'Loan_Amount_Term'      
        )
    
    
    y_col = 'Loan_Status'
    idx_col = 'Loan_ID'
    # make a clean copy of the DF for transformation
    #idx <- df[idx_col] 
    head(df)
    
    y = df[y_col]
    y[idx_col] = df[idx_col] 
    y <- y %>% column_to_rownames(var=idx_col) %>%  mutate(Loan_Status = ifelse(Loan_Status=='Y', 1, 0))
    head(y)
    
    
    # one-hot encode all categorical columns

    df_1h <- one_hot(as.data.table(lapply(df[categorical_columns], factor)))
    #df_1h[idx_col] = df[idx_col] 
    #df_1h <- df_1h %>%column_to_rownames(var=idx_col)
    df_1h$Male <- (df %>%  mutate(Male = ifelse(Gender=='Male', 1, 0)))$Male
    df_1h$Married <- (df %>%  mutate(Married = ifelse(Married=='Yes', 1, 0)))$Married
    df_1h$Graduate <- (df %>%  mutate(Graduate = ifelse(Education=='Graduate', 1, 0)))$Graduate
    df_1h$Self_Employed <- (df %>%  mutate(Self_Employed = ifelse(Self_Employed=='Yes', 1, 0)))$Self_Employed
    head(df_1h)
     
    # numerical columns
    df_scaled <- apply(df[numerical_columns],2,scale)
    head(df_scaled)
    # ratios
    
    loan_income_ratio <- data.frame(
        Loan_total_income_ratio= df$LoanAmount/(df$ApplicantIncome+df$CoapplicantIncome+df$LoanAmount),
        #Loan_primary_income_ratio= df$LoanAmount/(df$ApplicantIncome+df$LoanAmount),
        Loan_coincome_ratio= df$LoanAmount/(df$CoapplicantIncome+df$LoanAmount)
        )
    
    df_tidy <- cbind(y, df_1h, df_scaled, loan_income_ratio)
    return(df_tidy)
}

df_tidy <- munge_data(df_train)

head(df_tidy)

##          Loan_Status Dependents_0 Dependents_1 Dependents_2 Dependents_3+
## LP001002           1            1            0            0             0
## LP001003           0            0            1            0             0
## LP001005           1            1            0            0             0
## LP001008           1            1            0            0             0
## LP001011           1            0            0            1             0
## LP001013           1            1            0            0             0
##          Property_Area_Rural Property_Area_Semiurban Property_Area_Urban Male
## LP001002                   0                       0                   1    1
## LP001003                   1                       0                   0    1
## LP001005                   0                       0                   1    1
## LP001008                   0                       0                   1    1
## LP001011                   0                       0                   1    1
## LP001013                   0                       0                   1    1
##          Married Graduate Self_Employed  LoanAmount Credit_History
## LP001002       0        1             0 -0.20957924      0.4094287
## LP001003       1        1             0 -0.20957924      0.4094287
## LP001005       1        1             1 -0.95194093      0.4094287
## LP001008       0        1             0 -0.05392276      0.4094287
## LP001011       1        1             1  1.45474776      0.4094287
## LP001013       1        0             0 -0.60470724      0.4094287
##          Loan_Amount_Term Loan_total_income_ratio Loan_coincome_ratio
## LP001002        0.3090644              0.02141543          1.00000000
## LP001003        0.3090644              0.02058209          0.07823961
## LP001005        0.3090644              0.02152642          1.00000000
## LP001008        0.3090644              0.02296043          1.00000000
## LP001011        0.3090644              0.02702429          0.05982523
## LP001013        0.3090644              0.02408722          0.05896958

Modelado

En esta sección vamos a construir un modelo de Gradient Boosted Machine. GBM. Utilizando los datos limpios que generamos en la seccion anterior. (Pocs, 2021)

model_gbm = gbm(Loan_Status ~.,
              data = df_tidy,
              cv.folds = 10,
              distribution='bernoulli',
              shrinkage = .01,
              n.minobsinnode = 10,
              n.trees = 500)       # 500 tress to be built

summary(model_gbm)

##                                             var     rel.inf
## Credit_History                   Credit_History 63.84587194
## Loan_total_income_ratio Loan_total_income_ratio 15.91623522
## Property_Area_Semiurban Property_Area_Semiurban  6.10643746
## LoanAmount                           LoanAmount  5.29652806
## Loan_coincome_ratio         Loan_coincome_ratio  3.65766767
## Property_Area_Rural         Property_Area_Rural  1.14799418
## Dependents_1                       Dependents_1  1.11423244
## Married                                 Married  1.03415071
## Graduate                               Graduate  0.95651350
## Property_Area_Urban         Property_Area_Urban  0.21334090
## Dependents_2                       Dependents_2  0.21257828
## Self_Employed                     Self_Employed  0.16992974
## Loan_Amount_Term               Loan_Amount_Term  0.15126865
## Dependents_0                       Dependents_0  0.09453678
## `Dependents_3+`                 `Dependents_3+`  0.08271448
## Male                                       Male  0.00000000

modelo <- rpart(
    "Loan_Status~.", 
    method="class", 
    data=df_tidy)

rpart.plot(modelo)

df_test <- munge_data(read_csv("~/documents/r_projects/prestamos_c1/loan_test.csv"))

## Rows: 183 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Pro...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df_test)

##          Loan_Status Dependents_0 Dependents_1 Dependents_2 Dependents_3+
## LP001006           1            1            0            0             0
## LP001028           1            0            0            1             0
## LP001032           1            1            0            0             0
## LP001036           0            1            0            0             0
## LP001073           1            0            0            1             0
## LP001087           1            0            0            1             0
##          Property_Area_Rural Property_Area_Semiurban Property_Area_Urban Male
## LP001006                   0                       0                   1    1
## LP001028                   0                       0                   1    1
## LP001032                   0                       0                   1    1
## LP001036                   0                       0                   1    0
## LP001073                   0                       0                   1    1
## LP001087                   0                       1                   0    0
##          Married Graduate Self_Employed LoanAmount Credit_History
## LP001006       1        0             0 -0.3073043      0.4148869
## LP001028       1        1             0  0.6260848      0.4148869
## LP001032       0        1             0 -0.2489675      0.4148869
## LP001036       0        1             0 -0.8206684     -2.3971244
## LP001073       1        0             0 -0.4239780      0.4148869
## LP001087       0        1             0 -0.3073043      0.4148869
##          Loan_Amount_Term Loan_total_income_ratio Loan_coincome_ratio
## LP001006        0.2008192              0.02371073          0.04842615
## LP001028        0.2008192              0.01757624          0.02407898
## LP001032        0.2008192              0.02463054          1.00000000
## LP001036        0.2008192              0.02119353          1.00000000
## LP001073        0.2008192              0.02046131          0.09565217
## LP001087        0.2008192              0.02015790          0.05447118

Medición de Calidad del Modelo.

Vamos a calcular la matriz de confución, ROC y AUC. (Rdocumentation, 1.18)

y_prob = predict(model_gbm, df_test[,2:17], type='response')

## Using 494 trees...

y_pred = as.factor(ifelse(y_prob>0.5,1,0))

confusionMatrix(
    data=y_pred, reference=as.factor(df_test$Loan_Status))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  29   3
##          1  28 123
##                                           
##                Accuracy : 0.8306          
##                  95% CI : (0.7683, 0.8819)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 9.089e-06       
##                                           
##                   Kappa : 0.5512          
##                                           
##  Mcnemar's Test P-Value : 1.629e-05       
##                                           
##             Sensitivity : 0.5088          
##             Specificity : 0.9762          
##          Pos Pred Value : 0.9062          
##          Neg Pred Value : 0.8146          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1585          
##    Detection Prevalence : 0.1749          
##       Balanced Accuracy : 0.7425          
##                                           
##        'Positive' Class : 0               
##

roc(df_test$Loan_Status ~ predict(model_gbm, df_test[,2:17], type='response'),  plot = TRUE, print.auc = TRUE)

## Using 494 trees...

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.formula(formula = df_test$Loan_Status ~ predict(model_gbm,     df_test[, 2:17], type = "response"), plot = TRUE, print.auc = TRUE)
## 
## Data: predict(model_gbm, df_test[, 2:17], type = "response") in 57 controls (df_test$Loan_Status 0) < 126 cases (df_test$Loan_Status 1).
## Area under the curve: 0.8106

Bibliografía

Finnstats, 2021. How to use the scale() function in R. R-bloggers. Obtenido en https://www.r-bloggers.com/2021/12/how-to-use-the-scale-function-in-r/

Datatricks, 2019. One Hot encoding in R, three simple methods. Obtenido en https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods

Pocs, M (2021) Understanding How a Gradient Boosted Tree Does Binary Classification. Towards Data Science. Obtenido en https://towardsdatascience.com/understanding-how-a-gradient-boosted-tree-does-binary-classification-c215967600fe

Ellis, L. (2018) Explore your dataset in R. R-Bloggers. Obtenido en https://www.r-bloggers.com/2018/11/explore-your-dataset-in-r/

RDocumentation (1.18) Build a ROC Curve. Data Camp. Obetnido en https://www.rdocumentation.org/packages/pROC/versions/1.18.0/topics/roc