https://towardsdatascience.com/exploratory-data-analysis-in-r-for-beginners-fe031add7072
#install.packages('skimr')
library(data.table)
library(readr)
library(skimr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
library(mltools)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ purrr 0.3.5 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ tidyr::replace_na() masks mltools::replace_na()
## ✖ purrr::transpose() masks data.table::transpose()
library(dplyr)
#install.packages('gbm')
library(gbm)
## Loaded gbm 2.1.8.1
#install.packages('pROC')
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart)
#install.packages("rpart.plot")
library(rpart.plot)
Vamos a cargar inicialmente solo los datos de entrenamiento. Con el proposito de explorarlos y limpiar. Una vez que tengamos el proceso de limpieza de datos definido lo envolvemos en una función a la que podemos alimentar tambien los datos de prueba. De esta forma podemos garantizar que el modelado se generó sin utilizar ninguna información de los datos de prueba y estos solo se utilizan para medir la calidad del modelo.
df_train <- read_csv("~/documents/r_projects/prestamos_c1/loan_train.csv")
## Rows: 431 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Pro...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Para la exploración inicial de los datos vamos a aplicar primero un proceso manual y despues uno automatizado. (Ellis, 2018)
dim(df_train)
## [1] 431 13
head(df_train)
## # A tibble: 6 × 13
## Loan_ID Gender Married Depen…¹ Educa…² Self_…³ Appli…⁴ Coapp…⁵ LoanA…⁶ Loan_…⁷
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 LP0010… Male No 0 Gradua… No 5849 0 128 360
## 2 LP0010… Male Yes 1 Gradua… No 4583 1508 128 360
## 3 LP0010… Male Yes 0 Gradua… Yes 3000 0 66 360
## 4 LP0010… Male No 0 Gradua… No 6000 0 141 360
## 5 LP0010… Male Yes 2 Gradua… Yes 5417 4196 267 360
## 6 LP0010… Male Yes 0 Not Gr… No 2333 1516 95 360
## # … with 3 more variables: Credit_History <dbl>, Property_Area <chr>,
## # Loan_Status <chr>, and abbreviated variable names ¹Dependents, ²Education,
## # ³Self_Employed, ⁴ApplicantIncome, ⁵CoapplicantIncome, ⁶LoanAmount,
## # ⁷Loan_Amount_Term
glimpse(df_train)
## Rows: 431
## Columns: 13
## $ Loan_ID <chr> "LP001002", "LP001003", "LP001005", "LP001008", "LP0…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Married <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes"…
## $ Dependents <chr> "0", "1", "0", "0", "2", "0", "3+", "2", "1", "2", "…
## $ Education <chr> "Graduate", "Graduate", "Graduate", "Graduate", "Gra…
## $ Self_Employed <chr> "No", "No", "Yes", "No", "Yes", "No", "No", "No", "N…
## $ ApplicantIncome <dbl> 5849, 4583, 3000, 6000, 5417, 2333, 3036, 4006, 1284…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 0, 4196, 1516, 2504, 1526, 10968, 700, 1…
## $ LoanAmount <dbl> 128, 128, 66, 141, 267, 95, 158, 168, 349, 70, 109, …
## $ Loan_Amount_Term <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ Property_Area <chr> "Urban", "Rural", "Urban", "Urban", "Urban", "Urban"…
## $ Loan_Status <chr> "Y", "N", "Y", "Y", "Y", "Y", "N", "Y", "N", "Y", "Y…
skim(df_train)
| Name | df_train |
| Number of rows | 431 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Loan_ID | 0 | 1 | 8 | 8 | 0 | 431 | 0 |
| Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Married | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Dependents | 0 | 1 | 1 | 2 | 0 | 4 | 0 |
| Education | 0 | 1 | 8 | 12 | 0 | 2 | 0 |
| Self_Employed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Property_Area | 0 | 1 | 5 | 9 | 0 | 3 | 0 |
| Loan_Status | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1 | 5525.90 | 6820.69 | 150 | 2874 | 3833 | 5705.5 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1 | 1653.66 | 3278.82 | 0 | 0 | 1030 | 2304.0 | 41667 | ▇▁▁▁▁ |
| LoanAmount | 0 | 1 | 145.50 | 83.52 | 9 | 101 | 128 | 166.5 | 700 | ▇▃▁▁▁ |
| Loan_Amount_Term | 0 | 1 | 340.98 | 61.53 | 36 | 360 | 360 | 360.0 | 480 | ▁▁▁▇▁ |
| Credit_History | 0 | 1 | 0.86 | 0.35 | 0 | 1 | 1 | 1.0 | 1 | ▂▁▁▁▇ |
En esta sección encapsulamos toda la transformación de los datos en una sola función que vamos a utilizar para preparar primero los datos de entrenamiento y despues los de prueba.
Variables especiales: Loan_ID -> Indice Loan_Status -> Objetivo. Aplicar 1 hot Encode a variables categoricas. (Datatricks, 2019) Aplicar Scale a variables numericas. (Finnstats, 2021)
munge_data <- function(df) {
bin_columns = c(
'Gender', # Male
'Married', #Yes
'Education',
'Self_Employed'
)
categorical_columns = c(
'Dependents',
'Property_Area')
numerical_columns = c(
#'ApplicantIncome',
#'CoapplicantIncome',
'LoanAmount',
'Credit_History' ,
'Loan_Amount_Term'
)
y_col = 'Loan_Status'
idx_col = 'Loan_ID'
# make a clean copy of the DF for transformation
#idx <- df[idx_col]
head(df)
y = df[y_col]
y[idx_col] = df[idx_col]
y <- y %>% column_to_rownames(var=idx_col) %>% mutate(Loan_Status = ifelse(Loan_Status=='Y', 1, 0))
head(y)
# one-hot encode all categorical columns
df_1h <- one_hot(as.data.table(lapply(df[categorical_columns], factor)))
#df_1h[idx_col] = df[idx_col]
#df_1h <- df_1h %>%column_to_rownames(var=idx_col)
df_1h$Male <- (df %>% mutate(Male = ifelse(Gender=='Male', 1, 0)))$Male
df_1h$Married <- (df %>% mutate(Married = ifelse(Married=='Yes', 1, 0)))$Married
df_1h$Graduate <- (df %>% mutate(Graduate = ifelse(Education=='Graduate', 1, 0)))$Graduate
df_1h$Self_Employed <- (df %>% mutate(Self_Employed = ifelse(Self_Employed=='Yes', 1, 0)))$Self_Employed
head(df_1h)
# numerical columns
df_scaled <- apply(df[numerical_columns],2,scale)
head(df_scaled)
# ratios
loan_income_ratio <- data.frame(
Loan_total_income_ratio= df$LoanAmount/(df$ApplicantIncome+df$CoapplicantIncome+df$LoanAmount),
#Loan_primary_income_ratio= df$LoanAmount/(df$ApplicantIncome+df$LoanAmount),
Loan_coincome_ratio= df$LoanAmount/(df$CoapplicantIncome+df$LoanAmount)
)
df_tidy <- cbind(y, df_1h, df_scaled, loan_income_ratio)
return(df_tidy)
}
df_tidy <- munge_data(df_train)
head(df_tidy)
## Loan_Status Dependents_0 Dependents_1 Dependents_2 Dependents_3+
## LP001002 1 1 0 0 0
## LP001003 0 0 1 0 0
## LP001005 1 1 0 0 0
## LP001008 1 1 0 0 0
## LP001011 1 0 0 1 0
## LP001013 1 1 0 0 0
## Property_Area_Rural Property_Area_Semiurban Property_Area_Urban Male
## LP001002 0 0 1 1
## LP001003 1 0 0 1
## LP001005 0 0 1 1
## LP001008 0 0 1 1
## LP001011 0 0 1 1
## LP001013 0 0 1 1
## Married Graduate Self_Employed LoanAmount Credit_History
## LP001002 0 1 0 -0.20957924 0.4094287
## LP001003 1 1 0 -0.20957924 0.4094287
## LP001005 1 1 1 -0.95194093 0.4094287
## LP001008 0 1 0 -0.05392276 0.4094287
## LP001011 1 1 1 1.45474776 0.4094287
## LP001013 1 0 0 -0.60470724 0.4094287
## Loan_Amount_Term Loan_total_income_ratio Loan_coincome_ratio
## LP001002 0.3090644 0.02141543 1.00000000
## LP001003 0.3090644 0.02058209 0.07823961
## LP001005 0.3090644 0.02152642 1.00000000
## LP001008 0.3090644 0.02296043 1.00000000
## LP001011 0.3090644 0.02702429 0.05982523
## LP001013 0.3090644 0.02408722 0.05896958
En esta sección vamos a construir un modelo de Gradient Boosted Machine. GBM. Utilizando los datos limpios que generamos en la seccion anterior. (Pocs, 2021)
model_gbm = gbm(Loan_Status ~.,
data = df_tidy,
cv.folds = 10,
distribution='bernoulli',
shrinkage = .01,
n.minobsinnode = 10,
n.trees = 500) # 500 tress to be built
summary(model_gbm)
## var rel.inf
## Credit_History Credit_History 63.84587194
## Loan_total_income_ratio Loan_total_income_ratio 15.91623522
## Property_Area_Semiurban Property_Area_Semiurban 6.10643746
## LoanAmount LoanAmount 5.29652806
## Loan_coincome_ratio Loan_coincome_ratio 3.65766767
## Property_Area_Rural Property_Area_Rural 1.14799418
## Dependents_1 Dependents_1 1.11423244
## Married Married 1.03415071
## Graduate Graduate 0.95651350
## Property_Area_Urban Property_Area_Urban 0.21334090
## Dependents_2 Dependents_2 0.21257828
## Self_Employed Self_Employed 0.16992974
## Loan_Amount_Term Loan_Amount_Term 0.15126865
## Dependents_0 Dependents_0 0.09453678
## `Dependents_3+` `Dependents_3+` 0.08271448
## Male Male 0.00000000
modelo <- rpart(
"Loan_Status~.",
method="class",
data=df_tidy)
rpart.plot(modelo)
df_test <- munge_data(read_csv("~/documents/r_projects/prestamos_c1/loan_test.csv"))
## Rows: 183 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Pro...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df_test)
## Loan_Status Dependents_0 Dependents_1 Dependents_2 Dependents_3+
## LP001006 1 1 0 0 0
## LP001028 1 0 0 1 0
## LP001032 1 1 0 0 0
## LP001036 0 1 0 0 0
## LP001073 1 0 0 1 0
## LP001087 1 0 0 1 0
## Property_Area_Rural Property_Area_Semiurban Property_Area_Urban Male
## LP001006 0 0 1 1
## LP001028 0 0 1 1
## LP001032 0 0 1 1
## LP001036 0 0 1 0
## LP001073 0 0 1 1
## LP001087 0 1 0 0
## Married Graduate Self_Employed LoanAmount Credit_History
## LP001006 1 0 0 -0.3073043 0.4148869
## LP001028 1 1 0 0.6260848 0.4148869
## LP001032 0 1 0 -0.2489675 0.4148869
## LP001036 0 1 0 -0.8206684 -2.3971244
## LP001073 1 0 0 -0.4239780 0.4148869
## LP001087 0 1 0 -0.3073043 0.4148869
## Loan_Amount_Term Loan_total_income_ratio Loan_coincome_ratio
## LP001006 0.2008192 0.02371073 0.04842615
## LP001028 0.2008192 0.01757624 0.02407898
## LP001032 0.2008192 0.02463054 1.00000000
## LP001036 0.2008192 0.02119353 1.00000000
## LP001073 0.2008192 0.02046131 0.09565217
## LP001087 0.2008192 0.02015790 0.05447118
Vamos a calcular la matriz de confución, ROC y AUC. (Rdocumentation, 1.18)
y_prob = predict(model_gbm, df_test[,2:17], type='response')
## Using 494 trees...
y_pred = as.factor(ifelse(y_prob>0.5,1,0))
confusionMatrix(
data=y_pred, reference=as.factor(df_test$Loan_Status))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 29 3
## 1 28 123
##
## Accuracy : 0.8306
## 95% CI : (0.7683, 0.8819)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 9.089e-06
##
## Kappa : 0.5512
##
## Mcnemar's Test P-Value : 1.629e-05
##
## Sensitivity : 0.5088
## Specificity : 0.9762
## Pos Pred Value : 0.9062
## Neg Pred Value : 0.8146
## Prevalence : 0.3115
## Detection Rate : 0.1585
## Detection Prevalence : 0.1749
## Balanced Accuracy : 0.7425
##
## 'Positive' Class : 0
##
roc(df_test$Loan_Status ~ predict(model_gbm, df_test[,2:17], type='response'), plot = TRUE, print.auc = TRUE)
## Using 494 trees...
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.formula(formula = df_test$Loan_Status ~ predict(model_gbm, df_test[, 2:17], type = "response"), plot = TRUE, print.auc = TRUE)
##
## Data: predict(model_gbm, df_test[, 2:17], type = "response") in 57 controls (df_test$Loan_Status 0) < 126 cases (df_test$Loan_Status 1).
## Area under the curve: 0.8106
Finnstats, 2021. How to use the scale() function in R. R-bloggers. Obtenido en https://www.r-bloggers.com/2021/12/how-to-use-the-scale-function-in-r/
Datatricks, 2019. One Hot encoding in R, three simple methods. Obtenido en https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods
Pocs, M (2021) Understanding How a Gradient Boosted Tree Does Binary Classification. Towards Data Science. Obtenido en https://towardsdatascience.com/understanding-how-a-gradient-boosted-tree-does-binary-classification-c215967600fe
Ellis, L. (2018) Explore your dataset in R. R-Bloggers. Obtenido en https://www.r-bloggers.com/2018/11/explore-your-dataset-in-r/
RDocumentation (1.18) Build a ROC Curve. Data Camp. Obetnido en https://www.rdocumentation.org/packages/pROC/versions/1.18.0/topics/roc