Tidymodels es un conjunto de paquetes en R diseñado para organizar, estandarizar y facilitar el proceso de construcción, entreno, evaluar y afinar modelos estadísticos y de aprendizaje automático.
Librerías Principales de Tidymodels:
rsample: Proporciona herramientas para dividir datos en conjuntos de entrenamiento y prueba, realizar validación cruzada y otras técnicas de muestreo.
recipes: Permite la preprocesamiento de datos de manera estructurada y reproducible (feature engineering), como la normalización, transformación de variables y creación de variables dummy.
workflows: Combina preprocesamiento y modelos en un solo flujo de trabajo, lo que facilita la organización y gestión de todos los pasos en un solo objeto.
parsnip: Facilita la especificación de modelos de regresión, clasificación y otros tipos de modelos de una manera uniforme, independiente del motor de modelado (como glm, randomForest, xgboost, etc.).
tune: Ayuda a afinar hiperparámetros de modelos mediante técnicas como búsqueda en cuadrícula o aleatoria, optimización bayesiana, entre otras.
yardstick: Ofrece métricas para evaluar el rendimiento de modelos, como precisión, error cuadrático medio, ROC, entre otros.
¿Por qué usarlo en modelos de regresión?
Tidymodels ofrece un flujo de trabajo estructurado, reproducible y consistente que facilita y mejora la implementación y evaluación de modelos de regresión en R.
# cargando las ilbrerías
library(tidymodels)
library(tidyverse)
king_county= readr::read_csv("C:/Users/ronald.hernandez/Documents/practicas modelamiento estadistico/King county house sales/archive(6)/kc_house_data.csv")
king_county= king_county%>% filter(zipcode=="98103")
head(king_county %>% select(price, sqft_living, bedrooms, bathrooms, floors, view, condition, grade, yr_built))
## # A tibble: 6 x 9
## price sqft_living bedrooms bathrooms floors view condition grade yr_built
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 485000 1600 4 1 1.5 0 4 7 1916
## 2 570000 1260 3 1.75 1 0 5 6 1905
## 3 518500 1590 3 3.5 3 0 3 8 2010
## 4 822500 2320 5 3.5 2 0 5 7 1926
## 5 511000 1430 3 1 1 0 3 7 1947
## 6 532170 1360 3 2 2 0 3 8 1990
Lo primero que haremos es dividir los datos usando la librería rsample
king_county_split <- initial_split(king_county, prop =0.80, strata = price)
king_county_training <- king_county_split %>% training()
king_county_test <- king_county_split %>% testing()
Especificaremos un modelo de regresión lineal con la interfaz que ofrece parsnip. Para esto usaremos linear_reg() e indicarmeos “lm” en set_engine.
lm_model <- linear_reg()%>% set_engine('lm') %>% set_mode('regression')
Usaremos la librería recipes para establecer un proceso en donde se eliminaran variables altamente correlacionadas, y se crearan variables ficticias.
kc_recipe <- recipe(price ~ sqft_living+bedrooms+bathrooms+floors+view+condition+grade+yr_built,
data = king_county_training)%>%
step_corr(all_numeric(), threshold =0.8) %>%
step_dummy(all_nominal(),-all_outcomes())
Creamos un objeto de tipo workflow.
kc_wkfl <- workflow() %>%
add_model(lm_model) %>%
add_recipe(kc_recipe)
kc_wkfl
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_corr()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Crearemos los folds con vfold_cv() de rsample. Para este caso especificaremos 10 folios.
set.seed(0)
kc_folds <- vfold_cv(king_county_training, v =10, strata = price)
La función fit_resamples() recibe un modelo o un objeto de tipo workflow. Tiene como proposito explorar y comparar el rendimiento de diferentes de tipos de modelos. Notemos que esta función no se utliza para ajustar un modelo como tal, solo es para evaluación.
kc_rs_fit <- kc_wkfl %>% fit_resamples(resamples = kc_folds)
kc_rs_fit %>% collect_metrics()
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 127089. 10 5308. Preprocessor1_Model1
## 2 rsq standard 0.635 10 0.0241 Preprocessor1_Model1
kc_metrics <- kc_rs_fit %>% collect_metrics(summarize =FALSE)
kc_metrics
## # A tibble: 20 x 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold01 rmse standard 126207. Preprocessor1_Model1
## 2 Fold01 rsq standard 0.476 Preprocessor1_Model1
## 3 Fold02 rmse standard 122743. Preprocessor1_Model1
## 4 Fold02 rsq standard 0.654 Preprocessor1_Model1
## 5 Fold03 rmse standard 134287. Preprocessor1_Model1
## 6 Fold03 rsq standard 0.613 Preprocessor1_Model1
## 7 Fold04 rmse standard 131133. Preprocessor1_Model1
## 8 Fold04 rsq standard 0.631 Preprocessor1_Model1
## 9 Fold05 rmse standard 111381. Preprocessor1_Model1
## 10 Fold05 rsq standard 0.620 Preprocessor1_Model1
## 11 Fold06 rmse standard 113185. Preprocessor1_Model1
## 12 Fold06 rsq standard 0.724 Preprocessor1_Model1
## 13 Fold07 rmse standard 152151. Preprocessor1_Model1
## 14 Fold07 rsq standard 0.614 Preprocessor1_Model1
## 15 Fold08 rmse standard 96422. Preprocessor1_Model1
## 16 Fold08 rsq standard 0.682 Preprocessor1_Model1
## 17 Fold09 rmse standard 144222. Preprocessor1_Model1
## 18 Fold09 rsq standard 0.586 Preprocessor1_Model1
## 19 Fold10 rmse standard 139156. Preprocessor1_Model1
## 20 Fold10 rsq standard 0.749 Preprocessor1_Model1
kc_metrics %>% group_by(.metric) %>%
summarize(min=min(.estimate), median = median(.estimate),max=max(.estimate),mean = mean(.estimate), sd = sd(.estimate))
## # A tibble: 2 x 6
## .metric min median max mean sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 rmse 96422. 128670. 152151. 127089. 16785.
## 2 rsq 0.476 0.625 0.749 0.635 0.0762
Otra manera de evaluar el rendimiento de un modelo de regresión lineal es realizando predicciones en un conjunto de datos de prueba y con base a esos resultados proceder a calcular las metricas.
Para esto usaremos la función last_fit() de parsnip la cual toma una especificación del modelo, una fórmula del modelo y un objeto de división de datos y realiza lo siguiente:
kc_wkfl_fit <- kc_wkfl %>%
last_fit(split = king_county_split)
kc_wkfl_fit %>%
collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 108879. Preprocessor1_Model1
## 2 rsq standard 0.755 Preprocessor1_Model1
kc_preds <- kc_wkfl_fit %>%
collect_predictions()
head(kc_preds)
## # A tibble: 6 x 5
## id .pred .row price .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split 586167. 10 548000 Preprocessor1_Model1
## 2 train/test split 505785. 11 640000 Preprocessor1_Model1
## 3 train/test split 469458. 14 450000 Preprocessor1_Model1
## 4 train/test split 875757. 16 1000000 Preprocessor1_Model1
## 5 train/test split 500606. 21 565000 Preprocessor1_Model1
## 6 train/test split 316025. 25 466000 Preprocessor1_Model1
kc_wkfl_fit %>% extract_workflow()
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_corr()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) sqft_living bedrooms bathrooms floors view
## 2008193.3 178.1 454.9 -2410.2 -47974.0 20291.0
## condition grade yr_built
## -1226.6 122282.7 -1297.2
sba=read_csv("C:/Users/ronald.hernandez/Documents/practicas modelamiento estadistico/Modelos con tidymodels/SBAcase.11.13.17.csv")
## Rows: 2102 Columns: 35
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (8): Name, City, State, Bank, BankState, RevLineCr, LowDoc, MIS_Status
## dbl (27): Selected, LoanNr_ChkDgt, Zip, NAICS, ApprovalDate, ApprovalFY, Ter...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
sba_split <- initial_split(sba, prop =0.80, strata = Default)
sba_training <- sba_split %>% training()
sba_test <- sba_split %>% testing()
head(sba %>% select(Default, SBA_Appv, Term, NoEmp, New))
## # A tibble: 6 x 5
## Default SBA_Appv Term NoEmp New
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 15000 36 1 0
## 2 0 15000 56 1 0
## 3 0 15000 36 10 0
## 4 0 25000 36 6 0
## 5 0 343000 240 65 0
## 6 0 25000 84 1 0
Especificaremos un modelo de regresión logística con la interfaz que ofrece parsnip. Para esto usaremos logistic_reg() e indicarmeos “glm” en set_engine.
logistic_model <- logistic_reg()%>% set_engine('glm')%>% set_mode('classification')
Establecermos un proceso para eliminar variables altamente correlacionadas, creación de variables ficticias y convertiremos en factor la variable de respuesta “Default”. Este último paso es necesario para los modelos de clasificación de respuesta binaria en tidymodels.
sba_recipe <- recipe(Default ~ SBA_Appv + Term + NoEmp + New, data = sba_training) %>%
step_mutate(Default = as.factor(Default)) %>%
step_corr(all_numeric(), threshold =0.8) %>%
step_dummy(all_nominal(),-all_outcomes())
Creamos un objeto de tipo workflow
sba_wkfl <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(sba_recipe)
sba_wkfl
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_mutate()
## * step_corr()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
set.seed(0)
sba_folds <- vfold_cv(sba_training, v =10, strata = Default)
sba_rs_fit <- sba_wkfl %>% fit_resamples(resamples = sba_folds)
sba_rs_fit %>% collect_metrics()
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.826 10 0.00915 Preprocessor1_Model1
## 2 roc_auc binary 0.875 10 0.0129 Preprocessor1_Model1
sba_metrics <- sba_rs_fit %>% collect_metrics(summarize =FALSE)
sba_metrics
## # A tibble: 20 x 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold01 accuracy binary 0.775 Preprocessor1_Model1
## 2 Fold01 roc_auc binary 0.839 Preprocessor1_Model1
## 3 Fold02 accuracy binary 0.852 Preprocessor1_Model1
## 4 Fold02 roc_auc binary 0.918 Preprocessor1_Model1
## 5 Fold03 accuracy binary 0.839 Preprocessor1_Model1
## 6 Fold03 roc_auc binary 0.880 Preprocessor1_Model1
## 7 Fold04 accuracy binary 0.833 Preprocessor1_Model1
## 8 Fold04 roc_auc binary 0.889 Preprocessor1_Model1
## 9 Fold05 accuracy binary 0.810 Preprocessor1_Model1
## 10 Fold05 roc_auc binary 0.867 Preprocessor1_Model1
## 11 Fold06 accuracy binary 0.863 Preprocessor1_Model1
## 12 Fold06 roc_auc binary 0.925 Preprocessor1_Model1
## 13 Fold07 accuracy binary 0.798 Preprocessor1_Model1
## 14 Fold07 roc_auc binary 0.788 Preprocessor1_Model1
## 15 Fold08 accuracy binary 0.845 Preprocessor1_Model1
## 16 Fold08 roc_auc binary 0.902 Preprocessor1_Model1
## 17 Fold09 accuracy binary 0.796 Preprocessor1_Model1
## 18 Fold09 roc_auc binary 0.855 Preprocessor1_Model1
## 19 Fold10 accuracy binary 0.844 Preprocessor1_Model1
## 20 Fold10 roc_auc binary 0.889 Preprocessor1_Model1
sba_metrics %>% group_by(.metric) %>%
summarize(min=min(.estimate), median = median(.estimate),max=max(.estimate),mean = mean(.estimate), sd = sd(.estimate))
## # A tibble: 2 x 6
## .metric min median max mean sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 accuracy 0.775 0.836 0.863 0.826 0.0289
## 2 roc_auc 0.788 0.885 0.925 0.875 0.0407
En este caso usaremos metricas personalizadas
sba_wkfl_fit <- sba_wkfl %>%
last_fit(split = sba_split)
sba_wkfl_fit %>%
collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.815 Preprocessor1_Model1
## 2 roc_auc binary 0.886 Preprocessor1_Model1
sba_preds <- sba_wkfl_fit %>%
collect_predictions()
head(sba_preds)
## # A tibble: 6 x 7
## id .pred_0 .pred_1 .row .pred_class Default .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/test split 0.652 0.348 8 0 0 Preprocessor1_Mode~
## 2 train/test split 0.501 0.499 15 0 0 Preprocessor1_Mode~
## 3 train/test split 0.996 0.00435 18 0 0 Preprocessor1_Mode~
## 4 train/test split 0.990 0.0102 21 0 0 Preprocessor1_Mode~
## 5 train/test split 0.632 0.368 25 0 0 Preprocessor1_Mode~
## 6 train/test split 0.318 0.682 27 1 0 Preprocessor1_Mode~
sba_wkfl_fit %>% extract_workflow()
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_mutate()
## * step_corr()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) SBA_Appv Term NoEmp New
## 1.626e+00 1.040e-06 -2.536e-02 -1.557e-02 -1.223e-01
##
## Degrees of Freedom: 1679 Total (i.e. Null); 1675 Residual
## Null Deviance: 2122
## Residual Deviance: 1565 AIC: 1575
La función metric_set() crea una función de métricas de las seleccionables por yardstick, las cuales son definidas por el usuario.
custom_metrics = yardstick::metric_set(accuracy, sensitivity, specificity)
custom_metrics(sba_preds,
truth = Default,
estimate = .pred_class)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.815
## 2 sensitivity binary 0.901
## 3 specificity binary 0.638