ML h2o

1.- BIBLIOTECAS

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(h2o)

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## 
## Attaching package: 'h2o'
## 
## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year
## 
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## 
## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

library(DALEX)

## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## 
## Attaching package: 'DALEX'
## 
## The following object is masked from 'package:dplyr':
## 
##     explain

library(DALEXtra)
library(iml)
library(skimr)
library(DataExplorer)
library(DataExplorer)
library(ggpubr)
library(univariateML)
library(recipes)

## 
## Attaching package: 'recipes'
## 
## The following object is masked from 'package:stringr':
## 
##     fixed
## 
## The following object is masked from 'package:stats':
## 
##     step

2.- DATOS

l set de datos rent, disponible en el paquete gamlss.data, contiene información sobre el precio del alquiler de 1969 viviendas situadas en Munich en el año 1993. Además del precio, incluye 9 variables adicionales:

R: precio del alquiler.

Fl: metros cuadrados de la vivienda.

A: año de construcción.

B: si tiene cuarto de baño (1) o no (0).

H: si tiene calefacción central (1) o no (0).

L: si la cocina está equipada (1) o no (0).

Sp: si la calidad del barrio donde está situada la vivienda es superior la media (1) o no (0).

Sm: si la calidad del barrio donde está situada la vivienda es inferior la media (1) o no (0).

loc: combinación de Sp y Sm indicando si la calidad del barrio donde está situada la vivienda es inferior (1), igual (2) o superior (3) a la media.

data("rent", package = "gamlss.data")
datos <- rent

str(datos)

## 'data.frame':    1969 obs. of  9 variables:
##  $ R  : num  693 422 737 732 1295 ...
##  $ Fl : num  50 54 70 50 55 59 46 94 93 65 ...
##  $ A  : num  1972 1972 1972 1972 1893 ...
##  $ Sp : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sm : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ B  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ H  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ L  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

# Se descartan las variables SP y SM, ya que su información está recogida en la variable loc.
datos <- datos[, !(names(datos) %in% c("Sp", "Sm"))]

str(datos)

## 'data.frame':    1969 obs. of  7 variables:
##  $ R  : num  693 422 737 732 1295 ...
##  $ Fl : num  50 54 70 50 55 59 46 94 93 65 ...
##  $ A  : num  1972 1972 1972 1972 1893 ...
##  $ B  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ H  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ L  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

# Se renombran las columnas para que sean más descriptivas.
colnames(datos) <- c("precio", "metros", "anyo", "banyo", "calefaccion", "cocina", "situacion")

str(datos)

## 'data.frame':    1969 obs. of  7 variables:
##  $ precio     : num  693 422 737 732 1295 ...
##  $ metros     : num  50 54 70 50 55 59 46 94 93 65 ...
##  $ anyo       : num  1972 1972 1972 1972 1893 ...
##  $ banyo      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ calefaccion: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ cocina     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ situacion  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

3.- EDA

skim(datos)

Data summary
Name	datos
Number of rows	1969
Number of columns	7
_______________________
Column type frequency:
factor	4
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
banyo	1	FALSE	2	0: 1925, 1: 44
calefaccion	1	FALSE	2	0: 1580, 1: 389
cocina	1	FALSE	2	0: 1808, 1: 161
situacion	1	FALSE	3	2: 1247, 3: 550, 1: 172

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
precio	1	811.88	379.00	101.7	544.2	737.8	1022	3000	▇▇▂▁▁
metros	1	67.73	20.86	30.0	52.0	67.0	82	120	▆▇▇▅▂
anyo	1	1948.48	29.02	1890.0	1934.0	1957.0	1972	1988	▃▁▃▇▇

head(datos)

##   precio metros anyo banyo calefaccion cocina situacion
## 1  693.3     50 1972     0           0      0         2
## 2  422.0     54 1972     0           0      0         2
## 3  736.6     70 1972     0           0      0         2
## 4  732.2     50 1972     0           0      0         2
## 5 1295.1     55 1893     0           0      0         2
## 6 1195.9     59 1893     0           0      0         2

4.- Datos ausentes

# Número de datos ausentes por variable
datos %>% map_dbl(.f = function(x){sum(is.na(x))})

##      precio      metros        anyo       banyo calefaccion      cocina 
##           0           0           0           0           0           0 
##   situacion 
##           0

plot_missing(
  data = datos,
  title = "Porcentaje de valores ausentes",
  ggtheme = theme_bw(),
  theme_config = list(legend.position = "bottom")
)

5.- Variable respuesta

p1 <- ggplot(data = datos, aes(x = precio)) +
      geom_density(fill = "steelblue", alpha = 0.8) +
      geom_rug(alpha = 0.1) +
      scale_x_continuous(labels = scales::comma) +
      labs(title = "Distribución original", x = "gastos_totales") +
      theme_bw()

p2 <- ggplot(data = datos, aes(x = sqrt(precio))) +
      geom_density(fill = "steelblue", alpha = 0.8) +
      geom_rug(alpha = 0.1) +
      scale_x_continuous(labels = scales::comma) +
      labs(title = "Transformación raíz cuadrada", x = "gastos_totales") +
      theme_bw()

p3 <- ggplot(data = datos, aes(x = log10(precio))) +
      geom_density(fill = "steelblue", alpha = 0.8) +
      geom_rug(alpha = 0.1) +
      scale_x_continuous(labels = scales::comma) +
      labs(title = "Transformación logarítmica") +
      theme_bw()

p4 <- ggplot(data = datos, aes(x = log(precio))) +
      geom_density(fill = "steelblue", alpha = 0.8) +
      geom_rug(alpha = 0.1) +
      scale_x_continuous(labels = scales::comma) +
      labs(title = "Transformación logarítmica") +
      theme_bw()

ggarrange(p1, p2, p3, p4, ncol = 1, align = "v")

# Tabla de estadísticos principales
summary(datos$precio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   101.7   544.2   737.8   811.9  1022.0  3000.0

Algunos modelos de machine learning y aprendizaje estadístico requieren que la variable respuesta se distribuya de una forma determinada. Por ejemplo: para los modelos de regresión lineal (LM), la distribución tiene que ser de tipo normal. Para los modelos lineales generalizados (GLM), la distribución tiene que ser de la familia exponencial. Existen varios paquetes en R que permiten identificar a qué distribución se ajustan mejor los datos, uno de ellos es univariateML. Para conocer más sobre cómo identificar distribuciones consultar Ajuste de distribuciones con R.

comparacion_aic <- AIC(
                    mlbetapr(datos$precio),
                    mlexp(datos$precio),
                    mlinvgamma(datos$precio),
                    mlgamma(datos$precio),
                    mllnorm(datos$precio),
                    mlrayleigh(datos$precio),
                    mlinvgauss(datos$precio),
                    mlweibull(datos$precio),
                    mlinvweibull(datos$precio),
                    mllgamma(datos$precio)
                   )
comparacion_aic %>% rownames_to_column(var = "distribucion") %>% arrange(AIC)

##                  distribucion df      AIC
## 1       mlgamma(datos$precio)  2 28615.58
## 2       mllnorm(datos$precio)  2 28657.23
## 3    mlinvgauss(datos$precio)  2 28690.03
## 4     mlweibull(datos$precio)  2 28746.25
## 5    mlrayleigh(datos$precio)  1 28797.71
## 6      mlbetapr(datos$precio)  2 28867.94
## 7    mlinvgamma(datos$precio)  2 28869.14
## 8  mlinvweibull(datos$precio)  2 29245.12
## 9         mlexp(datos$precio)  1 30322.05
## 10     mllgamma(datos$precio)  2      Inf

6.- Variables continuas

plot_density(
  data = datos %>% select(-precio),
  ncol= 3,
  title = "Distribución variables continuas",
  ggtheme = theme_bw(),
  theme_config = list(
                  plot.title = element_text(size = 16, face = "bold"),
                  strip.text = element_text(colour = "black", size = 12, face = 2)
  )
)

custom_corr_plot <- function(variable1, variable2, df, alpha=0.3){
  p <- df %>%
       mutate(
         # Truco para que se ponga el título estilo facet
        title = paste(toupper(variable2), "vs", toupper(variable1))
       ) %>%
       ggplot(aes(x = !!sym(variable1), y = !!sym(variable2))) + 
       geom_point(alpha = alpha) +
       # Tendencia no lineal
       geom_smooth(se = FALSE, method = "gam", formula =  y ~ splines::bs(x, 3)) +
       # Tendencia lineal
       geom_smooth(se = FALSE, method = "lm", color = "firebrick") +
       facet_grid(. ~ title) +
       theme_bw() +
       theme(strip.text = element_text(colour = "black", size = 10, face = 2),
             axis.title = element_blank())
  return(p)
}

variables_continuas <- c("anyo", "metros")

plots <- map(
            .x = variables_continuas,
            .f = custom_corr_plot,
            variable2 = "precio",
            df = datos
         )

ggarrange(plotlist = plots, ncol = 2, nrow = 1) %>%
  annotate_figure(
    top = text_grob("Correlación con el precio de alquiler", face = "bold", size = 16,
                    x = 0.4)
  )

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

plot_correlation(
  data     = datos,
  cor_args = list(use = "pairwise.complete.obs"),
  type     = "continuous",
  title    = "Matriz de correlación variables continuas",
  theme_config = list(legend.position = "none",
                      plot.title = element_text(size = 16, face = "bold"),
                      axis.title = element_blank(),
                      axis.text.x = element_text(angle = -45, hjust = +0.1)
                     )
)

GGally::ggscatmat(
  data = datos %>% select_if(is.numeric),
  alpha = 0.1) +
theme_bw() +
labs(title = "Correlación por pares") +
theme(
  plot.title = element_text(size = 16, face = "bold"),
  #axis.text = element_blank(),
  strip.text = element_text(colour = "black", size = 10, face = 2)
)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

plot_bar( 
    datos, 
    ncol = 2, 
    title = "Número de observaciones por grupo", 
    ggtheme = theme_bw(), 
    theme_config = list( 
                      plot.title = element_text(size = 16, face = "bold"), 
                      strip.text = element_text(colour = "black", size = 12, face = 2),
                      legend.position = "none" 
                      ) 
)

custom_box_plot <- function(variable1, variable2, df, alpha=0.3){
  p <- df %>%
       mutate(
         # Truco para que se ponga el título estilo facet
        title = paste(toupper(variable2), "vs", toupper(variable1))
       ) %>%
       ggplot(aes(x = !!sym(variable1), y = !!sym(variable2))) + 
       geom_violin(alpha = alpha) +
       geom_boxplot(width = 0.1, outlier.shape = NA) +
       facet_grid(. ~ title) +
       theme_bw() +
       theme(strip.text = element_text(colour = "black", size = 10, face = 2),
             axis.title = element_blank())
  return(p)
}

variables_cualitativas <- c("banyo", "calefaccion", "cocina", "situacion")

plots <- map(
            .x = variables_cualitativas,
            .f = custom_box_plot,
            variable2 = "precio",
            df = datos
         )

ggarrange(plotlist = plots, ncol = 2, nrow = 2) %>%
  annotate_figure(
    top = text_grob("Correlación con precio", face = "bold", size = 16,
                    x = 0.28)
  )

Si alguno de los niveles de una variable cualitativa tiene muy pocas observaciones en comparación a los otros niveles, puede ocurrir que, durante la validación cruzada o bootstrapping, algunas particiones no contengan ninguna observación de dicha clase (varianza cero), lo que puede dar lugar a errores. En este caso hay que tener precaución con la variable banyo.

table(datos$banyo)

## 
##    0    1 
## 1925   44

7.- Modelos

# Creación de un cluster local con todos los cores disponibles.
h2o.init(ip = "localhost",
         # -1 indica que se empleen todos los cores disponibles.
         nthreads = -1,
         # Máxima memoria disponible para el cluster.
         max_mem_size = "6g")

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         41 minutes 21 seconds 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.42.0.2 
##     H2O cluster version age:    5 months and 2 days 
##     H2O cluster name:           H2O_started_from_R_Usuario_kmr404 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   2.23 GB 
##     H2O cluster total cores:    20 
##     H2O cluster allowed cores:  20 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.2 (2023-10-31 ucrt)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is (5 months and 2 days) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

# Se eliminan los datos del cluster por si ya había sido iniciado.
h2o.removeAll()
h2o.no_progress()

datos_h2o <- as.h2o(x = datos, destination_frame = "datos_h2o")

7.1.- TRAIN Y TEST SPLIT

set.seed(123)
particiones <- h2o.splitFrame(data=datos_h2o, ratios = c(0.8), seed = 123)
datos_train_h2o <- h2o.assign(data = particiones[[1]], key = "datos_train_h2o")
datos_test_h2o <- h2o.assign(data = particiones[[2]], key = "datos_test_h2o")

datos_train <- as.data.frame(datos_train_h2o)
datos_test <- as.data.frame(datos_test_h2o)

summary(datos_train$precio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   101.7   544.7   741.6   813.9  1026.5  3000.0

summary(datos_test$precio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   127.1   537.8   718.6   804.0  1000.0  2764.5

summary(datos_train$banyo)/nrow(datos_train)*100

##         0         1 
## 97.963081  2.036919

summary(datos_test$banyo)/nrow(datos_test)*100

##         0         1 
## 96.984925  3.015075

7.2.- MODELO GLM

Entrenamiento

# Se define la variable respuesta y los predictores.
var_respuesta <- "precio"
# Para este modelo se emplean todos los predictores disponibles.
predictores <- setdiff(h2o.colnames(datos_train_h2o), var_respuesta)

Optimización de hiperparámetros

# Valores de alpha que se van a comparar.
hiperparametros <- list(alpha = seq(0,1,0.1))

grid_glm <- h2o.grid(
    # Algoritmo y parámetros
    algorithm      = "glm",
    family         = "gamma",
    # Variable respuesta y predictores
    y              = var_respuesta,
    x              = predictores, 
    # Datos de entrenamiento
    training_frame = datos_train_h2o,
    # Preprocesado
    standardize    = TRUE,
    missing_values_handling = "Skip",
    ignore_const_cols = TRUE,
    # Hiperparámetros
    hyper_params    = hiperparametros,
    # Tipo de búsqueda
    search_criteria = list(strategy = "Cartesian"),
    lambda_search   = TRUE,
    # Selección automática del solver adecuado
    solver          = "AUTO",
    # Estrategia de validación para seleccionar el mejor modelo
    seed            = 123,
    nfolds          = 3,
    keep_cross_validation_predictions = TRUE,
    grid_id         = "grid_glm"
)

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 12!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!

## Warning in h2o.getGrid(grid_id = grid_id): Adding alpha array to hyperparameter
## runs slower with gridsearch. This is due to the fact that the algo has to run
## initialization for every alpha value. Setting the alpha array as a model
## parameter will skip the initialization and run faster overall.

# Se muestran los modelos ordenados de mayor a menor rmse
resultados_grid_glm <- h2o.getGrid(
                         grid_id = "grid_glm",
                         sort_by = "rmse",
                         decreasing = FALSE
                       )

## Warning in h2o.getGrid(grid_id = "grid_glm", sort_by = "rmse", decreasing =
## FALSE): Adding alpha array to hyperparameter runs slower with gridsearch. This
## is due to the fact that the algo has to run initialization for every alpha
## value. Setting the alpha array as a model parameter will skip the
## initialization and run faster overall.

as.data.frame(resultados_grid_glm@summary_table)

##                  alpha         model_ids     rmse
## 1                  0.1  grid_glm_model_2 310.2497
## 2                  0.2  grid_glm_model_3 310.2498
## 3  0.30000000000000004  grid_glm_model_4 310.2498
## 4                  0.4  grid_glm_model_5 310.2499
## 5                  0.5  grid_glm_model_6 310.2499
## 6   0.6000000000000001  grid_glm_model_7 310.2499
## 7   0.7000000000000001  grid_glm_model_8 310.2499
## 8                  0.8  grid_glm_model_9 310.2499
## 9                  0.9 grid_glm_model_10 310.2499
## 10                 1.0 grid_glm_model_11 310.2499
## 11                 0.0  grid_glm_model_1 310.3075

# Se reentrena el modelo con los mejores hiperparámetros
mejores_hiperparam <- h2o.getModel(resultados_grid_glm@model_ids[[1]])@parameters

modelo_glm <- h2o.glm(
    # Variable respuesta y predictores
    y              = var_respuesta,
    x              = predictores,
    # Datos de entrenamiento
    training_frame = datos_train_h2o,
    # Preprocesado
    standardize    = TRUE,
    missing_values_handling = "Skip",
    ignore_const_cols = TRUE,
    # Hiperparámetros
    alpha = mejores_hiperparam$alpha,
    lambda_search   = TRUE,
    # Selección automática del solver adecuado
    solver          = "AUTO",
    # Estrategia de validación (para comparar con otros modelos)
    seed            = 123,
    nfolds          = 10,
    keep_cross_validation_predictions = TRUE,
    model_id        = "modelo_glm"
)

## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 3!

Influencia de predictores

h2o.varimp(modelo_glm)

## Variable Importances: 
##         variable relative_importance scaled_importance percentage
## 1         metros            0.010826          1.000000   1.000000
## 2    situacion.1            0.000000          0.000000   0.000000
## 3    situacion.2            0.000000          0.000000   0.000000
## 4    situacion.3            0.000000          0.000000   0.000000
## 5  calefaccion.0            0.000000          0.000000   0.000000
## 6  calefaccion.1            0.000000          0.000000   0.000000
## 7       cocina.0            0.000000          0.000000   0.000000
## 8       cocina.1            0.000000          0.000000   0.000000
## 9        banyo.0            0.000000          0.000000   0.000000
## 10       banyo.1            0.000000          0.000000   0.000000
## 11          anyo            0.000000          0.000000   0.000000

h2o.varimp_plot(modelo_glm)

Diagnosis de los residuos

explainer_glm <- DALEXtra::explain_h2o(
                  model = modelo_glm,
                  data  = datos_train[, predictores],
                  y     = datos_train[[var_respuesta]],
                  label = "modelo_glm"
                )

## Preparation of a new explainer is initiated
##   -> model label       :  modelo_glm 
##   -> data              :  1571  rows  6  cols 
##   -> target variable   :  1571  values 
##   -> predict function  :  yhat.H2ORegressionModel  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package h2o , ver. 3.42.0.2 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  813.8608 , mean =  813.8803 , max =  813.9076  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -712.1723 , mean =  2.96012e-13 , max =  2186.092  
##   A new explainer has been created!

plot(model_performance(explainer_glm))

predicciones_train <- h2o.predict(
                        modelo_glm,
                        newdata = datos_train_h2o
                      )
predicciones_train <- h2o.cbind(
                       datos_train_h2o["precio"],
                       predicciones_train
                      )
predicciones_train <- as.data.frame(predicciones_train)
predicciones_train <- predicciones_train %>%
                      mutate(
                        residuo = predict - precio
                      )

p1 <- ggplot(predicciones_train, aes(x = precio, y  = predict)) +
      geom_point(alpha = 0.1) +
      geom_smooth(method = "gam", color = "red", size = 1, se = TRUE) +
      labs(title = "Predicciones vs valor real") +
      theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p2 <- ggplot(predicciones_train, aes(1:nrow(predicciones_train), y  = residuo)) +
      geom_point(alpha = 0.1) +
      geom_hline(yintercept = 0, color = "red", size = 1) +
      labs(title = "Residuos del modelo") +
      theme_bw()
p3 <- ggplot(predicciones_train, aes(x  = residuo)) +
      geom_density() +
      geom_rug() +
      labs(title = "Residuos del modelo") +
      theme_bw()

p4 <- ggplot(predicciones_train, aes(sample  = predict)) +
      stat_qq() +
      stat_qq_line(color = "red", size = 1) +
      labs(title = "QQ-plot residuos del modelo") +
      theme_bw()

ggpubr::ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2) %>%
ggpubr::annotate_figure(
  top = ggpubr::text_grob("Diagnóstico residuos entrenamiento",
                          color = "black", face = "bold", size = 14)
)

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

predicciones_test <- h2o.predict(
                        object  = modelo_glm,
                        newdata = datos_test_h2o
                     )
predicciones_test     <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test

Error de test

rmse_test_glm <- MLmetrics::RMSE(
                    y_pred = datos_test$precio,
                    y_true = datos_test$prediccion
                 )
paste("Error de test (rmse) del modelo GLM:", rmse_test_glm)

## [1] "Error de test (rmse) del modelo GLM: 378.70675549663"

##7.3 - GBM

# Hiperparámetros que se quieren comparar (random search)
hiperparametros <- list(
                     ntrees      = c(500, 1000, 2000),
                     learn_rate  = seq(0.01, 0.1, 0.01),
                     max_depth   = seq(2, 15, 1),
                     sample_rate = seq(0.5, 1.0, 0.2),
                     col_sample_rate = seq(0.1, 1.0, 0.3)
                    )
# Criterios de parada para la búsqueda
search_criteria <- list(
                    strategy = "RandomDiscrete",
                    max_models = 50, # Número máximo de combinaciones
                    max_runtime_secs = 60*10, # Tiempo máximo de búsqueda
                    stopping_tolerance = 0.001, # Mejora mínima
                    stopping_rounds = 5,
                    seed = 123
                   )


grid_gbm <- h2o.grid(
    # Algoritmo y parámetros
    algorithm = "gbm",
    distribution = "gaussian",
    # Variable respuesta y predictores
    y = var_respuesta,
    x = predictores,
    # Datos de entrenamiento
    training_frame = datos_train_h2o,
    # Preprocesado
    ignore_const_cols = TRUE,
    # Parada temprana
    score_tree_interval = 100,
    stopping_rounds = 3,
    stopping_metric = "rmse",
    stopping_tolerance = 0.01,
    # Hiperparámetros optimizados
    hyper_params = hiperparametros,
    # Estrategia de validación para seleccionar el mejor modelo
    seed   = 123,
    nfolds = 3,
    # Tipo de búsqueda
    search_criteria = search_criteria,
    keep_cross_validation_predictions = TRUE,
    grid_id         = "grid_gbm"
)

# Se muestran los modelos ordenados de mayor a menor rmse
resultados_grid_gbm <- h2o.getGrid(
                        grid_id = "grid_gbm",
                        sort_by = "rmse",
                        decreasing = FALSE
                       )

as.data.frame(resultados_grid_gbm@summary_table)

##    col_sample_rate learn_rate max_depth ntrees sample_rate         model_ids
## 1              0.1       0.04         3    466         0.9 grid_gbm_model_33
## 2              0.4       0.01         4    533         0.7 grid_gbm_model_21
## 3              0.4       0.03         2    500         0.7 grid_gbm_model_12
## 4              0.1       0.01        10    500         0.5 grid_gbm_model_22
## 5              0.1       0.03         8    500         0.5  grid_gbm_model_7
## 6              0.4       0.01         6    500         0.9 grid_gbm_model_23
## 7              0.1       0.04         9    500         0.9 grid_gbm_model_13
## 8              0.4       0.02         5    500         0.5 grid_gbm_model_35
## 9              0.1       0.05         6    466         0.5 grid_gbm_model_11
## 10             0.7       0.01         8    433         0.5  grid_gbm_model_1
## 11             0.7       0.01         9    500         0.5  grid_gbm_model_4
## 12             0.1       0.06         7    500         0.5 grid_gbm_model_17
## 13             0.7       0.02         6    400         0.5 grid_gbm_model_45
## 14             0.1       0.06        14    433         0.5 grid_gbm_model_28
## 15             0.7       0.05         3    400         0.7  grid_gbm_model_2
## 16             0.1       0.09        12    500         0.9 grid_gbm_model_46
## 17             0.4       0.05         5    400         0.9 grid_gbm_model_36
## 18             0.7       0.02        15    400         0.5 grid_gbm_model_41
## 19             0.4       0.03        15    400         0.9  grid_gbm_model_9
## 20             1.0       0.07         3    500         0.9 grid_gbm_model_34
## 21             1.0       0.01        15    400         0.9 grid_gbm_model_50
## 22             0.4       0.08         5    500         0.5 grid_gbm_model_38
## 23             1.0       0.01        13    500         0.9 grid_gbm_model_32
## 24             0.7       0.05         5    400         0.5 grid_gbm_model_19
## 25             0.4       0.05        10    400         0.9 grid_gbm_model_44
## 26             1.0       0.03         6    500         0.7 grid_gbm_model_18
## 27             0.7       0.02        10    400         0.9 grid_gbm_model_25
## 28             0.4       0.10         5    466         0.5 grid_gbm_model_48
## 29             0.4       0.06        11    400         0.5 grid_gbm_model_24
## 30             0.4       0.08        14    433         0.5 grid_gbm_model_26
## 31             0.4       0.07         8    500         0.9 grid_gbm_model_27
## 32             1.0       0.05         6    500         0.5 grid_gbm_model_42
## 33             0.7       0.03        11    400         0.7 grid_gbm_model_37
## 34             0.7       0.04         8    400         0.7  grid_gbm_model_8
## 35             0.4       0.09         8    400         0.7  grid_gbm_model_6
## 36             0.4       0.10         8    400         0.9 grid_gbm_model_29
## 37             1.0       0.05        13    400         0.5 grid_gbm_model_39
## 38             0.4       0.08        15    500         0.7 grid_gbm_model_20
## 39             1.0       0.04        12    400         0.7 grid_gbm_model_10
## 40             0.7       0.07        10    400         0.5 grid_gbm_model_30
## 41             1.0       0.05        14    500         0.5 grid_gbm_model_31
## 42             0.7       0.08        10    400         0.5 grid_gbm_model_47
## 43             1.0       0.06        10    400         0.9 grid_gbm_model_16
## 44             1.0       0.09         7    400         0.9 grid_gbm_model_14
## 45             1.0       0.07        14    400         0.7  grid_gbm_model_3
## 46             0.7       0.10        12    400         0.5 grid_gbm_model_15
## 47             1.0       0.06        15    400         0.9 grid_gbm_model_43
## 48             1.0       0.09         9    400         0.7 grid_gbm_model_49
## 49             1.0       0.08        12    400         0.9  grid_gbm_model_5
## 50             1.0       0.10        15    400         0.7 grid_gbm_model_40
##        rmse
## 1  304.6250
## 2  304.7685
## 3  304.8417
## 4  305.6167
## 5  305.8037
## 6  306.4812
## 7  306.6267
## 8  307.1674
## 9  307.6876
## 10 308.4450
## 11 309.2205
## 12 309.6394
## 13 309.6695
## 14 309.7567
## 15 310.3772
## 16 310.3804
## 17 311.6078
## 18 313.1005
## 19 314.5768
## 20 314.7106
## 21 315.1281
## 22 316.1080
## 23 316.9581
## 24 316.9640
## 25 317.0140
## 26 317.5602
## 27 317.6219
## 28 318.0550
## 29 318.8795
## 30 320.4766
## 31 320.6331
## 32 320.9059
## 33 321.1137
## 34 322.2029
## 35 322.3950
## 36 322.8808
## 37 324.6121
## 38 325.0035
## 39 327.1624
## 40 327.4183
## 41 328.3449
## 42 331.8125
## 43 335.3064
## 44 336.3255
## 45 337.0449
## 46 337.2439
## 47 339.3997
## 48 340.7610
## 49 345.3697
## 50 347.3378

# Se reentrena el modelo con los mejores hiperparámetros
mejores_hiperparam <- h2o.getModel(resultados_grid_gbm@model_ids[[1]])@parameters

modelo_gbm <- h2o.gbm(
    y = var_respuesta,
    x = predictores,
    training_frame = datos_train_h2o,
    ignore_const_cols = TRUE,
    learn_rate  = mejores_hiperparam$learn_rate,
    max_depth   = mejores_hiperparam$max_depth,
    ntrees      =  mejores_hiperparam$ntrees,
    sample_rate = mejores_hiperparam$sample_rate,
    seed            = 123,
    nfolds          = 10,
    keep_cross_validation_predictions = TRUE,
    model_id        = "modelo_gbm"
)

modelo_gbm

## Model Details:
## ==============
## 
## H2ORegressionModel: gbm
## Model ID:  modelo_gbm 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             466                      466               67818         3
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         3    3.00000          4          8     6.85622
## 
## 
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  75523.37
## RMSE:  274.8152
## MAE:  209.4056
## RMSLE:  0.3639322
## Mean Residual Deviance :  75523.37
## 
## 
## 
## H2ORegressionMetrics: gbm
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  97116.78
## RMSE:  311.6356
## MAE:  235.3906
## RMSLE:  0.4004822
## Mean Residual Deviance :  97116.78
## 
## 
## Cross-Validation Metrics Summary: 
##                                mean          sd   cv_1_valid   cv_2_valid
## mae                      235.570140   12.069691   230.102920   238.883470
## mean_residual_deviance 97077.510000 9869.390000 85383.720000 97712.195000
## mse                    97077.510000 9869.390000 85383.720000 97712.195000
## r2                         0.318327    0.078429     0.418854     0.393584
## residual_deviance      97077.510000 9869.390000 85383.720000 97712.195000
## rmse                     311.206540   15.916158   292.204930   312.589500
## rmsle                      0.400072    0.028570     0.411142     0.415139
##                           cv_3_valid    cv_4_valid    cv_5_valid   cv_6_valid
## mae                       229.785260    250.231190    249.778350   230.258200
## mean_residual_deviance 103064.650000 103995.010000 104626.050000 82253.020000
## mse                    103064.650000 103995.010000 104626.050000 82253.020000
## r2                          0.313951      0.245723      0.237618     0.379358
## residual_deviance      103064.650000 103995.010000 104626.050000 82253.020000
## rmse                      321.036830    322.482570    323.459500   286.797880
## rmsle                       0.386631      0.451339      0.395178     0.387468
##                          cv_7_valid   cv_8_valid    cv_9_valid  cv_10_valid
## mae                      229.207870   246.339260    240.301450   210.813540
## mean_residual_deviance 96906.490000 97043.195000 113370.180000 86420.550000
## mse                    96906.490000 97043.195000 113370.180000 86420.550000
## r2                         0.363963     0.243924      0.206042     0.380255
## residual_deviance      96906.490000 97043.195000 113370.180000 86420.550000
## rmse                     311.298070   311.517580    336.704900   293.973720
## rmsle                      0.366193     0.438896      0.378132     0.370600

Importancia de los predictores

h2o.varimp(modelo_gbm)

## Variable Importances: 
##      variable relative_importance scaled_importance percentage
## 1      metros    817602816.000000          1.000000   0.589683
## 2        anyo    243142080.000000          0.297384   0.175362
## 3 calefaccion    148122320.000000          0.181167   0.106831
## 4   situacion    103968176.000000          0.127162   0.074985
## 5      cocina     59727716.000000          0.073052   0.043078
## 6       banyo     13949906.000000          0.017062   0.010061

h2o.varimp_plot(modelo_gbm)

par(mfrow = c(1, 1))
pdp_plots <- h2o.partialPlot(
                object = modelo_gbm,
                data   = datos_train_h2o,
                cols   = predictores,
                nbins  = 10,
                plot   = TRUE,
                plot_stddev = TRUE
             )

## Warning in plot.window(...): "medcol" is not a graphical parameter

## Warning in plot.window(...): "medlty" is not a graphical parameter

## Warning in plot.window(...): "staplelty" is not a graphical parameter

## Warning in plot.window(...): "boxlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in box(...): "medcol" is not a graphical parameter

## Warning in box(...): "medlty" is not a graphical parameter

## Warning in box(...): "staplelty" is not a graphical parameter

## Warning in box(...): "boxlty" is not a graphical parameter

## Warning in title(...): "medcol" is not a graphical parameter

## Warning in title(...): "medlty" is not a graphical parameter

## Warning in title(...): "staplelty" is not a graphical parameter

## Warning in title(...): "boxlty" is not a graphical parameter

## Warning in plot.window(...): "medcol" is not a graphical parameter

## Warning in plot.window(...): "medlty" is not a graphical parameter

## Warning in plot.window(...): "staplelty" is not a graphical parameter

## Warning in plot.window(...): "boxlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in box(...): "medcol" is not a graphical parameter

## Warning in box(...): "medlty" is not a graphical parameter

## Warning in box(...): "staplelty" is not a graphical parameter

## Warning in box(...): "boxlty" is not a graphical parameter

## Warning in title(...): "medcol" is not a graphical parameter

## Warning in title(...): "medlty" is not a graphical parameter

## Warning in title(...): "staplelty" is not a graphical parameter

## Warning in title(...): "boxlty" is not a graphical parameter

custom_ice <- function(modelo_h2o, data, y, predictores = NA) {

  predictor <- Predictor$new(
                model = modelo_h2o, 
                data  = data, 
                y     = y, 
                class = "regression"
              )
  
  if(is.na(predictores)) {
    predictores <- colnames(data)
  }
  
  graficos <- list()
  
  for (i in seq_along(predictores)) {
    ice <- FeatureEffect$new(
                  predictor = predictor,
                  feature   =  predictores[i],
                  method    = "pdp+ice",
                  grid.size = 20
           )
    plot_ice <- plot(ice) + ggtitle(predictores[i])
    graficos[[i]] <- plot_ice
  }
  
  return(graficos)
}

graficos_ice <- custom_ice(
                  modelo_h2o = modelo_gbm,
                  data       = datos_train[, predictores],
                  y          = datos_train[[var_respuesta]]
                )

ggarrange(plotlist = graficos_ice, ncol = 2, nrow = 3)

explainer_gbm <- DALEXtra::explain_h2o(
                  model = modelo_gbm,
                  data = datos_train[, predictores],
                  y = datos_train[[var_respuesta]],
                  label = "modelo_gbm"
                )

## Preparation of a new explainer is initiated
##   -> model label       :  modelo_gbm 
##   -> data              :  1571  rows  6  cols 
##   -> target variable   :  1571  values 
##   -> predict function  :  yhat.H2ORegressionModel  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package h2o , ver. 3.42.0.2 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  266.8135 , mean =  814.2347 , max =  1931.219  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -848.7658 , mean =  -0.3544634 , max =  1585.59  
##   A new explainer has been created!

plot(model_performance(explainer_gbm))

Predicciones

predicciones_train <- h2o.predict(
                        modelo_gbm,
                        newdata = datos_train_h2o
                      )
predicciones_train <- h2o.cbind(
                       datos_train_h2o["precio"],
                       predicciones_train
                      )
predicciones_train <- as.data.frame(predicciones_train)
predicciones_train <- predicciones_train %>%
                      mutate(
                        residuo = predict - precio
                      )

p1 <- ggplot(predicciones_train, aes(x = precio, y  = predict)) +
      geom_point(alpha = 0.1) +
      geom_smooth(method = "gam", color = "red", size = 1, se = FALSE) +
      labs(title = "Predicciones vs valor real") +
      theme_bw()

p2 <- ggplot(predicciones_train, aes(1:nrow(predicciones_train), y  = residuo)) +
      geom_point(alpha = 0.1) +
      geom_hline(yintercept = 0, color = "red", size = 1) +
      labs(title = "Residuos del modelo") +
      theme_bw()

p3 <- ggplot(predicciones_train, aes(x  = residuo)) +
      geom_density() +
      geom_rug() +
      labs(title = "Residuos del modelo") +
      theme_bw()

p4 <- ggplot(predicciones_train, aes(sample  = predict)) +
      stat_qq() +
      stat_qq_line(color = "red", size = 1) +
      labs(title = "QQ-plot residuos del modelo") +
      theme_bw()

ggpubr::ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2) %>%
ggpubr::annotate_figure(
  top = ggpubr::text_grob("Diagnóstico residuos entrenamiento",
                          color = "black", face = "bold", size = 14)
)

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

predicciones_test <- h2o.predict(
                        object  = modelo_gbm,
                        newdata = datos_test_h2o
                     )
predicciones_test     <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test

Error de Test

rmse_test_gbm <- MLmetrics::RMSE(
                  y_pred = datos_test$precio,
                  y_true = datos_test$prediccion
                )
paste("Error de test (rmse) del modelo GBM:", rmse_test_gbm)

## [1] "Error de test (rmse) del modelo GBM: 277.349013437966"

7.4.- RANDOM FOREST

# Hiperparámetros que se quieren comparar (random search)
hiperparametros <- list(
                     ntrees      = c(500, 1000, 2000),
                     max_depth   = seq(2, 15, 1),
                     sample_rate = seq(0.5, 1.0, 0.2)
                    )
# Criterios de parada para la búsqueda
search_criteria <- list(
                    strategy = "RandomDiscrete",
                    max_models = 50, # Número máximo de combinaciones
                    max_runtime_secs = 60*10, # Tiempo máximo de búsqueda
                    stopping_tolerance = 0.001, # Mejora mínima
                    stopping_rounds = 5,
                    seed = 123
                   )

grid_rf <- h2o.grid(
    # Algoritmo y parámetros
    algorithm = "drf",
    # Variable respuesta y predictores
    y = var_respuesta,
    x = predictores,
    # Datos de entrenamiento
    training_frame = datos_train_h2o,
    # Preprocesado
    ignore_const_cols = TRUE,
    # Parada temprana
    score_tree_interval = 100,
    stopping_rounds = 5,
    stopping_metric = "rmse",
    stopping_tolerance = 0.01,
    # Hiperparámetros optimizados
    hyper_params = hiperparametros,
    # Estrategia de validación para seleccionar el mejor modelo
    seed   = 123,
    nfolds = 3,
    # Tipo de búsqueda
    search_criteria = search_criteria,
    keep_cross_validation_predictions = TRUE,
    grid_id         = "grid_rf"
)
# Se muestran los modelos ordenados de mayor a menor rsme
resultados_grid_rf <- h2o.getGrid(
                        grid_id = "grid_rf",
                        sort_by = "rmse",
                        decreasing = FALSE
                      )

as.data.frame(resultados_grid_rf@summary_table)

##    max_depth ntrees sample_rate        model_ids     rmse
## 1         12   1000         0.5 grid_rf_model_16 306.2439
## 2          9    500         0.5 grid_rf_model_12 306.7754
## 3         10    773         0.7 grid_rf_model_23 307.0377
## 4          8   1000         0.7 grid_rf_model_10 307.0578
## 5          9    500         0.9  grid_rf_model_2 307.2698
## 6         11    500         0.7 grid_rf_model_19 307.2860
## 7         10   1000         0.9 grid_rf_model_18 307.4147
## 8          9    176         0.7 grid_rf_model_24 307.6119
## 9         11    733         0.9  grid_rf_model_6 307.7980
## 10        11    500         0.9  grid_rf_model_8 307.9190
## 11        14    933         0.7  grid_rf_model_5 308.2271
## 12         9     38         0.5 grid_rf_model_25 308.2501
## 13        12   1000         0.9  grid_rf_model_3 308.5213
## 14        13    500         0.9 grid_rf_model_13 309.5815
## 15         5    633         0.5 grid_rf_model_11 309.7947
## 16        14   1000         0.9 grid_rf_model_14 310.5175
## 17        15   1000         0.9 grid_rf_model_15 311.3317
## 18         4    633         0.9 grid_rf_model_21 313.9828
## 19         4    500         0.9  grid_rf_model_7 314.3164
## 20        13      6         0.5 grid_rf_model_26 318.9769
## 21         3    666         0.5  grid_rf_model_4 319.9363
## 22         2   1000         0.5 grid_rf_model_17 331.0745
## 23         2    700         0.7  grid_rf_model_9 331.1429
## 24         2    700         0.9  grid_rf_model_1 331.2879
## 25         2   1000         0.9 grid_rf_model_22 331.5014
## 26         2    500         0.7 grid_rf_model_20 331.9818

# Se reentrena el modelo con los mejores hiperparámetros
mejores_hiperparam <- h2o.getModel(resultados_grid_rf@model_ids[[1]])@parameters

modelo_rf <- h2o.randomForest(
    # Variable respuesta y predictores
    y = var_respuesta,
    x  = predictores,
    # Datos de entrenamiento
    training_frame = datos_train_h2o,
    # Preprocesado
    ignore_const_cols = TRUE,
    # Hiperparámetros
    max_depth   = mejores_hiperparam$max_depth,
    ntrees      =  mejores_hiperparam$ntrees,
    sample_rate = mejores_hiperparam$sample_rate,
    # Estrategia de validación para seleccionar el mejor modelo
    seed            = 123,
    nfolds          = 10,
    keep_cross_validation_predictions = TRUE,
    model_id        = "modelo_rf"
)

modelo_rf

## Model Details:
## ==============
## 
## H2ORegressionModel: drf
## Model ID:  modelo_rf 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1            1000                     1000             2648588        11
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1        12   11.99900         27        374   206.18200
## 
## 
## H2ORegressionMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  94566.81
## RMSE:  307.5172
## MAE:  233.733
## RMSLE:  0.3967076
## Mean Residual Deviance :  94566.81
## 
## 
## 
## H2ORegressionMetrics: drf
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  95290.25
## RMSE:  308.6912
## MAE:  234.4518
## RMSLE:  0.397774
## Mean Residual Deviance :  95290.25
## 
## 
## Cross-Validation Metrics Summary: 
##                                mean          sd   cv_1_valid   cv_2_valid
## mae                      234.642360    8.429452   233.995880   240.281880
## mean_residual_deviance 95289.000000 8240.292000 84367.280000 97375.580000
## mse                    95289.000000 8240.292000 84367.280000 97375.580000
## r2                         0.331576    0.059268     0.425773     0.395673
## residual_deviance      95289.000000 8240.292000 84367.280000 97375.580000
## rmse                     308.428100   13.379117   290.460450   312.050600
## rmsle                      0.397601    0.024669     0.416650     0.414135
##                          cv_3_valid   cv_4_valid   cv_5_valid   cv_6_valid
## mae                      225.879640   241.203200   245.314100   228.435380
## mean_residual_deviance 97183.080000 97423.550000 97519.125000 82435.730000
## mse                    97183.080000 97423.550000 97519.125000 82435.730000
## r2                         0.353102     0.293386     0.289404     0.377979
## residual_deviance      97183.080000 97423.550000 97519.125000 82435.730000
## rmse                     311.742000   312.127440   312.280520   287.116200
## rmsle                      0.385071     0.439713     0.387037     0.384122
##                           cv_7_valid   cv_8_valid    cv_9_valid  cv_10_valid
## mae                       231.670460   241.904680    239.037920   218.700550
## mean_residual_deviance 104328.190000 92474.660000 109303.320000 90479.484000
## mse                    104328.190000 92474.660000 109303.320000 90479.484000
## r2                          0.315252     0.279518      0.234523     0.351147
## residual_deviance      104328.190000 92474.660000 109303.320000 90479.484000
## rmse                      322.998750   304.096470    330.610530   300.798070
## rmsle                       0.369934     0.427657      0.375001     0.376689

importancia

h2o.varimp(modelo_rf)

## Variable Importances: 
##      variable relative_importance scaled_importance percentage
## 1      metros  41929367552.000000          1.000000   0.553430
## 2        anyo  14318920704.000000          0.341501   0.188997
## 3 calefaccion   8895087616.000000          0.212145   0.117407
## 4   situacion   4839642624.000000          0.115424   0.063879
## 5      cocina   4496422400.000000          0.107238   0.059349
## 6       banyo   1283339904.000000          0.030607   0.016939

h2o.varimp_plot(modelo_rf)

pdp_plots <- h2o.partialPlot(
                object = modelo_rf,
                data   = datos_train_h2o,
                cols   = predictores,
                nbins  = 20,
                plot   = TRUE,
                plot_stddev   = TRUE
              )

## Warning in plot.window(...): "medcol" is not a graphical parameter

## Warning in plot.window(...): "medlty" is not a graphical parameter

## Warning in plot.window(...): "staplelty" is not a graphical parameter

## Warning in plot.window(...): "boxlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in box(...): "medcol" is not a graphical parameter

## Warning in box(...): "medlty" is not a graphical parameter

## Warning in box(...): "staplelty" is not a graphical parameter

## Warning in box(...): "boxlty" is not a graphical parameter

## Warning in title(...): "medcol" is not a graphical parameter

## Warning in title(...): "medlty" is not a graphical parameter

## Warning in title(...): "staplelty" is not a graphical parameter

## Warning in title(...): "boxlty" is not a graphical parameter

## Warning in plot.window(...): "medcol" is not a graphical parameter

## Warning in plot.window(...): "medlty" is not a graphical parameter

## Warning in plot.window(...): "staplelty" is not a graphical parameter

## Warning in plot.window(...): "boxlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter

## Warning in box(...): "medcol" is not a graphical parameter

## Warning in box(...): "medlty" is not a graphical parameter

## Warning in box(...): "staplelty" is not a graphical parameter

## Warning in box(...): "boxlty" is not a graphical parameter

## Warning in title(...): "medcol" is not a graphical parameter

## Warning in title(...): "medlty" is not a graphical parameter

## Warning in title(...): "staplelty" is not a graphical parameter

## Warning in title(...): "boxlty" is not a graphical parameter

custom_ice <- function(modelo_h2o, data, y, predictores = NA) {

  predictor <- Predictor$new(
                model = modelo_h2o, 
                data  = data, 
                y     = y, 
                class = "regression"
              )
  
  if(is.na(predictores)) {
    predictores <- colnames(data)
  }
  
  graficos <- list()
  
  for (i in seq_along(predictores)) {
    ice <- FeatureEffect$new(
                  predictor = predictor,
                  feature   =  predictores[i],
                  method    = "pdp+ice",
                  grid.size = 20
           )
    plot_ice <- plot(ice) + ggtitle(predictores[i])
    graficos[[i]] <- plot_ice
  }
  
  return(graficos)
}

graficos_ice <- custom_ice(
                  modelo_h2o = modelo_gbm,
                  data       = datos_train[, predictores],
                  y          = datos_train[[var_respuesta]]
                )

ggarrange(plotlist = graficos_ice, ncol = 2, nrow = 3)

explainer_rf <- DALEXtra::explain_h2o(
                  model = modelo_rf,
                  data  = datos_train[, predictores],
                  y     = datos_train[[var_respuesta]],
                  label = "modelo_rf" 
                )

## Preparation of a new explainer is initiated
##   -> model label       :  modelo_rf 
##   -> data              :  1571  rows  6  cols 
##   -> target variable   :  1571  values 
##   -> predict function  :  yhat.H2ORegressionModel  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package h2o , ver. 3.42.0.2 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  303.2364 , mean =  813.4317 , max =  1718.506  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -777.2178 , mean =  0.4485718 , max =  1338.003  
##   A new explainer has been created!

plot(model_performance(explainer_rf))

predicciones

predicciones_test <- h2o.predict(
                        object  = modelo_rf,
                        newdata = datos_test_h2o
                     )
predicciones_test <- as.vector(predicciones_test)

datos_test$prediccion <- predicciones_test

rmse_test_rf <- MLmetrics::RMSE(
                  y_pred = datos_test$precio,
                  y_true = datos_test$prediccion
                )
paste("Error de test (rmse) del modelo Random Forest:", rmse_test_rf)

## [1] "Error de test (rmse) del modelo Random Forest: 286.52609074536"

7.5. - ESEMBLE

modelo_ensemble <- h2o.stackedEnsemble(
                     y = var_respuesta,
                     x = predictores,
                     training_frame = datos_train_h2o,
                     base_models = list(modelo_glm,
                                        modelo_rf,
                                        modelo_gbm),
                     model_id = "modelo_ensemble"
                    )
modelo_ensemble

## Model Details:
## ==============
## 
## H2ORegressionModel: stackedensemble
## Model ID:  modelo_ensemble 
## Model Summary for Stacked Ensemble: 
##                                     key            value
## 1                     Stacking strategy cross_validation
## 2  Number of base models (used / total)              2/3
## 3      # GBM base models (used / total)              1/1
## 4      # GLM base models (used / total)              0/1
## 5      # DRF base models (used / total)              1/1
## 6                 Metalearner algorithm              GLM
## 7    Metalearner fold assignment scheme             AUTO
## 8                    Metalearner nfolds                0
## 9               Metalearner fold_column               NA
## 10   Custom metalearner hyperparameters             None
## 
## 
## H2ORegressionMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  65962.22
## RMSE:  256.8311
## MAE:  196.7224
## RMSLE:  0.3429918
## Mean Residual Deviance :  65962.22

predicciones_test <- h2o.predict(
                        object  = modelo_ensemble,
                        newdata = datos_test_h2o
                     )
predicciones_test     <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test

rmse_test_ensemble <- MLmetrics::RMSE(
                        y_pred = datos_test$precio,
                        y_true = datos_test$prediccion
                      )
paste("Error de test (rmse) del modelo ensemble:", rmse_test_ensemble)

## [1] "Error de test (rmse) del modelo ensemble: 281.723604652084"

8.- CONCLUSIONES

paste("Error de test (rmse) del modelo GLM:", rmse_test_glm)

## [1] "Error de test (rmse) del modelo GLM: 378.70675549663"

paste("Error de test (rmse) del modelo GBM:", rmse_test_gbm)

## [1] "Error de test (rmse) del modelo GBM: 277.349013437966"

paste("Error de test (rmse) del modelo Random Forest:", rmse_test_rf)

## [1] "Error de test (rmse) del modelo Random Forest: 286.52609074536"

paste("Error de test (rmse) del modelo ensemble:", rmse_test_ensemble)

## [1] "Error de test (rmse) del modelo ensemble: 281.723604652084"

El mejor modelo en las predicciones es el GBM

ML h2o

2023-12-22