1.- BIBLIOTECAS
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
##
## Attaching package: 'h2o'
##
## The following objects are masked from 'package:lubridate':
##
## day, hour, month, week, year
##
## The following objects are masked from 'package:stats':
##
## cor, sd, var
##
## The following objects are masked from 'package:base':
##
## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
library(DALEX)
## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
##
##
## Attaching package: 'DALEX'
##
## The following object is masked from 'package:dplyr':
##
## explain
library(DALEXtra)
library(iml)
library(skimr)
library(DataExplorer)
library(DataExplorer)
library(ggpubr)
library(univariateML)
library(recipes)
##
## Attaching package: 'recipes'
##
## The following object is masked from 'package:stringr':
##
## fixed
##
## The following object is masked from 'package:stats':
##
## step
2.- DATOS
l set de datos rent, disponible en el paquete gamlss.data, contiene información sobre el precio del alquiler de 1969 viviendas situadas en Munich en el año 1993. Además del precio, incluye 9 variables adicionales:
R: precio del alquiler.
Fl: metros cuadrados de la vivienda.
A: año de construcción.
B: si tiene cuarto de baño (1) o no (0).
H: si tiene calefacción central (1) o no (0).
L: si la cocina está equipada (1) o no (0).
Sp: si la calidad del barrio donde está situada la vivienda es superior la media (1) o no (0).
Sm: si la calidad del barrio donde está situada la vivienda es inferior la media (1) o no (0).
loc: combinación de Sp y Sm indicando si la calidad del barrio donde está situada la vivienda es inferior (1), igual (2) o superior (3) a la media.
data("rent", package = "gamlss.data")
datos <- rent
str(datos)
## 'data.frame': 1969 obs. of 9 variables:
## $ R : num 693 422 737 732 1295 ...
## $ Fl : num 50 54 70 50 55 59 46 94 93 65 ...
## $ A : num 1972 1972 1972 1972 1893 ...
## $ Sp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ B : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ H : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ L : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
# Se descartan las variables SP y SM, ya que su información está recogida en la variable loc.
datos <- datos[, !(names(datos) %in% c("Sp", "Sm"))]
str(datos)
## 'data.frame': 1969 obs. of 7 variables:
## $ R : num 693 422 737 732 1295 ...
## $ Fl : num 50 54 70 50 55 59 46 94 93 65 ...
## $ A : num 1972 1972 1972 1972 1893 ...
## $ B : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ H : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ L : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
# Se renombran las columnas para que sean más descriptivas.
colnames(datos) <- c("precio", "metros", "anyo", "banyo", "calefaccion", "cocina", "situacion")
str(datos)
## 'data.frame': 1969 obs. of 7 variables:
## $ precio : num 693 422 737 732 1295 ...
## $ metros : num 50 54 70 50 55 59 46 94 93 65 ...
## $ anyo : num 1972 1972 1972 1972 1893 ...
## $ banyo : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ calefaccion: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ cocina : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ situacion : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
3.- EDA
skim(datos)
| Name | datos |
| Number of rows | 1969 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| banyo | 0 | 1 | FALSE | 2 | 0: 1925, 1: 44 |
| calefaccion | 0 | 1 | FALSE | 2 | 0: 1580, 1: 389 |
| cocina | 0 | 1 | FALSE | 2 | 0: 1808, 1: 161 |
| situacion | 0 | 1 | FALSE | 3 | 2: 1247, 3: 550, 1: 172 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| precio | 0 | 1 | 811.88 | 379.00 | 101.7 | 544.2 | 737.8 | 1022 | 3000 | ▇▇▂▁▁ |
| metros | 0 | 1 | 67.73 | 20.86 | 30.0 | 52.0 | 67.0 | 82 | 120 | ▆▇▇▅▂ |
| anyo | 0 | 1 | 1948.48 | 29.02 | 1890.0 | 1934.0 | 1957.0 | 1972 | 1988 | ▃▁▃▇▇ |
head(datos)
## precio metros anyo banyo calefaccion cocina situacion
## 1 693.3 50 1972 0 0 0 2
## 2 422.0 54 1972 0 0 0 2
## 3 736.6 70 1972 0 0 0 2
## 4 732.2 50 1972 0 0 0 2
## 5 1295.1 55 1893 0 0 0 2
## 6 1195.9 59 1893 0 0 0 2
4.- Datos ausentes
# Número de datos ausentes por variable
datos %>% map_dbl(.f = function(x){sum(is.na(x))})
## precio metros anyo banyo calefaccion cocina
## 0 0 0 0 0 0
## situacion
## 0
plot_missing(
data = datos,
title = "Porcentaje de valores ausentes",
ggtheme = theme_bw(),
theme_config = list(legend.position = "bottom")
)
5.- Variable respuesta
p1 <- ggplot(data = datos, aes(x = precio)) +
geom_density(fill = "steelblue", alpha = 0.8) +
geom_rug(alpha = 0.1) +
scale_x_continuous(labels = scales::comma) +
labs(title = "Distribución original", x = "gastos_totales") +
theme_bw()
p2 <- ggplot(data = datos, aes(x = sqrt(precio))) +
geom_density(fill = "steelblue", alpha = 0.8) +
geom_rug(alpha = 0.1) +
scale_x_continuous(labels = scales::comma) +
labs(title = "Transformación raíz cuadrada", x = "gastos_totales") +
theme_bw()
p3 <- ggplot(data = datos, aes(x = log10(precio))) +
geom_density(fill = "steelblue", alpha = 0.8) +
geom_rug(alpha = 0.1) +
scale_x_continuous(labels = scales::comma) +
labs(title = "Transformación logarítmica") +
theme_bw()
p4 <- ggplot(data = datos, aes(x = log(precio))) +
geom_density(fill = "steelblue", alpha = 0.8) +
geom_rug(alpha = 0.1) +
scale_x_continuous(labels = scales::comma) +
labs(title = "Transformación logarítmica") +
theme_bw()
ggarrange(p1, p2, p3, p4, ncol = 1, align = "v")
# Tabla de estadísticos principales
summary(datos$precio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 101.7 544.2 737.8 811.9 1022.0 3000.0
Algunos modelos de machine learning y aprendizaje estadístico requieren que la variable respuesta se distribuya de una forma determinada. Por ejemplo: para los modelos de regresión lineal (LM), la distribución tiene que ser de tipo normal. Para los modelos lineales generalizados (GLM), la distribución tiene que ser de la familia exponencial. Existen varios paquetes en R que permiten identificar a qué distribución se ajustan mejor los datos, uno de ellos es univariateML. Para conocer más sobre cómo identificar distribuciones consultar Ajuste de distribuciones con R.
comparacion_aic <- AIC(
mlbetapr(datos$precio),
mlexp(datos$precio),
mlinvgamma(datos$precio),
mlgamma(datos$precio),
mllnorm(datos$precio),
mlrayleigh(datos$precio),
mlinvgauss(datos$precio),
mlweibull(datos$precio),
mlinvweibull(datos$precio),
mllgamma(datos$precio)
)
comparacion_aic %>% rownames_to_column(var = "distribucion") %>% arrange(AIC)
## distribucion df AIC
## 1 mlgamma(datos$precio) 2 28615.58
## 2 mllnorm(datos$precio) 2 28657.23
## 3 mlinvgauss(datos$precio) 2 28690.03
## 4 mlweibull(datos$precio) 2 28746.25
## 5 mlrayleigh(datos$precio) 1 28797.71
## 6 mlbetapr(datos$precio) 2 28867.94
## 7 mlinvgamma(datos$precio) 2 28869.14
## 8 mlinvweibull(datos$precio) 2 29245.12
## 9 mlexp(datos$precio) 1 30322.05
## 10 mllgamma(datos$precio) 2 Inf
6.- Variables continuas
plot_density(
data = datos %>% select(-precio),
ncol= 3,
title = "Distribución variables continuas",
ggtheme = theme_bw(),
theme_config = list(
plot.title = element_text(size = 16, face = "bold"),
strip.text = element_text(colour = "black", size = 12, face = 2)
)
)
custom_corr_plot <- function(variable1, variable2, df, alpha=0.3){
p <- df %>%
mutate(
# Truco para que se ponga el título estilo facet
title = paste(toupper(variable2), "vs", toupper(variable1))
) %>%
ggplot(aes(x = !!sym(variable1), y = !!sym(variable2))) +
geom_point(alpha = alpha) +
# Tendencia no lineal
geom_smooth(se = FALSE, method = "gam", formula = y ~ splines::bs(x, 3)) +
# Tendencia lineal
geom_smooth(se = FALSE, method = "lm", color = "firebrick") +
facet_grid(. ~ title) +
theme_bw() +
theme(strip.text = element_text(colour = "black", size = 10, face = 2),
axis.title = element_blank())
return(p)
}
variables_continuas <- c("anyo", "metros")
plots <- map(
.x = variables_continuas,
.f = custom_corr_plot,
variable2 = "precio",
df = datos
)
ggarrange(plotlist = plots, ncol = 2, nrow = 1) %>%
annotate_figure(
top = text_grob("Correlación con el precio de alquiler", face = "bold", size = 16,
x = 0.4)
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
plot_correlation(
data = datos,
cor_args = list(use = "pairwise.complete.obs"),
type = "continuous",
title = "Matriz de correlación variables continuas",
theme_config = list(legend.position = "none",
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_blank(),
axis.text.x = element_text(angle = -45, hjust = +0.1)
)
)
GGally::ggscatmat(
data = datos %>% select_if(is.numeric),
alpha = 0.1) +
theme_bw() +
labs(title = "Correlación por pares") +
theme(
plot.title = element_text(size = 16, face = "bold"),
#axis.text = element_blank(),
strip.text = element_text(colour = "black", size = 10, face = 2)
)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
plot_bar(
datos,
ncol = 2,
title = "Número de observaciones por grupo",
ggtheme = theme_bw(),
theme_config = list(
plot.title = element_text(size = 16, face = "bold"),
strip.text = element_text(colour = "black", size = 12, face = 2),
legend.position = "none"
)
)
custom_box_plot <- function(variable1, variable2, df, alpha=0.3){
p <- df %>%
mutate(
# Truco para que se ponga el título estilo facet
title = paste(toupper(variable2), "vs", toupper(variable1))
) %>%
ggplot(aes(x = !!sym(variable1), y = !!sym(variable2))) +
geom_violin(alpha = alpha) +
geom_boxplot(width = 0.1, outlier.shape = NA) +
facet_grid(. ~ title) +
theme_bw() +
theme(strip.text = element_text(colour = "black", size = 10, face = 2),
axis.title = element_blank())
return(p)
}
variables_cualitativas <- c("banyo", "calefaccion", "cocina", "situacion")
plots <- map(
.x = variables_cualitativas,
.f = custom_box_plot,
variable2 = "precio",
df = datos
)
ggarrange(plotlist = plots, ncol = 2, nrow = 2) %>%
annotate_figure(
top = text_grob("Correlación con precio", face = "bold", size = 16,
x = 0.28)
)
Si alguno de los niveles de una variable cualitativa tiene muy pocas observaciones en comparación a los otros niveles, puede ocurrir que, durante la validación cruzada o bootstrapping, algunas particiones no contengan ninguna observación de dicha clase (varianza cero), lo que puede dar lugar a errores. En este caso hay que tener precaución con la variable banyo.
table(datos$banyo)
##
## 0 1
## 1925 44
7.- Modelos
# Creación de un cluster local con todos los cores disponibles.
h2o.init(ip = "localhost",
# -1 indica que se empleen todos los cores disponibles.
nthreads = -1,
# Máxima memoria disponible para el cluster.
max_mem_size = "6g")
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 41 minutes 21 seconds
## H2O cluster timezone: Europe/Paris
## H2O data parsing timezone: UTC
## H2O cluster version: 3.42.0.2
## H2O cluster version age: 5 months and 2 days
## H2O cluster name: H2O_started_from_R_Usuario_kmr404
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.23 GB
## H2O cluster total cores: 20
## H2O cluster allowed cores: 20
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.3.2 (2023-10-31 ucrt)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is (5 months and 2 days) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
# Se eliminan los datos del cluster por si ya había sido iniciado.
h2o.removeAll()
h2o.no_progress()
datos_h2o <- as.h2o(x = datos, destination_frame = "datos_h2o")
7.1.- TRAIN Y TEST SPLIT
set.seed(123)
particiones <- h2o.splitFrame(data=datos_h2o, ratios = c(0.8), seed = 123)
datos_train_h2o <- h2o.assign(data = particiones[[1]], key = "datos_train_h2o")
datos_test_h2o <- h2o.assign(data = particiones[[2]], key = "datos_test_h2o")
datos_train <- as.data.frame(datos_train_h2o)
datos_test <- as.data.frame(datos_test_h2o)
summary(datos_train$precio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 101.7 544.7 741.6 813.9 1026.5 3000.0
summary(datos_test$precio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 127.1 537.8 718.6 804.0 1000.0 2764.5
summary(datos_train$banyo)/nrow(datos_train)*100
## 0 1
## 97.963081 2.036919
summary(datos_test$banyo)/nrow(datos_test)*100
## 0 1
## 96.984925 3.015075
7.2.- MODELO GLM
Entrenamiento
# Se define la variable respuesta y los predictores.
var_respuesta <- "precio"
# Para este modelo se emplean todos los predictores disponibles.
predictores <- setdiff(h2o.colnames(datos_train_h2o), var_respuesta)
Optimización de hiperparámetros
# Valores de alpha que se van a comparar.
hiperparametros <- list(alpha = seq(0,1,0.1))
grid_glm <- h2o.grid(
# Algoritmo y parámetros
algorithm = "glm",
family = "gamma",
# Variable respuesta y predictores
y = var_respuesta,
x = predictores,
# Datos de entrenamiento
training_frame = datos_train_h2o,
# Preprocesado
standardize = TRUE,
missing_values_handling = "Skip",
ignore_const_cols = TRUE,
# Hiperparámetros
hyper_params = hiperparametros,
# Tipo de búsqueda
search_criteria = list(strategy = "Cartesian"),
lambda_search = TRUE,
# Selección automática del solver adecuado
solver = "AUTO",
# Estrategia de validación para seleccionar el mejor modelo
seed = 123,
nfolds = 3,
keep_cross_validation_predictions = TRUE,
grid_id = "grid_glm"
)
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 12!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 71!
## Warning in h2o.getGrid(grid_id = grid_id): Adding alpha array to hyperparameter
## runs slower with gridsearch. This is due to the fact that the algo has to run
## initialization for every alpha value. Setting the alpha array as a model
## parameter will skip the initialization and run faster overall.
# Se muestran los modelos ordenados de mayor a menor rmse
resultados_grid_glm <- h2o.getGrid(
grid_id = "grid_glm",
sort_by = "rmse",
decreasing = FALSE
)
## Warning in h2o.getGrid(grid_id = "grid_glm", sort_by = "rmse", decreasing =
## FALSE): Adding alpha array to hyperparameter runs slower with gridsearch. This
## is due to the fact that the algo has to run initialization for every alpha
## value. Setting the alpha array as a model parameter will skip the
## initialization and run faster overall.
as.data.frame(resultados_grid_glm@summary_table)
## alpha model_ids rmse
## 1 0.1 grid_glm_model_2 310.2497
## 2 0.2 grid_glm_model_3 310.2498
## 3 0.30000000000000004 grid_glm_model_4 310.2498
## 4 0.4 grid_glm_model_5 310.2499
## 5 0.5 grid_glm_model_6 310.2499
## 6 0.6000000000000001 grid_glm_model_7 310.2499
## 7 0.7000000000000001 grid_glm_model_8 310.2499
## 8 0.8 grid_glm_model_9 310.2499
## 9 0.9 grid_glm_model_10 310.2499
## 10 1.0 grid_glm_model_11 310.2499
## 11 0.0 grid_glm_model_1 310.3075
# Se reentrena el modelo con los mejores hiperparámetros
mejores_hiperparam <- h2o.getModel(resultados_grid_glm@model_ids[[1]])@parameters
modelo_glm <- h2o.glm(
# Variable respuesta y predictores
y = var_respuesta,
x = predictores,
# Datos de entrenamiento
training_frame = datos_train_h2o,
# Preprocesado
standardize = TRUE,
missing_values_handling = "Skip",
ignore_const_cols = TRUE,
# Hiperparámetros
alpha = mejores_hiperparam$alpha,
lambda_search = TRUE,
# Selección automática del solver adecuado
solver = "AUTO",
# Estrategia de validación (para comparar con otros modelos)
seed = 123,
nfolds = 10,
keep_cross_validation_predictions = TRUE,
model_id = "modelo_glm"
)
## Warning in doTryCatch(return(expr), name, parentenv, handler): Reached maximum
## number of iterations 3!
Influencia de predictores
h2o.varimp(modelo_glm)
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 metros 0.010826 1.000000 1.000000
## 2 situacion.1 0.000000 0.000000 0.000000
## 3 situacion.2 0.000000 0.000000 0.000000
## 4 situacion.3 0.000000 0.000000 0.000000
## 5 calefaccion.0 0.000000 0.000000 0.000000
## 6 calefaccion.1 0.000000 0.000000 0.000000
## 7 cocina.0 0.000000 0.000000 0.000000
## 8 cocina.1 0.000000 0.000000 0.000000
## 9 banyo.0 0.000000 0.000000 0.000000
## 10 banyo.1 0.000000 0.000000 0.000000
## 11 anyo 0.000000 0.000000 0.000000
h2o.varimp_plot(modelo_glm)
Diagnosis de los residuos
explainer_glm <- DALEXtra::explain_h2o(
model = modelo_glm,
data = datos_train[, predictores],
y = datos_train[[var_respuesta]],
label = "modelo_glm"
)
## Preparation of a new explainer is initiated
## -> model label : modelo_glm
## -> data : 1571 rows 6 cols
## -> target variable : 1571 values
## -> predict function : yhat.H2ORegressionModel will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package h2o , ver. 3.42.0.2 , task regression ( default )
## -> predicted values : numerical, min = 813.8608 , mean = 813.8803 , max = 813.9076
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -712.1723 , mean = 2.96012e-13 , max = 2186.092
## A new explainer has been created!
plot(model_performance(explainer_glm))
predicciones_train <- h2o.predict(
modelo_glm,
newdata = datos_train_h2o
)
predicciones_train <- h2o.cbind(
datos_train_h2o["precio"],
predicciones_train
)
predicciones_train <- as.data.frame(predicciones_train)
predicciones_train <- predicciones_train %>%
mutate(
residuo = predict - precio
)
p1 <- ggplot(predicciones_train, aes(x = precio, y = predict)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "gam", color = "red", size = 1, se = TRUE) +
labs(title = "Predicciones vs valor real") +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- ggplot(predicciones_train, aes(1:nrow(predicciones_train), y = residuo)) +
geom_point(alpha = 0.1) +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Residuos del modelo") +
theme_bw()
p3 <- ggplot(predicciones_train, aes(x = residuo)) +
geom_density() +
geom_rug() +
labs(title = "Residuos del modelo") +
theme_bw()
p4 <- ggplot(predicciones_train, aes(sample = predict)) +
stat_qq() +
stat_qq_line(color = "red", size = 1) +
labs(title = "QQ-plot residuos del modelo") +
theme_bw()
ggpubr::ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2) %>%
ggpubr::annotate_figure(
top = ggpubr::text_grob("Diagnóstico residuos entrenamiento",
color = "black", face = "bold", size = 14)
)
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
predicciones_test <- h2o.predict(
object = modelo_glm,
newdata = datos_test_h2o
)
predicciones_test <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test
Error de test
rmse_test_glm <- MLmetrics::RMSE(
y_pred = datos_test$precio,
y_true = datos_test$prediccion
)
paste("Error de test (rmse) del modelo GLM:", rmse_test_glm)
## [1] "Error de test (rmse) del modelo GLM: 378.70675549663"
##7.3 - GBM
# Hiperparámetros que se quieren comparar (random search)
hiperparametros <- list(
ntrees = c(500, 1000, 2000),
learn_rate = seq(0.01, 0.1, 0.01),
max_depth = seq(2, 15, 1),
sample_rate = seq(0.5, 1.0, 0.2),
col_sample_rate = seq(0.1, 1.0, 0.3)
)
# Criterios de parada para la búsqueda
search_criteria <- list(
strategy = "RandomDiscrete",
max_models = 50, # Número máximo de combinaciones
max_runtime_secs = 60*10, # Tiempo máximo de búsqueda
stopping_tolerance = 0.001, # Mejora mínima
stopping_rounds = 5,
seed = 123
)
grid_gbm <- h2o.grid(
# Algoritmo y parámetros
algorithm = "gbm",
distribution = "gaussian",
# Variable respuesta y predictores
y = var_respuesta,
x = predictores,
# Datos de entrenamiento
training_frame = datos_train_h2o,
# Preprocesado
ignore_const_cols = TRUE,
# Parada temprana
score_tree_interval = 100,
stopping_rounds = 3,
stopping_metric = "rmse",
stopping_tolerance = 0.01,
# Hiperparámetros optimizados
hyper_params = hiperparametros,
# Estrategia de validación para seleccionar el mejor modelo
seed = 123,
nfolds = 3,
# Tipo de búsqueda
search_criteria = search_criteria,
keep_cross_validation_predictions = TRUE,
grid_id = "grid_gbm"
)
# Se muestran los modelos ordenados de mayor a menor rmse
resultados_grid_gbm <- h2o.getGrid(
grid_id = "grid_gbm",
sort_by = "rmse",
decreasing = FALSE
)
as.data.frame(resultados_grid_gbm@summary_table)
## col_sample_rate learn_rate max_depth ntrees sample_rate model_ids
## 1 0.1 0.04 3 466 0.9 grid_gbm_model_33
## 2 0.4 0.01 4 533 0.7 grid_gbm_model_21
## 3 0.4 0.03 2 500 0.7 grid_gbm_model_12
## 4 0.1 0.01 10 500 0.5 grid_gbm_model_22
## 5 0.1 0.03 8 500 0.5 grid_gbm_model_7
## 6 0.4 0.01 6 500 0.9 grid_gbm_model_23
## 7 0.1 0.04 9 500 0.9 grid_gbm_model_13
## 8 0.4 0.02 5 500 0.5 grid_gbm_model_35
## 9 0.1 0.05 6 466 0.5 grid_gbm_model_11
## 10 0.7 0.01 8 433 0.5 grid_gbm_model_1
## 11 0.7 0.01 9 500 0.5 grid_gbm_model_4
## 12 0.1 0.06 7 500 0.5 grid_gbm_model_17
## 13 0.7 0.02 6 400 0.5 grid_gbm_model_45
## 14 0.1 0.06 14 433 0.5 grid_gbm_model_28
## 15 0.7 0.05 3 400 0.7 grid_gbm_model_2
## 16 0.1 0.09 12 500 0.9 grid_gbm_model_46
## 17 0.4 0.05 5 400 0.9 grid_gbm_model_36
## 18 0.7 0.02 15 400 0.5 grid_gbm_model_41
## 19 0.4 0.03 15 400 0.9 grid_gbm_model_9
## 20 1.0 0.07 3 500 0.9 grid_gbm_model_34
## 21 1.0 0.01 15 400 0.9 grid_gbm_model_50
## 22 0.4 0.08 5 500 0.5 grid_gbm_model_38
## 23 1.0 0.01 13 500 0.9 grid_gbm_model_32
## 24 0.7 0.05 5 400 0.5 grid_gbm_model_19
## 25 0.4 0.05 10 400 0.9 grid_gbm_model_44
## 26 1.0 0.03 6 500 0.7 grid_gbm_model_18
## 27 0.7 0.02 10 400 0.9 grid_gbm_model_25
## 28 0.4 0.10 5 466 0.5 grid_gbm_model_48
## 29 0.4 0.06 11 400 0.5 grid_gbm_model_24
## 30 0.4 0.08 14 433 0.5 grid_gbm_model_26
## 31 0.4 0.07 8 500 0.9 grid_gbm_model_27
## 32 1.0 0.05 6 500 0.5 grid_gbm_model_42
## 33 0.7 0.03 11 400 0.7 grid_gbm_model_37
## 34 0.7 0.04 8 400 0.7 grid_gbm_model_8
## 35 0.4 0.09 8 400 0.7 grid_gbm_model_6
## 36 0.4 0.10 8 400 0.9 grid_gbm_model_29
## 37 1.0 0.05 13 400 0.5 grid_gbm_model_39
## 38 0.4 0.08 15 500 0.7 grid_gbm_model_20
## 39 1.0 0.04 12 400 0.7 grid_gbm_model_10
## 40 0.7 0.07 10 400 0.5 grid_gbm_model_30
## 41 1.0 0.05 14 500 0.5 grid_gbm_model_31
## 42 0.7 0.08 10 400 0.5 grid_gbm_model_47
## 43 1.0 0.06 10 400 0.9 grid_gbm_model_16
## 44 1.0 0.09 7 400 0.9 grid_gbm_model_14
## 45 1.0 0.07 14 400 0.7 grid_gbm_model_3
## 46 0.7 0.10 12 400 0.5 grid_gbm_model_15
## 47 1.0 0.06 15 400 0.9 grid_gbm_model_43
## 48 1.0 0.09 9 400 0.7 grid_gbm_model_49
## 49 1.0 0.08 12 400 0.9 grid_gbm_model_5
## 50 1.0 0.10 15 400 0.7 grid_gbm_model_40
## rmse
## 1 304.6250
## 2 304.7685
## 3 304.8417
## 4 305.6167
## 5 305.8037
## 6 306.4812
## 7 306.6267
## 8 307.1674
## 9 307.6876
## 10 308.4450
## 11 309.2205
## 12 309.6394
## 13 309.6695
## 14 309.7567
## 15 310.3772
## 16 310.3804
## 17 311.6078
## 18 313.1005
## 19 314.5768
## 20 314.7106
## 21 315.1281
## 22 316.1080
## 23 316.9581
## 24 316.9640
## 25 317.0140
## 26 317.5602
## 27 317.6219
## 28 318.0550
## 29 318.8795
## 30 320.4766
## 31 320.6331
## 32 320.9059
## 33 321.1137
## 34 322.2029
## 35 322.3950
## 36 322.8808
## 37 324.6121
## 38 325.0035
## 39 327.1624
## 40 327.4183
## 41 328.3449
## 42 331.8125
## 43 335.3064
## 44 336.3255
## 45 337.0449
## 46 337.2439
## 47 339.3997
## 48 340.7610
## 49 345.3697
## 50 347.3378
# Se reentrena el modelo con los mejores hiperparámetros
mejores_hiperparam <- h2o.getModel(resultados_grid_gbm@model_ids[[1]])@parameters
modelo_gbm <- h2o.gbm(
y = var_respuesta,
x = predictores,
training_frame = datos_train_h2o,
ignore_const_cols = TRUE,
learn_rate = mejores_hiperparam$learn_rate,
max_depth = mejores_hiperparam$max_depth,
ntrees = mejores_hiperparam$ntrees,
sample_rate = mejores_hiperparam$sample_rate,
seed = 123,
nfolds = 10,
keep_cross_validation_predictions = TRUE,
model_id = "modelo_gbm"
)
modelo_gbm
## Model Details:
## ==============
##
## H2ORegressionModel: gbm
## Model ID: modelo_gbm
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 466 466 67818 3
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 3 3.00000 4 8 6.85622
##
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 75523.37
## RMSE: 274.8152
## MAE: 209.4056
## RMSLE: 0.3639322
## Mean Residual Deviance : 75523.37
##
##
##
## H2ORegressionMetrics: gbm
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 97116.78
## RMSE: 311.6356
## MAE: 235.3906
## RMSLE: 0.4004822
## Mean Residual Deviance : 97116.78
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## mae 235.570140 12.069691 230.102920 238.883470
## mean_residual_deviance 97077.510000 9869.390000 85383.720000 97712.195000
## mse 97077.510000 9869.390000 85383.720000 97712.195000
## r2 0.318327 0.078429 0.418854 0.393584
## residual_deviance 97077.510000 9869.390000 85383.720000 97712.195000
## rmse 311.206540 15.916158 292.204930 312.589500
## rmsle 0.400072 0.028570 0.411142 0.415139
## cv_3_valid cv_4_valid cv_5_valid cv_6_valid
## mae 229.785260 250.231190 249.778350 230.258200
## mean_residual_deviance 103064.650000 103995.010000 104626.050000 82253.020000
## mse 103064.650000 103995.010000 104626.050000 82253.020000
## r2 0.313951 0.245723 0.237618 0.379358
## residual_deviance 103064.650000 103995.010000 104626.050000 82253.020000
## rmse 321.036830 322.482570 323.459500 286.797880
## rmsle 0.386631 0.451339 0.395178 0.387468
## cv_7_valid cv_8_valid cv_9_valid cv_10_valid
## mae 229.207870 246.339260 240.301450 210.813540
## mean_residual_deviance 96906.490000 97043.195000 113370.180000 86420.550000
## mse 96906.490000 97043.195000 113370.180000 86420.550000
## r2 0.363963 0.243924 0.206042 0.380255
## residual_deviance 96906.490000 97043.195000 113370.180000 86420.550000
## rmse 311.298070 311.517580 336.704900 293.973720
## rmsle 0.366193 0.438896 0.378132 0.370600
Importancia de los predictores
h2o.varimp(modelo_gbm)
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 metros 817602816.000000 1.000000 0.589683
## 2 anyo 243142080.000000 0.297384 0.175362
## 3 calefaccion 148122320.000000 0.181167 0.106831
## 4 situacion 103968176.000000 0.127162 0.074985
## 5 cocina 59727716.000000 0.073052 0.043078
## 6 banyo 13949906.000000 0.017062 0.010061
h2o.varimp_plot(modelo_gbm)
par(mfrow = c(1, 1))
pdp_plots <- h2o.partialPlot(
object = modelo_gbm,
data = datos_train_h2o,
cols = predictores,
nbins = 10,
plot = TRUE,
plot_stddev = TRUE
)
## Warning in plot.window(...): "medcol" is not a graphical parameter
## Warning in plot.window(...): "medlty" is not a graphical parameter
## Warning in plot.window(...): "staplelty" is not a graphical parameter
## Warning in plot.window(...): "boxlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in box(...): "medcol" is not a graphical parameter
## Warning in box(...): "medlty" is not a graphical parameter
## Warning in box(...): "staplelty" is not a graphical parameter
## Warning in box(...): "boxlty" is not a graphical parameter
## Warning in title(...): "medcol" is not a graphical parameter
## Warning in title(...): "medlty" is not a graphical parameter
## Warning in title(...): "staplelty" is not a graphical parameter
## Warning in title(...): "boxlty" is not a graphical parameter
## Warning in plot.window(...): "medcol" is not a graphical parameter
## Warning in plot.window(...): "medlty" is not a graphical parameter
## Warning in plot.window(...): "staplelty" is not a graphical parameter
## Warning in plot.window(...): "boxlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in box(...): "medcol" is not a graphical parameter
## Warning in box(...): "medlty" is not a graphical parameter
## Warning in box(...): "staplelty" is not a graphical parameter
## Warning in box(...): "boxlty" is not a graphical parameter
## Warning in title(...): "medcol" is not a graphical parameter
## Warning in title(...): "medlty" is not a graphical parameter
## Warning in title(...): "staplelty" is not a graphical parameter
## Warning in title(...): "boxlty" is not a graphical parameter
custom_ice <- function(modelo_h2o, data, y, predictores = NA) {
predictor <- Predictor$new(
model = modelo_h2o,
data = data,
y = y,
class = "regression"
)
if(is.na(predictores)) {
predictores <- colnames(data)
}
graficos <- list()
for (i in seq_along(predictores)) {
ice <- FeatureEffect$new(
predictor = predictor,
feature = predictores[i],
method = "pdp+ice",
grid.size = 20
)
plot_ice <- plot(ice) + ggtitle(predictores[i])
graficos[[i]] <- plot_ice
}
return(graficos)
}
graficos_ice <- custom_ice(
modelo_h2o = modelo_gbm,
data = datos_train[, predictores],
y = datos_train[[var_respuesta]]
)
ggarrange(plotlist = graficos_ice, ncol = 2, nrow = 3)
explainer_gbm <- DALEXtra::explain_h2o(
model = modelo_gbm,
data = datos_train[, predictores],
y = datos_train[[var_respuesta]],
label = "modelo_gbm"
)
## Preparation of a new explainer is initiated
## -> model label : modelo_gbm
## -> data : 1571 rows 6 cols
## -> target variable : 1571 values
## -> predict function : yhat.H2ORegressionModel will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package h2o , ver. 3.42.0.2 , task regression ( default )
## -> predicted values : numerical, min = 266.8135 , mean = 814.2347 , max = 1931.219
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -848.7658 , mean = -0.3544634 , max = 1585.59
## A new explainer has been created!
plot(model_performance(explainer_gbm))
Predicciones
predicciones_train <- h2o.predict(
modelo_gbm,
newdata = datos_train_h2o
)
predicciones_train <- h2o.cbind(
datos_train_h2o["precio"],
predicciones_train
)
predicciones_train <- as.data.frame(predicciones_train)
predicciones_train <- predicciones_train %>%
mutate(
residuo = predict - precio
)
p1 <- ggplot(predicciones_train, aes(x = precio, y = predict)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "gam", color = "red", size = 1, se = FALSE) +
labs(title = "Predicciones vs valor real") +
theme_bw()
p2 <- ggplot(predicciones_train, aes(1:nrow(predicciones_train), y = residuo)) +
geom_point(alpha = 0.1) +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Residuos del modelo") +
theme_bw()
p3 <- ggplot(predicciones_train, aes(x = residuo)) +
geom_density() +
geom_rug() +
labs(title = "Residuos del modelo") +
theme_bw()
p4 <- ggplot(predicciones_train, aes(sample = predict)) +
stat_qq() +
stat_qq_line(color = "red", size = 1) +
labs(title = "QQ-plot residuos del modelo") +
theme_bw()
ggpubr::ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2) %>%
ggpubr::annotate_figure(
top = ggpubr::text_grob("Diagnóstico residuos entrenamiento",
color = "black", face = "bold", size = 14)
)
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
predicciones_test <- h2o.predict(
object = modelo_gbm,
newdata = datos_test_h2o
)
predicciones_test <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test
Error de Test
rmse_test_gbm <- MLmetrics::RMSE(
y_pred = datos_test$precio,
y_true = datos_test$prediccion
)
paste("Error de test (rmse) del modelo GBM:", rmse_test_gbm)
## [1] "Error de test (rmse) del modelo GBM: 277.349013437966"
7.4.- RANDOM FOREST
# Hiperparámetros que se quieren comparar (random search)
hiperparametros <- list(
ntrees = c(500, 1000, 2000),
max_depth = seq(2, 15, 1),
sample_rate = seq(0.5, 1.0, 0.2)
)
# Criterios de parada para la búsqueda
search_criteria <- list(
strategy = "RandomDiscrete",
max_models = 50, # Número máximo de combinaciones
max_runtime_secs = 60*10, # Tiempo máximo de búsqueda
stopping_tolerance = 0.001, # Mejora mínima
stopping_rounds = 5,
seed = 123
)
grid_rf <- h2o.grid(
# Algoritmo y parámetros
algorithm = "drf",
# Variable respuesta y predictores
y = var_respuesta,
x = predictores,
# Datos de entrenamiento
training_frame = datos_train_h2o,
# Preprocesado
ignore_const_cols = TRUE,
# Parada temprana
score_tree_interval = 100,
stopping_rounds = 5,
stopping_metric = "rmse",
stopping_tolerance = 0.01,
# Hiperparámetros optimizados
hyper_params = hiperparametros,
# Estrategia de validación para seleccionar el mejor modelo
seed = 123,
nfolds = 3,
# Tipo de búsqueda
search_criteria = search_criteria,
keep_cross_validation_predictions = TRUE,
grid_id = "grid_rf"
)
# Se muestran los modelos ordenados de mayor a menor rsme
resultados_grid_rf <- h2o.getGrid(
grid_id = "grid_rf",
sort_by = "rmse",
decreasing = FALSE
)
as.data.frame(resultados_grid_rf@summary_table)
## max_depth ntrees sample_rate model_ids rmse
## 1 12 1000 0.5 grid_rf_model_16 306.2439
## 2 9 500 0.5 grid_rf_model_12 306.7754
## 3 10 773 0.7 grid_rf_model_23 307.0377
## 4 8 1000 0.7 grid_rf_model_10 307.0578
## 5 9 500 0.9 grid_rf_model_2 307.2698
## 6 11 500 0.7 grid_rf_model_19 307.2860
## 7 10 1000 0.9 grid_rf_model_18 307.4147
## 8 9 176 0.7 grid_rf_model_24 307.6119
## 9 11 733 0.9 grid_rf_model_6 307.7980
## 10 11 500 0.9 grid_rf_model_8 307.9190
## 11 14 933 0.7 grid_rf_model_5 308.2271
## 12 9 38 0.5 grid_rf_model_25 308.2501
## 13 12 1000 0.9 grid_rf_model_3 308.5213
## 14 13 500 0.9 grid_rf_model_13 309.5815
## 15 5 633 0.5 grid_rf_model_11 309.7947
## 16 14 1000 0.9 grid_rf_model_14 310.5175
## 17 15 1000 0.9 grid_rf_model_15 311.3317
## 18 4 633 0.9 grid_rf_model_21 313.9828
## 19 4 500 0.9 grid_rf_model_7 314.3164
## 20 13 6 0.5 grid_rf_model_26 318.9769
## 21 3 666 0.5 grid_rf_model_4 319.9363
## 22 2 1000 0.5 grid_rf_model_17 331.0745
## 23 2 700 0.7 grid_rf_model_9 331.1429
## 24 2 700 0.9 grid_rf_model_1 331.2879
## 25 2 1000 0.9 grid_rf_model_22 331.5014
## 26 2 500 0.7 grid_rf_model_20 331.9818
# Se reentrena el modelo con los mejores hiperparámetros
mejores_hiperparam <- h2o.getModel(resultados_grid_rf@model_ids[[1]])@parameters
modelo_rf <- h2o.randomForest(
# Variable respuesta y predictores
y = var_respuesta,
x = predictores,
# Datos de entrenamiento
training_frame = datos_train_h2o,
# Preprocesado
ignore_const_cols = TRUE,
# Hiperparámetros
max_depth = mejores_hiperparam$max_depth,
ntrees = mejores_hiperparam$ntrees,
sample_rate = mejores_hiperparam$sample_rate,
# Estrategia de validación para seleccionar el mejor modelo
seed = 123,
nfolds = 10,
keep_cross_validation_predictions = TRUE,
model_id = "modelo_rf"
)
modelo_rf
## Model Details:
## ==============
##
## H2ORegressionModel: drf
## Model ID: modelo_rf
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 1000 1000 2648588 11
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 12 11.99900 27 374 206.18200
##
##
## H2ORegressionMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 94566.81
## RMSE: 307.5172
## MAE: 233.733
## RMSLE: 0.3967076
## Mean Residual Deviance : 94566.81
##
##
##
## H2ORegressionMetrics: drf
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 95290.25
## RMSE: 308.6912
## MAE: 234.4518
## RMSLE: 0.397774
## Mean Residual Deviance : 95290.25
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## mae 234.642360 8.429452 233.995880 240.281880
## mean_residual_deviance 95289.000000 8240.292000 84367.280000 97375.580000
## mse 95289.000000 8240.292000 84367.280000 97375.580000
## r2 0.331576 0.059268 0.425773 0.395673
## residual_deviance 95289.000000 8240.292000 84367.280000 97375.580000
## rmse 308.428100 13.379117 290.460450 312.050600
## rmsle 0.397601 0.024669 0.416650 0.414135
## cv_3_valid cv_4_valid cv_5_valid cv_6_valid
## mae 225.879640 241.203200 245.314100 228.435380
## mean_residual_deviance 97183.080000 97423.550000 97519.125000 82435.730000
## mse 97183.080000 97423.550000 97519.125000 82435.730000
## r2 0.353102 0.293386 0.289404 0.377979
## residual_deviance 97183.080000 97423.550000 97519.125000 82435.730000
## rmse 311.742000 312.127440 312.280520 287.116200
## rmsle 0.385071 0.439713 0.387037 0.384122
## cv_7_valid cv_8_valid cv_9_valid cv_10_valid
## mae 231.670460 241.904680 239.037920 218.700550
## mean_residual_deviance 104328.190000 92474.660000 109303.320000 90479.484000
## mse 104328.190000 92474.660000 109303.320000 90479.484000
## r2 0.315252 0.279518 0.234523 0.351147
## residual_deviance 104328.190000 92474.660000 109303.320000 90479.484000
## rmse 322.998750 304.096470 330.610530 300.798070
## rmsle 0.369934 0.427657 0.375001 0.376689
importancia
h2o.varimp(modelo_rf)
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 metros 41929367552.000000 1.000000 0.553430
## 2 anyo 14318920704.000000 0.341501 0.188997
## 3 calefaccion 8895087616.000000 0.212145 0.117407
## 4 situacion 4839642624.000000 0.115424 0.063879
## 5 cocina 4496422400.000000 0.107238 0.059349
## 6 banyo 1283339904.000000 0.030607 0.016939
h2o.varimp_plot(modelo_rf)
pdp_plots <- h2o.partialPlot(
object = modelo_rf,
data = datos_train_h2o,
cols = predictores,
nbins = 20,
plot = TRUE,
plot_stddev = TRUE
)
## Warning in plot.window(...): "medcol" is not a graphical parameter
## Warning in plot.window(...): "medlty" is not a graphical parameter
## Warning in plot.window(...): "staplelty" is not a graphical parameter
## Warning in plot.window(...): "boxlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in box(...): "medcol" is not a graphical parameter
## Warning in box(...): "medlty" is not a graphical parameter
## Warning in box(...): "staplelty" is not a graphical parameter
## Warning in box(...): "boxlty" is not a graphical parameter
## Warning in title(...): "medcol" is not a graphical parameter
## Warning in title(...): "medlty" is not a graphical parameter
## Warning in title(...): "staplelty" is not a graphical parameter
## Warning in title(...): "boxlty" is not a graphical parameter
## Warning in plot.window(...): "medcol" is not a graphical parameter
## Warning in plot.window(...): "medlty" is not a graphical parameter
## Warning in plot.window(...): "staplelty" is not a graphical parameter
## Warning in plot.window(...): "boxlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medcol" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "medlty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "staplelty" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "boxlty" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medcol" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "medlty" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "staplelty" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "boxlty" is not a
## graphical parameter
## Warning in box(...): "medcol" is not a graphical parameter
## Warning in box(...): "medlty" is not a graphical parameter
## Warning in box(...): "staplelty" is not a graphical parameter
## Warning in box(...): "boxlty" is not a graphical parameter
## Warning in title(...): "medcol" is not a graphical parameter
## Warning in title(...): "medlty" is not a graphical parameter
## Warning in title(...): "staplelty" is not a graphical parameter
## Warning in title(...): "boxlty" is not a graphical parameter
custom_ice <- function(modelo_h2o, data, y, predictores = NA) {
predictor <- Predictor$new(
model = modelo_h2o,
data = data,
y = y,
class = "regression"
)
if(is.na(predictores)) {
predictores <- colnames(data)
}
graficos <- list()
for (i in seq_along(predictores)) {
ice <- FeatureEffect$new(
predictor = predictor,
feature = predictores[i],
method = "pdp+ice",
grid.size = 20
)
plot_ice <- plot(ice) + ggtitle(predictores[i])
graficos[[i]] <- plot_ice
}
return(graficos)
}
graficos_ice <- custom_ice(
modelo_h2o = modelo_gbm,
data = datos_train[, predictores],
y = datos_train[[var_respuesta]]
)
ggarrange(plotlist = graficos_ice, ncol = 2, nrow = 3)
explainer_rf <- DALEXtra::explain_h2o(
model = modelo_rf,
data = datos_train[, predictores],
y = datos_train[[var_respuesta]],
label = "modelo_rf"
)
## Preparation of a new explainer is initiated
## -> model label : modelo_rf
## -> data : 1571 rows 6 cols
## -> target variable : 1571 values
## -> predict function : yhat.H2ORegressionModel will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package h2o , ver. 3.42.0.2 , task regression ( default )
## -> predicted values : numerical, min = 303.2364 , mean = 813.4317 , max = 1718.506
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -777.2178 , mean = 0.4485718 , max = 1338.003
## A new explainer has been created!
plot(model_performance(explainer_rf))
predicciones
predicciones_test <- h2o.predict(
object = modelo_rf,
newdata = datos_test_h2o
)
predicciones_test <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test
rmse_test_rf <- MLmetrics::RMSE(
y_pred = datos_test$precio,
y_true = datos_test$prediccion
)
paste("Error de test (rmse) del modelo Random Forest:", rmse_test_rf)
## [1] "Error de test (rmse) del modelo Random Forest: 286.52609074536"
7.5. - ESEMBLE
modelo_ensemble <- h2o.stackedEnsemble(
y = var_respuesta,
x = predictores,
training_frame = datos_train_h2o,
base_models = list(modelo_glm,
modelo_rf,
modelo_gbm),
model_id = "modelo_ensemble"
)
modelo_ensemble
## Model Details:
## ==============
##
## H2ORegressionModel: stackedensemble
## Model ID: modelo_ensemble
## Model Summary for Stacked Ensemble:
## key value
## 1 Stacking strategy cross_validation
## 2 Number of base models (used / total) 2/3
## 3 # GBM base models (used / total) 1/1
## 4 # GLM base models (used / total) 0/1
## 5 # DRF base models (used / total) 1/1
## 6 Metalearner algorithm GLM
## 7 Metalearner fold assignment scheme AUTO
## 8 Metalearner nfolds 0
## 9 Metalearner fold_column NA
## 10 Custom metalearner hyperparameters None
##
##
## H2ORegressionMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 65962.22
## RMSE: 256.8311
## MAE: 196.7224
## RMSLE: 0.3429918
## Mean Residual Deviance : 65962.22
predicciones_test <- h2o.predict(
object = modelo_ensemble,
newdata = datos_test_h2o
)
predicciones_test <- as.vector(predicciones_test)
datos_test$prediccion <- predicciones_test
rmse_test_ensemble <- MLmetrics::RMSE(
y_pred = datos_test$precio,
y_true = datos_test$prediccion
)
paste("Error de test (rmse) del modelo ensemble:", rmse_test_ensemble)
## [1] "Error de test (rmse) del modelo ensemble: 281.723604652084"
8.- CONCLUSIONES
paste("Error de test (rmse) del modelo GLM:", rmse_test_glm)
## [1] "Error de test (rmse) del modelo GLM: 378.70675549663"
paste("Error de test (rmse) del modelo GBM:", rmse_test_gbm)
## [1] "Error de test (rmse) del modelo GBM: 277.349013437966"
paste("Error de test (rmse) del modelo Random Forest:", rmse_test_rf)
## [1] "Error de test (rmse) del modelo Random Forest: 286.52609074536"
paste("Error de test (rmse) del modelo ensemble:", rmse_test_ensemble)
## [1] "Error de test (rmse) del modelo ensemble: 281.723604652084"
El mejor modelo en las predicciones es el GBM