La radiación solar es una variable clave para el diseño de sistemas fotovoltaicos, la planificación energética y la caracterización climática de un sitio. El objetivo de este trabajo es analizar los datos de una estación meteorológica de Cuenca y construir diversos modelos de regresión que permitan predecir la radiación global promedio a partir de variables meteorológicas medidas localmente.
Se comparan tres enfoques:
Regresión lineal múltiple.
Árbol de decisión.
KNN (k-vecinos más cercanos).
library(readxl)
datos_raw <- read_excel("Datos_Cuenca_completo.xlsx")
Limpiar nombres de variables
datos <- datos_raw %>% clean_names()
Revisar nombres resultantes
names(datos)
## [1] "date" "time"
## [3] "temp_in_terna_c" "rel_humidity_ave_percent"
## [5] "air_temp_ave_c" "wind_speed_ave_m_s"
## [7] "wind_speed_max_m_s" "wind_dir"
## [9] "rain_mm" "global_rad_min_w_m2"
## [11] "global_rad_ave_w_m2" "global_rad_max_w_m2"
Eliminar filas donde TODAS las variables numéricas son NA
datos <- datos %>%
filter(if_any(where(is.numeric), ~ !is.na(.)))
Conversión de fecha y hora
datos <- datos %>%
mutate(
# 1) Parsear fecha y hora
date = ymd(date),
time = hms(time),
# 2) Hora del día (0–23)
hour = hour(time),
# 3) Día del año y mes (estacionalidad)
doy = yday(date), # day of year: 1–365
month = month(date), # 1–12
# 4) Día de la semana (por si hubiera patrones)
dow = wday(date, week_start = 1), # 1 = lunes, ..., 7 = domingo
# 5) Codificación cíclica de la hora
hour_sin = sin(2 * pi * hour / 24),
hour_cos = cos(2 * pi * hour / 24),
# 6) Codificación cíclica del día del año
doy_sin = sin(2 * pi * doy / 365),
doy_cos = cos(2 * pi * doy / 365)
)
Resumen básico
skim(datos)
| Name | datos |
| Number of rows | 4066 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 18 |
| Timespan | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 144 | 0.96 | 2021-11-30 | 2021-12-30 | 2021-12-17 | 29 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| temp_in_terna_c | 0 | 1.00 | 18.13 | 5.65 | 8.97 | 13.45 | 16.04 | 23.22 | 33.02 | ▇▇▃▅▁ |
| rel_humidity_ave_percent | 0 | 1.00 | 77.63 | 16.75 | 27.11 | 65.44 | 82.74 | 91.05 | 99.29 | ▁▂▃▅▇ |
| air_temp_ave_c | 0 | 1.00 | 15.40 | 3.28 | 9.38 | 12.92 | 14.37 | 17.63 | 25.17 | ▃▇▃▂▁ |
| wind_speed_ave_m_s | 0 | 1.00 | 1.55 | 1.14 | 0.00 | 0.69 | 1.24 | 2.18 | 7.77 | ▇▃▁▁▁ |
| wind_speed_max_m_s | 0 | 1.00 | 2.65 | 1.73 | 0.00 | 1.22 | 2.17 | 3.73 | 11.29 | ▇▅▂▁▁ |
| wind_dir | 0 | 1.00 | 152.97 | 93.02 | 0.00 | 74.71 | 144.10 | 219.21 | 360.00 | ▇▇▇▆▃ |
| rain_mm | 0 | 1.00 | 0.02 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 5.40 | ▇▁▁▁▁ |
| global_rad_min_w_m2 | 0 | 1.00 | 160.96 | 253.43 | 0.00 | 0.00 | 3.69 | 257.04 | 1254.35 | ▇▂▁▁▁ |
| global_rad_ave_w_m2 | 0 | 1.00 | 215.19 | 323.10 | 0.00 | 0.00 | 6.46 | 347.27 | 1303.80 | ▇▂▁▁▁ |
| global_rad_max_w_m2 | 0 | 1.00 | 281.88 | 421.75 | 0.00 | 0.00 | 8.97 | 461.28 | 1690.08 | ▇▁▁▁▁ |
| hour | 0 | 1.00 | 11.58 | 6.95 | 0.00 | 6.00 | 12.00 | 18.00 | 23.00 | ▇▇▆▇▇ |
| doy | 144 | 0.96 | 350.87 | 7.87 | 334.00 | 344.00 | 351.00 | 358.00 | 364.00 | ▅▇▇▇▇ |
| month | 144 | 0.96 | 12.00 | 0.02 | 11.00 | 12.00 | 12.00 | 12.00 | 12.00 | ▁▁▁▁▇ |
| dow | 144 | 0.96 | 3.97 | 2.02 | 1.00 | 2.00 | 4.00 | 6.00 | 7.00 | ▇▃▃▃▇ |
| hour_sin | 0 | 1.00 | -0.01 | 0.71 | -1.00 | -0.71 | 0.00 | 0.71 | 1.00 | ▇▅▂▅▇ |
| hour_cos | 0 | 1.00 | 0.01 | 0.71 | -1.00 | -0.71 | 0.00 | 0.71 | 1.00 | ▇▅▂▅▇ |
| doy_sin | 144 | 0.96 | -0.24 | 0.13 | -0.51 | -0.35 | -0.24 | -0.12 | -0.02 | ▅▇▇▇▇ |
| doy_cos | 144 | 0.96 | 0.96 | 0.03 | 0.86 | 0.94 | 0.97 | 0.99 | 1.00 | ▁▂▂▃▇ |
Variable type: Timespan
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time | 0 | 1 | 0 | 0 | 0 | 1 |
# Histogramas univariados solo con variables numéricas
# y eliminando explícitamente las columnas de fecha y hora
# 1) Construir data.frame numérico SIN date y SIN time
datos_num <- datos %>%
select(-any_of(c("date", "time"))) %>% # forzar exclusión de variables temporales crudas
select(where(is.numeric)) # quedarnos solo con numéricas
# (Opcional) Excluir variables cíclicas si ya las creaste y no quieres verlas aquí
datos_num <- datos_num %>%
select(-any_of(c("hour_sin", "hour_cos", "doy_sin", "doy_cos")))
# 2) Revisar en la consola qué columnas quedaron (útil para depurar)
str(datos_num)
## tibble [4,066 × 14] (S3: tbl_df/tbl/data.frame)
## $ temp_in_terna_c : num [1:4066] 18.4 18.4 27.4 27.3 27.3 ...
## $ rel_humidity_ave_percent: num [1:4066] 58.9 61.1 47.2 46.5 41.8 ...
## $ air_temp_ave_c : num [1:4066] 17.9 17.8 20.4 20.7 21.8 ...
## $ wind_speed_ave_m_s : num [1:4066] 0 0 1.94 2.38 1.92 2.26 2.58 1.89 2.09 2.87 ...
## $ wind_speed_max_m_s : num [1:4066] 0 0 3.59 3.87 3.8 4 5.03 4.14 3.8 4.55 ...
## $ wind_dir : num [1:4066] 360 360 269 283 161 ...
## $ rain_mm : num [1:4066] 0 0 0 0 0 0 0 0 0 0 ...
## $ global_rad_min_w_m2 : num [1:4066] 0.13 0.13 566.82 527.62 321.93 ...
## $ global_rad_ave_w_m2 : num [1:4066] 0.39 0.5 1104.97 1140.79 919.61 ...
## $ global_rad_max_w_m2 : num [1:4066] 0.56 0.78 1186.5 1281.92 1422.35 ...
## $ hour : num [1:4066] 21 21 18 18 18 18 19 19 19 19 ...
## $ doy : num [1:4066] 334 334 337 337 337 337 337 337 337 337 ...
## $ month : num [1:4066] 11 11 12 12 12 12 12 12 12 12 ...
## $ dow : num [1:4066] 2 2 5 5 5 5 5 5 5 5 ...
# 3) Generar histogramas
datos_long <- datos_num %>%
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "valor"
)
ggplot(datos_long, aes(x = valor)) +
geom_histogram(bins = 30) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme_minimal()
# 3.2 Correlaciones Correlaciones
# Seleccionar solo variables numéricas, excluyendo date/time crudas
datos_corr <- datos %>%
select(-any_of(c("date", "time"))) %>%
select(where(is.numeric))
# Matriz de correlación
mat_cor <- cor(datos_corr, use = "pairwise.complete.obs")
corrplot(
mat_cor,
method = "color",
type = "upper",
tl.cex = 0.7,
number.cex = 0.5
)
# Ver todas las columnas disponibles en 'datos'
names(datos)
## [1] "date" "time"
## [3] "temp_in_terna_c" "rel_humidity_ave_percent"
## [5] "air_temp_ave_c" "wind_speed_ave_m_s"
## [7] "wind_speed_max_m_s" "wind_dir"
## [9] "rain_mm" "global_rad_min_w_m2"
## [11] "global_rad_ave_w_m2" "global_rad_max_w_m2"
## [13] "hour" "doy"
## [15] "month" "dow"
## [17] "hour_sin" "hour_cos"
## [19] "doy_sin" "doy_cos"
La variable objetivo será la radiación global promedio.
variable_objetivo <- "global_rad_ave_w_m2" # <- AJUSTA AQUÍ
# 2) Asegurarnos de que esa columna existe en 'datos'
variable_objetivo %in% names(datos)
## [1] TRUE
# 3) Construir dataset de modelado:
# - Mantener SIEMPRE la columna objetivo,
# - Quitar date y time crudas,
# - Incluir el resto de numéricas,
# - Convertir la objetivo a numérica si hiciera falta,
# - Quitar filas con NA en la objetivo.
datos_modelo <- datos %>%
# Eliminar explícitamente date y time
select(-any_of(c("date", "time"))) %>%
# Mantener SIEMPRE la variable objetivo + todas las numéricas
select(all_of(variable_objetivo), where(is.numeric), everything()) %>%
# Por si la radiación está como character, la pasamos a numérica
mutate(
across(
.cols = all_of(variable_objetivo),
.fns = ~ suppressWarnings(as.numeric(.))
)
) %>%
# Eliminar filas SIN dato en la variable objetivo
drop_na(all_of(variable_objetivo))
# 4) Verificar cómo queda el dataset de modelado
skim(datos_modelo)
| Name | datos_modelo |
| Number of rows | 4066 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| global_rad_ave_w_m2 | 0 | 1.00 | 215.19 | 323.10 | 0.00 | 0.00 | 6.46 | 347.27 | 1303.80 | ▇▂▁▁▁ |
| temp_in_terna_c | 0 | 1.00 | 18.13 | 5.65 | 8.97 | 13.45 | 16.04 | 23.22 | 33.02 | ▇▇▃▅▁ |
| rel_humidity_ave_percent | 0 | 1.00 | 77.63 | 16.75 | 27.11 | 65.44 | 82.74 | 91.05 | 99.29 | ▁▂▃▅▇ |
| air_temp_ave_c | 0 | 1.00 | 15.40 | 3.28 | 9.38 | 12.92 | 14.37 | 17.63 | 25.17 | ▃▇▃▂▁ |
| wind_speed_ave_m_s | 0 | 1.00 | 1.55 | 1.14 | 0.00 | 0.69 | 1.24 | 2.18 | 7.77 | ▇▃▁▁▁ |
| wind_speed_max_m_s | 0 | 1.00 | 2.65 | 1.73 | 0.00 | 1.22 | 2.17 | 3.73 | 11.29 | ▇▅▂▁▁ |
| wind_dir | 0 | 1.00 | 152.97 | 93.02 | 0.00 | 74.71 | 144.10 | 219.21 | 360.00 | ▇▇▇▆▃ |
| rain_mm | 0 | 1.00 | 0.02 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 5.40 | ▇▁▁▁▁ |
| global_rad_min_w_m2 | 0 | 1.00 | 160.96 | 253.43 | 0.00 | 0.00 | 3.69 | 257.04 | 1254.35 | ▇▂▁▁▁ |
| global_rad_max_w_m2 | 0 | 1.00 | 281.88 | 421.75 | 0.00 | 0.00 | 8.97 | 461.28 | 1690.08 | ▇▁▁▁▁ |
| hour | 0 | 1.00 | 11.58 | 6.95 | 0.00 | 6.00 | 12.00 | 18.00 | 23.00 | ▇▇▆▇▇ |
| doy | 144 | 0.96 | 350.87 | 7.87 | 334.00 | 344.00 | 351.00 | 358.00 | 364.00 | ▅▇▇▇▇ |
| month | 144 | 0.96 | 12.00 | 0.02 | 11.00 | 12.00 | 12.00 | 12.00 | 12.00 | ▁▁▁▁▇ |
| dow | 144 | 0.96 | 3.97 | 2.02 | 1.00 | 2.00 | 4.00 | 6.00 | 7.00 | ▇▃▃▃▇ |
| hour_sin | 0 | 1.00 | -0.01 | 0.71 | -1.00 | -0.71 | 0.00 | 0.71 | 1.00 | ▇▅▂▅▇ |
| hour_cos | 0 | 1.00 | 0.01 | 0.71 | -1.00 | -0.71 | 0.00 | 0.71 | 1.00 | ▇▅▂▅▇ |
| doy_sin | 144 | 0.96 | -0.24 | 0.13 | -0.51 | -0.35 | -0.24 | -0.12 | -0.02 | ▅▇▇▇▇ |
| doy_cos | 144 | 0.96 | 0.96 | 0.03 | 0.86 | 0.94 | 0.97 | 0.99 | 1.00 | ▁▂▂▃▇ |
set.seed(123)
indice_train <- createDataPartition(
y = datos_modelo[[variable_objetivo]],
p = 0.7,
list = FALSE
)
train_data <- datos_modelo[indice_train, ]
test_data <- datos_modelo[-indice_train, ]
# Eliminar cualquier fila que tenga NA en alguna variable
train_data <- train_data %>% drop_na()
test_data <- test_data %>% drop_na()
# Comprobación rápida
nrow(train_data); nrow(test_data)
## [1] 2756
## [1] 1166
sum(is.na(train_data))
## [1] 0
sum(is.na(test_data))
## [1] 0
set.seed(123)
control <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3
)
# Fórmula general: y ~ todas las demás variables
form <- as.formula(
paste(variable_objetivo, "~ .")
)
set.seed(123)
modelo_lm <- train(
form,
data = train_data,
method = "lm",
trControl = control,
preProcess = c("center", "scale")
)
summary(modelo_lm$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291.68 -7.57 -1.21 7.40 380.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 215.2574 1.0061 213.953 < 2e-16 ***
## temp_in_terna_c -22.1570 4.8352 -4.582 4.8e-06 ***
## rel_humidity_ave_percent 0.5786 3.9664 0.146 0.884035
## air_temp_ave_c 36.3333 5.3614 6.777 1.5e-11 ***
## wind_speed_ave_m_s 1.7333 4.4089 0.393 0.694258
## wind_speed_max_m_s 2.4014 4.9013 0.490 0.624207
## wind_dir 0.5612 1.0807 0.519 0.603594
## rain_mm 0.8300 1.0443 0.795 0.426802
## global_rad_min_w_m2 148.1939 2.1277 69.650 < 2e-16 ***
## global_rad_max_w_m2 171.5045 2.4141 71.042 < 2e-16 ***
## hour 0.1085 1.7801 0.061 0.951390
## doy 414.6081 437.9918 0.947 0.343920
## month -0.3016 1.0525 -0.287 0.774482
## dow -0.7116 1.0416 -0.683 0.494573
## hour_sin 4.3295 2.2415 1.932 0.053524 .
## hour_cos -6.5822 1.7595 -3.741 0.000187 ***
## doy_sin -401.6081 412.5242 -0.974 0.330372
## doy_cos -10.5905 26.5641 -0.399 0.690163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.82 on 2738 degrees of freedom
## Multiple R-squared: 0.9732, Adjusted R-squared: 0.973
## F-statistic: 5850 on 17 and 2738 DF, p-value: < 2.2e-16
set.seed(123)
modelo_rpart <- train(
form,
data = train_data,
method = "rpart",
trControl = control,
tuneLength = 10
)
modelo_rpart
## CART
##
## 2756 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 2204, 2204, 2204, 2207, 2205, 2205, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.001678805 64.88945 0.9591879 33.00031
## 0.002502828 67.39361 0.9560597 33.97167
## 0.003200306 69.67160 0.9530413 35.81016
## 0.004274338 71.74145 0.9502354 39.20858
## 0.005613434 74.89157 0.9457620 44.02187
## 0.015845407 82.73508 0.9333533 49.13311
## 0.018649924 89.00055 0.9229463 52.20134
## 0.056467981 108.80010 0.8848315 62.55111
## 0.074050444 141.85570 0.8031881 98.13700
## 0.783668126 245.96061 0.7655851 192.78275
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001678805.
rpart.plot(modelo_rpart$finalModel)
set.seed(123)
modelo_knn <- train(
form,
data = train_data,
method = "knn",
trControl = control,
tuneLength = 10,
preProcess = c("center", "scale")
)
modelo_knn
## k-Nearest Neighbors
##
## 2756 samples
## 17 predictor
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 2204, 2204, 2204, 2207, 2205, 2205, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 66.55280 0.9574936 32.82651
## 7 65.67810 0.9588634 33.22320
## 9 66.86681 0.9576759 34.06966
## 11 68.05071 0.9562002 35.06621
## 13 69.20284 0.9547208 35.89345
## 15 70.18049 0.9537005 36.45028
## 17 71.00793 0.9527559 37.09943
## 19 71.31645 0.9524491 37.35763
## 21 71.86884 0.9518242 37.68473
## 23 72.42956 0.9511340 38.01769
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
plot(modelo_knn)
# 8. Evaluación y comparación de modelos # 8.1 Función de métricas
calcular_metricas <- function(modelo, nombre, datos_test, y_true){
pred <- predict(modelo, newdata = datos_test)
data.frame(
modelo = nombre,
RMSE = rmse(y_true, pred),
MAE = mae(y_true, pred),
R2 = cor(y_true, pred)^2
)
}
y_test <- test_data[[variable_objetivo]]
| modelo | RMSE | MAE | R2 |
|---|---|---|---|
| Regresión lineal | 59.944 | 26.161 | 0.966 |
| Árbol de decisión | 71.020 | 35.317 | 0.953 |
| KNN | 74.477 | 36.115 | 0.950 |
metricas %>%
ggplot(aes(x = reorder(modelo, RMSE), y = RMSE)) +
geom_col() +
coord_flip() +
labs(
x = "Modelo",
y = "RMSE (menor es mejor)",
title = "Comparación de modelos según RMSE"
) +
theme_minimal()
# 9. Análisis de residuos del mejor modelo
modelo_mejor <- modelo_lm # cambiar si corresponde
pred_mejor <- predict(modelo_mejor, newdata = test_data)
residuos <- y_test - pred_mejor
par(mfrow = c(1, 2))
hist(residuos,
main = "Histograma de residuos",
xlab = "Residuo")
plot(pred_mejor, residuos,
main = "Residuos vs predicción",
xlab = "Predicción", ylab = "Residuo")
abline(h = 0, col = "red")
par(mfrow = c(1, 1))