Completación de datos por el método kNN (k-vecino más cercano)

Carga de información:

Code
library(VIM)
library(tidyverse)
library(zoo)
library(caret)
source("Scripts/ECA.R")

df <- ECA(meteo = "input/CR1000_JULIO_2024.csv",
          gases = "input/eca_2024_julio.csv",
          pm = "input/pm_junio.csv",
          fecha_inicio = "2024-06-21",
          fecha_fin = "2024-06-30",
          estacion = "Unanue",
          tipo = "lista")

source("Scripts/data_cruda.R")
df1 <- cruda(meteo = "input/CR1000_JULIO_2024.csv",
             gases = "input/eca_2024_julio.csv",
             pm = "input/pm_junio.csv",
             fecha_inicio = "2024-06-21",
             fecha_fin = "2024-06-30",
             estacion = "Unanue",
             tipo = "lista")

Visualizando la data cruda:

Code
df1 %>% select(date, pm25, pm10) %>% head()
                 date      pm25      pm10
1 2024-06-21 00:00:00 -30.79940 -12.43890
2 2024-06-21 01:00:00 -15.28620  -5.34849
3 2024-06-21 02:00:00  16.74790  22.26210
4 2024-06-21 03:00:00  11.69750  23.92250
5 2024-06-21 04:00:00  -3.68306  10.41090
6 2024-06-21 05:00:00   5.31512  14.82070

Visualizando la data procesada:

Code
df$df %>% select(date, pm25, pm10) %>% head()
                 date     pm25    pm10
1 2024-06-21 00:00:00       NA      NA
2 2024-06-21 01:00:00       NA      NA
3 2024-06-21 02:00:00 16.74790 22.2621
4 2024-06-21 03:00:00 11.69750 23.9225
5 2024-06-21 04:00:00       NA 10.4109
6 2024-06-21 05:00:00  5.31512 14.8207

Primeramente determinamos el K óptimo que será utilizado como principal parámetro en la copletación de datos faltantes:

Code
datos_sin_na <- na.omit(df$df)

# Configuración de control para la validación cruzada
control <- trainControl(method = "cv", number = 10)

# Búsqueda del k óptimo
set.seed(123)
modelo <- train(pm25~., data = datos_sin_na, method = "knn",
                trControl = control,
                tuneLength = 20)


# Resultados
print(modelo)
k-Nearest Neighbors 

219 samples
 16 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 197, 197, 198, 197, 197, 198, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared    MAE     
   5  8.358520  0.21826283  6.377855
   7  8.858393  0.14489171  6.855358
   9  9.063432  0.12434717  7.012211
  11  9.035783  0.12666836  6.954638
  13  9.020266  0.10261145  6.966740
  15  8.934417  0.10409365  6.898997
  17  9.003863  0.09059925  7.007185
  19  9.043460  0.08682066  7.034710
  21  9.066077  0.07656560  7.029111
  23  9.084355  0.07070887  6.993344
  25  9.064549  0.07238900  6.984655
  27  9.101419  0.06036228  7.048669
  29  9.123851  0.05822914  7.054266
  31  9.178495  0.05024674  7.119413
  33  9.212477  0.04410112  7.160160
  35  9.230600  0.04782410  7.203478
  37  9.237239  0.04794015  7.219297
  39  9.208216  0.05837104  7.214041
  41  9.163661  0.07119001  7.183212
  43  9.157573  0.06870827  7.183684

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
Code
plot(modelo)

De acuerdo con el test realizado el valor óptimo es k = 5.

Aplicando kNN para la compmetación de datos de \(PM_{10}\) y \(PM_{2.5}\)

Code
dfx <- kNN(data = df$df,
           variable = c("pm25", "pm10"), k = 5)

head(dfx %>% select(date, pm25, pm10, pm25_imp, pm10_imp))
                 date     pm25    pm10 pm25_imp pm10_imp
1 2024-06-21 00:00:00 18.19940 23.9225     TRUE     TRUE
2 2024-06-21 01:00:00 16.74790 22.3966     TRUE     TRUE
3 2024-06-21 02:00:00 16.74790 22.2621    FALSE    FALSE
4 2024-06-21 03:00:00 11.69750 23.9225    FALSE    FALSE
5 2024-06-21 04:00:00 11.69750 10.4109     TRUE    FALSE
6 2024-06-21 05:00:00  5.31512 14.8207    FALSE    FALSE

Gráfico de comparación de los valores de \(PM_{2.5}\)

Code
comparativo <- data.frame(fecha = df$df$date, pm25_1 = df1$pm25, pm25_2 = dfx$pm25, pm25_3 = df$df$pm25)

a <- ggplot() +
  geom_line(data = comparativo, aes(x = fecha, y = pm25_1, color = "Cruda")) +
  geom_line(data = comparativo, aes(x = fecha, y = pm25_2, color = "Completada")) +
  geom_line(data = comparativo, aes(x = fecha, y = pm25_3, color = "Hueca")) +
  labs(color = "Tipo", title = "Comparación de datos del EMCA-01") +
  theme_bw() + theme(legend.position = "top")

plotly::ggplotly(a)