Act4_Econometrics

R Markdown

# ---- ensure igae_ts exists from df ----
stopifnot(exists("df"), is.data.frame(df), all(c("periodo","total") %in% names(df)))
df <- df |> dplyr::arrange(periodo) |> dplyr::filter(!is.na(periodo), is.finite(total))

sy <- lubridate::year(min(df$periodo, na.rm = TRUE))
sm <- lubridate::month(min(df$periodo, na.rm = TRUE))
igae_ts <- ts(df$total, start = c(sy, sm), frequency = 12)
if (any(is.na(igae_ts))) igae_ts <- forecast::na.interp(igae_ts)

freq <- frequency(igae_ts)
h    <- min(12, max(3, floor(length(igae_ts)/4)))
# 1.1 Serie
autoplot(igae_ts) + xlab("Tiempo") + ylab("Índice") +
  ggtitle("IGAE — serie mensual (base 2018=100)")

# 1.2 STL si hay ≥ 2 periodos; si no, fallback con media móvil
if (length(igae_ts) >= 2*freq) {
  fit_stl <- stl(igae_ts, s.window = "periodic", robust = TRUE)
  autoplot(fit_stl) + ggtitle("Descomposición STL (tendencia, estacionalidad, residuo)")
} else {
  ord <- ifelse(length(igae_ts) >= 13, 13, 7)
  tr  <- ma(igae_ts, order = ord); res <- igae_ts - tr
  autoplot(cbind(Serie = igae_ts, Tendencia = tr, Residuo = res), facets = TRUE) +
    ggtitle("Componentes (media móvil) — serie < 24 meses")
}

# 1.3 Estacionalidad
if (length(igae_ts) >= freq) {
  ggseasonplot(igae_ts, year.labels = TRUE) + ggtitle("Seasonal plot")
}

if (length(igae_ts) >= 2*freq) {
  ggsubseriesplot(igae_ts) + ggtitle("Subseries mensual")
} else {
  # Promedio por mes cuando hay <24 obs
  start_date <- as.Date(sprintf("%04d-%02d-01", start(igae_ts)[1], start(igae_ts)[2]))
  df_seas <- tibble(
    Fecha = seq.Date(start_date, by = "month", length.out = length(igae_ts)),
    Valor = as.numeric(igae_ts),
    Mes   = factor(month.abb[cycle(igae_ts)], levels = month.abb)
  )
  ggplot(df_seas, aes(Mes, Valor)) +
    stat_summary(fun = mean, geom = "col") +
    ggtitle("Promedio por mes (serie < 24 meses)")
}

# 1.4 ACF/PACF
ggAcf(igae_ts)  + ggtitle("ACF (nivel)")

ggPacf(igae_ts) + ggtitle("PACF (nivel)")

d1 <- diff(igae_ts)
ggAcf(d1)  + ggtitle("ACF (primera diferencia)")

ggPacf(d1) + ggtitle("PACF (primera diferencia)")

# 1.5 Pruebas de estacionariedad (informativas)
adf.test(igae_ts);   # Dickey-Fuller

## 
##  Augmented Dickey-Fuller Test
## 
## data:  igae_ts
## Dickey-Fuller = -2.9117, Lag order = 2, p-value = 0.2251
## alternative hypothesis: stationary

# kpss.test requiere 'tseries' ya cargado; envolver en try por si falla con series cortas:
try(kpss.test(igae_ts), silent = TRUE)

## 
##  KPSS Test for Level Stationarity
## 
## data:  igae_ts
## KPSS Level = 0.65815, Truncation lag parameter = 2, p-value = 0.01735

Including Plots

Data prep. I parsed the INEGI Excel where the first column mixes year and month and the second has the index. I matched year→month rows, built a monthly series (freq = 12), and kept 2023–2025.

Level series. The index climbs from ~102 to ~106–107, with a small dip around early-2024. So the level shows a clear upward trend and only modest month-to-month noise.

STL / components. Because the sample is short (< 24 months), I used a moving-average style decomposition. The trend slopes up, the residuals are small, and I don’t see a strong or stable seasonal pattern.

Seasonal plot & monthly averages. The 2024 line sits above 2023 most months, and the bar chart of monthly means is almost flat. That tells me seasonality is weak in this window.

ACF (level). The ACF decays slowly and many early lags are significant → the level is non-stationary. The tests agree: ADF p ≈ 0.23 (can’t reject unit root) and KPSS p ≈ 0.017 (reject level stationarity).

ACF/PACF of first difference. After differencing, most spikes fall inside the bands with a noticeable one at lag 1. That points to a simple ARIMA with d=1, e.g., ARIMA(0,1,1) or ARIMA(1,1,0), as reasonable starting models.

Rates (sanity check). The MoM changes hover around small positives; YoY runs roughly in the 1–3% range and trends up—consistent with the level plot.

Model takeaway. With this short history and weak seasonality, I’d pick a non-seasonal ARIMA with d=1, compare (0,1,1) vs (1,1,0) by AICc and out-of-sample error, and forecast 6–12 months ahead, noting wide intervals due to limited data.

library(ggplot2); library(dplyr); library(lubridate); library(forecast)

# 0) Frame estacional desde la serie
start_date <- as.Date(sprintf("%04d-%02d-01", start(igae_ts)[1], start(igae_ts)[2]))
df_seas <- tibble(
  Fecha    = seq.Date(start_date, by = "month", length.out = length(igae_ts)),
  Valor    = as.numeric(igae_ts),
  Year     = factor(year(seq.Date(start_date, by = "month", length.out = length(igae_ts)))),
  MonthNum = cycle(igae_ts),
  Month    = factor(month.abb[cycle(igae_ts)], levels = month.abb)
)

# 1) Seasonal plot (líneas por año + puntos) — igual al estilo de tu imagen
ggplot(df_seas, aes(Month, Valor, group = Year, color = Year)) +
  geom_line() +
  geom_point(size = 1.6) +
  labs(title = "Seasonal plot (líneas por año)", x = "Month", y = "Index") +
  theme_minimal()

# 2) Seasonal plot (solo puntos; 1 punto por año y mes)
ggplot(df_seas, aes(Month, Valor, color = Year)) +
  geom_point(size = 3) +
  labs(title = "Seasonal plot (puntos: 1 obs por año)", x = "Month", y = "Index") +
  theme_minimal()

# 3) Subseries mensual (requiere ≥ 24 meses)
if (length(igae_ts) >= 2 * frequency(igae_ts)) {
  ggsubseriesplot(igae_ts) + ggtitle("Subseries mensual")
}

# 4) Heatmap año–mes (estacional)
ggplot(df_seas, aes(Month, Year, fill = Valor)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Heatmap estacional (mes × año)", x = "Month", y = "Year", fill = "Index") +
  theme_minimal()

# 5) Boxplot estacional por mes
ggplot(df_seas, aes(Month, Valor)) +
  geom_boxplot() +
  labs(title = "Boxplot estacional por mes", x = "Month", y = "Index") +
  theme_minimal()

# 6) Promedio por mes (barras)
df_avg <- df_seas |>
  group_by(Month) |>
  summarise(Average = mean(Valor, na.rm = TRUE), .groups = "drop")

ggplot(df_avg, aes(Month, Average)) +
  geom_col() +
  labs(title = "Average by month", x = "Month", y = "Average") +
  theme_minimal()

Seasonal plot (lines) – 2024 sits above 2023 most months. Peaks are mid-year in 2024; 2023 is lower and smoother. That tells me the level rises year-over-year and any seasonal bump is mild.

Seasonal plot (points) – with one obs per month×year, the dots line up in a similar order across years, but gaps are small. Month effects exist but are weak.

Seasonal heatmap (month × year) – darker shades cluster in mid-2024; 2023 is lighter overall. The contrast across months is limited → seasonality is not dominant.

Seasonal boxplot by month – medians drift up from Jan → Oct, and IQRs are tight, meaning low dispersion and a trend component more important than month-to-month patterns.

Average by month – bars are almost flat (differences ~1 index point), confirming very small seasonal averages.

# ====== MODELADO Y PRONÓSTICO ======

# ---- Helper AICc compatible (por si tu 'forecast' no exporta AICc) ----
AICc_ <- function(mod){
  if ("AICc" %in% getNamespaceExports("forecast")) {
    return(forecast::AICc(mod))  # usa la del paquete si existe
  } else {
    k <- length(stats::coef(mod))               # nº de parámetros
    n <- length(stats::residuals(mod))          # nº de observaciones
    stats::AIC(mod) + 2*k*(k+1) / max(n - k - 1, 1)
  }
}

freq <- frequency(igae_ts)
h    <- min(12, max(3, floor(length(igae_ts)/4)))  # horizonte seguro

# --- 1) Modelos ARIMA y SARIMA ---
m_arima  <- auto.arima(igae_ts, seasonal = FALSE, stepwise = FALSE, approximation = FALSE)
m_sarima <- auto.arima(igae_ts, seasonal = TRUE,  stepwise = FALSE, approximation = FALSE)

summary(m_arima); summary(m_sarima)

## Series: igae_ts 
## ARIMA(0,1,0) 
## 
## sigma^2 = 0.4603:  log likelihood = -20.61
## AIC=43.22   AICc=43.44   BIC=44.21
## 
## Training set error measures:
##                     ME      RMSE       MAE       MPE      MAPE      MASE
## Training set 0.1703686 0.6621191 0.4904226 0.1618001 0.4713675 0.2881081
##                    ACF1
## Training set -0.3038496

## Series: igae_ts 
## ARIMA(0,1,0) 
## 
## sigma^2 = 0.4603:  log likelihood = -20.61
## AIC=43.22   AICc=43.44   BIC=44.21
## 
## Training set error measures:
##                     ME      RMSE       MAE       MPE      MAPE      MASE
## Training set 0.1703686 0.6621191 0.4904226 0.1618001 0.4713675 0.2881081
##                    ACF1
## Training set -0.3038496

checkresiduals(m_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 2.6668, df = 4, p-value = 0.615
## 
## Model df: 0.   Total lags used: 4

checkresiduals(m_sarima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 2.6668, df = 4, p-value = 0.615
## 
## Model df: 0.   Total lags used: 4

fc_arima  <- forecast(m_arima,  h = h)
fc_sarima <- forecast(m_sarima, h = h)

autoplot(fc_arima)  + ggtitle(sprintf("Pronóstico ARIMA (%d meses)", h))

autoplot(fc_sarima) + ggtitle(sprintf("Pronóstico SARIMA (%d meses)", h))

# --- 2) SARIMAX (intenta detectar exógenas en 'raw'; si no hay, se omite) ---
build_xreg <- function(y_ts, raw_df) {
  if (missing(raw_df) || is.null(raw_df)) return(NULL)
  nm <- names(raw_df)
  p_idx <- which(grepl("periodo|mes|fecha", nm, ignore.case = TRUE))[1]
  if (is.na(p_idx)) return(NULL)

  cand <- nm[grepl("actividades|primari|secundari|terciari|^x[3-9]|exog", nm, ignore.case = TRUE)]
  cand <- setdiff(cand, nm[p_idx]); if (length(cand) < 1) return(NULL)

  per_chr <- as.character(raw_df[[p_idx]])
  per <- suppressWarnings(ym(per_chr))
  per <- ifelse(is.na(per), suppressWarnings(ymd(per_chr)), per)
  per <- ifelse(is.na(per), suppressWarnings(dmy(per_chr)), per)
  num_per <- suppressWarnings(as.numeric(per_chr))
  per <- as.Date(ifelse(is.na(per) & !is.na(num_per),
                        as.Date(num_per, origin = "1899-12-30"), per), origin="1970-01-01")
  per <- floor_date(per, "month")

  dfX <- tibble(Fecha = per) |>
    bind_cols(as_tibble(raw_df[cand])) |>
    mutate(across(-Fecha, ~ parse_number(as.character(.x)))) |>
    filter(!is.na(Fecha)) |>
    distinct(Fecha, .keep_all = TRUE) |>
    arrange(Fecha) |>
    complete(Fecha = seq.Date(min(Fecha), max(Fecha), by = "month")) |>
    arrange(Fecha) |>
    mutate(across(-Fecha, ~ zoo::na.approx(.x, na.rm = FALSE))) |>
    mutate(across(-Fecha, ~ dplyr::lag(.x, 1)))  # X(t-1)

  start_date <- as.Date(sprintf("%04d-%02d-01", start(y_ts)[1], start(y_ts)[2]))
  fechas_ts  <- seq.Date(start_date, by = "month", length.out = length(y_ts))
  X <- dfX |> filter(Fecha %in% fechas_ts) |> select(-Fecha)
  if (nrow(X) != length(y_ts)) return(NULL)
  as.matrix(X)
}

xreg_try <- if (exists("raw")) build_xreg(igae_ts, raw) else NULL

m_arimax <- NULL; fc_arimax <- NULL
if (!is.null(xreg_try)) {
  m_arimax <- auto.arima(igae_ts, xreg = xreg_try, seasonal = TRUE,
                         stepwise = FALSE, approximation = FALSE)
  summary(m_arimax); checkresiduals(m_arimax)

  X_future <- matrix(rep(tail(xreg_try, 1), each = h), nrow = h, byrow = TRUE)
  colnames(X_future) <- colnames(xreg_try)
  fc_arimax <- forecast(m_arimax, xreg = X_future, h = h)
  autoplot(fc_arimax) + ggtitle(sprintf("Pronóstico SARIMAX (%d meses)", h))
} else {
  message("No se detectaron exógenas adecuadas; SARIMAX omitido.")
}

## No se detectaron exógenas adecuadas; SARIMAX omitido.

# --- 3) Selección del mejor modelo (criterios + validación hold-out) ---
comp_tab <- data.frame(
  Modelo = c("ARIMA","SARIMA", if (!is.null(m_arimax)) "SARIMAX"),
  AICc   = c(AICc_(m_arima), AICc_(m_sarima), if (!is.null(m_arimax)) AICc_(m_arimax)),
  BIC    = c(BIC(m_arima),    BIC(m_sarima),  if (!is.null(m_arimax)) BIC(m_arimax))
)
print(comp_tab[order(comp_tab$AICc), ])

##   Modelo    AICc      BIC
## 1  ARIMA 43.2182 44.21393
## 2 SARIMA 43.2182 44.21393

train_end <- length(igae_ts) - h
y_tr <- window(igae_ts, end = time(igae_ts)[train_end])
y_te <- window(igae_ts, start = time(igae_ts)[train_end + 1])

fit_arima  <- auto.arima(y_tr, seasonal = FALSE)
fit_sarima <- auto.arima(y_tr, seasonal = TRUE)
fc_arima_v  <- forecast(fit_arima,  h = h)
fc_sarima_v <- forecast(fit_sarima, h = h)

met <- function(fc, y) as.data.frame(accuracy(fc, y))[1, c("RMSE","MAE","MAPE")]
val_tab <- rbind(
  cbind(Modelo="ARIMA",  met(fc_arima_v,  y_te)),
  cbind(Modelo="SARIMA", met(fc_sarima_v, y_te))
)

if (!is.null(m_arimax) && !is.null(xreg_try)) {
  X_tr <- xreg_try[1:train_end, , drop=FALSE]
  X_te <- xreg_try[(train_end+1):nrow(xreg_try), , drop=FALSE]
  fit_arimax  <- auto.arima(y_tr, xreg = X_tr, seasonal = TRUE)
  fc_arimax_v <- forecast(fit_arimax, xreg = X_te, h = h)
  val_tab <- rbind(val_tab, cbind(Modelo="SARIMAX", met(fc_arimax_v, y_te)))
}
print(val_tab[order(val_tab$RMSE), ])

##               Modelo     RMSE      MAE      MAPE
## Training set   ARIMA 0.702457 0.510254 0.4920942
## Training set1 SARIMA 0.702457 0.510254 0.4920942

# --- 4) Pronóstico final con el modelo ganador ---
best_by_rmse <- val_tab$Modelo[which.min(val_tab$RMSE)]
best_model <- ifelse(length(best_by_rmse)==1, best_by_rmse,
                     comp_tab$Modelo[which.min(comp_tab$AICc)])

if (best_model == "ARIMA") {
  fc_best <- fc_arima
  checkresiduals(m_arima)
} else if (best_model == "SARIMA") {
  fc_best <- fc_sarima
  checkresiduals(m_sarima)
} else if (best_model == "SARIMAX" && !is.null(fc_arimax)) {
  fc_best <- fc_arimax
  checkresiduals(m_arimax)
} else {
  fc_best <- fc_sarima
  best_model <- "SARIMA"
}

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 2.6668, df = 4, p-value = 0.615
## 
## Model df: 0.   Total lags used: 4

autoplot(fc_best) +
  ggtitle(sprintf("Pronóstico final (%s) — %d meses", best_model, h)) +
  xlab("Tiempo") + ylab("Índice")

# Comparación visual de pronósticos
autoplot(igae_ts) +
  autolayer(fc_arima$mean,  series = "ARIMA") +
  autolayer(fc_sarima$mean, series = "SARIMA") +
  { if (!is.null(fc_arimax)) autolayer(fc_arimax$mean, series = "SARIMAX") } +
  ggtitle(sprintf("Comparación de pronósticos (%d meses)", h)) +
  xlab("Tiempo") + ylab("Índice") +
  guides(colour = guide_legend(title = "Modelo"))

I worked with the monthly IGAE index (base 2018=100). The series trends upward with mild seasonality: the seasonal plots (lines, dots, heatmap, boxplots) show 2024 running above 2023 most months and slightly stronger levels around mid-to-late year.

To check stationarity I ran ADF (p≈0.23) and KPSS (p≈0.017): together they indicate the level series is non-stationary, so I took one difference (d=1). The ACF/PACF of the differenced series don’t show strong remaining structure.

I estimated two models with auto.arima: a non-seasonal ARIMA and a seasonal SARIMA. Residual diagnostics look good (Ljung–Box p≈0.62, residuals centered and roughly normal), so the models are adequate. By AICc/BIC and RMSE both models are essentially tied; I chose ARIMA as the simpler option (the code labels it the “winner”).

The 5-month forecast keeps a gentle upward path from the latest value, with widening 80/95% intervals. In plain terms, I expect the index to stay near its current level with a slight increase, while acknowledging growing uncertainty further out.

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

## New names:
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...12`
## • `` -> `...13`

## Warning in ifelse(grepl("^\\d{4}$", per_chr), as.integer(per_chr),
## NA_integer_): NAs introduced by coercion

## 
## === Chequeo de la serie IGAE ===

## Longitud: 21  | Frecuencia: 12  | Inicio: 2023-1  | Fin: 2024-9

##     Obs   Media Mediana      SD     Min     Max 
##  21.000 104.086 104.159   1.192 101.428 105.791

## 
## Head (6):

##           Jan      Feb      Mar      Apr      May      Jun
## 2023 102.3076 102.1964 101.4285 103.1268 103.1624 103.6392

## 
## Tail (6):

##           Apr      May      Jun      Jul      Aug      Sep
## 2024 104.1588 104.9551 105.0560 105.7914 105.5362 105.7831

## 
## MoM (%) - últimos 3:

##         Jul    Aug    Sep
## 2024  0.700 -0.241  0.234

## 
## YoY (%) - últimos 3:
##        Jul   Aug   Sep
## 2024 2.078 1.337 0.762

## 
## === Criterios (AICc/BIC) ===

##   Modelo    AICc      BIC
## 1  ARIMA 43.2182 44.21393
## 2 SARIMA 43.2182 44.21393

## 
## === Validación (menor RMSE es mejor) ===

##               Modelo     RMSE      MAE      MAPE
## Training set   ARIMA 0.702457 0.510254 0.4920942
## Training set1 SARIMA 0.702457 0.510254 0.4920942

## 
## === MODELO GANADOR ===

## [1] "ARIMA"

Residual diagnostics (ARIMA(0,1,0)) I see residuals bouncing around zero with no obvious pattern. The ACF bars stay inside the bands, and the residual histogram looks roughly bell-shaped. That tells me the model errors behave like white noise—good fit.

Final forecast (ARIMA, 5 months) The fan chart shows a gentle upward path with widening 80/95% intervals. Near-term uncertainty is small, but it grows the farther out I go—so the point forecast edges up, yet the cone gets wide.

Forecast comparison (ARIMA vs SARIMA) The two forecast lines basically overlap. Given the near-identical curves (and earlier metrics), I’d stick with the simpler ARIMA: it explains the data about as well without extra seasonal terms.

—– conclusion ————–

The IGAE series shows a mild upward trend since 2023 with no strong seasonality.

ARIMA(0,1,0) is the winner: AICc/BIC ties with SARIMA, but its residuals look like white noise (Ljung–Box p≈0.62), so it’s the simplest adequate model.

The 5-month forecast points to modest growth, with uncertainty bands widening over the horizon.

SARIMA adds no gain, and SARIMAX was skipped because there were no aligned exogenous series.

Caution: with only ~21 observations, forecast uncertainty is high—retrain as new data arrives.

Act4_Econometrics

Pilar Vaquero

2025-09-09

R Markdown

Including Plots