SARIMA Models

Isai Guizar

Disclaimer: This document is intended for educational purposes only. It is part of my Econometrics courses at Tec de Monterrey.

This document applies the Box-Jenkins methodology to model and forecast Mexico’s GDP growth using SARIMA models, which extend ARIMA by explicitly capturing seasonal dynamics. We begin by identifying suitable model orders through visual inspection of the series, autocorrelation, and partial autocorrelation plots, with special attention to seasonal patterns. A SARIMA model is then estimated, and its residuals are subjected to diagnostic tests to assess model adequacy. Finally, we generate a four-quarter-ahead forecast with confidence intervals.

import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from datetime import datetime
import yfinance as yf
from pandas_datareader import data as pdr
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error

1 Identification

GDP = pdr.DataReader('NGDPRNSAXDCMXQ', 'fred', start='1990Q1', end = '2024Q4')
GDP = GDP.rename(columns={'NGDPRNSAXDCMXQ': 'Values'})
GDP = GDP.reset_index()
GDP = GDP.rename(columns={'DATE': 'Date'})

GDP['Date'] = pd.to_datetime(GDP['Date'])
GDP.set_index(pd.PeriodIndex(GDP['Date'], freq='Q'), inplace=True)

GDP['Values'] = GDP['Values']/1000 # Mil millones
GDP['y'] = np.log(GDP['Values']).diff() * 100
GDP = GDP.dropna()

plt.figure(figsize=(7, 5))
plt.plot(GDP['Date'], GDP['y'])
plt.title('Quarterly Growth of GDP')
plt.xlabel('Date')
plt.ylabel('Percent')
plt.tight_layout()
plt.show()

decomp = sm.tsa.seasonal_decompose(GDP['y'], model='additive', period=4)

fig, axes = plt.subplots(4, 1, figsize=(7, 7), sharex=True)

GDP['y'].plot(ax=axes[0], title='Mexico Quarterly GDP Growth', ylabel='Level')
decomp.trend.plot(ax=axes[1], title='Trend/Cycle')
decomp.seasonal.plot(ax=axes[2], title='Seasonality')
decomp.resid.plot(ax=axes[3], title='Remainder')

for ax in axes:
    ax.grid(False)
    
plt.tight_layout()
plt.show()

The decomposition shows a clear seasonal component with a periodicity of 4 (quarterly), justifying the use of a SARIMA model.

1.1 Tests for stationarity

adf_p  = adfuller(GDP['y'])[1]
kpss_p = kpss(GDP['y'], regression='c', nlags='auto')[1]

print(f"ADF p-value: {adf_p:.4f}")
print(f"KPSS p-value: {kpss_p:.4f}")

ADF p-value: 0.0000
KPSS p-value: 0.1000

/var/folders/wl/12fdw3c55777609gp0_kvdrh0000gn/T/ipykernel_28734/3253650088.py:2: InterpolationWarning:

The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.

Tests suggest the time series is stationary.

1.2 ACF and PACF

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(GDP['y'], lags=24,  zero = False, ax=axes[0])
plot_pacf(GDP['y'], lags=24, zero = False, ax=axes[1])
axes[0].set_title('ACF of Log GDP Growth')
axes[1].set_title('PACF of Log GDP Growth')
plt.tight_layout()
plt.show()

For the nonseasonal component we observed that a (1,0,3) model worked well. The significant spikes at lag 4, 8, 12 suggest a seasonal MA(3) component. Then, an initial candidate will be a SARIMA \((1, 0, 3) (0, 0, 3)_4\) model. Proceed to estimate the model and compare the AIC, BIC, and residuals with other reasonable guesses.

2 Estimation

def fit_sarima(ts, p, d, q, P, D, Q, s):
    model = SARIMAX(ts, order=(p, d, q), seasonal_order=(P, D, Q, s))
    fitted_model = model.fit(disp=False)
    print(f"\nSARIMA({p},{d},{q})({P},{D},{Q})[{s}]")
    print(f"AIC: {fitted_model.aic:.2f}")
    print(f"BIC: {fitted_model.bic:.2f}")
    return fitted_model

m1 = fit_sarima(GDP['y'],1,0,3,0,0,3,4)

m2 = fit_sarima(GDP['y'],1,0,3,0,0,1,4)

m3 = fit_sarima(GDP['y'],1,0,3,0,0,2,4)

m4 = fit_sarima(GDP['y'],1,0,1,1,0,1,4)

m5 = fit_sarima(GDP['y'],1,0,1,0,0,1,4)

/opt/anaconda3/lib/python3.12/site-packages/statsmodels/base/model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals


SARIMA(1,0,3)(0,0,3)[4]
AIC: 655.40
BIC: 678.16

SARIMA(1,0,3)(0,0,1)[4]
AIC: 653.61
BIC: 670.67

SARIMA(1,0,3)(0,0,2)[4]
AIC: 654.59
BIC: 674.49

SARIMA(1,0,1)(1,0,1)[4]
AIC: 645.94
BIC: 660.17

SARIMA(1,0,1)(0,0,1)[4]
AIC: 661.00
BIC: 672.38

/opt/anaconda3/lib/python3.12/site-packages/statsmodels/base/model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

3 Diagnostic Checking

    residuals = m4.resid

    fig = plt.figure(figsize=(7, 7))
    gs = fig.add_gridspec(2, 2)

    ax1 = fig.add_subplot(gs[0, :]) 
    sns.lineplot(x=np.arange(len(residuals)), y=residuals, ax=ax1)
    ax1.set_title("Residuals over Time")

    ax2 = fig.add_subplot(gs[1, 0])
    plot_acf(residuals, lags=30, zero=False, ax=ax2)
    ax2.set_title("ACF of Residuals")

    ax3 = fig.add_subplot(gs[1, 1])
    sns.histplot(residuals, kde=True, ax=ax3)
    ax3.set_title("Histogram of Residuals")

    plt.tight_layout()
    plt.show()

Confirm the residuals are white noise with a Ljung-Box test:

print("\nLjung-Box Test:")
print(acorr_ljungbox(residuals, lags=[10], model_df=4, return_df=True))


Ljung-Box Test:
     lb_stat  lb_pvalue
10  1.543794   0.956533

4 Forecasting

periods = 4
cl      = 0.9   # confidence level

forecast = m4.get_forecast(steps=periods, alpha = 1-cl) 
mean_forecast = forecast.predicted_mean
conf_int = forecast.conf_int()

for date, value in mean_forecast.items():
  print(f"{date}: {value:.2f}")

2025Q1: -2.22
2025Q2: 1.33
2025Q3: 0.51
2025Q4: 2.01

mean_forecast = forecast.predicted_mean

GDP['Date'] = pd.to_datetime(GDP['Date'])
last_date   = GDP['Date'].iloc[-1]
forecast_dates = pd.date_range(start=last_date + pd.offsets.QuarterEnd(), periods=4, freq='QE')

# Plot
plt.figure(figsize=(7, 5))
plt.plot(GDP['Date'], GDP['y'], label='Observed')
plt.plot(forecast_dates, mean_forecast, label='Forecast', color='orange')
plt.fill_between(forecast_dates, conf_int.iloc[:, 0], conf_int.iloc[:, 1], color='orange', alpha=0.3, label=' Confidence Interval')
plt.title(f'SARIMA Forecast for GDP Growth  with  a {100*cl}% Confidence Interval')
plt.xlabel('Date')
plt.ylabel('y')
plt.legend()
plt.tight_layout()
plt.show()

Estimate the RMSFE:

fitted_values = m4.fittedvalues
rmsfe = np.sqrt(mean_squared_error(m4.data.endog, fitted_values))
print(f" RMSFE: {rmsfe:.2f}")

 RMSFE: 2.95