Time Series Demo

Author

Jimmy

Test

import numpy as np 
import pandas as pd
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.stats.diagnostic import acorr_ljungbox
import matplotlib.pyplot as plt
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error
from statsmodels.graphics.tsaplots import plot_pacf
import altair as alt
import warnings
warnings.filterwarnings("ignore") 
np.random.uniform(0,1,10)
array([0.78874915, 0.6862814 , 0.23030092, 0.05259818, 0.50130469,
       0.18490017, 0.02031174, 0.52975919, 0.79594482, 0.52905825])

I am using AirPassenger data to demonstrate time series forecasting

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
df = pd.read_csv(url)
df['Month'] = pd.to_datetime(df['Month'])
df.set_index('Month', inplace=True)
df.head()
Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121

I am using \(80\%\) of the data as training set, the rest of them as test set

split_point = int(len(df) * 0.8)
print("We are using",split_point, "months as training data")
We are using 115 months as training data
train_set = df.iloc[:split_point]
test_set = df.iloc[split_point:]
train_set.tail()
test_set.head()
Passengers
Month
1958-08-01 505
1958-09-01 404
1958-10-01 359
1958-11-01 310
1958-12-01 337
adf_test = adfuller(df['Passengers'])
print(adf_test)
(0.8153688792060442, 0.9918802434376409, 13, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 996.692930839019)

The series is not stationary, we have evidence to suspect that there’s a seasonality

Let’s try to difference the series with \(D=1 \wedge d=1\)

ACF:

plot_acf(df['Passengers'], lags=24)
plt.show()

df1 = df.diff()
df1 = df1.dropna()
plot_acf(df1['Passengers'], lags=24)
plt.show()

df11 = df1['Passengers'].diff(12)
df11 = df11.dropna()
plot_acf(df11, lags=24)
plt.show()

PACF:

plot_pacf(df['Passengers'], lags=24)
plt.show()

plot_pacf(df1, lags=24)
plt.show()

plot_pacf(df11, lags=24)
plt.show()

adf_test1 = adfuller(df11)
print(adf_test1)
(-15.59561808374634, 1.856511600123444e-28, 0, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 919.8428088960275)

The series is stationary after \(d=1 \wedge D=1\)

Hypotheses:

\[H_{0}: \text{The residuals are independently distributed} \text{ Versus } H_{1}: \text{The residuals are not independently distributed} \implies \text{ There exists an auto-correlation} \]

ljung_box_result = acorr_ljungbox(train_set['Passengers'], return_df=True) 
print(ljung_box_result)
       lb_stat      lb_pvalue
1    98.782569   2.818041e-23
2   179.006635   1.346490e-39
3   247.948735   1.817517e-53
4   308.582223   1.525346e-65
5   364.311826   1.449747e-76
6   418.693975   2.670613e-87
7   470.681092   1.603923e-97
8   522.179836  1.222667e-107
9   579.035539  6.599530e-119
10  642.002302  1.746688e-131

We failed to reject the null hypothesis, there’s an auto-correlation. We are suspecting a seasonality here

model = auto_arima(train_set['Passengers'], 
                   start_p=0, start_q=0,
                   max_p=5, max_q=5, 
                   m=12,
                   seasonal=True,  
                   trace=True,
                   stepwise=True)
print(model.summary())
Performing stepwise search to minimize aic
 ARIMA(0,1,0)(1,1,1)[12]             : AIC=758.922, Time=0.17 sec
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=757.826, Time=0.02 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=755.750, Time=0.07 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=756.380, Time=0.10 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=755.499, Time=0.03 sec
 ARIMA(1,1,0)(0,1,1)[12]             : AIC=755.982, Time=0.10 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=757.649, Time=0.17 sec
 ARIMA(2,1,0)(0,1,0)[12]             : AIC=756.771, Time=0.04 sec
 ARIMA(1,1,1)(0,1,0)[12]             : AIC=756.022, Time=0.05 sec
 ARIMA(0,1,1)(0,1,0)[12]             : AIC=756.011, Time=0.03 sec
 ARIMA(2,1,1)(0,1,0)[12]             : AIC=757.976, Time=0.05 sec
 ARIMA(1,1,0)(0,1,0)[12] intercept   : AIC=757.438, Time=0.05 sec

Best model:  ARIMA(1,1,0)(0,1,0)[12]          
Total fit time: 0.889 seconds
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                                  y   No. Observations:                  115
Model:             SARIMAX(1, 1, 0)x(0, 1, 0, 12)   Log Likelihood                -375.750
Date:                            Wed, 19 Jun 2024   AIC                            755.499
Time:                                    22:05:33   BIC                            760.749
Sample:                                01-01-1949   HQIC                           757.625
                                     - 07-01-1958                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.2048      0.090     -2.271      0.023      -0.382      -0.028
sigma2        92.6974     13.326      6.956      0.000      66.579     118.816
===================================================================================
Ljung-Box (L1) (Q):                   0.03   Jarque-Bera (JB):                 2.44
Prob(Q):                              0.87   Prob(JB):                         0.30
Heteroskedasticity (H):               0.96   Skew:                             0.37
Prob(H) (two-sided):                  0.92   Kurtosis:                         2.86
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Using SARIMA model with order (1,1,0)(0,1,0) [12] we have:

mod = sm.tsa.SARIMAX(train_set['Passengers'], 
                       order=(1, 1, 0), 
                       seasonal_order=(0, 1, 0, 12))
results = mod.fit()
forecast = results.get_forecast(steps=len(test_set))
forecast_mean = forecast.predicted_mean
forecast_dates = pd.date_range(start='1958-08-01', periods=len(forecast_mean), freq='MS')  
forecast_mean.index = forecast_dates
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.26740D+00    |proj g|=  1.30394D-03

At iterate    5    f=  3.26739D+00    |proj g|=  1.18637D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      8     10      1     0     0   5.870D-07   3.267D+00
  F =   3.2673885458939407     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
 This problem is unconstrained.

We then combine them:

combined = pd.concat([train_set['Passengers'], test_set['Passengers'], forecast_mean], axis=1)
combined.columns = ['Train', 'Test', 'Forecast']
combined_reset = combined.reset_index().rename(columns={'index': 'Date'})
combined_melted = combined_reset.melt('Date', var_name='Type', value_name='Passengers')

We then visualize the historical data and the forecasting points

chart0 = alt.Chart(combined_melted ).mark_line().encode(
    alt.X('Date:T', title='Date'),
    alt.Y('Passengers:Q', title='# of Passengers'),
    color='Type:N',
    tooltip=['Date:T', 'Passengers:Q', 'Type:N']
).interactive().properties(
    width=500,
    height=300
)
chart0
residuals = results.resid
test = sm.stats.acorr_ljungbox(residuals, return_df=True)
print(test)
     lb_stat  lb_pvalue
1   0.002170   0.962842
2   0.933702   0.626974
3   1.332964   0.721321
4   1.445974   0.836166
5   1.864579   0.867548
6   1.889649   0.929563
7   2.790449   0.903687
8   3.659371   0.886475
9   3.980878   0.912667
10  5.797085   0.832013

we reject the null hypothesis that there’s an auto-correlation in the residuals

Check how accurate this model is:

Mean error:

errors = forecast_mean - test_set['Passengers']
mean_error = np.mean(errors)
print("Mean Error:", mean_error)
Mean Error: -22.959210980795042

RMSE:

rmse = np.sqrt(mean_squared_error(test_set['Passengers'], forecast_mean))
print("Root Mean Squared Error (RMSE):", rmse)
Root Mean Squared Error (RMSE): 35.08281311854609

MAPE:

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
mape = mean_absolute_percentage_error(test_set['Passengers'].dropna(), forecast_mean.dropna())
print("Mean Absolute Percentage Error (MAPE):",mape)
Mean Absolute Percentage Error (MAPE): 6.124549631659345

Full trained model:

model_1 = sm.tsa.SARIMAX(df['Passengers'], 
                       order=(1,1,0), 
                       seasonal_order=(0,1,0,12))
results_1 = model_1.fit()
forecast1 = results_1.get_forecast(steps=24)
forecast1_mean = forecast1.predicted_mean
forecast1_mean.head()
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.52916D+00    |proj g|=  1.70840D-03

At iterate    5    f=  3.52915D+00    |proj g|=  1.49881D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      8     10      1     0     0   8.580D-06   3.529D+00
  F =   3.5291438863545266     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
 This problem is unconstrained.
1961-01-01    444.307609
1961-02-01    418.212986
1961-03-01    446.242093
1961-04-01    488.233139
1961-05-01    499.235893
Freq: MS, Name: predicted_mean, dtype: float64
df_vis = df.reset_index()
forecast_df = pd.DataFrame({'Date': forecast1_mean.index, 'Passengers': forecast1_mean.values})
df_vis['Type'] = 'Historical'
forecast_df['Type'] = 'Forecasted'
combined_df = pd.concat([df_vis, forecast_df])
combined_df['Month_Year'] = combined_df['Month'].fillna(combined_df['Date'])
chart1 = alt.Chart(combined_df).mark_line().encode(
    alt.X('Month_Year:T', title='Month'),
    alt.Y('Passengers:Q', title='# of Passengers'),
    color='Type:N',
    tooltip=['Month_Year:T', 'Passengers:Q', 'Type:N']
).interactive(
).properties(
    width=500,
    height=300,
    title='Forecasted Numbers of Air Passenger for the Next 24 Months'
)
chart1