Time Series Demo

Author

Jimmy

Test

import numpy as np 
import pandas as pd
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.stats.diagnostic import acorr_ljungbox
import matplotlib.pyplot as plt
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error
from statsmodels.graphics.tsaplots import plot_pacf
import altair as alt
import warnings
warnings.filterwarnings("ignore") 
np.random.uniform(0,1,10)

array([0.78874915, 0.6862814 , 0.23030092, 0.05259818, 0.50130469,
       0.18490017, 0.02031174, 0.52975919, 0.79594482, 0.52905825])

I am using AirPassenger data to demonstrate time series forecasting

What is time series forecasting?
A sequence of numbers that are indexed in time format
Using historical data to predict future
Common examples using time series forecasting: are sales, stock prices, inventory, weather, energy consumption, unemployment rate, GDP …….. and so on
It can be in an hourly, daily, weekly, monthly, or yearly format, the index must be time in your dataset
What am I trying to do?
I have a data set that contains the number of Air passengers at an airport, the data was recorded in a monthly format, and the data was ranged from January 1949 to December 1960
I am trying to predict the number of passengers for the next 24 months

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
df = pd.read_csv(url)
df['Month'] = pd.to_datetime(df['Month'])
df.set_index('Month', inplace=True)
df.head()

	Passengers
Month
1949-01-01	112
1949-02-01	118
1949-03-01	132
1949-04-01	129
1949-05-01	121

I am using \(80\%\) of the data as training set, the rest of them as test set

split_point = int(len(df) * 0.8)
print("We are using",split_point, "months as training data")

We are using 115 months as training data

train_set = df.iloc[:split_point]
test_set = df.iloc[split_point:]
train_set.tail()
test_set.head()

	Passengers
Month
1958-08-01	505
1958-09-01	404
1958-10-01	359
1958-11-01	310
1958-12-01	337

adf_test = adfuller(df['Passengers'])
print(adf_test)

(0.8153688792060442, 0.9918802434376409, 13, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 996.692930839019)

The series is not stationary, we have evidence to suspect that there’s a seasonality

Let’s try to difference the series with \(D=1 \wedge d=1\)

ACF:

plot_acf(df['Passengers'], lags=24)
plt.show()

df1 = df.diff()
df1 = df1.dropna()
plot_acf(df1['Passengers'], lags=24)
plt.show()

df11 = df1['Passengers'].diff(12)
df11 = df11.dropna()
plot_acf(df11, lags=24)
plt.show()

PACF:

plot_pacf(df['Passengers'], lags=24)
plt.show()

plot_pacf(df1, lags=24)
plt.show()

plot_pacf(df11, lags=24)
plt.show()

adf_test1 = adfuller(df11)
print(adf_test1)

(-15.59561808374634, 1.856511600123444e-28, 0, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 919.8428088960275)

The series is stationary after \(d=1 \wedge D=1\)

Hypotheses:

\[H_{0}: \text{The residuals are independently distributed} \text{ Versus } H_{1}: \text{The residuals are not independently distributed} \implies \text{ There exists an auto-correlation} \]

ljung_box_result = acorr_ljungbox(train_set['Passengers'], return_df=True) 
print(ljung_box_result)

       lb_stat      lb_pvalue
1    98.782569   2.818041e-23
2   179.006635   1.346490e-39
3   247.948735   1.817517e-53
4   308.582223   1.525346e-65
5   364.311826   1.449747e-76
6   418.693975   2.670613e-87
7   470.681092   1.603923e-97
8   522.179836  1.222667e-107
9   579.035539  6.599530e-119
10  642.002302  1.746688e-131

We failed to reject the null hypothesis, there’s an auto-correlation. We are suspecting a seasonality here

model = auto_arima(train_set['Passengers'], 
                   start_p=0, start_q=0,
                   max_p=5, max_q=5, 
                   m=12,
                   seasonal=True,  
                   trace=True,
                   stepwise=True)
print(model.summary())

Performing stepwise search to minimize aic
 ARIMA(0,1,0)(1,1,1)[12]             : AIC=758.922, Time=0.17 sec
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=757.826, Time=0.02 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=755.750, Time=0.07 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=756.380, Time=0.10 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=755.499, Time=0.03 sec
 ARIMA(1,1,0)(0,1,1)[12]             : AIC=755.982, Time=0.10 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=757.649, Time=0.17 sec
 ARIMA(2,1,0)(0,1,0)[12]             : AIC=756.771, Time=0.04 sec
 ARIMA(1,1,1)(0,1,0)[12]             : AIC=756.022, Time=0.05 sec
 ARIMA(0,1,1)(0,1,0)[12]             : AIC=756.011, Time=0.03 sec
 ARIMA(2,1,1)(0,1,0)[12]             : AIC=757.976, Time=0.05 sec
 ARIMA(1,1,0)(0,1,0)[12] intercept   : AIC=757.438, Time=0.05 sec

Best model:  ARIMA(1,1,0)(0,1,0)[12]          
Total fit time: 0.889 seconds
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                                  y   No. Observations:                  115
Model:             SARIMAX(1, 1, 0)x(0, 1, 0, 12)   Log Likelihood                -375.750
Date:                            Wed, 19 Jun 2024   AIC                            755.499
Time:                                    22:05:33   BIC                            760.749
Sample:                                01-01-1949   HQIC                           757.625
                                     - 07-01-1958                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.2048      0.090     -2.271      0.023      -0.382      -0.028
sigma2        92.6974     13.326      6.956      0.000      66.579     118.816
===================================================================================
Ljung-Box (L1) (Q):                   0.03   Jarque-Bera (JB):                 2.44
Prob(Q):                              0.87   Prob(JB):                         0.30
Heteroskedasticity (H):               0.96   Skew:                             0.37
Prob(H) (two-sided):                  0.92   Kurtosis:                         2.86
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Using SARIMA model with order (1,1,0)(0,1,0) [12] we have:

mod = sm.tsa.SARIMAX(train_set['Passengers'], 
                       order=(1, 1, 0), 
                       seasonal_order=(0, 1, 0, 12))
results = mod.fit()
forecast = results.get_forecast(steps=len(test_set))
forecast_mean = forecast.predicted_mean
forecast_dates = pd.date_range(start='1958-08-01', periods=len(forecast_mean), freq='MS')  
forecast_mean.index = forecast_dates

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.26740D+00    |proj g|=  1.30394D-03

At iterate    5    f=  3.26739D+00    |proj g|=  1.18637D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      8     10      1     0     0   5.870D-07   3.267D+00
  F =   3.2673885458939407     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

 This problem is unconstrained.

We then combine them:

combined = pd.concat([train_set['Passengers'], test_set['Passengers'], forecast_mean], axis=1)
combined.columns = ['Train', 'Test', 'Forecast']
combined_reset = combined.reset_index().rename(columns={'index': 'Date'})
combined_melted = combined_reset.melt('Date', var_name='Type', value_name='Passengers')

We then visualize the historical data and the forecasting points

chart0 = alt.Chart(combined_melted ).mark_line().encode(
    alt.X('Date:T', title='Date'),
    alt.Y('Passengers:Q', title='# of Passengers'),
    color='Type:N',
    tooltip=['Date:T', 'Passengers:Q', 'Type:N']
).interactive().properties(
    width=500,
    height=300
)
chart0

residuals = results.resid
test = sm.stats.acorr_ljungbox(residuals, return_df=True)
print(test)

     lb_stat  lb_pvalue
1   0.002170   0.962842
2   0.933702   0.626974
3   1.332964   0.721321
4   1.445974   0.836166
5   1.864579   0.867548
6   1.889649   0.929563
7   2.790449   0.903687
8   3.659371   0.886475
9   3.980878   0.912667
10  5.797085   0.832013

we reject the null hypothesis that there’s an auto-correlation in the residuals

Check how accurate this model is:

Mean error:

errors = forecast_mean - test_set['Passengers']
mean_error = np.mean(errors)
print("Mean Error:", mean_error)

Mean Error: -22.959210980795042

RMSE:

rmse = np.sqrt(mean_squared_error(test_set['Passengers'], forecast_mean))
print("Root Mean Squared Error (RMSE):", rmse)

Root Mean Squared Error (RMSE): 35.08281311854609

MAPE:

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = mean_absolute_percentage_error(test_set['Passengers'].dropna(), forecast_mean.dropna())
print("Mean Absolute Percentage Error (MAPE):",mape)

Mean Absolute Percentage Error (MAPE): 6.124549631659345

Full trained model:

model_1 = sm.tsa.SARIMAX(df['Passengers'], 
                       order=(1,1,0), 
                       seasonal_order=(0,1,0,12))
results_1 = model_1.fit()
forecast1 = results_1.get_forecast(steps=24)
forecast1_mean = forecast1.predicted_mean
forecast1_mean.head()

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.52916D+00    |proj g|=  1.70840D-03

At iterate    5    f=  3.52915D+00    |proj g|=  1.49881D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      8     10      1     0     0   8.580D-06   3.529D+00
  F =   3.5291438863545266     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

 This problem is unconstrained.

1961-01-01    444.307609
1961-02-01    418.212986
1961-03-01    446.242093
1961-04-01    488.233139
1961-05-01    499.235893
Freq: MS, Name: predicted_mean, dtype: float64

df_vis = df.reset_index()
forecast_df = pd.DataFrame({'Date': forecast1_mean.index, 'Passengers': forecast1_mean.values})
df_vis['Type'] = 'Historical'
forecast_df['Type'] = 'Forecasted'
combined_df = pd.concat([df_vis, forecast_df])
combined_df['Month_Year'] = combined_df['Month'].fillna(combined_df['Date'])
chart1 = alt.Chart(combined_df).mark_line().encode(
    alt.X('Month_Year:T', title='Month'),
    alt.Y('Passengers:Q', title='# of Passengers'),
    color='Type:N',
    tooltip=['Month_Year:T', 'Passengers:Q', 'Type:N']
).interactive(
).properties(
    width=500,
    height=300,
    title='Forecasted Numbers of Air Passenger for the Next 24 Months'
)
chart1