Test
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.stats.diagnostic import acorr_ljungbox
import matplotlib.pyplot as plt
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error
from statsmodels.graphics.tsaplots import plot_pacf
import altair as alt
import warnings
warnings.filterwarnings("ignore" )
np.random.uniform(0 ,1 ,10 )
array([0.78874915, 0.6862814 , 0.23030092, 0.05259818, 0.50130469,
0.18490017, 0.02031174, 0.52975919, 0.79594482, 0.52905825])
I am using AirPassenger data to demonstrate time series forecasting
What is time series forecasting?
A sequence of numbers that are indexed in time format
Using historical data to predict future
Common examples using time series forecasting: are sales, stock prices, inventory, weather, energy consumption, unemployment rate, GDP …….. and so on
It can be in an hourly, daily, weekly, monthly, or yearly format, the index must be time in your dataset
What am I trying to do?
I have a data set that contains the number of Air passengers at an airport, the data was recorded in a monthly format, and the data was ranged from January 1949 to December 1960
I am trying to predict the number of passengers for the next 24 months
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
df = pd.read_csv(url)
df['Month' ] = pd.to_datetime(df['Month' ])
df.set_index('Month' , inplace= True )
df.head()
Month
1949-01-01
112
1949-02-01
118
1949-03-01
132
1949-04-01
129
1949-05-01
121
I am using \(80\%\) of the data as training set, the rest of them as test set
split_point = int (len (df) * 0.8 )
print ("We are using" ,split_point, "months as training data" )
We are using 115 months as training data
train_set = df.iloc[:split_point]
test_set = df.iloc[split_point:]
train_set.tail()
test_set.head()
Month
1958-08-01
505
1958-09-01
404
1958-10-01
359
1958-11-01
310
1958-12-01
337
adf_test = adfuller(df['Passengers' ])
print (adf_test)
(0.8153688792060442, 0.9918802434376409, 13, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 996.692930839019)
The series is not stationary, we have evidence to suspect that there’s a seasonality
Let’s try to difference the series with \(D=1 \wedge d=1\)
ACF:
plot_acf(df['Passengers' ], lags= 24 )
plt.show()
df1 = df.diff()
df1 = df1.dropna()
plot_acf(df1['Passengers' ], lags= 24 )
plt.show()
df11 = df1['Passengers' ].diff(12 )
df11 = df11.dropna()
plot_acf(df11, lags= 24 )
plt.show()
PACF:
plot_pacf(df['Passengers' ], lags= 24 )
plt.show()
plot_pacf(df1, lags= 24 )
plt.show()
plot_pacf(df11, lags= 24 )
plt.show()
adf_test1 = adfuller(df11)
print (adf_test1)
(-15.59561808374634, 1.856511600123444e-28, 0, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 919.8428088960275)
The series is stationary after \(d=1 \wedge D=1\)
Hypotheses:
\[H_{0}: \text{The residuals are independently distributed} \text{ Versus } H_{1}: \text{The residuals are not independently distributed} \implies \text{ There exists an auto-correlation} \]
ljung_box_result = acorr_ljungbox(train_set['Passengers' ], return_df= True )
print (ljung_box_result)
lb_stat lb_pvalue
1 98.782569 2.818041e-23
2 179.006635 1.346490e-39
3 247.948735 1.817517e-53
4 308.582223 1.525346e-65
5 364.311826 1.449747e-76
6 418.693975 2.670613e-87
7 470.681092 1.603923e-97
8 522.179836 1.222667e-107
9 579.035539 6.599530e-119
10 642.002302 1.746688e-131
We failed to reject the null hypothesis, there’s an auto-correlation. We are suspecting a seasonality here
model = auto_arima(train_set['Passengers' ],
start_p= 0 , start_q= 0 ,
max_p= 5 , max_q= 5 ,
m= 12 ,
seasonal= True ,
trace= True ,
stepwise= True )
print (model.summary())
Performing stepwise search to minimize aic
ARIMA(0,1,0)(1,1,1)[12] : AIC=758.922, Time=0.17 sec
ARIMA(0,1,0)(0,1,0)[12] : AIC=757.826, Time=0.02 sec
ARIMA(1,1,0)(1,1,0)[12] : AIC=755.750, Time=0.07 sec
ARIMA(0,1,1)(0,1,1)[12] : AIC=756.380, Time=0.10 sec
ARIMA(1,1,0)(0,1,0)[12] : AIC=755.499, Time=0.03 sec
ARIMA(1,1,0)(0,1,1)[12] : AIC=755.982, Time=0.10 sec
ARIMA(1,1,0)(1,1,1)[12] : AIC=757.649, Time=0.17 sec
ARIMA(2,1,0)(0,1,0)[12] : AIC=756.771, Time=0.04 sec
ARIMA(1,1,1)(0,1,0)[12] : AIC=756.022, Time=0.05 sec
ARIMA(0,1,1)(0,1,0)[12] : AIC=756.011, Time=0.03 sec
ARIMA(2,1,1)(0,1,0)[12] : AIC=757.976, Time=0.05 sec
ARIMA(1,1,0)(0,1,0)[12] intercept : AIC=757.438, Time=0.05 sec
Best model: ARIMA(1,1,0)(0,1,0)[12]
Total fit time: 0.889 seconds
SARIMAX Results
==========================================================================================
Dep. Variable: y No. Observations: 115
Model: SARIMAX(1, 1, 0)x(0, 1, 0, 12) Log Likelihood -375.750
Date: Wed, 19 Jun 2024 AIC 755.499
Time: 22:05:33 BIC 760.749
Sample: 01-01-1949 HQIC 757.625
- 07-01-1958
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.2048 0.090 -2.271 0.023 -0.382 -0.028
sigma2 92.6974 13.326 6.956 0.000 66.579 118.816
===================================================================================
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 2.44
Prob(Q): 0.87 Prob(JB): 0.30
Heteroskedasticity (H): 0.96 Skew: 0.37
Prob(H) (two-sided): 0.92 Kurtosis: 2.86
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Using SARIMA model with order (1,1,0)(0,1,0) [12] we have:
mod = sm.tsa.SARIMAX(train_set['Passengers' ],
order= (1 , 1 , 0 ),
seasonal_order= (0 , 1 , 0 , 12 ))
results = mod.fit()
forecast = results.get_forecast(steps= len (test_set))
forecast_mean = forecast.predicted_mean
forecast_dates = pd.date_range(start= '1958-08-01' , periods= len (forecast_mean), freq= 'MS' )
forecast_mean.index = forecast_dates
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 2 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 3.26740D+00 |proj g|= 1.30394D-03
At iterate 5 f= 3.26739D+00 |proj g|= 1.18637D-03
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
2 8 10 1 0 0 5.870D-07 3.267D+00
F = 3.2673885458939407
CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
This problem is unconstrained.
We then combine them:
combined = pd.concat([train_set['Passengers' ], test_set['Passengers' ], forecast_mean], axis= 1 )
combined.columns = ['Train' , 'Test' , 'Forecast' ]
combined_reset = combined.reset_index().rename(columns= {'index' : 'Date' })
combined_melted = combined_reset.melt('Date' , var_name= 'Type' , value_name= 'Passengers' )
We then visualize the historical data and the forecasting points
chart0 = alt.Chart(combined_melted ).mark_line().encode(
alt.X('Date:T' , title= 'Date' ),
alt.Y('Passengers:Q' , title= '# of Passengers' ),
color= 'Type:N' ,
tooltip= ['Date:T' , 'Passengers:Q' , 'Type:N' ]
).interactive().properties(
width= 500 ,
height= 300
)
chart0
residuals = results.resid
test = sm.stats.acorr_ljungbox(residuals, return_df= True )
print (test)
lb_stat lb_pvalue
1 0.002170 0.962842
2 0.933702 0.626974
3 1.332964 0.721321
4 1.445974 0.836166
5 1.864579 0.867548
6 1.889649 0.929563
7 2.790449 0.903687
8 3.659371 0.886475
9 3.980878 0.912667
10 5.797085 0.832013
we reject the null hypothesis that there’s an auto-correlation in the residuals
Check how accurate this model is:
Mean error:
errors = forecast_mean - test_set['Passengers' ]
mean_error = np.mean(errors)
print ("Mean Error:" , mean_error)
Mean Error: -22.959210980795042
RMSE:
rmse = np.sqrt(mean_squared_error(test_set['Passengers' ], forecast_mean))
print ("Root Mean Squared Error (RMSE):" , rmse)
Root Mean Squared Error (RMSE): 35.08281311854609
MAPE:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs ((y_true - y_pred) / y_true)) * 100
mape = mean_absolute_percentage_error(test_set['Passengers' ].dropna(), forecast_mean.dropna())
print ("Mean Absolute Percentage Error (MAPE):" ,mape)
Mean Absolute Percentage Error (MAPE): 6.124549631659345
Full trained model:
model_1 = sm.tsa.SARIMAX(df['Passengers' ],
order= (1 ,1 ,0 ),
seasonal_order= (0 ,1 ,0 ,12 ))
results_1 = model_1.fit()
forecast1 = results_1.get_forecast(steps= 24 )
forecast1_mean = forecast1.predicted_mean
forecast1_mean.head()
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 2 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 3.52916D+00 |proj g|= 1.70840D-03
At iterate 5 f= 3.52915D+00 |proj g|= 1.49881D-03
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
2 8 10 1 0 0 8.580D-06 3.529D+00
F = 3.5291438863545266
CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
This problem is unconstrained.
1961-01-01 444.307609
1961-02-01 418.212986
1961-03-01 446.242093
1961-04-01 488.233139
1961-05-01 499.235893
Freq: MS, Name: predicted_mean, dtype: float64
df_vis = df.reset_index()
forecast_df = pd.DataFrame({'Date' : forecast1_mean.index, 'Passengers' : forecast1_mean.values})
df_vis['Type' ] = 'Historical'
forecast_df['Type' ] = 'Forecasted'
combined_df = pd.concat([df_vis, forecast_df])
combined_df['Month_Year' ] = combined_df['Month' ].fillna(combined_df['Date' ])
chart1 = alt.Chart(combined_df).mark_line().encode(
alt.X('Month_Year:T' , title= 'Month' ),
alt.Y('Passengers:Q' , title= '# of Passengers' ),
color= 'Type:N' ,
tooltip= ['Month_Year:T' , 'Passengers:Q' , 'Type:N' ]
).interactive(
).properties(
width= 500 ,
height= 300 ,
title= 'Forecasted Numbers of Air Passenger for the Next 24 Months'
)
chart1