A seasonal time series data is read and fitted with deterministic and stochastic trend models. A residual approach is followed to fit the stochastic models and a possible set of models is found, and each model is checked for significant coefficients and the significant models are selected for the diagnostics checking. Then the residual analysis is conducted on each model and the model with the stationary residuals is selected for the forecast.
The Air passengers dataset is downloaded from kaggle click here. This dataset gives information of monthly passengers totals of a US airline from 1949 to 1960. This report analyse the data using suitable time series model and forecasts the number of passengers for the next 24 months.
The significance level is considered as 5%.
#install.packages("TSA")
library(TSA)
#install.packages("tseries")
library(tseries)
#install.packages("lmtest")
library(lmtest)
#install.packages("FitAR")
library(FitAR)
#install.packages("forecast")
library(forecast)
From Figure -1, it is clear that the data set contains 2 columns namely Month
and #Passengers
.
# Data is read using read.csv()
air_passenger_df <- read.csv("AirPassengers.csv")
# To display the first 6 observations
head(air_passenger_df)
## Month X.Passengers
## 1 1949-01 112
## 2 1949-02 118
## 3 1949-03 132
## 4 1949-04 129
## 5 1949-05 121
## 6 1949-06 135
For better readability the column #Passengers
is renamed as Number_of_Passengers
.
# Rename the columns
colnames(air_passenger_df) <- (c("Month","Number_of_Passengers"))
# To get the details of last 2 observations
tail(air_passenger_df,2)
## Month Number_of_Passengers
## 143 1960-11 390
## 144 1960-12 432
summary(air_passenger_df)
## Month Number_of_Passengers
## Length:144 Min. :104.0
## Class :character 1st Qu.:180.0
## Mode :character Median :265.5
## Mean :280.3
## 3rd Qu.:360.5
## Max. :622.0
From the above descriptive statistics output, it is inferred that in the period from Jan 1949 to Dec 1960, the average number of passengers is 280, the maximum number of passengers in a month is 622 and the minimum number of passengers in a month is 104.
# To convert into a Time series object, since it is a monthly data frequency is set to 12.
air_passenger_df_TS <- ts(as.vector(air_passenger_df$Number_of_Passengers), start=c(1949,1), end=c(1960,1),frequency = 12)
# Checking the class of the object
class(air_passenger_df_TS)
## [1] "ts"
To avoid the issues while converting a data frame to a time series object, The data frame is converted to vector and then to a time series object.
air_passenger_df_TS
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417
From the Figure-1 and above output it is clear that the time series object created is appropriate.
## To plot the time series object
plot(air_passenger_df_TS,type='o',ylab='Number of Passengers',main='Time series plot of Airline Passengers')
It is hard to comment on the seasonality from the above plot, hence the points in the graph are replaced with the starting letter of its respective month.
# Remove type='o'
plot(air_passenger_df_TS,ylab='Number of Passengers',main='Time series plot of Airline Passengers')
points(y=air_passenger_df_TS,x=time(air_passenger_df_TS),pch=as.vector(season(air_passenger_df_TS)))
A clear decision can be made on the behavior of the series by finding the correlation between the months to check whether the two consecutive months are related. The scatter plot between the Previous month’s number of passengers to the next month in the time series is plotted to achieve this.
The Time series object is copied into series_without_lag
and the first 5 obervation is displayed
# Copying the time series data to series_without_lag
series_without_lag = air_passenger_df_TS
head(series_without_lag)
## Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135
First lag of the time series is created using zlag method and assigned to first_lag_ts_series
first_lag_ts_series = zlag(air_passenger_df_TS)
# To display first 5 observation of lag 1 time series
head(first_lag_ts_series)
## [1] NA 112 118 132 129 121
The first observation is NA as the first lag is created by discarding the first value. To omit this missing value, the values from 2nd index is consered.
index = 2:length(first_lag_ts_series)
plot(y=air_passenger_df_TS,x=zlag(first_lag_ts_series),ylab='Original Airline Passengers', xlab='first lag Airline Passengers', main= "Relation between Orginal Time Series and it's first lag ")
From the plot it can be inferred that there exist a strong positive correlation between the Original Airline Passengers servies with it’s first lag.
Correlation between the original_series
and first_lag_series
is found using cor() method.
cor(series_without_lag[index],first_lag_ts_series[index])
## [1] 0.9570215
From the above output it is inferred that 95.7% of variance is explained by the previous month value. This implies that there exists a strong positive correlation between the consecutive months in the time series.Hence the air_passenger_df_TS
is an Autoregressive (AR) as correlation is the characteristics of AR series.
Even though the series has a stochastic trend, interested to see how the deterministic models will fit on this series.
where,
\(µ_{t}\): Trend component,
\(β_{0}\): Intercept of the linear regression model,
\(β_{1}\): Slope of the linear regression,
Ɛ: Residuals after capturing the trend.
In R, the first step is to extract time from the air_passenger_df_TS
time series object. The time of air_passenger_df_TS
is extracted and assigned to time_of_air_passenger_df_TS
.
time_of_air_passenger_df_TS <- time(air_passenger_df_TS)
Fitting the model using lm() method,the summary() and abline() methods to find the summary of the fit and line of best fit respectively.
linear_regression <-lm(air_passenger_df_TS ~ time_of_air_passenger_df_TS)
# To get the output of the model
summary(linear_regression)
##
## Call:
## lm(formula = air_passenger_df_TS ~ time_of_air_passenger_df_TS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.688 -25.704 -4.379 20.729 139.287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -59739.522 2219.247 -26.92 <2e-16 ***
## time_of_air_passenger_df_TS 30.700 1.135 27.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.9 on 131 degrees of freedom
## Multiple R-squared: 0.848, Adjusted R-squared: 0.8469
## F-statistic: 731 on 1 and 131 DF, p-value: < 2.2e-16
From the above output the Number of Passengers (y) is found by estimated equation:surf
\(\beta_{1}\) is 7.427
Both coefficients are statistically significant at 5% significance level.
The overall model is significant as the p-value is < 2.2e-16, i.e., less than 0.05.The adjusted R-squared is 0.8469 which implies that 84.69% of the variation in the passengers number is explained by this model.
# Plot to see the fit of regression on time series
plot(air_passenger_df_TS,type='o',ylab='AirLine Passengers',col="blue", main = "Time series plot for Airline Passengers ")
abline(linear_regression,col="red")
From the above plot, the blue line represents the actual time series and red line represents the linear model fit. It is clear that this model explains the trend but failed to explain the seasonality in the series. Hence this model cannot be considered for prediction.
The quadratic model is ideal for series with quadratic trend. As the name suggests, this model required to include square of the extracted time. This is to test if quadratic term of time (\(t^2\) ) has influence in predicting the the number of passengers.
where,
\(µ_{t}\) : Trend component we want to model,
\(β_{0}\) : Intercept of the Quadratic regression model,
\(β_{1}\) : Coefficient estimate of t,
\(β_{2}\) : Coefficient estimate of t2,
Ɛ: Residuals after capturing the trend.
lm() method is used to fit the model with the square of the extracted time
# To find the square of the time
square_of_time_of_air_passenger_df_TSS <- time_of_air_passenger_df_TS ^ 2
#Fitting the quadratic model
quadratic_model <-lm(air_passenger_df_TS ~ time_of_air_passenger_df_TS + square_of_time_of_air_passenger_df_TSS)
# To get the summary of the model
summary(quadratic_model)
##
## Call:
## lm(formula = air_passenger_df_TS ~ time_of_air_passenger_df_TS +
## square_of_time_of_air_passenger_df_TSS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.247 -22.283 -8.455 20.384 125.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.321e+06 1.493e+06 2.225 0.0278 *
## time_of_air_passenger_df_TS -3.429e+03 1.527e+03 -2.245 0.0265 *
## square_of_time_of_air_passenger_df_TSS 8.850e-01 3.907e-01 2.265 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.25 on 130 degrees of freedom
## Multiple R-squared: 0.8538, Adjusted R-squared: 0.8516
## F-statistic: 379.6 on 2 and 130 DF, p-value: < 2.2e-16
The Number of Passengers (y) is found by estimated equation:time_of_air_passenger_df_TS
\(\beta_{1}\) is 3.429e+03
square_of_time_of_air_passenger_df_TSS
\(\beta_{2}\) is 8.850e-01
All the coefficients are statistically significant at 5% significance level.
The overall model is significant as the p-value is < 2.2e-16, i.e., less than 0.05.The adjusted R-squared is 0.8516, i.e., 85.16% of the variation in the passengers number is explained by this model.
# Visualising the plot
plot(ts(fitted(quadratic_model)), ylim = c(min(c(fitted(quadratic_model), as.vector(air_passenger_df_TS))), max(c(fitted(quadratic_model),as.vector(air_passenger_df_TS)))),
ylab='AirLine Passengers' , main = "Fitted quadratic curve to Airline Passengers", type="l",lty=2,col="red")
lines(as.vector(air_passenger_df_TS),type="o",col="navyblue")
Similar to linear model, the quadratic model captures only the trend and fail to capture seasonality. Thus this model cannot be considered for forecast.
In seasonal models, the repeating patterns of time ‘t’ is created.As the time series is a monthly data, hence the frequency is 12. There is a paramter for each month.
month.=season(air_passenger_df_TS)
seasonal_model =lm(air_passenger_df_TS ~ month. -1) # -1 removes the intercept term
summary(seasonal_model)
##
## Call:
## lm(formula = air_passenger_df_TS ~ month. - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.91 -81.64 -20.64 89.64 231.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## month.January 241.75 30.53 7.919 1.30e-12 ***
## month.February 220.82 31.88 6.926 2.25e-10 ***
## month.March 256.64 31.88 8.049 6.54e-13 ***
## month.April 249.45 31.88 7.824 2.16e-12 ***
## month.May 253.64 31.88 7.955 1.08e-12 ***
## month.June 291.36 31.88 9.138 1.82e-15 ***
## month.July 326.73 31.88 10.247 < 2e-16 ***
## month.August 327.91 31.88 10.284 < 2e-16 ***
## month.September 283.73 31.88 8.899 6.72e-15 ***
## month.October 248.91 31.88 7.807 2.36e-12 ***
## month.November 218.55 31.88 6.854 3.22e-10 ***
## month.December 246.36 31.88 7.727 3.59e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105.7 on 121 degrees of freedom
## Multiple R-squared: 0.8742, Adjusted R-squared: 0.8618
## F-statistic: 70.09 on 12 and 121 DF, p-value: < 2.2e-16
All the 12 coefficients for each month is significant and with the value as shown in the above output. The adjusted R-squared is 0.8618, which implies that 86.18% of variation in the number of passengers for the airlines is explained using the above co-efficients. The overall model is significant as the p-value is less than the significance level.
Let’s visualise the fit on the time series data.
# To visualise the fit
plot(ts(fitted(seasonal_model)), ylab='AirLine Passengers',
main = "Fitted seasonal model to Airline Passengers series" ,
ylim = c(min(c(fitted(seasonal_model), as.vector(air_passenger_df_TS))) ,
max(c(fitted(seasonal_model), as.vector(air_passenger_df_TS))) ),col="red")
lines(as.vector(air_passenger_df_TS),type="o")
From figure 7, it is inferred that the seasonal model captured the seasonality but completely failed to capture the trend. Hence this model can be discarded.
Similar to seasonal model, Cosine model is used to predict the repeating pattern in the data, where as in the cosine instead of including the frequency of the series, cyclic mathematical function, cosine is used to creat repeating patterns.
where,
\(µ_{t}\): Trend component we want to model,
\(β_{0}\): : Intercept of the cosine regression model,
\(β_{1}\): : Coefficient estimate of amplitude of cosine (), \(β_{2}\): : Coefficient estimate of amplitude of sine (), Ɛ : Residuals after capturing the trend.
In R, harmonic method is used to find the cosine equation.
har. <- harmonic(air_passenger_df_TS, 1)
data <- data.frame(air_passenger_df_TS,har.)
cosine_model <- lm(air_passenger_df_TS ~ cos.2.pi.t. + sin.2.pi.t. , data = data)
summary(cosine_model)
##
## Call:
## lm(formula = air_passenger_df_TS ~ cos.2.pi.t. + sin.2.pi.t.,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.52 -79.52 -18.95 81.93 257.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 263.962 8.997 29.340 < 2e-16 ***
## cos.2.pi.t. -41.012 12.676 -3.235 0.00154 **
## sin.2.pi.t. -3.929 12.771 -0.308 0.75886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.8 on 130 degrees of freedom
## Multiple R-squared: 0.07515, Adjusted R-squared: 0.06092
## F-statistic: 5.281 on 2 and 130 DF, p-value: 0.006233
cos.2.pi.t.
\(\beta_{1}\) is -41.012 with p-value 0.00154
sin.2.pi.t.
\(\beta_{2}\) is -3.929 with p-value 0.75886
The \(\beta_{0}\) and \(\beta_{1}\) are significant coefficient, where as the \(\beta_{2}\) is not significant. The adjusted R-square is 0.06092 i.e., only 6% of the variation in the series is explained by this model.
plot(ts(fitted(cosine_model)), ylab='AirLine Passengers', main = "Fitted seasonal model for Airline Passeneger using cosine " , ylim = c(min(c(fitted(cosine_model), as.vector(air_passenger_df_TS))) ,max(c(fitted(cosine_model), as.vector(air_passenger_df_TS))) ),col="red")
lines(as.vector(air_passenger_df_TS),type="o")
Form the plot it is clear that, the cosine model failed to capture the trend. Due to it’s low R-sqaured value and as it failed to capture the trend, we can discard this model.
As the series has the stochastic trend and the seasonality in the time series, let’s try fitting the stochastic models.
The Stationarity and Autocorrelation are the primary assumptions need to be satsified to build an Auto Regressive model. Hence let’s check whether these assumptions holds good.
## To plot the time series object
plot(air_passenger_df_TS,type='o',ylab='Number of Passengers',main='Time series plot of Airline Passengers')
points(y=air_passenger_df_TS,x=time(air_passenger_df_TS),pch=as.vector(season(air_passenger_df_TS)))
air_passenger_df_TS
time series is not stationaryFurther to know auto correlation in the series is found by ACF plot and order of AR model is found by PACF plot. A function plot_acf_and_pacf
is defined to plot ACF and PACF plots as shown below.
# The function accepts the series, number of lags and the series names to be given on the plots
plot_acf_and_pacf <- function(df_TS, lags_given,df_name) {
assign("x",df_TS)
par(mfrow=c(1,2))
acf(df_TS, lag.max=lags_given,main=paste("ACF Plot of", df_name))
pacf(df_TS, lag.max=lags_given, main=paste("PACF Plot of", df_name))
}
plot_acf_and_pacf(air_passenger_df_TS, 64,"Air Passengers")
From figure 10, further it is visually proved that there is wave and decaying patten in the ACF plot, hence the series is not stationary. The decaying pattern at periods indicates there is seasonal trend in the series. There is no pattern in the PACF plot.
As the seasonality is confirmed from the time series and ACF plot, let’s fit the SARIMA model. This model is has the following paramters,
SARIMA (p,d,q) \(\times\) (P,D,Q)\(_{s}\)
p : Auto Regressive (AR) Order,
d : The number of ordinary differences,
q : Moving Average (MA) order,
P : Seasonal AR order D : number of seasonal differences,
Q: Seasonal MA order s: period or frequency of the series, here the value is 12 as it is monthly data
Following the residual approach to find the specification of the SARIMA model, where we start the model with 1 seasonal differencing, find the residual and if any pattern left in the residuals, rebuild the model with the parameters from ACF and PACF plot and repeat the process.
Intially we fit the model with seasonal differencing with order D = 1.
# p,d,q P,D,Q
model1_airline = arima(air_passenger_df_TS,order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res_of_model1 = residuals(model1_airline);
par(mfrow=c(1,1))
plot(res_of_model1,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")
From figure 11, it is inferred that there is trend left in the residuals, that is model failed to capture the trend. Let’s see the ACF and PACF plots to find whether this trend is significant, also the residuals are near 0.
The ACF and PACF to see the autocorrelation.
plot_acf_and_pacf(res_of_model1, 48,"Residual of M1")
From the ACF and PACF plot, there are no significant autocorrelations at the seasonal lags, this means that the seasonal trend is captured by the seasonal differencing with order D =1. Hence from PACF plot, the Seasonal AR order i.e., P = 0
the number of seasonal differences i.e., D = 1 and
From ACF plot, The seasonal MA order i.e., Q = 0
Before finding the ARMA orders, we need to make the series stationary, as there is change in variance in the series.
# p,d,q P,D,Q
model2_airline = arima(log(air_passenger_df_TS),order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res_of_model2 = residuals(model2_airline);
par(mfrow=c(1,1))
plot(res_of_model2,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")
plot_acf_and_pacf(res_of_model2, 48,"Residual of M2")
From figure 13, the variance is reduced a bit, but still there is change in variance in the series. From figure 14, there are signficiant autocorrelations before the first seasonal lag in ACF, thus we need to get rid of this by ordinary differencing the series before finding the specification of AR and MA orders.
# p,d,q P,D,Q
model3_airline = arima(log(air_passenger_df_TS),order=c(0,1,0),seasonal=list(order=c(0,1,0), period=12))
res_of_model3 = residuals(model3_airline);
par(mfrow=c(1,1))
plot(res_of_model3,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")
From figure 15, the residuals are spread around 0 and the series is stationary.
plot_acf_and_pacf(res_of_model3, 48,"Residual of M3")
From the ACF plot, there are 2 significant lags before the first seasonal lag and from PACF plot there are 3 significant lags before the first seasonal lag. Hence these significant lags are not captured by the model and is available in the residuals. We need to fit the SARIMA model with ARMA specification of AR(3) and MA(2).
# p,d,q P,D,Q
model4_airline = arima(log(air_passenger_df_TS),order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12), method="ML")
res_of_model4 = residuals(model4_airline);
par(mfrow=c(1,1))
plot(res_of_model4,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")
plot_acf_and_pacf(res_of_model4, 36,"Residual of M3")
From figure 18, it is clear that there is no significant autocorrelation left in both ACF and PACF. This implies that the model captures the trend and seasonality of the airline passengers series.
For the possible set of models we can use the EACF. It is used to identify the order of autoregressive and moving average components of ARMA using eacf function. We need to use the res_of_model3
as it contains significant autocorrelations in the residuals. The possible ARMA specification set is achieved with the top-left triangular pattern of zeroes occurring in the AR rows and MA columns.
eacf(res_of_model3)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o x o o o o o o o o x o o
## 1 o o x o o o o o x o o x o o
## 2 o x x o o o o o x o o x o x
## 3 x x o o o o o o o o o x o o
## 4 x o o o o o o o o o o x x o
## 5 x o x o o o o o o o o x o o
## 6 x x o o o o o o o o o x o o
## 7 x x o o o o o o o o o x o o
From the above output, the 0th row and 3rd column does not have any near by values “x”,hence these order have a significant p-value. The set of possible non seasonal orders are: * ARIMA (0,1,3) * ARIMA (0,1,4) * ARIMA (1,1,3)
BIC can also be used to find the possible AR and MA co-efficients. BIC penalizes the complexity of the model by including the penalty term. Hence the model with lowest BIC values is preferred and kept at top with dark shade.
res = armasubsets(y=res_of_model3,nar=6,nma=6,y.name='test',ar.method='ols')
plot(res)
From the output the possible values of AR are 1,3 and from EACF output we got MA values as 3 and 4,from BIC, the values for MA is 0.Hence the new possible non seasonal ARMA specifications are : * ARIMA (1,1,0) * ARIMA (3,1,0)
The set of possible model is {
SARIMA (0,1,3) \(\times\) (0,1,0)\(_{12}\), SARIMA (0,1,4) \(\times\) (0,1,0)\(_{12}\), SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\), SARIMA (1,1,3) \(\times\) (0,1,0)\(_{12}\), SARIMA (1,1,4) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,0) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,4) \(\times\) (0,1,0)\(_{12}\) }
Each of the possible models are fitted using the CSS (Minimises the conditional sum-of-squares) and ML (maximum likelihood) methods. The coefficient of each parameter of the model is checked whether it is significant with the significance value of 0.05. This is achieved by conducting ztest using coeftest method.
model4_013_airline_CSS = arima(log(air_passenger_df_TS),order=c(0,1,3),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_013_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.317317 0.088317 -3.5929 0.000327 ***
## ma2 0.070500 0.093572 0.7534 0.451187
## ma3 -0.212449 0.094458 -2.2491 0.024504 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_013_airline_ML = arima(log(air_passenger_df_TS),order=c(0,1,3),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_013_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.318722 0.088811 -3.5888 0.0003322 ***
## ma2 0.067324 0.093639 0.7190 0.4721559
## ma3 -0.209883 0.094659 -2.2173 0.0266055 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above 2 outputs of CSS and ML method, all the co-efficients of the model parameters are not significant, hence this model can be discarded.
model4_014_airline_CSS = arima(log(air_passenger_df_TS),order=c(0,1,4),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_014_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.339925 0.094542 -3.5955 0.0003238 ***
## ma2 0.074199 0.099325 0.7470 0.4550412
## ma3 -0.222803 0.096202 -2.3160 0.0205590 *
## ma4 0.064725 0.097367 0.6648 0.5062099
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_014_airline_ML = arima(log(air_passenger_df_TS),order=c(0,1,4),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_014_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.338329 0.093937 -3.6016 0.0003162 ***
## ma2 0.070868 0.098713 0.7179 0.4728071
## ma3 -0.218498 0.095560 -2.2865 0.0222247 *
## ma4 0.060088 0.096330 0.6238 0.5327728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As all the coefficients of SARIMA (0,1,4) \(\times\) (0,1,0)\(_{12}\) are not significant, thus this model can be discarded.
model4_110_airline_CSS = arima(log(air_passenger_df_TS),order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_110_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.320713 0.086428 -3.7107 0.0002067 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_110_airline_ML = arima(log(air_passenger_df_TS),order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_110_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.320115 0.086565 -3.698 0.0002173 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient of SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) is significant in both CSS and ML methods, hence this model is considered for diagnostics checking.
model4_113_airline_CSS = arima(log(air_passenger_df_TS),order=c(1,1,3),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_113_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.058042 0.323569 -0.1794 0.85764
## ma1 -0.272506 0.312650 -0.8716 0.38343
## ma2 0.056305 0.139887 0.4025 0.68731
## ma3 -0.221780 0.095843 -2.3140 0.02067 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_113_airline_ML = arima(log(air_passenger_df_TS),order=c(1,1,3),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_113_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.160787 0.356104 -0.4515 0.65162
## ma1 -0.168909 0.344022 -0.4910 0.62344
## ma2 0.017293 0.151861 0.1139 0.90934
## ma3 -0.204896 0.093462 -2.1923 0.02836 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the coefficients of SARIMA (1,1,3) \(\times\) (0,1,0)$_{12} model are not significant. Hence, this model can be discarded.
model4_114_airline_CSS = arima(log(air_passenger_df_TS),order=c(1,1,4),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_114_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.47382 0.39121 1.2112 0.22584
## ma1 -0.82314 0.39105 -2.1050 0.03529 *
## ma2 0.24051 0.17975 1.3381 0.18088
## ma3 -0.27381 0.12237 -2.2375 0.02525 *
## ma4 0.18273 0.14206 1.2862 0.19836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_114_airline_ML = arima(log(air_passenger_df_TS),order=c(1,1,4),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_114_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.55811 0.23372 2.3880 0.0169414 *
## ma1 -0.91086 0.23945 -3.8040 0.0001424 ***
## ma2 0.26817 0.15238 1.7599 0.0784255 .
## ma3 -0.25880 0.12536 -2.0644 0.0389809 *
## ma4 0.23131 0.12011 1.9258 0.0541285 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the 2 outputs it is clear that the coefficients of SARIMA (1,1,4) \(\times\) (0,1,0)\(_{12}\) model are not significant, thus this model is discarded.
model4_310_airline_CSS = arima(log(air_passenger_df_TS),order=c(3,1,0),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_310_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.319187 0.089556 -3.5641 0.0003651 ***
## ar2 -0.056980 0.094205 -0.6049 0.5452756
## ar3 -0.198322 0.089704 -2.2108 0.0270465 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_310_airline_ML = arima(log(air_passenger_df_TS),order=c(3,1,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_310_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.316976 0.089587 -3.5382 0.0004029 ***
## ar2 -0.060070 0.093891 -0.6398 0.5223098
## ar3 -0.195217 0.089335 -2.1852 0.0288718 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the coefficients of SARIMA (3,1,0) \(\times\) (0,1,0)\(_{12}\) are not signficant, thus this model is discarded.
#model4_312_airline
model4_312_airline_CSS = arima(log(air_passenger_df_TS),order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_312_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.1048537 0.0042831 24.481 < 2.2e-16 ***
## ar2 -0.8068319 0.0034306 -235.186 < 2.2e-16 ***
## ar3 -0.3958784 0.0039296 -100.743 < 2.2e-16 ***
## ma1 -0.5223968 0.0075540 -69.155 < 2.2e-16 ***
## ma2 1.0917037 0.0051349 212.606 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_312_airline_ML = arima(log(air_passenger_df_TS),order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_312_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.215010 0.086311 -2.4911 0.01273 *
## ar2 -0.765596 0.052014 -14.7191 < 2.2e-16 ***
## ar3 -0.411513 0.084004 -4.8987 9.647e-07 ***
## ma1 -0.081764 0.034134 -2.3954 0.01660 *
## ma2 0.999997 0.037506 26.6624 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the co-efficients of SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\) are significant for both CSS and ML methods,hence this model is considered for further diagnostics.
model4_314_airline_CSS = arima(log(air_passenger_df_TS),order=c(3,1,4),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_314_airline_CSS)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.12894 0.44316 0.2910 0.7710872
## ar2 -0.81538 0.14134 -5.7691 7.971e-09 ***
## ar3 -0.14720 0.37972 -0.3877 0.6982662
## ma1 -0.47136 0.44643 -1.0559 0.2910353
## ma2 0.98690 0.27640 3.5706 0.0003562 ***
## ma3 -0.36360 0.44256 -0.8216 0.4113238
## ma4 0.01335 0.22221 0.0601 0.9520919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_314_airline_ML = arima(log(air_passenger_df_TS),order=c(3,1,4),seasonal=list(order=c(0,1,0), period=12),method = "ML")
## Warning in log(s2): NaNs produced
coeftest(model4_314_airline_ML)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.37044 0.42378 -0.8741 0.382044
## ar2 -0.68279 0.12602 -5.4180 6.028e-08 ***
## ar3 -0.52411 0.33921 -1.5451 0.122322
## ma1 0.02575 0.43740 0.0589 0.953055
## ma2 0.77436 0.27746 2.7909 0.005257 **
## ma3 0.10757 0.44894 0.2396 0.810643
## ma4 -0.21937 0.25027 -0.8765 0.380752
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the CSS and ML methods coefficient testing , all the coefficients of SARIMA (3,1,4) \(\times\) (0,1,0)\(_{12}\) are not significant, hence this model is discarded.
Diagnostics check is carried out for the models which have significiant coefficients in both CSS and ML method. This is done by the residual analysis plot to check whether the residual is stationary and hence no trend or seasonality left by the model, checking the normality of the residual through [Shapiro wilt test, Histogram and QQ plot], ACF plot, PACF plot and Ljung-Box test to check all the lags in the series are good and no autocorrelation left in residuals. To achieve these tests a function residual_analysis
is defined as shown below, which converts the residual to standard residuals and plots the standard residuals plot, histogram, QQ plot, ACF and Ljung box test plots.
residual_analysis <- function(model, std = TRUE,start = 2, class = c("ARIMA","GARCH","ARMA-GARCH", "fGARCH")[1]){
if (class == "ARIMA"){
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
}else if (class == "GARCH"){
res.model = model$residuals[start:model$n.used]
}else if (class == "ARMA-GARCH"){
res.model = model@fit$residuals
}else if (class == "fGARCH"){
res.model = model@residuals
}else {
stop("The argument 'class' must be either 'ARIMA' or 'GARCH' ")
}
par(mfrow=c(3,2))
plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
abline(h=0)
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardised residuals")
print(shapiro.test(res.model))
k=0
LBQPlot(res.model, lag.max = 30, StartLag = k + 1, k = 0, SquaredQ = FALSE)
par(mfrow=c(1,1))
}
Shapiro Wilk Test: Null hypothesis: \(H_{0}\): Errors are normally distributed
Alternate hypothesis: Errors are not normally distributed
Reject \(H_{0}\) if p-value < 0.05
Among the possible models, the following models are selected for diagnostics check:
{ SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\) }
sort.score <- function(x, score = c("bic", "aic")){
if (score == "aic"){
x[with(x, order(AIC)),]
} else if (score == "bic") {
x[with(x, order(BIC)),]
} else {
warning('score = "x" only accepts valid arguments ("aic","bic")')
}
}
m4_110_airline = arima(air_passenger_df_TS,order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
residual_analysis(model = m4_110_airline, class = "ARIMA")
m4_312_airline = arima(air_passenger_df_TS,order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12),method = "ML")
residual_analysis(model = m4_312_airline, class = "ARIMA")
As the residual series is not normal , hence this model is not a good option for the forecast.
Among the 2 models selected , let’s see the AIC and BIC values for these 2 models.
BIC(model4_110_airline_ML,model4_312_airline_ML)
## df BIC
## model4_110_airline_ML 2 -408.1026
## model4_312_airline_ML 6 -406.4563
AIC(model4_110_airline_ML,model4_312_airline_ML)
## df AIC
## model4_110_airline_ML 2 -413.6776
## model4_312_airline_ML 6 -423.1813
Among the two models, SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) has the lowset AIC and BIC values and it is normal, hence this model is selected for the forecast.
The forecast for the next 12 periods is done using the forecast method using SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) model. And the predictions are plotted.
# Forecasting
m5_015.landingA = Arima(air_passenger_df_TS,order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),
method = "CSS")
preds1 = forecast(m5_015.landingA, h = 24)
plot(preds1,
main = " Forecast for the next 24 months of the time series",
xlab = "Year",
ylab = "Number of Passengers")
From figure 22, it is inferred that the number of passengers is likely to increase as there is a upward trend and seasonality remains same. Even the confidence interval suggests that the trend will follows an upward trend.
The time series has a stochastic upward trend. The data is initially fitted with the deterministic trend models namely linear, quadratic, seasonal, and cosine models. As expected these models didn’t perform well on the data. Then the stochastic trend model namely the SARIMA model is fitted as the time series had seasonality. The residual approach is followed to fit the SARIMA model, the possible SARIMA models are found using ACF, PACF, EACF, and BIC tables. For each possible models, the coefficients of the parameters are checked for significance and selected the models with significant coefficients. These selected models are checked for the residual analysis namely for normality, ACF, and Ljung-Box Test. The model which gives the residual stationary series is selected for the forecast of the number of passengers for the next 24 months of the time series. From the forecast it is inferred number of airline passengers is expected to increase.