Anilkumar Lingaraj Biradar

Summary

A seasonal time series data is read and fitted with deterministic and stochastic trend models. A residual approach is followed to fit the stochastic models and a possible set of models is found, and each model is checked for significant coefficients and the significant models are selected for the diagnostics checking. Then the residual analysis is conducted on each model and the model with the stationary residuals is selected for the forecast.

Introduction

The Air passengers dataset is downloaded from kaggle click here. This dataset gives information of monthly passengers totals of a US airline from 1949 to 1960. This report analyse the data using suitable time series model and forecasts the number of passengers for the next 24 months.

Assumption:

The significance level is considered as 5%.

Installing the Time Series Package

#install.packages("TSA")
library(TSA)
#install.packages("tseries")
library(tseries)
#install.packages("lmtest")
library(lmtest)
#install.packages("FitAR")
library(FitAR)
#install.packages("forecast")
library(forecast)

Opening the Data in Excel

To get information about the data and it’s column names the data set is intially opened using Microsoft Excel.
Screen shot of Excel for the first 5 observations.
Figure -1 Screenshot of Excel for the first 5 observations


From Figure -1, it is clear that the data set contains 2 columns namely Month and #Passengers.

Reading the Data

# Data is read using read.csv()
air_passenger_df <- read.csv("AirPassengers.csv")
# To display the first 6 observations
head(air_passenger_df)
##     Month X.Passengers
## 1 1949-01          112
## 2 1949-02          118
## 3 1949-03          132
## 4 1949-04          129
## 5 1949-05          121
## 6 1949-06          135

For better readability the column #Passengers is renamed as Number_of_Passengers.

# Rename the columns
colnames(air_passenger_df) <- (c("Month","Number_of_Passengers"))

# To get the details of last 2 observations
tail(air_passenger_df,2)
##       Month Number_of_Passengers
## 143 1960-11                  390
## 144 1960-12                  432

Descriptive Analysis

summary(air_passenger_df)
##     Month           Number_of_Passengers
##  Length:144         Min.   :104.0       
##  Class :character   1st Qu.:180.0       
##  Mode  :character   Median :265.5       
##                     Mean   :280.3       
##                     3rd Qu.:360.5       
##                     Max.   :622.0

From the above descriptive statistics output, it is inferred that in the period from Jan 1949 to Dec 1960, the average number of passengers is 280, the maximum number of passengers in a month is 622 and the minimum number of passengers in a month is 104.

# To convert into a Time series object, since it is a monthly data frequency is set to 12.
air_passenger_df_TS <- ts(as.vector(air_passenger_df$Number_of_Passengers), start=c(1949,1), end=c(1960,1),frequency = 12)
# Checking the class of the object
class(air_passenger_df_TS)
## [1] "ts"

To avoid the issues while converting a data frame to a time series object, The data frame is converted to vector and then to a time series object.

air_passenger_df_TS
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417

From the Figure-1 and above output it is clear that the time series object created is appropriate.

Plot of Time Series object

## To plot the time series object
plot(air_passenger_df_TS,type='o',ylab='Number of Passengers',main='Time series plot of Airline Passengers')

Figure 2 Time series plot of Airline Passengers from 1949 to 1960



It is hard to comment on the seasonality from the above plot, hence the points in the graph are replaced with the starting letter of its respective month.

# Remove type='o'
plot(air_passenger_df_TS,ylab='Number of Passengers',main='Time series plot of Airline Passengers')
points(y=air_passenger_df_TS,x=time(air_passenger_df_TS),pch=as.vector(season(air_passenger_df_TS)))

Figure 3 Time series plot of Airline Passengers from 1949 to 1960 with month letter



From Figure 3, the following observations can be made on time series plot:
  1. Trend : A clear upward trend is observed. Here the trend is linear
  2. Seasonlity : There is a presence of seasonality as the peaks are in august month and troughs are in November each year.
  3. Changing variance : Initially there is no change in variance but it changes over the years. Thus there is a change in variance in the time series.
  4. Behaviour : even though there are few fluctuations, the majority of the series appears to be having a relationship between the consecutive observations,Due to the presence of seasonality it is not clear at the moment whether moving average or auto regressive.
  5. Change point : From the plot there is no clear intervention points

Impact of previous month passengers to the next month passengers

A clear decision can be made on the behavior of the series by finding the correlation between the months to check whether the two consecutive months are related. The scatter plot between the Previous month’s number of passengers to the next month in the time series is plotted to achieve this.

The Time series object is copied into series_without_lag and the first 5 obervation is displayed

# Copying the time series data to series_without_lag
series_without_lag = air_passenger_df_TS
head(series_without_lag)
##      Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135

First lag of the time series is created using zlag method and assigned to first_lag_ts_series

first_lag_ts_series = zlag(air_passenger_df_TS)
# To display first 5 observation of lag 1 time series
head(first_lag_ts_series)
## [1]  NA 112 118 132 129 121

The first observation is NA as the first lag is created by discarding the first value. To omit this missing value, the values from 2nd index is consered.

index = 2:length(first_lag_ts_series)


plot(y=air_passenger_df_TS,x=zlag(first_lag_ts_series),ylab='Original Airline Passengers', xlab='first lag Airline Passengers', main= "Relation between Orginal Time Series and it's first lag ")

Figure 4 Scatter plot of Original Airline Passengers vs first lag of Airline Passengers


From the plot it can be inferred that there exist a strong positive correlation between the Original Airline Passengers servies with it’s first lag.


Correlation between the original_series and first_lag_series is found using cor() method.

cor(series_without_lag[index],first_lag_ts_series[index])
## [1] 0.9570215

From the above output it is inferred that 95.7% of variance is explained by the previous month value. This implies that there exists a strong positive correlation between the consecutive months in the time series.Hence the air_passenger_df_TS is an Autoregressive (AR) as correlation is the characteristics of AR series.

Stochastic or Deterministic

The value of series at time ‘t’ is dependent on the values of series at time ‘t-1’, hence the series is stochastic.
\(Y_{t} = \beta . t + e_{t}\)

Fitting the deterministic models

Even though the series has a stochastic trend, interested to see how the deterministic models will fit on this series.

Fitting the Linear Regression Model in time

As the trend in the series is linear, hence fitting linear model will be a good try. The linear trend model is expressed as follows:
\(µ_{t} = β_{0} + β_{1} . t\)
The linear trend model is expressed as follows:
\(Y = µ_{t} + Ɛ\)

where,
\(µ_{t}\): Trend component,
\(β_{0}\): Intercept of the linear regression model,
\(β_{1}\): Slope of the linear regression,
Ɛ: Residuals after capturing the trend.

In R, the first step is to extract time from the air_passenger_df_TS time series object. The time of air_passenger_df_TS is extracted and assigned to time_of_air_passenger_df_TS.

time_of_air_passenger_df_TS <- time(air_passenger_df_TS)

Fitting the model using lm() method,the summary() and abline() methods to find the summary of the fit and line of best fit respectively.

linear_regression <-lm(air_passenger_df_TS ~ time_of_air_passenger_df_TS)
# To get the output of the model
summary(linear_regression)
## 
## Call:
## lm(formula = air_passenger_df_TS ~ time_of_air_passenger_df_TS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.688 -25.704  -4.379  20.729 139.287 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -59739.522   2219.247  -26.92   <2e-16 ***
## time_of_air_passenger_df_TS     30.700      1.135   27.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.9 on 131 degrees of freedom
## Multiple R-squared:  0.848,  Adjusted R-squared:  0.8469 
## F-statistic:   731 on 1 and 131 DF,  p-value: < 2.2e-16
From the above output the Number of Passengers (y) is found by estimated equation:

y = -59739.522 + 30.700 * time_of_air_passenger_df_TS + \(\epsilon\)

where the
  • Coefficient of intercept \(\beta_{0}\) is -59739.522
  • Coefficient of surf \(\beta_{1}\) is 7.427
  • \(\epsilon\) is the error term

Both coefficients are statistically significant at 5% significance level.

The overall model is significant as the p-value is < 2.2e-16, i.e., less than 0.05.The adjusted R-squared is 0.8469 which implies that 84.69% of the variation in the passengers number is explained by this model.

# Plot to see the fit of regression on time series
plot(air_passenger_df_TS,type='o',ylab='AirLine Passengers',col="blue", main = "Time series plot for Airline Passengers ")
abline(linear_regression,col="red")

Figure 5 Linear Model of time series plot for Airline Passengers series


From the above plot, the blue line represents the actual time series and red line represents the linear model fit. It is clear that this model explains the trend but failed to explain the seasonality in the series. Hence this model cannot be considered for prediction.

Fitting the Quadratic trend Model in time

The quadratic model is ideal for series with quadratic trend. As the name suggests, this model required to include square of the extracted time. This is to test if quadratic term of time (\(t^2\) ) has influence in predicting the the number of passengers.

The quadratic trend model is expressed as follows:
\(µ_{t} = β_{0} + β_{1} . t + β_{2} . t^2\)
And, in terms of the time series data,
\(Y = µ_{t} + Ɛ\)

where,
\(µ_{t}\) : Trend component we want to model,
\(β_{0}\) : Intercept of the Quadratic regression model,
\(β_{1}\) : Coefficient estimate of t,
\(β_{2}\) : Coefficient estimate of t2,
Ɛ: Residuals after capturing the trend.

lm() method is used to fit the model with the square of the extracted time

# To find the square of the time
square_of_time_of_air_passenger_df_TSS <- time_of_air_passenger_df_TS ^ 2
#Fitting the quadratic model
quadratic_model <-lm(air_passenger_df_TS ~ time_of_air_passenger_df_TS + square_of_time_of_air_passenger_df_TSS)
# To get the summary of the model
summary(quadratic_model)
## 
## Call:
## lm(formula = air_passenger_df_TS ~ time_of_air_passenger_df_TS + 
##     square_of_time_of_air_passenger_df_TSS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.247 -22.283  -8.455  20.384 125.477 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                             3.321e+06  1.493e+06   2.225   0.0278 *
## time_of_air_passenger_df_TS            -3.429e+03  1.527e+03  -2.245   0.0265 *
## square_of_time_of_air_passenger_df_TSS  8.850e-01  3.907e-01   2.265   0.0252 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.25 on 130 degrees of freedom
## Multiple R-squared:  0.8538, Adjusted R-squared:  0.8516 
## F-statistic: 379.6 on 2 and 130 DF,  p-value: < 2.2e-16
The Number of Passengers (y) is found by estimated equation:

y = 3.321e+06 - 3.429e+03 * time_of_air_passenger_df_TS + 8.850e-01 * square_of_time_of_air_passenger_df_TSS + \(\epsilon\)

where the
  • Coefficient of intercept \(\beta_{0}\) is 3.321e+06
  • Coefficient of time_of_air_passenger_df_TS \(\beta_{1}\) is 3.429e+03
  • Coefficient of square_of_time_of_air_passenger_df_TSS \(\beta_{2}\) is 8.850e-01
  • \(\epsilon\) is the error term

All the coefficients are statistically significant at 5% significance level.

The overall model is significant as the p-value is < 2.2e-16, i.e., less than 0.05.The adjusted R-squared is 0.8516, i.e., 85.16% of the variation in the passengers number is explained by this model.

# Visualising the plot
plot(ts(fitted(quadratic_model)), ylim = c(min(c(fitted(quadratic_model), as.vector(air_passenger_df_TS))), max(c(fitted(quadratic_model),as.vector(air_passenger_df_TS)))),
     ylab='AirLine Passengers' , main = "Fitted quadratic curve to Airline Passengers", type="l",lty=2,col="red")
lines(as.vector(air_passenger_df_TS),type="o",col="navyblue")

Figure 6 Quadratic Model of time series plot for Airline Passengers series


Similar to linear model, the quadratic model captures only the trend and fail to capture seasonality. Thus this model cannot be considered for forecast.

Seasonal Deterministic models

In seasonal models, the repeating patterns of time ‘t’ is created.As the time series is a monthly data, hence the frequency is 12. There is a paramter for each month.

month.=season(air_passenger_df_TS)
seasonal_model =lm(air_passenger_df_TS ~ month. -1) # -1 removes the intercept term 
summary(seasonal_model)
## 
## Call:
## lm(formula = air_passenger_df_TS ~ month. - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.91  -81.64  -20.64   89.64  231.09 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## month.January     241.75      30.53   7.919 1.30e-12 ***
## month.February    220.82      31.88   6.926 2.25e-10 ***
## month.March       256.64      31.88   8.049 6.54e-13 ***
## month.April       249.45      31.88   7.824 2.16e-12 ***
## month.May         253.64      31.88   7.955 1.08e-12 ***
## month.June        291.36      31.88   9.138 1.82e-15 ***
## month.July        326.73      31.88  10.247  < 2e-16 ***
## month.August      327.91      31.88  10.284  < 2e-16 ***
## month.September   283.73      31.88   8.899 6.72e-15 ***
## month.October     248.91      31.88   7.807 2.36e-12 ***
## month.November    218.55      31.88   6.854 3.22e-10 ***
## month.December    246.36      31.88   7.727 3.59e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105.7 on 121 degrees of freedom
## Multiple R-squared:  0.8742, Adjusted R-squared:  0.8618 
## F-statistic: 70.09 on 12 and 121 DF,  p-value: < 2.2e-16

All the 12 coefficients for each month is significant and with the value as shown in the above output. The adjusted R-squared is 0.8618, which implies that 86.18% of variation in the number of passengers for the airlines is explained using the above co-efficients. The overall model is significant as the p-value is less than the significance level.

Let’s visualise the fit on the time series data.

# To visualise the fit
plot(ts(fitted(seasonal_model)), ylab='AirLine Passengers', 
     main = "Fitted seasonal model to Airline Passengers series" , 
     ylim = c(min(c(fitted(seasonal_model), as.vector(air_passenger_df_TS))) ,
              max(c(fitted(seasonal_model), as.vector(air_passenger_df_TS))) ),col="red")
lines(as.vector(air_passenger_df_TS),type="o")

Figure 7 Seasonal Model of time series plot for Airline Passengers series


From figure 7, it is inferred that the seasonal model captured the seasonality but completely failed to capture the trend. Hence this model can be discarded.

Fitting Cosine trend deterministic models

Similar to seasonal model, Cosine model is used to predict the repeating pattern in the data, where as in the cosine instead of including the frequency of the series, cyclic mathematical function, cosine is used to creat repeating patterns.

\(µ_{t} = β_{0} + β_{1} . cos (2 π * f * t) + β_{2} sin (2 π * f * t)\)
And, in terms of the time series data,
\(Y = µ_{t} + Ɛ\)

where,
\(µ_{t}\): Trend component we want to model,
\(β_{0}\): : Intercept of the cosine regression model,
\(β_{1}\): : Coefficient estimate of amplitude of cosine (),
\(β_{2}\): : Coefficient estimate of amplitude of sine (),
Ɛ : Residuals after capturing the trend.

In R, harmonic method is used to find the cosine equation.

har. <- harmonic(air_passenger_df_TS, 1)
data <- data.frame(air_passenger_df_TS,har.)
cosine_model <- lm(air_passenger_df_TS ~ cos.2.pi.t. + sin.2.pi.t. , data = data) 
summary(cosine_model)
## 
## Call:
## lm(formula = air_passenger_df_TS ~ cos.2.pi.t. + sin.2.pi.t., 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.52  -79.52  -18.95   81.93  257.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  263.962      8.997  29.340  < 2e-16 ***
## cos.2.pi.t.  -41.012     12.676  -3.235  0.00154 ** 
## sin.2.pi.t.   -3.929     12.771  -0.308  0.75886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.8 on 130 degrees of freedom
## Multiple R-squared:  0.07515,    Adjusted R-squared:  0.06092 
## F-statistic: 5.281 on 2 and 130 DF,  p-value: 0.006233
  • Coefficient of intercept \(\beta_{0}\) is 263.962 with p-value < 2e-16
  • Coefficient of cos.2.pi.t. \(\beta_{1}\) is -41.012 with p-value 0.00154
  • Coefficient of sin.2.pi.t. \(\beta_{2}\) is -3.929 with p-value 0.75886

The \(\beta_{0}\) and \(\beta_{1}\) are significant coefficient, where as the \(\beta_{2}\) is not significant. The adjusted R-square is 0.06092 i.e., only 6% of the variation in the series is explained by this model.

plot(ts(fitted(cosine_model)), ylab='AirLine Passengers', main = "Fitted seasonal model for Airline Passeneger using cosine " , ylim = c(min(c(fitted(cosine_model), as.vector(air_passenger_df_TS))) ,max(c(fitted(cosine_model), as.vector(air_passenger_df_TS))) ),col="red")
lines(as.vector(air_passenger_df_TS),type="o")

Figure 8 Cosine Model of time series plot for Airline Passengers series


Form the plot it is clear that, the cosine model failed to capture the trend. Due to it’s low R-sqaured value and as it failed to capture the trend, we can discard this model.

Stochastic time-series model

As the series has the stochastic trend and the seasonality in the time series, let’s try fitting the stochastic models.

The Stationarity and Autocorrelation are the primary assumptions need to be satsified to build an Auto Regressive model. Hence let’s check whether these assumptions holds good.

Visualise the time series plot

## To plot the time series object
plot(air_passenger_df_TS,type='o',ylab='Number of Passengers',main='Time series plot of Airline Passengers')
points(y=air_passenger_df_TS,x=time(air_passenger_df_TS),pch=as.vector(season(air_passenger_df_TS)))

Figure 9: Time series plot with monthly label of Airline Passengers


  • we can see the trend is upwards so mean is not constant
  • The fluctuation is low at the begining and more at the end, hence Variance is not constant.
  • Visually it is clear that air_passenger_df_TS time series is not stationary
  • From figure 9, it is inferred that there is seasonality in the series.

ACF and PACF plot

Further to know auto correlation in the series is found by ACF plot and order of AR model is found by PACF plot. A function plot_acf_and_pacf is defined to plot ACF and PACF plots as shown below.

# The function accepts the series, number of lags and the series names to be given on the plots
plot_acf_and_pacf <- function(df_TS, lags_given,df_name) {
  assign("x",df_TS)
  par(mfrow=c(1,2))
  acf(df_TS, lag.max=lags_given,main=paste("ACF Plot of", df_name))
  pacf(df_TS, lag.max=lags_given, main=paste("PACF Plot of", df_name))
}
plot_acf_and_pacf(air_passenger_df_TS, 64,"Air Passengers")

Figure 10: ACF and PACF plots of time series


From figure 10, further it is visually proved that there is wave and decaying patten in the ACF plot, hence the series is not stationary. The decaying pattern at periods indicates there is seasonal trend in the series. There is no pattern in the PACF plot.

Seasonal Autoregressive Integrated Moving Average (SARIMA) Model

As the seasonality is confirmed from the time series and ACF plot, let’s fit the SARIMA model. This model is has the following paramters,
SARIMA (p,d,q) \(\times\) (P,D,Q)\(_{s}\)
p : Auto Regressive (AR) Order,
d : The number of ordinary differences,
q : Moving Average (MA) order,
P : Seasonal AR order D : number of seasonal differences,
Q: Seasonal MA order s: period or frequency of the series, here the value is 12 as it is monthly data

Model Specification

Following the residual approach to find the specification of the SARIMA model, where we start the model with 1 seasonal differencing, find the residual and if any pattern left in the residuals, rebuild the model with the parameters from ACF and PACF plot and repeat the process.

SARIMA (0,0,0) \(\times\) (0,1,0)\(_{12}\)

Intially we fit the model with seasonal differencing with order D = 1.

#                                                  p,d,q                        P,D,Q
model1_airline = arima(air_passenger_df_TS,order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res_of_model1 = residuals(model1_airline);  
par(mfrow=c(1,1))
plot(res_of_model1,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")

Figure 11: Time series plot of the residuals after first seasonal difference with order D = 1


From figure 11, it is inferred that there is trend left in the residuals, that is model failed to capture the trend. Let’s see the ACF and PACF plots to find whether this trend is significant, also the residuals are near 0.

The ACF and PACF to see the autocorrelation.

plot_acf_and_pacf(res_of_model1, 48,"Residual of M1")

Figure 12: ACF and PACF plot of residuals after first seasonal difference with order D = 1


From the ACF and PACF plot, there are no significant autocorrelations at the seasonal lags, this means that the seasonal trend is captured by the seasonal differencing with order D =1. Hence from PACF plot, the Seasonal AR order i.e., P = 0
the number of seasonal differences i.e., D = 1 and
From ACF plot, The seasonal MA order i.e., Q = 0

Before finding the ARMA orders, we need to make the series stationary, as there is change in variance in the series.

#                                                       p,d,q                        P,D,Q
model2_airline = arima(log(air_passenger_df_TS),order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res_of_model2 = residuals(model2_airline);  
par(mfrow=c(1,1))
plot(res_of_model2,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")

Figure 13: Time series plot of the residuals after log transformed & first seasonal difference with order D = 1


plot_acf_and_pacf(res_of_model2, 48,"Residual of M2")

Figure 14: ACF and PACF plot of residuals after log transformed and first seasonal difference with order D = 1


From figure 13, the variance is reduced a bit, but still there is change in variance in the series. From figure 14, there are signficiant autocorrelations before the first seasonal lag in ACF, thus we need to get rid of this by ordinary differencing the series before finding the specification of AR and MA orders.

SARIMA (0,1,0) \(\times\) (0,1,0)\(_{12}\)

#                                                       p,d,q                        P,D,Q
model3_airline = arima(log(air_passenger_df_TS),order=c(0,1,0),seasonal=list(order=c(0,1,0), period=12))
res_of_model3 = residuals(model3_airline);  
par(mfrow=c(1,1))
plot(res_of_model3,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")

Figure 15: Time series plot of the residuals after ordinary difference d = 1 & first seasonal difference D = 1


From figure 15, the residuals are spread around 0 and the series is stationary.

plot_acf_and_pacf(res_of_model3, 48,"Residual of M3")

Figure 16: ACF and PACF plot of residuals series of ordinary difference d = 1 & first seasonal difference D = 1


From the ACF plot, there are 2 significant lags before the first seasonal lag and from PACF plot there are 3 significant lags before the first seasonal lag. Hence these significant lags are not captured by the model and is available in the residuals. We need to fit the SARIMA model with ARMA specification of AR(3) and MA(2).

SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\)

#                                                       p,d,q                        P,D,Q
model4_airline = arima(log(air_passenger_df_TS),order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12), method="ML")
res_of_model4 = residuals(model4_airline);  
par(mfrow=c(1,1))
plot(res_of_model4,xlab='Time',ylab='Residuals',main="Time series plot of the residuals")

Figure 17: Time series plot of the residuals SARIMA(3,1,2) \(\times\) (0,1,0)


plot_acf_and_pacf(res_of_model4, 36,"Residual of M3")

Figure 18: ACF and PACF plot of residuals SARIMA(3,1,2) \(\times\) (0,1,0)


From figure 18, it is clear that there is no significant autocorrelation left in both ACF and PACF. This implies that the model captures the trend and seasonality of the airline passengers series.

Extended Autocorrelation Function (EACF)

For the possible set of models we can use the EACF. It is used to identify the order of autoregressive and moving average components of ARMA using eacf function. We need to use the res_of_model3 as it contains significant autocorrelations in the residuals. The possible ARMA specification set is achieved with the top-left triangular pattern of zeroes occurring in the AR rows and MA columns.

eacf(res_of_model3)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o x o o o o o o o o  x  o  o 
## 1 o o x o o o o o x o o  x  o  o 
## 2 o x x o o o o o x o o  x  o  x 
## 3 x x o o o o o o o o o  x  o  o 
## 4 x o o o o o o o o o o  x  x  o 
## 5 x o x o o o o o o o o  x  o  o 
## 6 x x o o o o o o o o o  x  o  o 
## 7 x x o o o o o o o o o  x  o  o

From the above output, the 0th row and 3rd column does not have any near by values “x”,hence these order have a significant p-value. The set of possible non seasonal orders are: * ARIMA (0,1,3) * ARIMA (0,1,4) * ARIMA (1,1,3)

Bayesian Information Criterion (BIC)

BIC can also be used to find the possible AR and MA co-efficients. BIC penalizes the complexity of the model by including the penalty term. Hence the model with lowest BIC values is preferred and kept at top with dark shade.

res = armasubsets(y=res_of_model3,nar=6,nma=6,y.name='test',ar.method='ols')
plot(res)

Figure 19: Bayesian Information Criterion Table


From the output the possible values of AR are 1,3 and from EACF output we got MA values as 3 and 4,from BIC, the values for MA is 0.Hence the new possible non seasonal ARMA specifications are : * ARIMA (1,1,0) * ARIMA (3,1,0)

  • ARIMA (1,1,4)
  • ARIMA (3,1,4)

The set of possible model is {

SARIMA (0,1,3) \(\times\) (0,1,0)\(_{12}\), SARIMA (0,1,4) \(\times\) (0,1,0)\(_{12}\), SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\), SARIMA (1,1,3) \(\times\) (0,1,0)\(_{12}\), SARIMA (1,1,4) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,0) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,4) \(\times\) (0,1,0)\(_{12}\) }

Model Fitting

Each of the possible models are fitted using the CSS (Minimises the conditional sum-of-squares) and ML (maximum likelihood) methods. The coefficient of each parameter of the model is checked whether it is significant with the significance value of 0.05. This is achieved by conducting ztest using coeftest method.

SARIMA (0,1,3) \(\times\) (0,1,0)\(_{12}\)

model4_013_airline_CSS = arima(log(air_passenger_df_TS),order=c(0,1,3),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_013_airline_CSS)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)    
## ma1 -0.317317   0.088317 -3.5929 0.000327 ***
## ma2  0.070500   0.093572  0.7534 0.451187    
## ma3 -0.212449   0.094458 -2.2491 0.024504 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_013_airline_ML = arima(log(air_passenger_df_TS),order=c(0,1,3),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_013_airline_ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.318722   0.088811 -3.5888 0.0003322 ***
## ma2  0.067324   0.093639  0.7190 0.4721559    
## ma3 -0.209883   0.094659 -2.2173 0.0266055 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above 2 outputs of CSS and ML method, all the co-efficients of the model parameters are not significant, hence this model can be discarded.

SARIMA (0,1,4) \(\times\) (0,1,0)\(_{12}\)

model4_014_airline_CSS = arima(log(air_passenger_df_TS),order=c(0,1,4),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_014_airline_CSS)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.339925   0.094542 -3.5955 0.0003238 ***
## ma2  0.074199   0.099325  0.7470 0.4550412    
## ma3 -0.222803   0.096202 -2.3160 0.0205590 *  
## ma4  0.064725   0.097367  0.6648 0.5062099    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_014_airline_ML = arima(log(air_passenger_df_TS),order=c(0,1,4),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_014_airline_ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.338329   0.093937 -3.6016 0.0003162 ***
## ma2  0.070868   0.098713  0.7179 0.4728071    
## ma3 -0.218498   0.095560 -2.2865 0.0222247 *  
## ma4  0.060088   0.096330  0.6238 0.5327728    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As all the coefficients of SARIMA (0,1,4) \(\times\) (0,1,0)\(_{12}\) are not significant, thus this model can be discarded.

SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\)

model4_110_airline_CSS = arima(log(air_passenger_df_TS),order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_110_airline_CSS)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.320713   0.086428 -3.7107 0.0002067 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_110_airline_ML = arima(log(air_passenger_df_TS),order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_110_airline_ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.320115   0.086565  -3.698 0.0002173 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient of SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) is significant in both CSS and ML methods, hence this model is considered for diagnostics checking.

SARIMA (1,1,3) \(\times\) (0,1,0)\(_{12}\)

model4_113_airline_CSS = arima(log(air_passenger_df_TS),order=c(1,1,3),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_113_airline_CSS)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)  
## ar1 -0.058042   0.323569 -0.1794  0.85764  
## ma1 -0.272506   0.312650 -0.8716  0.38343  
## ma2  0.056305   0.139887  0.4025  0.68731  
## ma3 -0.221780   0.095843 -2.3140  0.02067 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_113_airline_ML = arima(log(air_passenger_df_TS),order=c(1,1,3),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_113_airline_ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)  
## ar1 -0.160787   0.356104 -0.4515  0.65162  
## ma1 -0.168909   0.344022 -0.4910  0.62344  
## ma2  0.017293   0.151861  0.1139  0.90934  
## ma3 -0.204896   0.093462 -2.1923  0.02836 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the coefficients of SARIMA (1,1,3) \(\times\) (0,1,0)$_{12} model are not significant. Hence, this model can be discarded.

SARIMA (1,1,4) \(\times\) (0,1,0)\(_{12}\)

model4_114_airline_CSS = arima(log(air_passenger_df_TS),order=c(1,1,4),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_114_airline_CSS)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ar1  0.47382    0.39121  1.2112  0.22584  
## ma1 -0.82314    0.39105 -2.1050  0.03529 *
## ma2  0.24051    0.17975  1.3381  0.18088  
## ma3 -0.27381    0.12237 -2.2375  0.02525 *
## ma4  0.18273    0.14206  1.2862  0.19836  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_114_airline_ML = arima(log(air_passenger_df_TS),order=c(1,1,4),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_114_airline_ML)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.55811    0.23372  2.3880 0.0169414 *  
## ma1 -0.91086    0.23945 -3.8040 0.0001424 ***
## ma2  0.26817    0.15238  1.7599 0.0784255 .  
## ma3 -0.25880    0.12536 -2.0644 0.0389809 *  
## ma4  0.23131    0.12011  1.9258 0.0541285 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the 2 outputs it is clear that the coefficients of SARIMA (1,1,4) \(\times\) (0,1,0)\(_{12}\) model are not significant, thus this model is discarded.

SARIMA (3,1,0) \(\times\) (0,1,0)\(_{12}\)

model4_310_airline_CSS = arima(log(air_passenger_df_TS),order=c(3,1,0),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_310_airline_CSS)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.319187   0.089556 -3.5641 0.0003651 ***
## ar2 -0.056980   0.094205 -0.6049 0.5452756    
## ar3 -0.198322   0.089704 -2.2108 0.0270465 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_310_airline_ML = arima(log(air_passenger_df_TS),order=c(3,1,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_310_airline_ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.316976   0.089587 -3.5382 0.0004029 ***
## ar2 -0.060070   0.093891 -0.6398 0.5223098    
## ar3 -0.195217   0.089335 -2.1852 0.0288718 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the coefficients of SARIMA (3,1,0) \(\times\) (0,1,0)\(_{12}\) are not signficant, thus this model is discarded.

SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\)
#model4_312_airline

model4_312_airline_CSS = arima(log(air_passenger_df_TS),order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_312_airline_CSS)
## 
## z test of coefficients:
## 
##       Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.1048537  0.0042831   24.481 < 2.2e-16 ***
## ar2 -0.8068319  0.0034306 -235.186 < 2.2e-16 ***
## ar3 -0.3958784  0.0039296 -100.743 < 2.2e-16 ***
## ma1 -0.5223968  0.0075540  -69.155 < 2.2e-16 ***
## ma2  1.0917037  0.0051349  212.606 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_312_airline_ML = arima(log(air_passenger_df_TS),order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12),method = "ML")
coeftest(model4_312_airline_ML)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value  Pr(>|z|)    
## ar1 -0.215010   0.086311  -2.4911   0.01273 *  
## ar2 -0.765596   0.052014 -14.7191 < 2.2e-16 ***
## ar3 -0.411513   0.084004  -4.8987 9.647e-07 ***
## ma1 -0.081764   0.034134  -2.3954   0.01660 *  
## ma2  0.999997   0.037506  26.6624 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the co-efficients of SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\) are significant for both CSS and ML methods,hence this model is considered for further diagnostics.

SARIMA (3,1,4) \(\times\) (0,1,0)\(_{12}\)

model4_314_airline_CSS = arima(log(air_passenger_df_TS),order=c(3,1,4),seasonal=list(order=c(0,1,0), period=12),method = "CSS")
coeftest(model4_314_airline_CSS)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.12894    0.44316  0.2910 0.7710872    
## ar2 -0.81538    0.14134 -5.7691 7.971e-09 ***
## ar3 -0.14720    0.37972 -0.3877 0.6982662    
## ma1 -0.47136    0.44643 -1.0559 0.2910353    
## ma2  0.98690    0.27640  3.5706 0.0003562 ***
## ma3 -0.36360    0.44256 -0.8216 0.4113238    
## ma4  0.01335    0.22221  0.0601 0.9520919    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4_314_airline_ML = arima(log(air_passenger_df_TS),order=c(3,1,4),seasonal=list(order=c(0,1,0), period=12),method = "ML")
## Warning in log(s2): NaNs produced
coeftest(model4_314_airline_ML)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.37044    0.42378 -0.8741  0.382044    
## ar2 -0.68279    0.12602 -5.4180 6.028e-08 ***
## ar3 -0.52411    0.33921 -1.5451  0.122322    
## ma1  0.02575    0.43740  0.0589  0.953055    
## ma2  0.77436    0.27746  2.7909  0.005257 ** 
## ma3  0.10757    0.44894  0.2396  0.810643    
## ma4 -0.21937    0.25027 -0.8765  0.380752    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the CSS and ML methods coefficient testing , all the coefficients of SARIMA (3,1,4) \(\times\) (0,1,0)\(_{12}\) are not significant, hence this model is discarded.

Diagnostics check

Diagnostics check is carried out for the models which have significiant coefficients in both CSS and ML method. This is done by the residual analysis plot to check whether the residual is stationary and hence no trend or seasonality left by the model, checking the normality of the residual through [Shapiro wilt test, Histogram and QQ plot], ACF plot, PACF plot and Ljung-Box test to check all the lags in the series are good and no autocorrelation left in residuals. To achieve these tests a function residual_analysis is defined as shown below, which converts the residual to standard residuals and plots the standard residuals plot, histogram, QQ plot, ACF and Ljung box test plots.

residual_analysis <- function(model, std = TRUE,start = 2, class = c("ARIMA","GARCH","ARMA-GARCH", "fGARCH")[1]){
  if (class == "ARIMA"){
    if (std == TRUE){
      res.model = rstandard(model)
    }else{
      res.model = residuals(model)
    }
  }else if (class == "GARCH"){
    res.model = model$residuals[start:model$n.used]
  }else if (class == "ARMA-GARCH"){
    res.model = model@fit$residuals
  }else if (class == "fGARCH"){
    res.model = model@residuals
  }else {
    stop("The argument 'class' must be either 'ARIMA' or 'GARCH' ")
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  acf(res.model,main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  k=0
  LBQPlot(res.model, lag.max = 30, StartLag = k + 1, k = 0, SquaredQ = FALSE)
  par(mfrow=c(1,1))
}

Shapiro Wilk Test: Null hypothesis: \(H_{0}\): Errors are normally distributed
Alternate hypothesis: Errors are not normally distributed
Reject \(H_{0}\) if p-value < 0.05

Among the possible models, the following models are selected for diagnostics check:
{ SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\), SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\) }

sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}

ResidualAnalsysis : SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\)

m4_110_airline = arima(air_passenger_df_TS,order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
residual_analysis(model = m4_110_airline, class = "ARIMA")

Figure 20: Residual Analysis for SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\)


  • The residuals are spread over 0 and there is no trend in the standardised residuals.
  • From the Shapiro-Wilk normality test the p-value is 0.1223, hence we fail to reject the null hyposthesis, residuals are normal.
  • Further this is proved visually by the histogram and QQ plot as the data points are in sync with the reference line.
  • From ACF plot there is no lags over the signficant lines.
  • Also from Ljung-Box Test it is inferred that the points are above the reference lines, which implies there is no autocorrelation left in the residuals thus no information left in the series.
  • AIC and BIC values of the log r
    As residual is a white noise, SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) model is considered for forecast. To confirm, We need to compare the AIC and BIC values with other model.

ResidualAnalsysis : SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\)

m4_312_airline = arima(air_passenger_df_TS,order=c(3,1,2),seasonal=list(order=c(0,1,0), period=12),method = "ML")
residual_analysis(model = m4_312_airline, class = "ARIMA")

Figure 21: Residual Analysis for SARIMA (3,1,2) \(\times\) (0,1,0)\(_{12}\)


  • The standardised residuals are spread over 0.
  • From the Shapiro-Wilk normality test the p-value is 0.01436, hence we reject the null hyposthesis, thus residuals are not normal.
  • From histogram it is inferred that the series is not normal.
  • In QQ plot, the data points at the two ends are away from the reference line.
  • From ACF plot there are no lags over the signficant lines. Also from Ljung-Box Test it is inferred that the points are above the reference lines, which implies there is no autocorrelation left in the residuals thus no information left in the series.

As the residual series is not normal , hence this model is not a good option for the forecast.


Among the 2 models selected , let’s see the AIC and BIC values for these 2 models.

BIC(model4_110_airline_ML,model4_312_airline_ML)
##                       df       BIC
## model4_110_airline_ML  2 -408.1026
## model4_312_airline_ML  6 -406.4563
AIC(model4_110_airline_ML,model4_312_airline_ML)
##                       df       AIC
## model4_110_airline_ML  2 -413.6776
## model4_312_airline_ML  6 -423.1813

Among the two models, SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) has the lowset AIC and BIC values and it is normal, hence this model is selected for the forecast.

Forecasting

The forecast for the next 12 periods is done using the forecast method using SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\) model. And the predictions are plotted.

# Forecasting
m5_015.landingA = Arima(air_passenger_df_TS,order=c(1,1,0),seasonal=list(order=c(0,1,0), period=12),
                       method = "CSS")
preds1 = forecast(m5_015.landingA, h = 24)
plot(preds1,
     main = " Forecast for the next 24 months of the time series",
     xlab = "Year",
     ylab = "Number of Passengers")

Figure 22: Forecast for 12 months using SARIMA (1,1,0) \(\times\) (0,1,0)\(_{12}\)


From figure 22, it is inferred that the number of passengers is likely to increase as there is a upward trend and seasonality remains same. Even the confidence interval suggests that the trend will follows an upward trend.

Conclusion

The time series has a stochastic upward trend. The data is initially fitted with the deterministic trend models namely linear, quadratic, seasonal, and cosine models. As expected these models didn’t perform well on the data. Then the stochastic trend model namely the SARIMA model is fitted as the time series had seasonality. The residual approach is followed to fit the SARIMA model, the possible SARIMA models are found using ACF, PACF, EACF, and BIC tables. For each possible models, the coefficients of the parameters are checked for significance and selected the models with significant coefficients. These selected models are checked for the residual analysis namely for normality, ACF, and Ljung-Box Test. The model which gives the residual stationary series is selected for the forecast of the number of passengers for the next 24 months of the time series. From the forecast it is inferred number of airline passengers is expected to increase.