Air Passengers Forecast

Introduction

The number of international passengers per month on an airline (Pan Am) in the united states were obtained from the Fedral Aviation Administration for the period 1946-1960. The company used the data to predict future demand before ordering new aircraft and training aircrew. The data are available as a time series in R and is named AirPassengers.

Exploratory Time Series Analysis

library(forecast)
data(AirPassengers)
AP <- AirPassengers
AP

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

All data in R are stored in objects, which have a range of methods available. The class of an object can be found using the class function:

   class(AP)

## [1] "ts"

   start(AP); end(AP); frequency(AP)

## [1] 1949    1

## [1] 1960   12

## [1] 12

   summary(AP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

Now, plot the time series…

plot(AP, ylab= "Passengers (1000's)")

plot(decompose(AirPassengers)) # time series decomposition

apts <- ts(AirPassengers, frequency=12)
f<- decompose(apts)

The above figure shows the time series decomposition into trend, seasonal and random (noise) . It is clear that the time series is non-stationary (has random walks) because of seasonal effects and a trend (linear trend).

Also, we may use Augemented Dickey-Fuller test (from tseries package) to check the stationarity of our timeseries. Rejecting the null hypothesis suggests that a time series is stationary

library(tseries)
adf.test(apts, alternative ="stationary", k=12)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  apts
## Dickey-Fuller = -1.5094, Lag order = 12, p-value = 0.7807
## alternative hypothesis: stationary

From the above p-value, we concluded that the time series is non-stationary (i.e. explosive). Now, We have to transfer the time series from non-stationary state to a stationary state in which it’s statistical properties (mean) do NOT change with time. Stationarity implies that E[x_t] must not depend on time (i.e. series has no underlying trend). We will use first-order differencing for such transformation. Differencing a series can remove trends.

Model Identification and Estimation

We need to find the appropriate values of p,d,q representing the AR order, the degress of differencing, and the MA order respectively. We will use auto.arima to find the best ARIMA model to univariate time series (i.e. a time series that consists of single (scalar) observations recorded sequentially over equal time increments.)

  findbest <- auto.arima(AirPassengers)
  findbest #Check the arma values: [1]  0  1  0  0 12  1  1

## Series: AirPassengers 
## ARIMA(0,1,1)(0,1,0)[12]                    
## 
## Coefficients:
##           ma1
##       -0.3184
## s.e.   0.0877
## 
## sigma^2 estimated as 137.3:  log likelihood=-508.32
## AIC=1020.64   AICc=1020.73   BIC=1026.39

  plot(forecast(findbest,h=20))

Create ARIMA prediction model

fit <- arima(AirPassengers, order=c(0, 1, 1), list(order=c(0, 1, 0), period = 12))
fit

## Series: AirPassengers 
## ARIMA(0,1,1)(0,1,0)[12]                    
## 
## Coefficients:
##           ma1
##       -0.3184
## s.e.   0.0877
## 
## sigma^2 estimated as 137.3:  log likelihood=-508.32
## AIC=1020.64   AICc=1020.73   BIC=1026.39

Compute prediction intervals of 95% confidence level for each prediction

fore <- predict(fit, n.ahead=24)
# calculate upper (U) and lower (L) prediction intervals
U <- fore$pred + 2*fore$se # se: standard error (quantile is 2 as mean=0)
L <- fore$pred - 2*fore$se
# plot observed and predicted values
ts.plot(AirPassengers, fore$pred, U, L, col=c(1, 2, 4, 4), lty=c(1, 1, 2, 2))
library(graphics)
legend("topleft", c("Actual", "Forecast", "Error Bounds (95% prediction interval)"), 
       col=c(1, 2, 4),lty=c(1, 1, 2))

In the above figure, the red solid line shows the forecasted values, and the blue dotted lines are error bounds at a confidence level of 95%.

Residual Analysis

The sample autocorrelation function (ACF) for a series gives correlations between the series xt and lagged values of the series for lags of 1, 2, 3, and so on. The lagged values can be written as xt-1, xt-2, xt-3,and so on. The ACF gives correlations between xt and xt-1, xt and xt-2, and so on. The ideal for a sample ACF of residuals is that there aren’t any significant correlations for any lag.
The partial autocorrelation function (PACF) plays an important role in data analyses aimed at identifying the extent of the lag in an autoregressive model. By plotting the partial autocorrelative functions one could determine the appropriate lags p in an AR (p) model or in an extended ARIMA (p,d,q) model.

   res <- residuals(fit)  # get residuals from fit
   # check acf and pacf of residuals
   acf(res)

   pacf(res)

The above figure shows the ACF of the residuals for a model. The “lag” (time span between observations) is shown along the horizontal, and the autocorrelation is on the vertical. The red lines indicated bounds for statistical significance. This is a good ACF for residuals. Nothing is significant; that’s what we want for residuals.

The lag at which the PACF cuts off is the indicated number of AR terms (i.e. p).
The lag at which the ACF cuts off is the indicated number of MA terms (i.e. q)

Check normality using Q-Q plot

arima() fits the model using maximum likelihood estimation (assuming Gaussian residual series) Now, plot the Q–Q plots, which measures the agreement of a fitted distribution with observed data…

# qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.
   qqnorm(residuals(fit))
   qqline(residuals(fit))

The linearity of the points suggests that the data are normally distributed with mean = 0.

Test for stationarity using ADF test

To cross check the stationarity of timeseries, use Augemented Dickey-Fuller test (from tseries package). Rejecting the null hypothesis suggests that a time series is stationary

library(tseries)
adf.test(fit$residuals, alternative ="stationary")

## Warning in adf.test(fit$residuals, alternative = "stationary"): p-value
## smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  fit$residuals
## Dickey-Fuller = -5.0027, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

From the above p-value, we can conclude that the residuals of our ARIMA prediction model is stationary.