This document focuses on the analysis of the airpassengers dataframe.
The AirPassenger dataset in R provides monthly totals of US airline passengers, from 1949 to 1960.
Description of dataframe airpassengers can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airpassengers.html
I will asssess whether a linear regression or arima model is a best fit for the time series forecast as follows:
data(AirPassengers)
AP <- AirPassengers
# Take a look at the class of the dataset AirPassengers
class(AP)
## [1] "ts"
The dataset is already of a time series class.
# preview of data
AP
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
Passenger numbers in (’000) per month for the relevant years.
# test for missing values
sum(is.na(AP))
## [1] 0
Zero missing values GREAT!
# test frequency
frequency(AP)
## [1] 12
12 calendar months.
# test cycle
cycle(AP)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 1 2 3 4 5 6 7 8 9 10 11 12
## 1950 1 2 3 4 5 6 7 8 9 10 11 12
## 1951 1 2 3 4 5 6 7 8 9 10 11 12
## 1952 1 2 3 4 5 6 7 8 9 10 11 12
## 1953 1 2 3 4 5 6 7 8 9 10 11 12
## 1954 1 2 3 4 5 6 7 8 9 10 11 12
## 1955 1 2 3 4 5 6 7 8 9 10 11 12
## 1956 1 2 3 4 5 6 7 8 9 10 11 12
## 1957 1 2 3 4 5 6 7 8 9 10 11 12
## 1958 1 2 3 4 5 6 7 8 9 10 11 12
## 1959 1 2 3 4 5 6 7 8 9 10 11 12
## 1960 1 2 3 4 5 6 7 8 9 10 11 12
# dataset summary
summary(AP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
Statistical values.
# plot the raw data using the base plot function
autoplot(AP) + labs(x="Time", y ="Passenger numbers ('000)", title="Air Passengers from 1949 to 1961") +
theme_classic()
boxplot(AP~cycle(AP), xlab="Passenger Numbers ('000)", ylab="Months", col=rgb(0.1,0.9,0.3,0.4), main="Monthly Air Passengers Boxplot from 1949 to 1961", horizontal=TRUE, notch=FALSE)
Observations:
I’ll decompose the time series for estimates of trend, seasonal, and random components using moving average method.
The multiplicative model is:
Y[t]=T[t]∗S[t]∗e[t]
where
Y(t) is the number of passengers at time t, T(t) is the trend component at time t, S(t) is the seasonal component at time t, e(t) is the random error component at time t.
decomposeAP <- decompose(AP,"multiplicative")
autoplot(decomposeAP) +
theme_classic()
Observations:
A stationary time series has the conditions that the mean, variance and covariance are not functions of time. In order to fit arima models, the time series is required to be stationary. I’ll use two methods to test the stationarity.
In order to test the stationarity of the time series, let’s run the Augmented Dickey-Fuller (ADF) Test. using the adf.test function from the tseries R package.
First set the hypothesis test:
The null hypothesis: that the time series is non stationary The alternative hypothesis: that the time series is stationary
adf.test(AP)
## Warning in adf.test(AP): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: AP
## Dickey-Fuller = -7.3186, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
As a rule of thumb, where the p-value is less than 5%, we reject the null hypothesis. As the p-value is 0.01 which is less than 0.05 we reject the null in favour of the alternative hypothesis that the time series is stationary.
Another way to test for stationarity is to use autocorrelation. I’ll use autocorrelation function (acf). This function plots the correlation between a series and its lags ie previous observations with a 95% confidence interval in blue. If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series.
autoplot(acf(AP, plot=FALSE)) + labs(title="Correlogram of Air Passengers from 1949 to 1961") +
theme_classic()
Observations:
Since we have already created the decomposeAP list object with a random component, we can plot the acf of the decomposeAP$random.
# review random time series for any missing values
decomposeAP$random
## Jan Feb Mar Apr May Jun Jul
## 1949 NA NA NA NA NA NA 0.9516643
## 1950 0.9626030 1.0714668 1.0374474 1.0140476 0.9269030 0.9650406 0.9835566
## 1951 1.0138446 1.0640180 1.0918541 1.0176651 1.0515825 0.9460444 0.9474041
## 1952 1.0258814 1.0939696 1.0134734 0.9695596 0.9632673 1.0003735 0.9468562
## 1953 0.9976684 1.0151646 1.0604644 1.0802327 1.0413329 0.9718056 0.9551933
## 1954 0.9829785 0.9232032 1.0044417 0.9943899 1.0119479 0.9978740 1.0237753
## 1955 1.0154046 0.9888241 0.9775844 1.0015732 0.9878755 1.0039635 1.0385512
## 1956 1.0066157 0.9970250 0.9876248 0.9968224 0.9985644 1.0275560 1.0217685
## 1957 0.9937293 0.9649918 0.9881769 0.9867637 0.9924177 1.0328601 1.0261250
## 1958 0.9954212 0.9522762 0.9469115 0.9383993 0.9715785 1.0261340 1.0483841
## 1959 0.9825176 0.9505736 0.9785278 0.9746440 1.0177637 0.9968613 1.0373136
## 1960 1.0039279 0.9590794 0.8940857 1.0064948 1.0173588 1.0120790 NA
## Aug Sep Oct Nov Dec
## 1949 0.9534014 1.0022198 1.0040278 1.0062701 1.0118119
## 1950 0.9733720 1.0225047 0.9721928 0.9389527 1.0067914
## 1951 0.9397599 0.9888637 0.9938809 1.0235337 1.0250824
## 1952 0.9931171 0.9746302 1.0046687 1.0202797 1.0115407
## 1953 0.9894989 0.9934337 1.0192680 1.0009392 0.9915039
## 1954 0.9845184 0.9881036 0.9927613 0.9995143 0.9908692
## 1955 0.9831117 1.0032501 1.0003084 0.9827720 1.0125535
## 1956 1.0004765 1.0008730 0.9835071 0.9932761 0.9894251
## 1957 1.0312668 1.0236147 1.0108432 1.0212995 1.0005263
## 1958 1.0789695 0.9856540 0.9977971 0.9802940 0.9405687
## 1959 1.0531001 0.9974447 1.0013371 1.0134608 0.9999192
## 1960 NA NA NA NA NA
# autoplot the random time series from 7:138 which exclude the NA values
autoplot(acf(decomposeAP$random[7:138], plot=FALSE)) + labs(title="Correlogram of Air Passengers Random Component from 1949 to 1961") +
theme_classic()
Observations:
1. Linear regression Model
Given there is an upwards trend we’ll look at a linear model first for comparison.
autoplot(AP) + geom_smooth(method="lm") + labs(x="Time", y="Passenger numbers ('000)", title="Air Passengers from 1949 to 1961") +
theme_classic()
Observations:
2. ARIMA Model
Using the auto.arima function from the forecast R package to fit the best model and coefficients, given the default parameters including seasonality as TRUE.
arimaAP <- auto.arima(AP)
arimaAP
## Series: AP
## ARIMA(2,1,1)(0,1,0)[12]
##
## Coefficients:
## ar1 ar2 ma1
## 0.5960 0.2143 -0.9819
## s.e. 0.0888 0.0880 0.0292
##
## sigma^2 estimated as 132.3: log likelihood=-504.92
## AIC=1017.85 AICc=1018.17 BIC=1029.35
The ARIMA(2,1,1)(0,1,0)[12] model parameters are lag 1 differencing (d), an autoregressive term of second lag (p) and a moving average model of order 1 (q). Then the seasonal model has an autoregressive term of first lag (D) at model period 12 units, in this case months.
ggtsdiag(arimaAP) +
theme_classic()
Observations:
Plot a forecast of the time series using the forecast function, again from the forecast R package, with a 95% confidence interval where h is the forecast horizon periods in months.
forecastAP <- forecast(arimaAP, level = c(95), h = 36)
autoplot(forecastAP) + labs(x="Time", y="Passenger numbers ('000)") +
theme_classic()