ANALYSIS OF THE AIRPASSENGERS DATASET

QUESTION ONE

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tseries)

## Warning: package 'tseries' was built under R version 4.3.3

data("AirPassengers")
dataset <- AirPassengers
dataset

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

QUESTION TWO

# Plot the time series
plot(dataset, main="AirPassengers Time Series", ylab="Number of Passengers", xlab="Year")

The data shows an increase in the number of passengers over time indicating an upward trend. This implies that the data’s average is not constant and therefore non-stationary in its mean.(from the general overview of the plot, but an ADF test would confirm if it is.) This will definitely need to be addressed when modelling.

Regarding seasonality the series exhibits a pattern with a consistent cycle each year. This can be seen in the peaks and troughs that occur around the times yearly, mostly with more passengers around mid year, probably because it’s around the summer vacation time.

No apparent anomalies or outliers are observed in the data. The series maintains a consistent behavior over time showing no unexpected spikes or drops.

QUESTION THREE.

To check for stationarity, we concuct an ADF test.

adf_test <- adf.test(dataset)

## Warning in adf.test(dataset): p-value smaller than printed p-value

adf_test

## 
##  Augmented Dickey-Fuller Test
## 
## data:  dataset
## Dickey-Fuller = -7.3186, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

if (adf_test$p.value < 0.05) {
    print("Reject the null hypothesis: The series is stationary.")
} else {
    print("Fail to reject the null hypothesis: The series is non-stationary.")
}

## [1] "Reject the null hypothesis: The series is stationary."

To my surprise, the p value is less than .05 . Technically, we should reject the null hypothesis and conclude that the series is stationary, but the visual implications suggest a very much clear non-stationary series.

The ADF test must have not accounted for the trend and or seasonality features in the series due to several reasons. Examples could be the size of the data set might be too small or the lag length might be too short to be accounted for. Model mis-specifications might also be a reason for the misguided result of the test. The variance, for example, of the series is not constant. This is quite evident in the plot but was not accounted for in the test.

In conclusion, the series is non-stationary, despite the possibly biased statistical value provided by the ADF test.

QUESTION FOUR

To transform the data, we could difference the series and use log transformation(to stabilize variance) before fitting our ARMA model

DS_diff <- diff(log(dataset),differences = 1)

plot(DS_diff, main="Differenced Log AirPassengers", ylab="Differenced Log Passengers", xlab="Year")

adf_test <- adf.test(DS_diff)

## Warning in adf.test(DS_diff): p-value smaller than printed p-value

adf_test

## 
##  Augmented Dickey-Fuller Test
## 
## data:  DS_diff
## Dickey-Fuller = -6.4313, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

Both the transformed plot and the ADF test suggests that the series is now stationary as the upward trend is no longer in play and the series shifts around a constant mean.

QUESTION FIVE

# Plot ACF and PACF
par(mfrow=c(1,2))
acf(DS_diff, main="ACF of Differenced Log Transformed Series")
pacf(DS_diff, main="PACF of Differenced Log Transformed Series")

According to the ACF plot,we see a significant spike at lag 1. This suggests a strong correlation at the lag and no significant autocorrelations between the subsequent lags and the series as they all fall within the confidence interval. Therefore we opt to use MA(1) as our q value.

For the PACF plot we see a significant spike sat lag 1 , with the other subsequent lags appearing within the confidence interval, hence non-significant.

Our suggested ARMA model, based on the plots would be an ARMA(1,1)model, with our features being AR(p)=1 and MA(q)=1

QUESTION 6

Seasonality indicates regular patterns in the data, suggesting the need for seasonal differencing or seasonal components in the model. Seasonal peaks and troughs in the ACF and PACF plots suggest incorporating seasonal AR and MA terms. Therefore SAR and SMA component sneed to be included when fitting the model.

QUESTION 7

# Difference the log-transformed series to remove seasonality

DS_diff <- diff(log(dataset), lag = 12)  # Seasonal differencing

# Fit an ARMA(1,1) model to the differenced series
arma_model1 <- Arima(DS_diff, order = c(1,0,1))

summary(arma_model1)

## Series: DS_diff 
## ARIMA(1,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1    mean
##       0.8565  -0.2844  0.1147
## s.e.  0.0588   0.1074  0.0174
## 
## sigma^2 = 0.001755:  log likelihood = 232.59
## AIC=-457.18   AICc=-456.86   BIC=-445.65
## 
## Training set error measures:
##                       ME       RMSE        MAE  MPE MAPE      MASE        ACF1
## Training set 0.001026603 0.04141181 0.03102731 -Inf  Inf 0.4026692 -0.02241652

AR(1) coefficient (ar1): 0.8565 This indicates that the time series has a positive autocorrelation at lag 1. The coefficient suggests a strong influence of the previous observation on the current observation.

MA(1) coefficient (ma1): -0.2844 This suggests that the model accounts for the effect of a lagged error term in the series, with a negative influence.

Mean (mean): 0.1147 The model includes a constant term, which is the mean level around which the series fluctuates.

sigma²: 0.001755 This is the estimated variance of the residuals. The lower value indicates a better fit of the model to the data.

Log-Likelihood: 232.59 The log-likelihood measures how well the model fits the data. The higher value indicates a better fit

Overall, the model seems to fit the data quite well, with low values for RMSE, MAE, and a MASE less than 1 indicating better performance compared to a naïve model.

QUESTION 8

# Fit an ARMA(1,1) model to the differenced series
arima_model <- auto.arima(AirPassengers)
arima_model

## Series: AirPassengers 
## ARIMA(2,1,1)(0,1,0)[12] 
## 
## Coefficients:
##          ar1     ar2      ma1
##       0.5960  0.2143  -0.9819
## s.e.  0.0888  0.0880   0.0292
## 
## sigma^2 = 132.3:  log likelihood = -504.92
## AIC=1017.85   AICc=1018.17   BIC=1029.35

TO COMPARE THE TWO MODELS:

Fit Statistics:

Model 1 has a significantly lower sigma², indicating that the residual variance is much smaller. This suggests that the model fits the data better. Model 1 also has a much lower AIC, AICc, and BIC compared to Model 2. Lower values of these criteria indicate a better model fit relative to its complexity

Error Measures:

Model 1 shows much lower values for ME, RMSE, and MAE compared to Model 2. This suggests that Model 1 has smaller prediction errors

IN CONCLUSION Model 1 is the better model based on the following reasons:

Better Fit Statistics: Lower sigma², AIC, AICc, and BIC values indicate a better fit. Lower Error Measures: Lower ME, RMSE, and MAE suggest more accurate predictions.

Alternatively,we can compare accuracy of the two models using the forecast package

library(forecast)


accuracy_model1 <- accuracy(fitted(arma_model1), DS_diff)
accuracy_model2 <- accuracy(fitted(arima_model), dataset)


cat("Model 1 Accuracy:\n")

## Model 1 Accuracy:

print(accuracy_model1)

##                   ME       RMSE        MAE  MPE MAPE        ACF1 Theil's U
## Test set 0.001026603 0.04141181 0.03102731 -Inf  Inf -0.02241652         0

cat("Model 2 Accuracy:\n")

## Model 2 Accuracy:

print(accuracy_model2)

##              ME     RMSE     MAE      MPE     MAPE        ACF1 Theil's U
## Test set 1.3423 10.84619 7.86754 0.420698 2.800458 -0.00124847 0.3755164

The test overally suggests the same. Model 1 performs better overall based on lower errors and better fit indicators.

QUESTION 9

summary(arima_model)

## Series: AirPassengers 
## ARIMA(2,1,1)(0,1,0)[12] 
## 
## Coefficients:
##          ar1     ar2      ma1
##       0.5960  0.2143  -0.9819
## s.e.  0.0888  0.0880   0.0292
## 
## sigma^2 = 132.3:  log likelihood = -504.92
## AIC=1017.85   AICc=1018.17   BIC=1029.35
## 
## Training set error measures:
##                  ME     RMSE     MAE      MPE     MAPE     MASE        ACF1
## Training set 1.3423 10.84619 7.86754 0.420698 2.800458 0.245628 -0.00124847

Model Fit:

The model has reasonable fit statistics given the AIC and BIC values, though they are not extremely low. This suggests that while the model is relatively good, there might be room for improvement.

Error Metrics:

The RMSE and MAE are relatively high, which suggests that while the model captures the data well, there is still some significant residual error. The MAPE is low, indicating that in terms of percentage error, the model performs reasonably well. The MASE being less than 1 indicates that the model performs better than a naive forecast, which is good.

Model Specifications:

The chosen ARIMA(2,1,1) model with seasonal differencing (12) seems reasonable given that the AirPassengers dataset has a clear seasonal pattern (monthly data with annual seasonality). The coefficients suggest that the model captures both the autoregressive and moving average components effectively.

Overall, the model kinda fits the series. HOWEVER, based on the fit statistics and error measures, the ARIMA(1,0,1) with a non-zero mean model appears to be a better choice compared to the ARIMA(2,1,1)(0,1,0)[12] model. The ARIMA(1,0,1) model has lower AIC and BIC values, indicating a better fit relative to its complexity. Additionally, its lower RMSE and MAE suggest it provides more accurate forecasts with smaller residuals.

While the ARIMA(2,1,1)(0,1,0)[12] model incorporates additional parameters and seasonal components, it does not perform as well in terms of fit and prediction accuracy as the ARIMA(1,0,1) model in this case.

I, as the Data analyst here,based on the ARIMA specifications, would therefore not agree with the automated model and opt to stick with the manual model as it is way better and more accurate.

QUESTION 10 and 11

# Perform diagnostic checks
checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)(0,1,0)[12]
## Q* = 37.784, df = 21, p-value = 0.01366
## 
## Model df: 3.   Total lags used: 24

# Ljung-Box test
Box.test(residuals(arima_model), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(arima_model)
## X-squared = 0.00022916, df = 1, p-value = 0.9879

Residuals Over Time:

The residuals seem to fluctuate around the zero line without any clear pattern, which suggests that the model has captured the time series data well, with no obvious unmodeled trends or seasonality.

Autocorrelation Function (ACF): It shows that most of the autocorrelations for different lags are within the significance boundaries, indicating that there are no significant autocorrelations in the residuals.

Histogram of Residuals: The close match between the histogram and the normal curve suggests that the residuals are normally distributed, which is a good sign for the model’s validity.

According to the box test,the model is generally a good fit for the series, with a p value close to 1 for the model and a p value less than 1 for the residual test.

Generally, there are no hidden patterns, therefore confirming the reliability of the model for forecasting purposes.

QUESTION 12 and 13

# Generate forecast
forecast_values <- forecast(arima_model, h=12)
forecast_values

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1961       445.6349 430.8903 460.3795 423.0851 468.1847
## Feb 1961       420.3950 403.0907 437.6993 393.9304 446.8596
## Mar 1961       449.1983 429.7726 468.6240 419.4892 478.9074
## Apr 1961       491.8399 471.0270 512.6529 460.0092 523.6707
## May 1961       503.3945 481.5559 525.2330 469.9953 536.7937
## Jun 1961       566.8624 544.2637 589.4612 532.3007 601.4242
## Jul 1961       654.2602 631.0820 677.4383 618.8122 689.7081
## Aug 1961       638.5975 614.9704 662.2246 602.4630 674.7320
## Sep 1961       540.8837 516.9028 564.8647 504.2081 577.5594
## Oct 1961       494.1266 469.8624 518.3909 457.0177 531.2356
## Nov 1961       423.3327 398.8381 447.8273 385.8715 460.7939
## Dec 1961       465.5076 440.8229 490.1923 427.7556 503.2596

plot(forecast_values, main="12-Month Forecast", xlab = "Year", ylab = "Passengers")

The data from the past demonstrates a distinct rising trend with a clear seasonal pattern.The predicted values for the following 12 months are the same. The historical data’s rising trend and seasonal pattern are maintained in the forecast. The predicted and observed values indicate that, generally, the model captured the trend and seasonalities of the time series.However, the disadvantages of the model are that it assumes historical trends to continue in the future. Long-term forecasts cannot be unquestionable since the forecast interval also increases with time.

QUESTION 14 and 15

# Fit a seasonal ARIMA model
seasonal_arima_model <- auto.arima(AirPassengers, seasonal=TRUE)
summary(seasonal_arima_model)

## Series: AirPassengers 
## ARIMA(2,1,1)(0,1,0)[12] 
## 
## Coefficients:
##          ar1     ar2      ma1
##       0.5960  0.2143  -0.9819
## s.e.  0.0888  0.0880   0.0292
## 
## sigma^2 = 132.3:  log likelihood = -504.92
## AIC=1017.85   AICc=1018.17   BIC=1029.35
## 
## Training set error measures:
##                  ME     RMSE     MAE      MPE     MAPE     MASE        ACF1
## Training set 1.3423 10.84619 7.86754 0.420698 2.800458 0.245628 -0.00124847

non_seasonal<- auto.arima(AirPassengers, seasonal=FALSE)
summary(non_seasonal)

## Series: AirPassengers 
## ARIMA(4,1,2) with drift 
## 
## Coefficients:
##          ar1     ar2      ar3      ar4      ma1      ma2   drift
##       0.2243  0.3689  -0.2567  -0.2391  -0.0971  -0.8519  2.6809
## s.e.  0.1047  0.1147   0.0985   0.0919   0.0866   0.0877  0.1711
## 
## sigma^2 = 706.3:  log likelihood = -670.07
## AIC=1356.15   AICc=1357.22   BIC=1379.85
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.228696 25.82793 20.59211 -1.665245 7.476447 0.6428946
##                      ACF1
## Training set 0.0009861078

# Compare forecast accuracy
seasonal_forecast <- forecast(seasonal_arima_model, h=12)
plot(seasonal_forecast, main="12-Month Seasonal Forecast")

non_seasonal_foreast<- forecast(non_seasonal, h=12)
plot(non_seasonal_foreast, main="12-Month Non-Seasonal Forecast")

The seasonal model has a lower AIC value of 1017.85 compared to 1356.15 for the non-seasonal model, suggesting it is a better fit given the number of parameters. Similarly, the BIC is lower for the seasonal model (1029.35) than for the non-seasonal model (1379.85), indicating that after adjusting for the number of parameters, the seasonal model still provides a better fit.

The seasonal ARIMA model explicitly accounts for seasonality with its seasonal differencing term, which is designed to capture and adjust for seasonal effects. The non-seasonal model does not have a seasonal component, and while it includes a drift term, this may not be sufficient to capture the seasonal patterns in the data.

Based on the error measures provided, the seasonal model has a lower RMSE (10.84619) and MAE (7.86754), which points to more accurate forecasts compared to the non-seasonal model with an RMSE of 25.82793 and MAE of 20.59211. The MAPE for the seasonal model is also lower (2.800458), indicating that the percentage errors are smaller, which is desirable for forecast accuracy.

IN CONCUSION, the seasonal model is better as it accounts for seasonality. The lower AIC and BIC values, in comparison to the non-seasonality model,suggest the same. The RMSE, MAE and MAPE values back this up, showing less forecasting accuracy for the econd model.

ANALYSIS OF THE AIRPASSENGERS DATASET

SANDRA BUSIEKA

2024-05-21