Introduction

In this report, I analyze the classic AirPassengers dataset using R. To ensure forecasts remain positive, we apply a log transformation before modeling with ARIMA.


Step 1: Load Libraries and Dataset

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)

# Load dataset
data(AirPassengers)

# Log transform to avoid negative forecasts
log_air <- log(AirPassengers)

Explanation: We use log(AirPassengers) because the passenger numbers grow exponentially. This stabilizes the variance and ensures forecasts are always positive once we exponentiate back.


Step 2: Basic Info

class(AirPassengers)
## [1] "ts"
start(AirPassengers)
## [1] 1949    1
end(AirPassengers)
## [1] 1960   12
frequency(AirPassengers)
## [1] 12
summary(AirPassengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

Step 3: Visualization

plot(AirPassengers, main="Original AirPassengers Time Series", col="blue")

plot(log_air, main="Log-Transformed AirPassengers", col="red")

boxplot(AirPassengers ~ cycle(AirPassengers), main="Monthly Distribution", col="lightblue")

Explanation: The log-transformed series reduces the growth rate and makes seasonal patterns clearer.


Step 4: Stationarity Check

# ADF test on log data
adf.test(log_air, alternative="stationary", k=12)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  log_air
## Dickey-Fuller = -1.5325, Lag order = 12, p-value = 0.7711
## alternative hypothesis: stationary
# First differencing of log data
diff_log <- diff(log_air, differences=1)
plot(diff_log, main="1st Differenced Log Series", col="darkred")

# Re-test
adf.test(diff_log, alternative="stationary", k=12)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_log
## Dickey-Fuller = -3.3656, Lag order = 12, p-value = 0.06313
## alternative hypothesis: stationary

Explanation: After differencing, the log series becomes stationary, suitable for ARIMA modeling.


Step 5: Autocorrelation Analysis (ACF & PACF)

# Plot ACF and PACF
par(mfrow = c(1,2))
acf(log_air, main="ACF of Log(AirPassengers)")
pacf(log_air, main="PACF of Log(AirPassengers)")

par(mfrow = c(1,1)) # reset

Explanation:


# Plot ACF and PACF
par(mfrow = c(1,2))
acf(diff_log, main="ACF of diff_log(AirPassengers)")
pacf(diff_log, main="PACF of diff_log(AirPassengers)")

par(mfrow = c(1,1)) # reset

Explanation:

ACF : Strong spike at lag 1 → short-term correlation remains.

PACF : Strong lag 1 spike → suggests AR(1) component.

The differenced log series looks stationary, suitable for ARIMA modeling.

Step 6: Decomposition

ddata <- decompose(log_air, "multiplicative")
plot(ddata)


Step 7: Fit ARIMA Model

mymodel <- auto.arima(log_air)
mymodel
## Series: log_air 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 = 0.001371:  log likelihood = 244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77
# Alternative check
auto.arima(log_air, ic="aic", trace=TRUE)
## 
##  ARIMA(2,1,2)(1,1,1)[12]                    : Inf
##  ARIMA(0,1,0)(0,1,0)[12]                    : -434.83
##  ARIMA(1,1,0)(1,1,0)[12]                    : -474.8188
##  ARIMA(0,1,1)(0,1,1)[12]                    : -483.3991
##  ARIMA(0,1,1)(0,1,0)[12]                    : -449.9794
##  ARIMA(0,1,1)(1,1,1)[12]                    : -481.9131
##  ARIMA(0,1,1)(0,1,2)[12]                    : -481.9625
##  ARIMA(0,1,1)(1,1,0)[12]                    : -477.4053
##  ARIMA(0,1,1)(1,1,2)[12]                    : Inf
##  ARIMA(0,1,0)(0,1,1)[12]                    : -467.5581
##  ARIMA(1,1,1)(0,1,1)[12]                    : -481.8995
##  ARIMA(0,1,2)(0,1,1)[12]                    : -481.6165
##  ARIMA(1,1,0)(0,1,1)[12]                    : -481.4896
##  ARIMA(1,1,2)(0,1,1)[12]                    : -482.0433
## 
##  Best model: ARIMA(0,1,1)(0,1,1)[12]
## Series: log_air 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 = 0.001371:  log likelihood = 244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77

Step 8: Residual Diagnostics

plot.ts(mymodel$residuals, main="Residuals of ARIMA Model")

Box.test(residuals(mymodel), lag=5, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(mymodel)
## X-squared = 4.8694, df = 5, p-value = 0.432
Box.test(residuals(mymodel), lag=10, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(mymodel)
## X-squared = 8.8097, df = 10, p-value = 0.5503
Box.test(residuals(mymodel), lag=15, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(mymodel)
## X-squared = 9.9068, df = 15, p-value = 0.8256

Explanation: If p-values > 0.05 → residuals are white noise → model is good.


Step 9: Forecasting

# Forecast on log scale
myforecast <- forecast(mymodel, level=c(95), h=24)

# Exponentiate back to original scale
myforecast$mean <- exp(myforecast$mean)
myforecast$lower <- exp(myforecast$lower)
myforecast$upper <- exp(myforecast$upper)

# Plot forecast
plot(myforecast, main="2-Year Forecast (Positive Values Only)", col="darkgreen")

Explanation: Forecasts are back-transformed to the original scale, so passenger numbers are always positive.


Step 10: Accuracy

accuracy(myforecast)
##                        ME       RMSE        MAE        MPE      MAPE      MASE
## Training set 0.0005730622 0.03504883 0.02626034 0.01098898 0.4752815 0.2169522
##                    ACF1
## Training set 0.01443892