In this report, I analyze the classic AirPassengers dataset using R. To ensure forecasts remain positive, we apply a log transformation before modeling with ARIMA.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
# Load dataset
data(AirPassengers)
# Log transform to avoid negative forecasts
log_air <- log(AirPassengers)
Explanation: We use log(AirPassengers)
because
the passenger numbers grow exponentially. This stabilizes the variance
and ensures forecasts are always positive once we
exponentiate back.
class(AirPassengers)
## [1] "ts"
start(AirPassengers)
## [1] 1949 1
end(AirPassengers)
## [1] 1960 12
frequency(AirPassengers)
## [1] 12
summary(AirPassengers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
plot(AirPassengers, main="Original AirPassengers Time Series", col="blue")
plot(log_air, main="Log-Transformed AirPassengers", col="red")
boxplot(AirPassengers ~ cycle(AirPassengers), main="Monthly Distribution", col="lightblue")
Explanation: The log-transformed series reduces the growth rate and makes seasonal patterns clearer.
# ADF test on log data
adf.test(log_air, alternative="stationary", k=12)
##
## Augmented Dickey-Fuller Test
##
## data: log_air
## Dickey-Fuller = -1.5325, Lag order = 12, p-value = 0.7711
## alternative hypothesis: stationary
# First differencing of log data
diff_log <- diff(log_air, differences=1)
plot(diff_log, main="1st Differenced Log Series", col="darkred")
# Re-test
adf.test(diff_log, alternative="stationary", k=12)
##
## Augmented Dickey-Fuller Test
##
## data: diff_log
## Dickey-Fuller = -3.3656, Lag order = 12, p-value = 0.06313
## alternative hypothesis: stationary
Explanation: After differencing, the log series becomes stationary, suitable for ARIMA modeling.
# Plot ACF and PACF
par(mfrow = c(1,2))
acf(log_air, main="ACF of Log(AirPassengers)")
pacf(log_air, main="PACF of Log(AirPassengers)")
par(mfrow = c(1,1)) # reset
Explanation:
# Plot ACF and PACF
par(mfrow = c(1,2))
acf(diff_log, main="ACF of diff_log(AirPassengers)")
pacf(diff_log, main="PACF of diff_log(AirPassengers)")
par(mfrow = c(1,1)) # reset
Explanation:
ACF : Strong spike at lag 1 → short-term correlation remains.
PACF : Strong lag 1 spike → suggests AR(1) component.
The differenced log series looks stationary, suitable for ARIMA modeling.
ddata <- decompose(log_air, "multiplicative")
plot(ddata)
mymodel <- auto.arima(log_air)
mymodel
## Series: log_air
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 = 0.001371: log likelihood = 244.7
## AIC=-483.4 AICc=-483.21 BIC=-474.77
# Alternative check
auto.arima(log_air, ic="aic", trace=TRUE)
##
## ARIMA(2,1,2)(1,1,1)[12] : Inf
## ARIMA(0,1,0)(0,1,0)[12] : -434.83
## ARIMA(1,1,0)(1,1,0)[12] : -474.8188
## ARIMA(0,1,1)(0,1,1)[12] : -483.3991
## ARIMA(0,1,1)(0,1,0)[12] : -449.9794
## ARIMA(0,1,1)(1,1,1)[12] : -481.9131
## ARIMA(0,1,1)(0,1,2)[12] : -481.9625
## ARIMA(0,1,1)(1,1,0)[12] : -477.4053
## ARIMA(0,1,1)(1,1,2)[12] : Inf
## ARIMA(0,1,0)(0,1,1)[12] : -467.5581
## ARIMA(1,1,1)(0,1,1)[12] : -481.8995
## ARIMA(0,1,2)(0,1,1)[12] : -481.6165
## ARIMA(1,1,0)(0,1,1)[12] : -481.4896
## ARIMA(1,1,2)(0,1,1)[12] : -482.0433
##
## Best model: ARIMA(0,1,1)(0,1,1)[12]
## Series: log_air
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 = 0.001371: log likelihood = 244.7
## AIC=-483.4 AICc=-483.21 BIC=-474.77
plot.ts(mymodel$residuals, main="Residuals of ARIMA Model")
Box.test(residuals(mymodel), lag=5, type="Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(mymodel)
## X-squared = 4.8694, df = 5, p-value = 0.432
Box.test(residuals(mymodel), lag=10, type="Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(mymodel)
## X-squared = 8.8097, df = 10, p-value = 0.5503
Box.test(residuals(mymodel), lag=15, type="Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(mymodel)
## X-squared = 9.9068, df = 15, p-value = 0.8256
Explanation: If p-values > 0.05 → residuals are white noise → model is good.
# Forecast on log scale
myforecast <- forecast(mymodel, level=c(95), h=24)
# Exponentiate back to original scale
myforecast$mean <- exp(myforecast$mean)
myforecast$lower <- exp(myforecast$lower)
myforecast$upper <- exp(myforecast$upper)
# Plot forecast
plot(myforecast, main="2-Year Forecast (Positive Values Only)", col="darkgreen")
Explanation: Forecasts are back-transformed to the original scale, so passenger numbers are always positive.
accuracy(myforecast)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0005730622 0.03504883 0.02626034 0.01098898 0.4752815 0.2169522
## ACF1
## Training set 0.01443892