Data Preparation

portland.bus <- portland.bus[order(date),]

portland.test <- portland.bus[date > "1966-12-01",]

portland.train <- portland.bus[date <= "1966-12-01",]

portland.test.ts <- ts(portland.test$bus_ridership, frequency = 12, start = c(1966,1))

# convert to ts 
portland.train.ts <- ts(portland.train$bus_ridership, frequency = 12, start = c(1960,1))

Plots & Diagnoses

There’s a clear upward trend before the peak in 1967. There might also be a seasonal effect as the bus ridership plummet during winter and summer, but increase again after the 2 seasons. My guess is that the students have no schools in the summer and winter so they don’t purchase bus passes. The summer vacation lasts longer, about 3 months, so the dip is lower compared to winter vacation (1 month).

# plot data
plot.ts(portland.train.ts)

To check if there’s trend and other seasonal effects in the data, I decompose the data, Multiplicative Decomposition returns better results so here I will only show Multiplicative Decomposition:

# decompose - additive
portland.train.decomp <- decompose(portland.train.ts)
plot(portland.train.decomp)

# decompose - multiplicative
portland.bus.decomp.mult <- decompose(portland.train.ts, type="multiplicative")
plot(portland.bus.decomp.mult)

# which one is better?
mean(na.omit(portland.bus.decomp.mult$random))
## [1] 0.9980424
mean(na.omit(portland.train.decomp$random))
## [1] -1.594907

Modelling

ETS

First I try ETS (M,A,N) or Holt’s linear method with multiplicative errors, then the default ETS with (M,Ad,M). ETS(M,Ad,M) with damped trend got smaller AICcs.

ets(portland.train.ts, model="MAN", damped=NULL, alpha=NULL, beta=NULL,
    gamma=NULL, phi=NULL, lambda=NULL, biasadj=FALSE,
    additive.only=FALSE, restrict=TRUE,
    allow.multiplicative.trend=FALSE)
## ETS(M,A,N) 
## 
## Call:
##  ets(y = portland.train.ts, model = "MAN", damped = NULL, alpha = NULL,  
## 
##  Call:
##      beta = NULL, gamma = NULL, phi = NULL, additive.only = FALSE,  
## 
##  Call:
##      lambda = NULL, biasadj = FALSE, restrict = TRUE, allow.multiplicative.trend = FALSE) 
## 
##   Smoothing parameters:
##     alpha = 0.9164 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 631.2429 
##     b = 11.0333 
## 
##   sigma:  0.0539
## 
##      AIC     AICc      BIC 
## 1046.187 1046.956 1058.341
ets(portland.train.ts, model="ZZZ")
## ETS(M,Ad,M) 
## 
## Call:
##  ets(y = portland.train.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.9998 
##     beta  = 0.0025 
##     gamma = 1e-04 
##     phi   = 0.98 
## 
##   Initial states:
##     l = 605.7555 
##     b = 15.1207 
##     s=0.973 1.0251 1.022 0.9639 0.9016 0.9399
##            0.9707 1.0045 1.0401 1.0224 1.0707 1.0661
## 
##   sigma:  0.0303
## 
##       AIC      AICc       BIC 
##  959.9084  970.4315 1003.6631
fit.ets <- ets(portland.train.ts, model="ZZZ")

ARIMA

For Arima, we start with stationary test. An unit root test indicates that the data is non stationary. ACF and PACF test also suggest that there’s autocorrelation in the data.

# Using R's default ADF test we get: detrend data
adf.test(portland.train[, bus_ridership]
         , alternative = "stationary"
         , k = 12
)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  portland.train[, bus_ridership]
## Dickey-Fuller = -2.3403, Lag order = 12, p-value = 0.4363
## alternative hypothesis: stationary
# First look at ACF/PACF
ggAcf(portland.train.ts)

ggPacf(portland.train.ts)

DIY ARIMA

So I take the first seasonal differencing to make the data stationary and look at ACF plot to decide if we need to include MA term. The significant spike at lag 1 and 12 in the ACF suggests a non-seasonal MA(1) component, and the significant spike at lag 12 in the ACF suggests a seasonal MA(1) component. We might also need the first difference of the seasonal difference data.

Eventually, I go with ARIMA(0,1,1)(0,1,1)12 model, indicating a non seasonal + seasonal difference, and non-seasonal and seasonal MA(1) components.The residuals look white noise and normally distributed.

# first seasonal difference
diff12 = diff(portland.train.ts, 12)
# plot data
plot.ts(diff12)

ggAcf(diff12)

ggPacf(diff12)

# first difference
diff121 = diff(diff12)
plot.ts(diff121)

ggAcf(diff121)

ggPacf(diff121)

fit.arima <- Arima(portland.train.ts, order=c(0,1,1), seasonal=c(0,1,1))
summary(fit.arima)  
## Series: portland.train.ts 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.0942  -0.6250
## s.e.   0.1369   0.2697
## 
## sigma^2 estimated as 1260:  log likelihood=-356.13
## AIC=718.27   AICc=718.63   BIC=725.06
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.3010665 32.17515 22.67038 -0.1228739 2.078184 0.1923029
##                     ACF1
## Training set 0.004873151
tsdisplay(residuals(fit.arima))

#check residual
Box.test(fit.arima$residuals, type="Ljung-Box") 
## 
##  Box-Ljung test
## 
## data:  fit.arima$residuals
## X-squared = 0.0020669, df = 1, p-value = 0.9637
ggAcf(fit.arima$residuals) 

ggPacf(fit.arima$residuals)

checkresiduals(fit.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,1)[12]
## Q* = 20.345, df = 22, p-value = 0.5615
## 
## Model df: 2.   Total lags used: 24

using auto.arima() function from forecast package

However AICc of DIY Arima is lower so we go with DIY Arima instead of Auto Arima. Also, auto.arima does not handle seasonal data well. In this case, as you can see, the data is seasonal and non stationary. The fact that auto.arima suggests a model with d = 0 really casts doubt on its credibility.

auto.bus <- auto.arima(portland.train.ts, seasonal =  T)
summary(auto.bus)
## Series: portland.train.ts 
## ARIMA(1,0,0)(1,1,0)[12] with drift 
## 
## Coefficients:
##          ar1     sar1    drift
##       0.8624  -0.3047  10.2218
## s.e.  0.0597   0.1371   1.8956
## 
## sigma^2 estimated as 1325:  log likelihood=-360.65
## AIC=729.31   AICc=729.91   BIC=738.41
## 
## Training set error measures:
##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set -0.3158989 32.99042 22.15516 -0.08935202 2.045891 0.1879326
##                    ACF1
## Training set 0.03680954
coeftest(auto.bus)
## 
## z test of coefficients:
## 
##        Estimate Std. Error z value  Pr(>|z|)    
## ar1    0.862383   0.059748 14.4337 < 2.2e-16 ***
## sar1  -0.304709   0.137134 -2.2220   0.02628 *  
## drift 10.221800   1.895613  5.3923 6.954e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Selection

ETS not only fits the train better, but also perform forecasting better.

# fit.arima
a1 <- fit.arima %>% forecast(h = 30) %>%
  accuracy(portland.test.ts)
a1[,c("RMSE","MAE","MAPE","MASE")]
##                   RMSE       MAE      MAPE      MASE
## Training set  32.17515  22.67038  2.078184 0.1923029
## Test set     206.37015 194.30883 14.216977 1.6482370
# fit.ets
a2 <- fit.ets %>% forecast(h = 30) %>%
  accuracy(portland.test.ts)
a2[,c("RMSE","MAE","MAPE","MASE")]
##                   RMSE       MAE     MAPE      MASE
## Training set  29.71399  21.97627 2.097950 0.1864151
## Test set     133.31501 131.36991 9.530178 1.1143536

Forecast