8 a) Plot the series. what can you learn from examining this plot?

get the data in: (make a new spreadsheet without formatting. change the date to a short date.)

data <- read.csv("C:/Users/Owner/Documents/ch7data.csv")

it looks kinda ugly:

head(data)
##    ï..Date Total.TEU.s
## 1 1/1/1995      123723
## 2 2/1/1995       99368
## 3 3/1/1995      118549
## 4 4/1/1995      123411
## 5 5/1/1995      114514
## 6 6/1/1995      114468

The date is a factor. I’m going to have to tell R that a date is a date. Also, it looks like there was some metadata that made R think some numbers were floating point.

data =edit(data)
data$Date <- as.Date(as.character.Date(data$Date), "%m/%d/%Y")

…This took so long to make lol.

head(data)
##         Date TotalTEU
## 1 1995-01-01   123723
## 2 1995-02-01    99368
## 3 1995-03-01   118549
## 4 1995-04-01   123411
## 5 1995-05-01   114514
## 6 1995-06-01   114468

now plot:

plot(data$TotalTEU ~ data$Date, type = "l", ylab = "Total TEUs", xlab = "Date", main = "Total TEUs by Date")

From examining this plot, it appears the data is non-stationary, cyclical, seasonal, and with a trend.

throwing this in here because further calculations require a time series object.

TEU <- data$TotalTEU
datats <- ts(data = TEU, start = c(1995,1), end = c(2007, 12), frequency = 12)
datats
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1995 123723.0  99368.0 118549.0 123411.0 114514.0 114468.0 125412.0
## 1996 111494.0  99785.0  96906.0 111204.0 115513.0 119422.0 129984.0
## 1997 127065.0 112733.0 113063.0 129797.0 136712.0 140220.0 143756.0
## 1998 125930.0 122976.0 154947.0 154522.0 167204.0 159638.0 158948.0
## 1999 142116.0 142080.0 141926.0 153559.0 182975.0 169682.0 185017.0
## 2000 194180.0 175890.0 188438.0 220157.0 217749.0 220071.0 243695.0
## 2001 212323.0 163332.0 217284.0 221465.0 213860.0 243053.0 250344.0
## 2002 220810.0 244167.0 229954.0 276373.0 284385.0 301447.0 271933.0
## 2003 276482.0 274740.0 298495.0 326709.0 348276.0 305892.0 331741.0
## 2004 345412.0 247710.0 340748.0 345339.0 367128.0 347056.0 365901.0
## 2005 305102.2 294022.5 262173.0 336086.9 319472.0 340582.0 356716.0
## 2006 327009.1 251811.9 345401.0 370170.6 368863.5 387956.8 413357.2
## 2007 367095.7 358600.8 323472.1 375511.5 368874.3 393187.1 387573.2
##           Aug      Sep      Oct      Nov      Dec
## 1995 122866.0 115473.0 121523.0 104880.0 103821.0
## 1996 134296.0 134657.0 144430.0 128521.0 122428.0
## 1997 143389.0 143700.0 144425.0 131877.0 134315.0
## 1998 171152.0 157267.0 169364.0 158255.0 140165.0
## 1999 188281.0 187081.0 208163.0 184662.0 178493.0
## 2000 250551.0 227848.0 260469.0 210209.0 203021.0
## 2001 261705.0 275559.0 274954.0 241730.0 225886.0
## 2002 339690.0 330967.0 265218.0 301333.0 306099.0
## 2003 360046.0 350476.0 372112.0 338379.0 306984.0
## 2004 344109.0 324346.0 352718.0 340051.0 283268.0
## 2005 349654.5 356912.4 375050.8 332037.0 328244.1
## 2006 414004.3 431282.9 421693.7 390208.5 365590.8
## 2007 379027.1 407914.8 393948.0 383240.8 346140.5
  1. Calculate and display the first 24 autocorrelations for the series. What do the ACF and PACF suggest about the series?

calculate and display the first 24 autocorrelations:

acf(data$TotalTEU, lag.max = 24, plot = 0)
## 
## Autocorrelations of series 'data$TotalTEU', by lag
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 1.000 0.956 0.932 0.909 0.870 0.848 0.831 0.814 0.810 0.808 0.808 0.808 
##    12    13    14    15    16    17    18    19    20    21    22    23 
## 0.796 0.775 0.742 0.706 0.667 0.641 0.616 0.597 0.590 0.584 0.583 0.589 
##    24 
## 0.580
acf(data$TotalTEU, lag.max = 24, plot = 1)

pacf(data$TotalTEU)

be aware the acf function begins at lag 0… disregard this spike.

There appears to be a trend in the data. I should take a difference.

  1. Suggest and estimate an optimal set of differencing to use with the series.

first difference:

acf(diff(data$TotalTEU), lag.max = 24)

MA(1) is suggested.

pacf(diff(data$TotalTEU))

AR(1) is suggested.

It looks like there is one dominant spike in the differenced ACF, and 1 really good spike in the differenced partial ACF. This suggests an ARIMA(1,1,1).

  1. Estimate the ARIMA model that you believe to be a good candidate for forecasting container shipments. It may help to specify the seasonality as 12. Test the Ljung-Box statistic and report your findings. Finally, Plot the first 24 autocorrelations of the residuals to your best model.

ARIMA(1,1,1)

TEU <- data$TotalTEU
datats <- ts(data = TEU, start = c(1995,1), end = c(2007, 12), frequency = 12)
datats
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1995 123723.0  99368.0 118549.0 123411.0 114514.0 114468.0 125412.0
## 1996 111494.0  99785.0  96906.0 111204.0 115513.0 119422.0 129984.0
## 1997 127065.0 112733.0 113063.0 129797.0 136712.0 140220.0 143756.0
## 1998 125930.0 122976.0 154947.0 154522.0 167204.0 159638.0 158948.0
## 1999 142116.0 142080.0 141926.0 153559.0 182975.0 169682.0 185017.0
## 2000 194180.0 175890.0 188438.0 220157.0 217749.0 220071.0 243695.0
## 2001 212323.0 163332.0 217284.0 221465.0 213860.0 243053.0 250344.0
## 2002 220810.0 244167.0 229954.0 276373.0 284385.0 301447.0 271933.0
## 2003 276482.0 274740.0 298495.0 326709.0 348276.0 305892.0 331741.0
## 2004 345412.0 247710.0 340748.0 345339.0 367128.0 347056.0 365901.0
## 2005 305102.2 294022.5 262173.0 336086.9 319472.0 340582.0 356716.0
## 2006 327009.1 251811.9 345401.0 370170.6 368863.5 387956.8 413357.2
## 2007 367095.7 358600.8 323472.1 375511.5 368874.3 393187.1 387573.2
##           Aug      Sep      Oct      Nov      Dec
## 1995 122866.0 115473.0 121523.0 104880.0 103821.0
## 1996 134296.0 134657.0 144430.0 128521.0 122428.0
## 1997 143389.0 143700.0 144425.0 131877.0 134315.0
## 1998 171152.0 157267.0 169364.0 158255.0 140165.0
## 1999 188281.0 187081.0 208163.0 184662.0 178493.0
## 2000 250551.0 227848.0 260469.0 210209.0 203021.0
## 2001 261705.0 275559.0 274954.0 241730.0 225886.0
## 2002 339690.0 330967.0 265218.0 301333.0 306099.0
## 2003 360046.0 350476.0 372112.0 338379.0 306984.0
## 2004 344109.0 324346.0 352718.0 340051.0 283268.0
## 2005 349654.5 356912.4 375050.8 332037.0 328244.1
## 2006 414004.3 431282.9 421693.7 390208.5 365590.8
## 2007 379027.1 407914.8 393948.0 383240.8 346140.5
plot(datats)
library(forecast)
## Warning: package 'forecast' was built under R version 3.6.1
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

a function of the forecast package that tells me the number of differences to make a time series stationary:

ndiffs(datats)
## [1] 1

The suggested ARIMA model:

arima(x = datats,order = c(1,1,1))
## 
## Call:
## arima(x = datats, order = c(1, 1, 1))
## 
## Coefficients:
##           ar1      ma1
##       -0.1633  -0.2210
## s.e.   0.1894   0.1831
## 
## sigma^2 estimated as 601868320:  log likelihood = -1786.71,  aic = 3579.43
modelOne <- arima(x = datats,order = c(1,1,1))

a more accurate model may be given by the auto.arima() function of the forecast package.

auto.arima(datats)
## Series: datats 
## ARIMA(1,0,2)(0,1,1)[12] with drift 
## 
## Coefficients:
##          ar1      ma1     ma2     sma1      drift
##       0.9094  -0.7698  0.3171  -0.7603  1860.8150
## s.e.  0.0535   0.0909  0.0922   0.0739   260.7463
## 
## sigma^2 estimated as 3.82e+08:  log likelihood=-1630.08
## AIC=3272.16   AICc=3272.77   BIC=3289.98
modelTwo <- auto.arima(datats)

the ljung-box test:

Box.test(datats, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  datats
## X-squared = 145.26, df = 1, p-value < 2.2e-16

now I dont know if i should use one or two as the degrees of freedom in the Ljung-Box test…

checkresiduals(modelOne, plot = 1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 131.73, df = 22, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 24
checkresiduals(modelTwo)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,2)(0,1,1)[12] with drift
## Q* = 25.345, df = 19, p-value = 0.1495
## 
## Model df: 5.   Total lags used: 24

it appears the auto.arima() function creates a more accurate model, although it could be overspecified.