8 a) Plot the series. what can you learn from examining this plot?
get the data in: (make a new spreadsheet without formatting. change the date to a short date.)
data <- read.csv("C:/Users/Owner/Documents/ch7data.csv")
it looks kinda ugly:
head(data)
## ï..Date Total.TEU.s
## 1 1/1/1995 123723
## 2 2/1/1995 99368
## 3 3/1/1995 118549
## 4 4/1/1995 123411
## 5 5/1/1995 114514
## 6 6/1/1995 114468
The date is a factor. I’m going to have to tell R that a date is a date. Also, it looks like there was some metadata that made R think some numbers were floating point.
data =edit(data)
data$Date <- as.Date(as.character.Date(data$Date), "%m/%d/%Y")
…This took so long to make lol.
head(data)
## Date TotalTEU
## 1 1995-01-01 123723
## 2 1995-02-01 99368
## 3 1995-03-01 118549
## 4 1995-04-01 123411
## 5 1995-05-01 114514
## 6 1995-06-01 114468
now plot:
plot(data$TotalTEU ~ data$Date, type = "l", ylab = "Total TEUs", xlab = "Date", main = "Total TEUs by Date")
From examining this plot, it appears the data is non-stationary, cyclical, seasonal, and with a trend.
throwing this in here because further calculations require a time series object.
TEU <- data$TotalTEU
datats <- ts(data = TEU, start = c(1995,1), end = c(2007, 12), frequency = 12)
datats
## Jan Feb Mar Apr May Jun Jul
## 1995 123723.0 99368.0 118549.0 123411.0 114514.0 114468.0 125412.0
## 1996 111494.0 99785.0 96906.0 111204.0 115513.0 119422.0 129984.0
## 1997 127065.0 112733.0 113063.0 129797.0 136712.0 140220.0 143756.0
## 1998 125930.0 122976.0 154947.0 154522.0 167204.0 159638.0 158948.0
## 1999 142116.0 142080.0 141926.0 153559.0 182975.0 169682.0 185017.0
## 2000 194180.0 175890.0 188438.0 220157.0 217749.0 220071.0 243695.0
## 2001 212323.0 163332.0 217284.0 221465.0 213860.0 243053.0 250344.0
## 2002 220810.0 244167.0 229954.0 276373.0 284385.0 301447.0 271933.0
## 2003 276482.0 274740.0 298495.0 326709.0 348276.0 305892.0 331741.0
## 2004 345412.0 247710.0 340748.0 345339.0 367128.0 347056.0 365901.0
## 2005 305102.2 294022.5 262173.0 336086.9 319472.0 340582.0 356716.0
## 2006 327009.1 251811.9 345401.0 370170.6 368863.5 387956.8 413357.2
## 2007 367095.7 358600.8 323472.1 375511.5 368874.3 393187.1 387573.2
## Aug Sep Oct Nov Dec
## 1995 122866.0 115473.0 121523.0 104880.0 103821.0
## 1996 134296.0 134657.0 144430.0 128521.0 122428.0
## 1997 143389.0 143700.0 144425.0 131877.0 134315.0
## 1998 171152.0 157267.0 169364.0 158255.0 140165.0
## 1999 188281.0 187081.0 208163.0 184662.0 178493.0
## 2000 250551.0 227848.0 260469.0 210209.0 203021.0
## 2001 261705.0 275559.0 274954.0 241730.0 225886.0
## 2002 339690.0 330967.0 265218.0 301333.0 306099.0
## 2003 360046.0 350476.0 372112.0 338379.0 306984.0
## 2004 344109.0 324346.0 352718.0 340051.0 283268.0
## 2005 349654.5 356912.4 375050.8 332037.0 328244.1
## 2006 414004.3 431282.9 421693.7 390208.5 365590.8
## 2007 379027.1 407914.8 393948.0 383240.8 346140.5
calculate and display the first 24 autocorrelations:
acf(data$TotalTEU, lag.max = 24, plot = 0)
##
## Autocorrelations of series 'data$TotalTEU', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 1.000 0.956 0.932 0.909 0.870 0.848 0.831 0.814 0.810 0.808 0.808 0.808
## 12 13 14 15 16 17 18 19 20 21 22 23
## 0.796 0.775 0.742 0.706 0.667 0.641 0.616 0.597 0.590 0.584 0.583 0.589
## 24
## 0.580
acf(data$TotalTEU, lag.max = 24, plot = 1)
pacf(data$TotalTEU)
be aware the acf function begins at lag 0… disregard this spike.
There appears to be a trend in the data. I should take a difference.
first difference:
acf(diff(data$TotalTEU), lag.max = 24)
MA(1) is suggested.
pacf(diff(data$TotalTEU))
AR(1) is suggested.
It looks like there is one dominant spike in the differenced ACF, and 1 really good spike in the differenced partial ACF. This suggests an ARIMA(1,1,1).
ARIMA(1,1,1)
TEU <- data$TotalTEU
datats <- ts(data = TEU, start = c(1995,1), end = c(2007, 12), frequency = 12)
datats
## Jan Feb Mar Apr May Jun Jul
## 1995 123723.0 99368.0 118549.0 123411.0 114514.0 114468.0 125412.0
## 1996 111494.0 99785.0 96906.0 111204.0 115513.0 119422.0 129984.0
## 1997 127065.0 112733.0 113063.0 129797.0 136712.0 140220.0 143756.0
## 1998 125930.0 122976.0 154947.0 154522.0 167204.0 159638.0 158948.0
## 1999 142116.0 142080.0 141926.0 153559.0 182975.0 169682.0 185017.0
## 2000 194180.0 175890.0 188438.0 220157.0 217749.0 220071.0 243695.0
## 2001 212323.0 163332.0 217284.0 221465.0 213860.0 243053.0 250344.0
## 2002 220810.0 244167.0 229954.0 276373.0 284385.0 301447.0 271933.0
## 2003 276482.0 274740.0 298495.0 326709.0 348276.0 305892.0 331741.0
## 2004 345412.0 247710.0 340748.0 345339.0 367128.0 347056.0 365901.0
## 2005 305102.2 294022.5 262173.0 336086.9 319472.0 340582.0 356716.0
## 2006 327009.1 251811.9 345401.0 370170.6 368863.5 387956.8 413357.2
## 2007 367095.7 358600.8 323472.1 375511.5 368874.3 393187.1 387573.2
## Aug Sep Oct Nov Dec
## 1995 122866.0 115473.0 121523.0 104880.0 103821.0
## 1996 134296.0 134657.0 144430.0 128521.0 122428.0
## 1997 143389.0 143700.0 144425.0 131877.0 134315.0
## 1998 171152.0 157267.0 169364.0 158255.0 140165.0
## 1999 188281.0 187081.0 208163.0 184662.0 178493.0
## 2000 250551.0 227848.0 260469.0 210209.0 203021.0
## 2001 261705.0 275559.0 274954.0 241730.0 225886.0
## 2002 339690.0 330967.0 265218.0 301333.0 306099.0
## 2003 360046.0 350476.0 372112.0 338379.0 306984.0
## 2004 344109.0 324346.0 352718.0 340051.0 283268.0
## 2005 349654.5 356912.4 375050.8 332037.0 328244.1
## 2006 414004.3 431282.9 421693.7 390208.5 365590.8
## 2007 379027.1 407914.8 393948.0 383240.8 346140.5
plot(datats)
library(forecast)
## Warning: package 'forecast' was built under R version 3.6.1
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
a function of the forecast package that tells me the number of differences to make a time series stationary:
ndiffs(datats)
## [1] 1
The suggested ARIMA model:
arima(x = datats,order = c(1,1,1))
##
## Call:
## arima(x = datats, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## -0.1633 -0.2210
## s.e. 0.1894 0.1831
##
## sigma^2 estimated as 601868320: log likelihood = -1786.71, aic = 3579.43
modelOne <- arima(x = datats,order = c(1,1,1))
a more accurate model may be given by the auto.arima() function of the forecast package.
auto.arima(datats)
## Series: datats
## ARIMA(1,0,2)(0,1,1)[12] with drift
##
## Coefficients:
## ar1 ma1 ma2 sma1 drift
## 0.9094 -0.7698 0.3171 -0.7603 1860.8150
## s.e. 0.0535 0.0909 0.0922 0.0739 260.7463
##
## sigma^2 estimated as 3.82e+08: log likelihood=-1630.08
## AIC=3272.16 AICc=3272.77 BIC=3289.98
modelTwo <- auto.arima(datats)
the ljung-box test:
Box.test(datats, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: datats
## X-squared = 145.26, df = 1, p-value < 2.2e-16
now I dont know if i should use one or two as the degrees of freedom in the Ljung-Box test…
checkresiduals(modelOne, plot = 1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 131.73, df = 22, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 24
checkresiduals(modelTwo)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,2)(0,1,1)[12] with drift
## Q* = 25.345, df = 19, p-value = 0.1495
##
## Model df: 5. Total lags used: 24
it appears the auto.arima() function creates a more accurate model, although it could be overspecified.