Background

This project is based on the dataset presented by U.S. Energy Information Administration about petroleum sales volumes. The data covers a time span from 1983 to 2017 and can be downloaded respectively by states.
I’m particularly interested in the sales volumes in Massachusetts, New York and California. So I collected data on these three states.

Read in Data

total <- read.csv("total.csv")
ny <- read.csv("NY.csv")
ma <- read.csv("MA.csv")
ca <- read.csv("CA.csv")
data <- data.frame(time = rev(ma$Month), 
                   total = rev(total$Total[1:210]), 
                   ny = rev(ny$NY[1:210]), 
                   ma = rev(ma$Ma[1:210]), 
                   ca = rev(ca$Ca[1:210]))
data$total <- as.numeric(data$total)
data$ny <- as.numeric(data$ny)
data$ma <- as.numeric(data$ma)
data$ca <- as.numeric(data$ca)

A Look at the Data

  • Green: California; Blue: New York; Brown: Massachusetts
plot(data$ca, type = "l", col = "green", ylim=c(1, 9000), ylab = "", 
     xlab = "Starting from Oct, 1993")
par(new = TRUE)
plot(data$ny, type = "l", col = "blue", ylim=c(1, 9000), ylab = "", xlab = "")
par(new = TRUE)
plot(data$ma, type = "l", col = "brown", ylim=c(1, 9000), ylab = "", xlab = "")

Because there’s missing data for recent years, I decide to use the first part of the NY dataset as the data for this project.

ARIMA model

Overview

data <- data[1:169,]
ny <- data$ny
ny <- ts(data$ny, start=c(1993, 10), frequency=12) 
plot(ny, type = "l", col = "blue", ylim=c(1, 4000), ylab = "")

We can see that there’s definitely a seasonal component. The trend is going up first then going down.

Decomposition

fit <- stl(ny, s.window = "per")
plot(fit)

Differencing

diff1 <- diff(ny, 12)
plot(diff1)

diff2 <- diff(diff1, 1)
plot(diff2)

The data after differencing looks like white noise.

ACF and PACF

par(mfrow = c(1,2))
acf(diff2, lag.max = 36)
pacf(diff2, lag.max = 36)

For non-seasonality, based on the single spike in ACF and PACF exponentially decaying to zero, it seems like a MA(1) would work.

For seasonality, look at the ACF of lag multipled by 12. There’s a peak at lag 1, 12 and 24 so maybe a seasonal MA(3) works. PACF exponentially decays to zero,

Fitting the ARIMA model

fit <- arima(ny, order = c(0,1,1), seasonal = list(order = c(0, 1, 3), period = 12))
plot(residuals(fit))

The residual looks like white noise.

Forecasting: Holt Winters vs ARIMA

Holt Winters

fit1 <- HoltWinters(ny,seasonal="multiplicative")
forecast1 <- forecast(fit1, h = 8,level = 0.95)
plot(forecast(fit1))

ARIMA

fit2 <- Arima(ny, order = c(0,1,1), seasonal = list(order = c(0, 1, 3), period = 12)) 
forecast2 <- forecast(fit2, h=8,level = 0.95)
plot(forecast2)