The dataset used for this time series analysis is the U.S city avaerage gasoline price dataset sourced from the Federal Reserve Economic Data (FRED) ,St. Louis, and the United States Bureau of Labor Statistics. It contains monthly average regular unleaded gasoline prices from January 1976 to December 2021.The two fields in the data are date of observation and the average price.
When I first came across this dataset on the FRED website, I was impressed with the amount of historical data recorded. I moved here from India, where I’ve driven for 10 years, and I plan to start driving in the US as soon as possible, so analyzing the trend of gasoline prices seemed like an interesting choice.
Fuel price fluctuations are common across the world.The concern is when the rates increase considerably and consistently over a short time period. The most common reasons listed for price fluctuations are market competitors, tax and transportation fees, inflation, supply and demand, etc.
The goal of this analysis, as of now, is to understand how gasoline prices have fluctuated over the years and to determine whether: (i) prices follow patterns based on time of the year and, (ii) the fluctuations can be attributed to any obvious external socio-economic factors. (iii) build a forecast model and assess performance
The dataset is continuous and does not have any gaps in monthly data points. it may not be easy to forecast the prices for very far into the future as there could be potential unforeseen changes in economic or social factors that affect the prices. However, predicting fluctuations for the near future should be closer to actuals.
For the purpose of this study, we will be considering data from January 2000 to December 2021.
#Loading the dataset
fueldata_raw <- read_csv("/Users/samishav/Documents/Spring 2022/Forecasting & Time Series/Forecasting/APU000074714.csv")
summary(fueldata_raw)
## DATE APU000074714
## Min. :1976-01-01 Min. :0.592
## 1st Qu.:1987-06-23 1st Qu.:1.119
## Median :1998-12-16 Median :1.323
## Mean :1998-12-16 Mean :1.767
## 3rd Qu.:2010-06-08 3rd Qu.:2.470
## Max. :2021-12-01 Max. :4.090
#Filter study period (2000-2021) , rename columns and remove any duplicate data entries per month.
fueldata <- fueldata_raw %>% filter(DATE >= ymd("2000=01-01")) %>% rename(obs_date = "DATE",avg_price ='APU000074714') %>% mutate(month_year = format(obs_date, "%Y-%m")) %>% distinct(month_year,.keep_all=TRUE)
as_tibble(fueldata)
## # A tibble: 264 × 3
## obs_date avg_price month_year
## <date> <dbl> <chr>
## 1 2000-01-01 1.30 2000-01
## 2 2000-02-01 1.37 2000-02
## 3 2000-03-01 1.54 2000-03
## 4 2000-04-01 1.51 2000-04
## 5 2000-05-01 1.50 2000-05
## 6 2000-06-01 1.62 2000-06
## 7 2000-07-01 1.59 2000-07
## 8 2000-08-01 1.51 2000-08
## 9 2000-09-01 1.58 2000-09
## 10 2000-10-01 1.56 2000-10
## # … with 254 more rows
# Trend chart
monthly_price_plot = ggplot(fueldata)+
geom_line(aes(obs_date,avg_price))+
theme_bw()+
xlab("Date")+
ylab("Average Price")+
labs(
title = 'Average Price of Unleaded Regular Gasoline, FRED',
subtitle = 'January 2000 - December 2021'
)
#monthly_price_plot
monthly_price_plot +
geom_smooth(aes(obs_date,avg_price),method='lm',color='orange')
The trend chart of average price over time reveals a gradual positive change in fuel price over the years.The peak in average fuel price of $4.09, during the period of study, is observed in 2008 which maybe attributed to the impact of the global recession. There was a sharp fall in prices immediately after, in 2009, and then again in 2015 and 2016. The next notable decrease in prices was in early 2020 followed again by a steep rise.
The time series of the average gasoline price seems to be mean non-stationary for the period between 2000 and 2020 as there is an evident upward movement in the price over the years. On the other hand, the variance looks to be slightly fluctuating unevenly at some points in the period, which makes it difficult to ascertain stationarity. If the plot is variance non-stationary, it can be fixed using the box cox transformation.
fueldata_boxcox = fueldata %>%
mutate(avg_price_boxcox = forecast::BoxCox(avg_price,lambda='auto'))
fueldata_boxcox %>%
ggplot()+
geom_line(aes(obs_date,avg_price_boxcox))+
theme_bw()+
ggtitle("Average Fuel Price over Time (Box-Cox Transformation)")+
ylab("Average Price (Box-Cox Tranformation)")+
xlab("Date")
After the box cox transformation to resolve any variance non-stationary behavior, the graph reveals no change visually. Hence, it can be concluded that the time series was originally variance stationary and does not require any transformation for variance stabilization.
We take the first lag difference of average price to attempt to stabilize the mean of the series.
fueldata_diff = fueldata%>%
mutate(avg_price_boxcox = forecast::BoxCox(avg_price,lambda='auto')) %>%
mutate(price_diff_box = avg_price_boxcox - lag(avg_price_boxcox)) %>%
mutate(price_diff = avg_price - lag(avg_price))
fueldata_diff %>%
ggplot()+
geom_line(aes(obs_date,price_diff))+
theme_bw()+
ggtitle("Fuel Price (First Difference)")+
ylab("Avg Price (Difference))")+
xlab("Date")
The transformation of average price using the first difference seems to have brought the mean of the time series to 0. The next step is to test original as well as transformed variable to ascertain stationarity/non-stationarity. For this we will be using the Augmented Dicky-Fuller test and the KPSS test.
fueldata_diff = fueldata_diff %>%
drop_na()
adf.test(fueldata_diff$avg_price)
##
## Augmented Dickey-Fuller Test
##
## data: fueldata_diff$avg_price
## Dickey-Fuller = -2.1983, Lag order = 6, p-value = 0.4924
## alternative hypothesis: stationary
kpss.test(fueldata_diff$avg_price)
##
## KPSS Test for Level Stationarity
##
## data: fueldata_diff$avg_price
## KPSS Level = 1.592, Truncation lag parameter = 5, p-value = 0.01
For the original price variable, the ADF test has showed a p value greater than 0.05 and the KPSS test value has a p value less than 0.05. This indicates that the average price is not mean stationary.
adf.test(fueldata_diff$price_diff)
##
## Augmented Dickey-Fuller Test
##
## data: fueldata_diff$price_diff
## Dickey-Fuller = -7.8846, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(fueldata_diff$price_diff)
##
## KPSS Test for Level Stationarity
##
## data: fueldata_diff$price_diff
## KPSS Level = 0.04479, Truncation lag parameter = 5, p-value = 0.1
On testing for the differenced average price, the ADF test showed a p value less than 0.05 and KPSS test showed a p value above 0.05, indicating that the null hypothesis can be rejected and that the differenced variable is mean stationary.
adf.test(fueldata_diff$price_diff_box)
##
## Augmented Dickey-Fuller Test
##
## data: fueldata_diff$price_diff_box
## Dickey-Fuller = -7.7629, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(fueldata_diff$price_diff_box)
##
## KPSS Test for Level Stationarity
##
## data: fueldata_diff$price_diff_box
## KPSS Level = 0.051774, Truncation lag parameter = 5, p-value = 0.1
On testing for the difference of box cox transformed variable, the same results were observed as for the raw differenced average price.
Since the original variable with difference is stationary as per the two tests, it can be concluded that, average price series is variance stationary and the first difference of average price is both mean and variance stationary.
The average price plotted over time earlier revealed an increasing trend over time. We take a closer look at the differenced average price from 2000 - 2010 and 2010-2020 to determine existence of seasonality.
fueldata_subset = fueldata_diff %>%
filter(obs_date < ymd("2010=01-01") & obs_date >= ymd("2000=01-01"))
fueldata_subset_2 = fueldata_diff %>%
filter(obs_date < ymd("2021-01-01") & obs_date >= ymd("2010-01-01"))
# Period : 2000-2010
fueldata_subset %>%
ggplot()+
geom_line(aes(obs_date,avg_price))+
theme_bw()+
ggtitle("Fuel Price (First Difference), 2000-2010")+
ylab("Avg Price (Difference))") +
xlab("Date")+
scale_x_date(date_breaks = "3 months", date_labels = "%b %Y") +theme(axis.text.x=element_text(angle=60, hjust=1))
# Period 2010-2020
fueldata_subset_2 %>%
ggplot()+
geom_line(aes(obs_date,avg_price))+
theme_bw()+
ggtitle("Fuel Price (First Difference), 2010-2020")+
ylab("Avg Price (Difference))") +
xlab("Date")+
scale_x_date(date_breaks = "3 months", date_labels = "%b %Y") + theme(axis.text.x=element_text(angle=60, hjust=1))
Yearly seasonality patterns in some form can be observed in the average fuel price. The price tends to be higher in the second quarter of the year and decreases considerably in the fourth quarter. Historically gasoline prices do tend to be higher during the spring-summer periods when demand is highest. In winter months, the demand for gasoline goes down due to bad weather and poor driving conditions. This could be the driving for the seasonal patterns observed.
However, there are also variations observed in the pattern which could be attributed to other economic factors such as recession, global events and other macro-economic factors. Examples of this is the drastic drops in fuel prices around the second quarter of 2020, when the first Covid-19 wave was in its climb and in 2008 during the global recession. Another example is the dip in the second quarter of 2016 which is attributed to the global drop in crude oil prices by the end of 2015.
par(mfrow = c(1,2))
acf(fueldata_diff$price_diff_box,lag.max=30)
pacf(fueldata_diff$price_diff_box,lag.max=20)
The ACF plot of the first difference of average price shows resemblance of dampening over time in the first few periods, but it does not seem as evident as a pure AR model. It is possible that it is a mixed AR/MA model. If considered as an auto-regressive model, the PACF plot suggests two significant lags. Hence, the series can be considered to follow a second order auto-regressive model.
The analysis so far points towards the series following a ARIMA model of order (2,1,0) -a second order auto-regression element,with first order integration and no discernible behavior of a moving average model.
To determine whether it is a mixed model and the order of the auto-regression and moving average elements that best fits the time series, we try a few different variants as below. Since we were able to observe stationarity with first order differencing, we put the integration parameter as 1, and try different orders of AR and MA models.
arima(fueldata_diff$avg_price,order=c(0,1,2))
##
## Call:
## arima(x = fueldata_diff$avg_price, order = c(0, 1, 2))
##
## Coefficients:
## ma1 ma2
## 0.5843 0.1532
## s.e. 0.0627 0.0651
##
## sigma^2 estimated as 0.02013: log likelihood = 139.69, aic = -273.38
BIC(arima(fueldata_diff$avg_price,order=c(0,1,2)))
## [1] -262.679
arima(fueldata_diff$avg_price,order=c(1,1,2))
##
## Call:
## arima(x = fueldata_diff$avg_price, order = c(1, 1, 2))
##
## Coefficients:
## ar1 ma1 ma2
## -0.1839 0.7618 0.2431
## s.e. 0.2950 0.2841 0.1433
##
## sigma^2 estimated as 0.0201: log likelihood = 139.89, aic = -271.78
BIC(arima(fueldata_diff$avg_price,order=c(1,1,2)))
## [1] -257.506
arima(fueldata_diff$avg_price,order=c(1,1,1))
##
## Call:
## arima(x = fueldata_diff$avg_price, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## 0.2147 0.3516
## s.e. 0.1026 0.0937
##
## sigma^2 estimated as 0.02024: log likelihood = 138.97, aic = -271.93
BIC(arima(fueldata_diff$avg_price,order=c(1,1,1)))
## [1] -261.2297
arima(fueldata_diff$avg_price,order=c(2,1,2))
##
## Call:
## arima(x = fueldata_diff$avg_price, order = c(2, 1, 2))
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 1.1727 -0.3775 -0.6306 -0.1894
## s.e. 0.1123 0.1038 0.1186 0.1127
##
## sigma^2 estimated as 0.01947: log likelihood = 143.97, aic = -277.94
BIC(arima(fueldata_diff$avg_price,order=c(2,1,2)))
## [1] -260.1002
arima(fueldata_diff$avg_price,order=c(2,1,1))
##
## Call:
## arima(x = fueldata_diff$avg_price, order = c(2, 1, 1))
##
## Coefficients:
## ar1 ar2 ma1
## 1.2662 -0.5103 -0.7691
## s.e. 0.1010 0.0559 0.1071
##
## sigma^2 estimated as 0.01967: log likelihood = 142.64, aic = -277.27
BIC(arima(fueldata_diff$avg_price,order=c(2,1,1)))
## [1] -262.9995
arima(fueldata_diff$avg_price,order=c(2,1,0))
##
## Call:
## arima(x = fueldata_diff$avg_price, order = c(2, 1, 0))
##
## Coefficients:
## ar1 ar2
## 0.5766 -0.2313
## s.e. 0.0602 0.0602
##
## sigma^2 estimated as 0.01999: log likelihood = 140.57, aic = -275.14
BIC(arima(fueldata_diff$avg_price,order=c(2,1,0)))
## [1] -264.4327
After trying different variations of AR and MA orders, we see that the AIC is lowest for the second order AR, first order integration, second order moving average model. However on calculating the BIC, the lowest value is observed for an ARIMA model of order (2,1,0). We will run an auto.arima to help determine best model.
auto.arima(fueldata_diff$avg_price,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE,max.order = 6,max.p = 7,max.q = 5)
## Series: fueldata_diff$avg_price
## ARIMA(2,1,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 1.1727 -0.3775 -0.6306 -0.1894
## s.e. 0.1123 0.1038 0.1186 0.1127
##
## sigma^2 = 0.01977: log likelihood = 143.97
## AIC=-277.94 AICc=-277.71 BIC=-260.1
Auto.arima below also reveals that the ARIMA(2,1,2) model is the best fit based on lowest AIC and BIC value. Therefore, we proceed with the model ARIMA(2,1,2) for residual and in-sample performance testing.
best_mod = arima(fueldata_diff$avg_price,order=c(2,1,2))
resid = best_mod$residuals
pred = resid+fueldata_diff$avg_price
ggplot()+
geom_line(aes(fueldata_diff$obs_date,fueldata_diff$avg_price))+
geom_line(aes(fueldata_diff$obs_date,pred),color='blue',alpha=0.4)+
theme_bw()+
xlab("Date")+
ylab("Avg Price")
The in sample performance of the model is visualized above. The predicted values of the model for the same time period follow the same pattern of the observed values. Almost all fluctuations in observed values of the series are captured by the model predictions.
RMSE = sqrt(mean((expm1(pred) - expm1(fueldata_diff$avg_price))^2,na.rm=T))
RMSE
## [1] 3.098569
The RMSE for the in sample predictions is also favorable to the model selected. The RMSE reveals that the model predicts fuel prices within approximately 3 points of actual price on average in-sample.
forecast::checkresiduals(best_mod)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 6.6341, df = 6, p-value = 0.356
##
## Model df: 4. Total lags used: 10
par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)
The residuals of the selected ARIMA model visually resemble a white noise process. However, from the histogram, the residuals do not exactly follow a normal distribution on account of a slight left skew caused by the large dip in the average price in 2008.
The ACF and PACF plots show that the autocorrelation function for the series is generally very close to 0.
We further test for auto-correlation using the Ljung- Box test, to determine if the residuals are independent as expected.
Box.test(resid,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.0050491, df = 1, p-value = 0.9434
Box.test(resid,type='Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.80015, df = 5, p-value = 0.977
Box.test(resid,type='Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: resid
## X-squared = 6.6341, df = 10, p-value = 0.7595
Box.test(resid,type='Ljung-Box',lag=20)
##
## Box-Ljung test
##
## data: resid
## X-squared = 17.039, df = 20, p-value = 0.6505
The test is performed for different lags. The p value for lags 1,5,10 and 20 are all greater than 0.05. This indicates that we can reject the null hypothesis that the residuals are not independent. We conclude that the residuals for the best model selected are independent and resemble a white noise process.
prediction = predict(best_mod,n.ahead=6)
pred_data = data.frame(
pred = prediction,
obs_date = c(ymd(max(fueldata_diff$obs_date)),ymd(max(fueldata_diff$obs_date)+ months(1)),ymd(max(fueldata_diff$obs_date)+ months(2)),ymd(max(fueldata_diff$obs_date)+ months(3)),ymd(max(fueldata_diff$obs_date)+ months(4)),ymd(max(fueldata_diff$obs_date)+ months(5)))
)
pred_merge = fueldata_diff %>%
full_join(pred_data) %>%
mutate(
pred_high = pred.pred+2*pred.se,
pred_1se_high = pred.pred+1*pred.se,
pred_low = pred.pred - 2*pred.se,
pred_1se_low = pred.pred-1*pred.se,
pred.pred = pred.pred,
error = pred.pred - avg_price
)
pred_merge %>%
filter(obs_date >= ymd("2000-06-01")) %>%
ggplot()+
geom_line(aes(obs_date,avg_price))+
geom_line(aes(obs_date,pred.pred),color='blue')+
geom_ribbon(aes(x=obs_date,ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)+
geom_ribbon(aes(x=obs_date,ymin=pred_1se_low,ymax = pred_1se_high),fill='blue',alpha=0.2)+
theme_bw()+
xlab("Date")+
ylab("Avg Price")
The predictions from the selected best model for five months into the future starting from the end of 2021 are as per the graph above. The predicted average price for the end of 2021 seem to follow the decreasing trend observed during the last quarter of 2021 and the values predicted are not extremely different from the previous period values. Thus the conclusion is that the projected values look reasonable in the period which we have forecast for.