Time Series Analysis of Fuel Prices in USA (2011

Introduction

According to the US Energy Information Administration, regular gas costs 58.7% more this month than it did a year ago - $3.491 a gallon last month vs $2.20 in November 2020. (EIA). I was very much interested to know the facts behind the rise in the fuel prices. The data was pulled from the EIA website.

Despite the fact that customers buy gasoline locally, the price of the fuel is mostly decided by the worldwide market for crude oil. When the price of crude oil fluctuates on the global market, whether due to geopolitical events, crude supply variations, or gasoline demand fluctuations, the prices that consumers pay at the pump fluctuate as well.

It is always difficult to forecast fuel prices as there are factors such as Covid 19 in the past few years where the prices have gone up over the time period.

Importing the data and analyzing various columns in it.

library(readxl)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tseries)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(ggplot2)
library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## Loading required package: TTR

library(mltools)

## 
## Attaching package: 'mltools'

## The following object is masked from 'package:tidyr':
## 
##     replace_na

library(tidyr)
library(forecast)

## Warning: package 'forecast' was built under R version 4.1.2

library(tseries)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(zoo)
library(sjPlot)

## #refugeeswelcome

library(sjmisc)

## 
## Attaching package: 'sjmisc'

## The following object is masked from 'package:mltools':
## 
##     replace_na

## The following object is masked from 'package:purrr':
## 
##     is_empty

## The following object is masked from 'package:tidyr':
## 
##     replace_na

## The following object is masked from 'package:tibble':
## 
##     add_case

library(sjlabelled)

## 
## Attaching package: 'sjlabelled'

## The following object is masked from 'package:forcats':
## 
##     as_factor

## The following object is masked from 'package:dplyr':
## 
##     as_label

## The following object is masked from 'package:ggplot2':
## 
##     as_label

library(prophet)

## Loading required package: Rcpp

## Loading required package: rlang

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:sjlabelled':
## 
##     as_character, as_label

## The following object is masked from 'package:sjmisc':
## 
##     is_empty

## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice

Fuel_prices_data <- read_excel("/Users/nishanthreddy/Downloads/Fuel_prices_data.xlsx")
head(Fuel_prices_data)

## # A tibble: 6 × 2
##   Week                Price
##   <dttm>              <dbl>
## 1 2022-01-17 00:00:00  3.40
## 2 2022-01-10 00:00:00  3.39
## 3 2022-01-03 00:00:00  3.38
## 4 2021-12-27 00:00:00  3.38
## 5 2021-12-20 00:00:00  3.40
## 6 2021-12-13 00:00:00  3.41

summary(Fuel_prices_data)

##       Week                         Price      
##  Min.   :1993-04-05 00:00:00   Min.   :0.949  
##  1st Qu.:2000-06-15 12:00:00   1st Qu.:1.329  
##  Median :2007-08-27 00:00:00   Median :2.284  
##  Mean   :2007-08-27 00:00:00   Mean   :2.242  
##  3rd Qu.:2014-11-06 12:00:00   3rd Qu.:2.913  
##  Max.   :2022-01-17 00:00:00   Max.   :4.165

str(Fuel_prices_data)

## tibble [1,503 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Week : POSIXct[1:1503], format: "2022-01-17" "2022-01-10" ...
##  $ Price: num [1:1503] 3.4 3.39 3.38 3.38 3.4 ...

We can see here the summary of the data which is pulled from past years. Mean, Median values are 2.42 and 2.28 respectively. The data has 1,503 rows and 2 columns.

Finding missing values if any…

colSums(is.na(Fuel_prices_data))

##  Week Price 
##     0     0

Fuel_data <- Fuel_prices_data[which(Fuel_prices_data$Week >= "2011-01-01" & Fuel_prices_data$Week <= "2021-12-31" ),]
Fuel_data

## # A tibble: 574 × 2
##    Week                Price
##    <dttm>              <dbl>
##  1 2021-12-27 00:00:00  3.38
##  2 2021-12-20 00:00:00  3.40
##  3 2021-12-13 00:00:00  3.41
##  4 2021-12-06 00:00:00  3.44
##  5 2021-11-29 00:00:00  3.48
##  6 2021-11-22 00:00:00  3.49
##  7 2021-11-15 00:00:00  3.50
##  8 2021-11-08 00:00:00  3.50
##  9 2021-11-01 00:00:00  3.48
## 10 2021-10-25 00:00:00  3.48
## # … with 564 more rows

head(Fuel_data)

## # A tibble: 6 × 2
##   Week                Price
##   <dttm>              <dbl>
## 1 2021-12-27 00:00:00  3.38
## 2 2021-12-20 00:00:00  3.40
## 3 2021-12-13 00:00:00  3.41
## 4 2021-12-06 00:00:00  3.44
## 5 2021-11-29 00:00:00  3.48
## 6 2021-11-22 00:00:00  3.49

We took the data from past 10 years as its more significant and also found that there are no missing values in the data.

par(mfrow=c(1,2))
plot(Fuel_data$Week,Fuel_data$Price, type = "h", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")
plot(Fuel_data$Week,Fuel_data$Price, type = "o", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")

plot(Fuel_data$Week,Fuel_data$Price, type = "l", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")
boxplot(Fuel_data$Price, main = "Fuel Price")

From the Histogram and graphs, We can see various trends that the fuel prices dipped in 2016. As per the statistics the United States became a monthly net exporter of natural gas for the first time since 1957 in November 2016. Infrastructure advancements, such as natural gas pipelines and facilities for liquefying natural gas for export, aided this by allowing supply to fulfill rising demand from international markets.Throughout 2016, pipeline shipments from the United States to Mexico increased, accounting for 87 percent of total natural gas exports from the United States.

In 2019 the prices again went down as there was sufficient amount of production in the United States for that period and in 2020, COVID 19 pandemic played a major role in the decline of the Fuel prices as the demand was less when compared to the previous years.

There are no outliers in the data as per the boxplot plotted above.

Linear Modeling

library(tidyverse)
sample_data <- sort(sample(nrow(Fuel_data), nrow(Fuel_data)*0.8))
train<-Fuel_data[sample_data,]
test<-Fuel_data[-sample_data,]
lm_train = lm(Price ~ Week,data = train)
summary(lm_train)

## 
## Call:
## lm(formula = Price ~ Week, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12605 -0.33764  0.05372  0.31161  1.12459 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.996e+00  3.109e-01   25.71   <2e-16 ***
## Week        -3.436e-09  2.114e-10  -16.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4565 on 457 degrees of freedom
## Multiple R-squared:  0.3664, Adjusted R-squared:  0.365 
## F-statistic: 264.3 on 1 and 457 DF,  p-value: < 2.2e-16

Looking at the summary of the 80% data, we can see that the pvalue is <0.05 by which we can reject the null hypothesis testing and can conclude that there is a clear relationship between the Fuel prices and Years.

predict_price <- predict(lm_train,newdata = test[c('Week')] )
predicted_test_data <- test
predicted_test_data['predict_price'] <- predict_price

install.packages("mltools", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpfDJwyf/downloaded_packages

#install.packages("Metrics") 
#rmse(predicted_test_data$Price, predicted_test_data$predict_price)

The Root Mean Square Error value which we got for the predicted model is 0.45.

Plotting the Predict and Actual Prices using ggplot

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpfDJwyf/downloaded_packages

install.packages("ggplot2", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpfDJwyf/downloaded_packages

ggplot(train, aes(x = Week , y = Price )) +
  geom_line(colour = "Black") +
  geom_line(data = test, aes(x = Week, y = Price),colour = "green") + 
  geom_line(data = predicted_test_data, aes(x = Week, y = predict_price),colour = "red") +
  labs(title = "Fuel Prices over the Years(Predicted)",
       subtitle = "2011-2021",
       x = "Years", y = "Fuel Price per Gallon ($)")

The actual and predicted fuel prices over the past ten years are labelled in the above graph. The actual prices are shown in black color, the predicted prices are shown in the red color and followed by the test data prices which are labelled in the green color.The factors such as abundant production of fuel in the 2016,2019 years and followed by the COVID 19 pandemic in the 2020 are clearly responsible for the changes in the graphin those time periods. We can conclude by saying that Linear regression isnt the best technique for the time series data as it isnt predicting it accurately. Better techniques can be used such as ARIMA for better analysis.

#line chart
ggplot(data = Fuel_data, aes(x = Fuel_data$Week, y = Fuel_data$Price  )) + theme_bw()+
  geom_line() + labs(title = "Line Chart")

## Warning: Use of `Fuel_data$Week` is discouraged. Use `Week` instead.

## Warning: Use of `Fuel_data$Price` is discouraged. Use `Price` instead.

Time series appears to be variance stationary through out the time frame. However, we’ll test the transformations to assure.

Fuel_Data_roll <- Fuel_data %>%
  mutate(
   Price_mean = zoo::rollmean(
      Price, 
      k = 180, 
      fill = NA),
    Price_sd = zoo::rollapply(
      Price, 
      FUN = sd, 
      width = 180, 
      fill = NA)
  )
Price_roll_mean <- Fuel_Data_roll %>%
  ggplot() +
  geom_line(aes(Week, Price_mean)) +
  theme_bw() +
  ggtitle("Fuel Prices Mean over Time (6 month rolling window)") +
  ylab("Adjusted Fuel Price") +
  xlab("Years")
Price_roll_sd <-Fuel_Data_roll %>%
  ggplot() +
  geom_line(aes(Week, Price_sd)) +
  theme_bw() +
  ggtitle("Differenced Fuel Prices SD over Time (6 month rolling window)") +
  ylab("Adjusted Fuel Price") +
  xlab("Years")
Price_roll_mean

## Warning: Removed 179 row(s) containing missing values (geom_path).

Price_roll_sd

## Warning: Removed 179 row(s) containing missing values (geom_path).

ADF Test

print(adf.test(Fuel_data$Price))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data$Price
## Dickey-Fuller = -3.4053, Lag order = 8, p-value = 0.05253
## alternative hypothesis: stationary

From the above ADF Test output, we observe the p-value is greater than 0.05. So, we fail to reject the Null Hypothesis and infer that the time series is non-stationary.

kpss.test(Fuel_data$Price, null="Trend")

## Warning in kpss.test(Fuel_data$Price, null = "Trend"): p-value smaller than
## printed p-value

## 
##  KPSS Test for Trend Stationarity
## 
## data:  Fuel_data$Price
## KPSS Trend = 0.95139, Truncation lag parameter = 6, p-value = 0.01

From the above KPSS Test output, we observe the p-value is less than 0.05. So, we reject the Null Hypothesis and infer that the time series is non-stationary.

Log Transformation for Variance.

Fuel_data_log = Fuel_data %>%
  mutate(Price_log = log1p(Fuel_data$Price))
Fuel_data_log %>%
  ggplot()+
      geom_line(aes(Week,Price_log ))+
      theme_bw()+
      ggtitle("Fuel Prices Over Time (Log)")+
      ylab("Fuel Price (Log)")+
      xlab("Years")

The time series after being log-transformed and Box-Cox transformed are extremely identical.

Handling Mean Stationarity

Fuel_diff <- Fuel_data_log %>%
  mutate(Price_diff = Fuel_data$Price - lag(Fuel_data$Price))
Fuel_diff %>% select(Week, Price, Price_diff) %>%
head()

## # A tibble: 6 × 3
##   Week                Price Price_diff
##   <dttm>              <dbl>      <dbl>
## 1 2021-12-27 00:00:00  3.38    NA     
## 2 2021-12-20 00:00:00  3.40     0.0200
## 3 2021-12-13 00:00:00  3.41     0.0190
## 4 2021-12-06 00:00:00  3.44     0.0260
## 5 2021-11-29 00:00:00  3.48     0.0380
## 6 2021-11-22 00:00:00  3.49     0.0150

ggplot() + geom_line(aes(Fuel_diff$Week,Fuel_diff$Price_diff)) + theme_bw() + ggtitle("Fuel Price (First Difference)") + ylab("Price(Difference))") + xlab("Year")

## Warning: Removed 1 row(s) containing missing values (geom_path).

ACF

acf(Fuel_diff$Price)

We can see that the time series has no seasonality in the ACF plot above.

In the ACF plot, there is a Dampening effect. As a result, the time series is entirely an AR process.

PACF

pacf(Fuel_diff$Price)

We can see from the PACF that there are two statistically significant autocorrelations. As a result, the order of the AR process is 2. ARIMA is given as (2,1,0).

Log Transformation and Differencing

Fuel_data_log_diff <- Fuel_data %>%
 arrange(Fuel_data$Week) %>%
  mutate(Price_log = log1p(Fuel_data$Price), Price_diff = Price - lag(Fuel_data$Price),
         Price_log_diff = Price_log - lag(Price_log) ) %>%
  drop_na() 
Fuel_data_log_diff %>%
  ggplot() + geom_line(aes(Fuel_data_log_diff$Week,Fuel_data_log_diff$Price_diff)) + theme_bw() + ggtitle("Fuel Price (First Difference)") +    ylab("Price(Difference))") + xlab("Year")

## Warning: Use of `Fuel_data_log_diff$Week` is discouraged. Use `Week` instead.

## Warning: Use of `Fuel_data_log_diff$Price_diff` is discouraged. Use `Price_diff`
## instead.

Re-verifying using the ADF Test.

adf.test(Fuel_data_log_diff$Price)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price
## Dickey-Fuller = -2.2758, Lag order = 8, p-value = 0.4615
## alternative hypothesis: stationary

adf.test(Fuel_data_log_diff$Price_diff)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price_diff
## Dickey-Fuller = -3.175, Lag order = 8, p-value = 0.09224
## alternative hypothesis: stationary

adf.test(Fuel_data_log_diff$Price_log)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price_log
## Dickey-Fuller = -3.409, Lag order = 8, p-value = 0.0519
## alternative hypothesis: stationary

adf.test(Fuel_data_log_diff$Price_log_diff)

## Warning in adf.test(Fuel_data_log_diff$Price_log_diff): p-value smaller than
## printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price_log_diff
## Dickey-Fuller = -6.9354, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(Fuel_data_log_diff$Price)

## Warning in kpss.test(Fuel_data_log_diff$Price): p-value smaller than printed p-
## value

## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price
## KPSS Level = 3.9025, Truncation lag parameter = 6, p-value = 0.01

kpss.test(Fuel_data_log_diff$Price_diff)

## Warning in kpss.test(Fuel_data_log_diff$Price_diff): p-value smaller than
## printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price_diff
## KPSS Level = 6.0787, Truncation lag parameter = 6, p-value = 0.01

kpss.test(Fuel_data_log_diff$Price_log)

## Warning in kpss.test(Fuel_data_log_diff$Price_log): p-value smaller than printed
## p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price_log
## KPSS Level = 3.7162, Truncation lag parameter = 6, p-value = 0.01

kpss.test(Fuel_data_log_diff$Price_log_diff)

## Warning in kpss.test(Fuel_data_log_diff$Price_log_diff): p-value greater than
## printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price_log_diff
## KPSS Level = 0.10982, Truncation lag parameter = 6, p-value = 0.1

The results of the ADF and KPSS tests show that time series is stationary.

ARIMA

par(mfrow = c(1,2))
acf(Fuel_data_log_diff$Price_log_diff,lag.max=20)
pacf(Fuel_data_log_diff$Price_log_diff,lag.max=20)

AIC of ARIMA Models.

AIC(
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,2)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,3)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(2,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(3,1,0))
)

##                                                         df       AIC
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 1))  2 -3568.583
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 2))  3 -3612.266
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 3))  4 -3622.255
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 0))  2 -3636.575
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 1))  3 -3634.718
## arima(Fuel_data_log_diff$Price_log, order = c(2, 1, 0))  3 -3634.702
## arima(Fuel_data_log_diff$Price_log, order = c(3, 1, 0))  4 -3633.285

According to the above AIC values Models, ARIMA (1,1,0) is the best model with an AIC = -3636.578.

BIC of ARIMA Models

BIC(
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,2)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,3)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(2,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(3,1,0))
)

##                                                         df       BIC
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 1))  2 -3559.884
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 2))  3 -3599.219
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 3))  4 -3604.859
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 0))  2 -3627.877
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 1))  3 -3621.670
## arima(Fuel_data_log_diff$Price_log, order = c(2, 1, 0))  3 -3621.655
## arima(Fuel_data_log_diff$Price_log, order = c(3, 1, 0))  4 -3615.888

According to the above BIC values Models, ARIMA (1,1,1) is the best model with an BIC = -3621.676.

Auto ARIMA

auto.arima(Fuel_data_log_diff$Price_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

## Series: Fuel_data_log_diff$Price_log 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.6031
## s.e.  0.0332
## 
## sigma^2 = 0.0001009:  log likelihood = 1820.29
## AIC=-3636.58   AICc=-3636.55   BIC=-3627.88

As per the output of the model, we can conclude that ARIMA(1,1,0) is the best model which can be used for further analysis.

Ljung - Box Test

Model_1 = arima(Fuel_data_log_diff$Price_log,order=c(1,1,0))
forecast::checkresiduals(Model_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)
## Q* = 7.5908, df = 9, p-value = 0.5758
## 
## Model df: 1.   Total lags used: 10

From the viz, we can say that the residuals seem to be white noise.

par(mfrow=c(1,2))
res = Model_1$resid
acf(res,lag.max=20)
pacf(res,lag.max=20)

Predicting Model Performance using Forecast.

plot(forecast(Model_1,h=150),main="ARIMA(1,1,0) Forecast",ylab="Fuel Price per Gallon ($)",xlab="Date")

prediction <- Arima(Fuel_data_log_diff$Price, model = Model_1)
accuracy(prediction)

##                        ME      RMSE        MAE         MPE      MAPE      MASE
## Training set 0.0001401753 0.0390067 0.02871679 0.009395186 0.9943248 0.7846193
##                      ACF1
## Training set -0.006829676

The RMSE of the corresponding model is 0.039

The Forecast shows us that there wont be any major difference in Fuel price for the upcoming days.

## Time series analysis using Prophet Model

Fuel_data <- Fuel_prices_data %>%
  filter(Week >= ymd("2000=01-01")) %>% rename(Year = "Week",Avg_price ='Price') %>% 
  mutate(Month_yr = format(Year, "%Y-%m")) %>% distinct(Month_yr,.keep_all=TRUE) 
as_tibble(Fuel_data)

## # A tibble: 265 × 3
##    Year                Avg_price Month_yr
##    <dttm>                  <dbl> <chr>   
##  1 2022-01-17 00:00:00      3.40 2022-01 
##  2 2021-12-27 00:00:00      3.38 2021-12 
##  3 2021-11-29 00:00:00      3.48 2021-11 
##  4 2021-10-25 00:00:00      3.48 2021-10 
##  5 2021-09-27 00:00:00      3.27 2021-09 
##  6 2021-08-30 00:00:00      3.24 2021-08 
##  7 2021-07-26 00:00:00      3.23 2021-07 
##  8 2021-06-28 00:00:00      3.18 2021-06 
##  9 2021-05-31 00:00:00      3.12 2021-05 
## 10 2021-04-26 00:00:00      2.96 2021-04 
## # … with 255 more rows

Prophet Model

The time series data is divided into two parts: train data (from January 2000 to December 2014) and test data (from January 2015 to January 2021). The prophet model is then built based on the train data and forecast. The future forecast period units have been adjusted to reflect the fact that this series is collected at the month level.

prophet_fueldata = Fuel_data %>% 
    rename(ds = Year,
    y = Avg_price)  

train = prophet_fueldata %>% 
  filter(ds<ymd("2015-01-01"))

test = prophet_fueldata %>%
  filter(ds>=ymd("2015-01-01"))

model = prophet(train)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

future = make_future_dataframe(model,periods = 85, freq = 'month')
forecast = predict(model,future)

plot(model,forecast)+
ylab("Average Fuel Price")+xlab("Date")+labs(title = 'Forecast: Average Fuel Price',subtitle = 'January 2000 - December 2021')+theme_bw()

Time series Components plots

prophet_plot_components(model,forecast)

We can see taht there’s an upward trend in the Jan-Mid of March period and then there’s a fall in the April followed by few fluctuations till July then during Fall it dips and rises afterwards.

Finding Saturation points

After finding the summaries of both the test and train data, we need to find the saturation points which means the threshold of the price which is reasonable as seen in the last 20 years of the data.

summary(train)

##        ds                            y           Month_yr        
##  Min.   :2000-01-31 00:00:00   Min.   :1.137   Length:180        
##  1st Qu.:2003-10-20 00:00:00   1st Qu.:1.686   Class :character  
##  Median :2007-07-12 12:00:00   Median :2.654   Mode  :character  
##  Mean   :2007-07-13 07:36:00   Mean   :2.560                     
##  3rd Qu.:2011-04-04 00:00:00   3rd Qu.:3.373                     
##  Max.   :2014-12-29 00:00:00   Max.   :4.146

summary(test)

##        ds                            y           Month_yr        
##  Min.   :2015-01-26 00:00:00   Min.   :1.870   Length:85         
##  1st Qu.:2016-10-31 00:00:00   1st Qu.:2.341   Class :character  
##  Median :2018-07-30 00:00:00   Median :2.589   Mode  :character  
##  Mean   :2018-07-28 02:32:28   Mean   :2.604                     
##  3rd Qu.:2020-04-27 00:00:00   3rd Qu.:2.857                     
##  Max.   :2022-01-17 00:00:00   Max.   :3.478

#setting floor for train data
train$floor = 1.13
train$cap = 4.14

#setting floor for forecast data

future$floor = 1.8
future$cap = 3.4
sat_model = prophet(train,growth='logistic')

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

sat_forecast = predict(sat_model,future)

plot(sat_model,sat_forecast)+ylim(0,4.5)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Fuel ',subtitle = 'Saturation Points')+theme_bw()

We set these saturation thresholds for the prophet model, and the resulting figure shows that the predicted values are substantially different from the previous forecast, and that the increasing trend is not as sharp as it was before the limitations were established.

Examination for Change points

model = prophet(train, n.changepoints = 10)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

future = make_future_dataframe(model,periods = 85, freq = 'month')
plot(model,forecast)+add_changepoints_to_plot(model)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Prices',subtitle = 'January 2000 - December 2021')+theme_bw()

These are the two change points observed in the Time series.

Chekcing the Type of Seasonilty

additive = prophet(train)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

add_fcst = predict(additive,future)

plot(additive,add_fcst)+
ylim(0,5)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Fuel',subtitle = 'Additive Seasonality Model')+theme_bw()

prophet_plot_components(additive,add_fcst)

The additive model does not appear to accurately fit the increased seasonal changes in average price values towards the end of the training period, although it performs well in the beginning.

multi = prophet(train,seasonality.mode = 'multiplicative')

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(0,5) +theme_bw()+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Fuel',subtitle = 'Multiplicative Seasonality Model')

prophet_plot_components(multi,multi_fcst)

Both the models seem to have the same trends.

Model Assessment

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  filter(ds>=ymd("2015-01-01"))

## Warning in base::check_tzones(e1, e2): 'tzone' attributes are inconsistent

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 1.85"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 1.76"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.71"

RMSE for the above model is ‘1.85’, whereas MAE for the same model is slightly lower which is ‘1.76’. The Mean Absolute percentage Error is 71%

Cross Validation

Model cross-validation will be performed by making the appropriate adjustments to incorporate monthly data. Setting the initial training time to 5 years and the rolling horizon to 1 year appears to produce relatively low error metrics after a variety of combinations.

df.cv <- cross_validation(model ,horizon=365,period = 365,initial = 5*365,units='days')

## Making 9 forecasts with cutoffs between 2005-12-31 and 2013-12-29

head(df.cv)

## # A tibble: 6 × 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  2.40 2006-01-30 00:00:00  2.34       2.19       2.46 2005-12-31 00:00:00
## 2  2.30 2006-02-27 00:00:00  2.40       2.27       2.52 2005-12-31 00:00:00
## 3  2.54 2006-03-27 00:00:00  2.53       2.40       2.67 2005-12-31 00:00:00
## 4  2.96 2006-04-24 00:00:00  2.52       2.39       2.65 2005-12-31 00:00:00
## 5  2.91 2006-05-29 00:00:00  2.66       2.52       2.79 2005-12-31 00:00:00
## 6  2.91 2006-06-26 00:00:00  2.68       2.55       2.81 2005-12-31 00:00:00

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Avg Price")+
  scale_color_discrete(name = 'Cutoff')+
  theme(legend.position = "None")+
  labs(
    title = 'Forecast: Average Price of Fuel',
    subtitle = 'Cross Validation'
  )

plot_cross_validation_metric(df.cv, metric = 'rmse')

plot_cross_validation_metric(df.cv, metric = 'mape')

The rolling window predictions are compared to actuals in the cross validation result plot. For a one-year forecast horizon, the RMSE is between 0.25 and 0.75. The average absolute percent inaccuracy varies from 10 to 25%.

According to the results, the ARIMA model appears to have fewer errors for predictions in the foreseeable term, whereas larger errors in later periods are similar to the Prophet model. For near-future predictions, Prophet has higher errors, but the variability in the error is lower. We can observe that the ARIMA outperforms the Prophet for this time series based on the RMSE and the data above.

Conclusion

We may conclude that the model has done a reasonable job of capturing the data generation process, and we can look at ways to improve the model’s performance by adding more data and external regression variables that could help explain the fluctuation of gasoline prices better. As a result, we’ve covered all of the bases, including determining whether average prices follow seasonal trends and whether changes can be caught through the data generation process. We also fitted two different time series models to the data in order to estimate prices over a specific time horizon and evaluated performance using relevant measures to determine which model was superior.

Time Series Analysis of Fuel Prices in USA (2011 - 2021)

Nishanth Reddy

1/19/2022

Introduction

It is always difficult to forecast fuel prices as there are factors such as Covid 19 in the past few years where the prices have gone up over the time period.

Importing the data and analyzing various columns in it.

We can see here the summary of the data which is pulled from past years. Mean, Median values are 2.42 and 2.28 respectively. The data has 1,503 rows and 2 columns.

Finding missing values if any…

We took the data from past 10 years as its more significant and also found that there are no missing values in the data.

In 2019 the prices again went down as there was sufficient amount of production in the United States for that period and in 2020, COVID 19 pandemic played a major role in the decline of the Fuel prices as the demand was less when compared to the previous years.

There are no outliers in the data as per the boxplot plotted above.

Linear Modeling

Looking at the summary of the 80% data, we can see that the pvalue is <0.05 by which we can reject the null hypothesis testing and can conclude that there is a clear relationship between the Fuel prices and Years.

The Root Mean Square Error value which we got for the predicted model is 0.45.

Plotting the Predict and Actual Prices using ggplot

Time series appears to be variance stationary through out the time frame. However, we’ll test the transformations to assure.

ADF Test

From the above ADF Test output, we observe the p-value is greater than 0.05. So, we fail to reject the Null Hypothesis and infer that the time series is non-stationary.

From the above KPSS Test output, we observe the p-value is less than 0.05. So, we reject the Null Hypothesis and infer that the time series is non-stationary.

Log Transformation for Variance.

The time series after being log-transformed and Box-Cox transformed are extremely identical.

Handling Mean Stationarity

ACF

We can see that the time series has no seasonality in the ACF plot above.

In the ACF plot, there is a Dampening effect. As a result, the time series is entirely an AR process.

PACF

We can see from the PACF that there are two statistically significant autocorrelations. As a result, the order of the AR process is 2. ARIMA is given as (2,1,0).

Log Transformation and Differencing

Re-verifying using the ADF Test.

The results of the ADF and KPSS tests show that time series is stationary.

ARIMA

AIC of ARIMA Models.

According to the above AIC values Models, ARIMA (1,1,0) is the best model with an AIC = -3636.578.

BIC of ARIMA Models

According to the above BIC values Models, ARIMA (1,1,1) is the best model with an BIC = -3621.676.

Auto ARIMA

As per the output of the model, we can conclude that ARIMA(1,1,0) is the best model which can be used for further analysis.

Ljung - Box Test

From the viz, we can say that the residuals seem to be white noise.

Predicting Model Performance using Forecast.

The RMSE of the corresponding model is 0.039

The Forecast shows us that there wont be any major difference in Fuel price for the upcoming days.

## Time series analysis using Prophet Model

Prophet Model

Time series Components plots

We can see taht there’s an upward trend in the Jan-Mid of March period and then there’s a fall in the April followed by few fluctuations till July then during Fall it dips and rises afterwards.

Finding Saturation points

After finding the summaries of both the test and train data, we need to find the saturation points which means the threshold of the price which is reasonable as seen in the last 20 years of the data.

We set these saturation thresholds for the prophet model, and the resulting figure shows that the predicted values are substantially different from the previous forecast, and that the increasing trend is not as sharp as it was before the limitations were established.

Examination for Change points

These are the two change points observed in the Time series.

Chekcing the Type of Seasonilty

The additive model does not appear to accurately fit the increased seasonal changes in average price values towards the end of the training period, although it performs well in the beginning.

Both the models seem to have the same trends.

Model Assessment

RMSE for the above model is ‘1.85’, whereas MAE for the same model is slightly lower which is ‘1.76’. The Mean Absolute percentage Error is 71%

Cross Validation

Model cross-validation will be performed by making the appropriate adjustments to incorporate monthly data. Setting the initial training time to 5 years and the rolling horizon to 1 year appears to produce relatively low error metrics after a variety of combinations.

The rolling window predictions are compared to actuals in the cross validation result plot. For a one-year forecast horizon, the RMSE is between 0.25 and 0.75. The average absolute percent inaccuracy varies from 10 to 25%.

Conclusion