Introduction

According to the US Energy Information Administration, regular gas costs 58.7% more this month than it did a year ago - $3.491 a gallon last month vs $2.20 in November 2020. (EIA). I was very much interested to know the facts behind the rise in the fuel prices. The data was pulled from the EIA website.

Despite the fact that customers buy gasoline locally, the price of the fuel is mostly decided by the worldwide market for crude oil. When the price of crude oil fluctuates on the global market, whether due to geopolitical events, crude supply variations, or gasoline demand fluctuations, the prices that consumers pay at the pump fluctuate as well.

It is always difficult to forecast fuel prices as there are factors such as Covid 19 in the past few years where the prices have gone up over the time period.

Importing the data and analyzing various columns in it.

library(readxl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(ggplot2)
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## Loading required package: TTR
library(mltools)
## 
## Attaching package: 'mltools'
## The following object is masked from 'package:tidyr':
## 
##     replace_na
library(tidyr)
library(forecast)
## Warning: package 'forecast' was built under R version 4.1.2
library(tseries)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(zoo)
library(sjPlot)
## #refugeeswelcome
library(sjmisc)
## 
## Attaching package: 'sjmisc'
## The following object is masked from 'package:mltools':
## 
##     replace_na
## The following object is masked from 'package:purrr':
## 
##     is_empty
## The following object is masked from 'package:tidyr':
## 
##     replace_na
## The following object is masked from 'package:tibble':
## 
##     add_case
library(sjlabelled)
## 
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:forcats':
## 
##     as_factor
## The following object is masked from 'package:dplyr':
## 
##     as_label
## The following object is masked from 'package:ggplot2':
## 
##     as_label
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
## 
## Attaching package: 'rlang'
## The following objects are masked from 'package:sjlabelled':
## 
##     as_character, as_label
## The following object is masked from 'package:sjmisc':
## 
##     is_empty
## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice
Fuel_prices_data <- read_excel("/Users/nishanthreddy/Downloads/Fuel_prices_data.xlsx")
head(Fuel_prices_data)
## # A tibble: 6 × 2
##   Week                Price
##   <dttm>              <dbl>
## 1 2022-01-17 00:00:00  3.40
## 2 2022-01-10 00:00:00  3.39
## 3 2022-01-03 00:00:00  3.38
## 4 2021-12-27 00:00:00  3.38
## 5 2021-12-20 00:00:00  3.40
## 6 2021-12-13 00:00:00  3.41
summary(Fuel_prices_data)
##       Week                         Price      
##  Min.   :1993-04-05 00:00:00   Min.   :0.949  
##  1st Qu.:2000-06-15 12:00:00   1st Qu.:1.329  
##  Median :2007-08-27 00:00:00   Median :2.284  
##  Mean   :2007-08-27 00:00:00   Mean   :2.242  
##  3rd Qu.:2014-11-06 12:00:00   3rd Qu.:2.913  
##  Max.   :2022-01-17 00:00:00   Max.   :4.165
str(Fuel_prices_data)
## tibble [1,503 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Week : POSIXct[1:1503], format: "2022-01-17" "2022-01-10" ...
##  $ Price: num [1:1503] 3.4 3.39 3.38 3.38 3.4 ...

We can see here the summary of the data which is pulled from past years. Mean, Median values are 2.42 and 2.28 respectively. The data has 1,503 rows and 2 columns.

Finding missing values if any…

colSums(is.na(Fuel_prices_data))
##  Week Price 
##     0     0
Fuel_data <- Fuel_prices_data[which(Fuel_prices_data$Week >= "2011-01-01" & Fuel_prices_data$Week <= "2021-12-31" ),]
Fuel_data                                       
## # A tibble: 574 × 2
##    Week                Price
##    <dttm>              <dbl>
##  1 2021-12-27 00:00:00  3.38
##  2 2021-12-20 00:00:00  3.40
##  3 2021-12-13 00:00:00  3.41
##  4 2021-12-06 00:00:00  3.44
##  5 2021-11-29 00:00:00  3.48
##  6 2021-11-22 00:00:00  3.49
##  7 2021-11-15 00:00:00  3.50
##  8 2021-11-08 00:00:00  3.50
##  9 2021-11-01 00:00:00  3.48
## 10 2021-10-25 00:00:00  3.48
## # … with 564 more rows
head(Fuel_data)
## # A tibble: 6 × 2
##   Week                Price
##   <dttm>              <dbl>
## 1 2021-12-27 00:00:00  3.38
## 2 2021-12-20 00:00:00  3.40
## 3 2021-12-13 00:00:00  3.41
## 4 2021-12-06 00:00:00  3.44
## 5 2021-11-29 00:00:00  3.48
## 6 2021-11-22 00:00:00  3.49

We took the data from past 10 years as its more significant and also found that there are no missing values in the data.

par(mfrow=c(1,2))
plot(Fuel_data$Week,Fuel_data$Price, type = "h", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")
plot(Fuel_data$Week,Fuel_data$Price, type = "o", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")

plot(Fuel_data$Week,Fuel_data$Price, type = "l", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")
boxplot(Fuel_data$Price, main = "Fuel Price")

In 2019 the prices again went down as there was sufficient amount of production in the United States for that period and in 2020, COVID 19 pandemic played a major role in the decline of the Fuel prices as the demand was less when compared to the previous years.

There are no outliers in the data as per the boxplot plotted above.

Linear Modeling

library(tidyverse)
sample_data <- sort(sample(nrow(Fuel_data), nrow(Fuel_data)*0.8))
train<-Fuel_data[sample_data,]
test<-Fuel_data[-sample_data,]
lm_train = lm(Price ~ Week,data = train)
summary(lm_train)
## 
## Call:
## lm(formula = Price ~ Week, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12605 -0.33764  0.05372  0.31161  1.12459 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.996e+00  3.109e-01   25.71   <2e-16 ***
## Week        -3.436e-09  2.114e-10  -16.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4565 on 457 degrees of freedom
## Multiple R-squared:  0.3664, Adjusted R-squared:  0.365 
## F-statistic: 264.3 on 1 and 457 DF,  p-value: < 2.2e-16

Looking at the summary of the 80% data, we can see that the pvalue is <0.05 by which we can reject the null hypothesis testing and can conclude that there is a clear relationship between the Fuel prices and Years.

predict_price <- predict(lm_train,newdata = test[c('Week')] )
predicted_test_data <- test
predicted_test_data['predict_price'] <- predict_price
install.packages("mltools", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpfDJwyf/downloaded_packages
#install.packages("Metrics") 
#rmse(predicted_test_data$Price, predicted_test_data$predict_price)

The Root Mean Square Error value which we got for the predicted model is 0.45.

Plotting the Predict and Actual Prices using ggplot

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpfDJwyf/downloaded_packages
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpfDJwyf/downloaded_packages
ggplot(train, aes(x = Week , y = Price )) +
  geom_line(colour = "Black") +
  geom_line(data = test, aes(x = Week, y = Price),colour = "green") + 
  geom_line(data = predicted_test_data, aes(x = Week, y = predict_price),colour = "red") +
  labs(title = "Fuel Prices over the Years(Predicted)",
       subtitle = "2011-2021",
       x = "Years", y = "Fuel Price per Gallon ($)")

The actual and predicted fuel prices over the past ten years are labelled in the above graph. The actual prices are shown in black color, the predicted prices are shown in the red color and followed by the test data prices which are labelled in the green color.The factors such as abundant production of fuel in the 2016,2019 years and followed by the COVID 19 pandemic in the 2020 are clearly responsible for the changes in the graphin those time periods. We can conclude by saying that Linear regression isnt the best technique for the time series data as it isnt predicting it accurately. Better techniques can be used such as ARIMA for better analysis.

#line chart
ggplot(data = Fuel_data, aes(x = Fuel_data$Week, y = Fuel_data$Price  )) + theme_bw()+
  geom_line() + labs(title = "Line Chart")
## Warning: Use of `Fuel_data$Week` is discouraged. Use `Week` instead.
## Warning: Use of `Fuel_data$Price` is discouraged. Use `Price` instead.

Time series appears to be variance stationary through out the time frame. However, we’ll test the transformations to assure.

Fuel_Data_roll <- Fuel_data %>%
  mutate(
   Price_mean = zoo::rollmean(
      Price, 
      k = 180, 
      fill = NA),
    Price_sd = zoo::rollapply(
      Price, 
      FUN = sd, 
      width = 180, 
      fill = NA)
  )
Price_roll_mean <- Fuel_Data_roll %>%
  ggplot() +
  geom_line(aes(Week, Price_mean)) +
  theme_bw() +
  ggtitle("Fuel Prices Mean over Time (6 month rolling window)") +
  ylab("Adjusted Fuel Price") +
  xlab("Years")
Price_roll_sd <-Fuel_Data_roll %>%
  ggplot() +
  geom_line(aes(Week, Price_sd)) +
  theme_bw() +
  ggtitle("Differenced Fuel Prices SD over Time (6 month rolling window)") +
  ylab("Adjusted Fuel Price") +
  xlab("Years")
Price_roll_mean
## Warning: Removed 179 row(s) containing missing values (geom_path).

Price_roll_sd
## Warning: Removed 179 row(s) containing missing values (geom_path).

ADF Test

print(adf.test(Fuel_data$Price))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data$Price
## Dickey-Fuller = -3.4053, Lag order = 8, p-value = 0.05253
## alternative hypothesis: stationary

From the above ADF Test output, we observe the p-value is greater than 0.05. So, we fail to reject the Null Hypothesis and infer that the time series is non-stationary.

kpss.test(Fuel_data$Price, null="Trend")
## Warning in kpss.test(Fuel_data$Price, null = "Trend"): p-value smaller than
## printed p-value
## 
##  KPSS Test for Trend Stationarity
## 
## data:  Fuel_data$Price
## KPSS Trend = 0.95139, Truncation lag parameter = 6, p-value = 0.01

From the above KPSS Test output, we observe the p-value is less than 0.05. So, we reject the Null Hypothesis and infer that the time series is non-stationary.

Log Transformation for Variance.

Fuel_data_log = Fuel_data %>%
  mutate(Price_log = log1p(Fuel_data$Price))
Fuel_data_log %>%
  ggplot()+
      geom_line(aes(Week,Price_log ))+
      theme_bw()+
      ggtitle("Fuel Prices Over Time (Log)")+
      ylab("Fuel Price (Log)")+
      xlab("Years")

The time series after being log-transformed and Box-Cox transformed are extremely identical.

Handling Mean Stationarity

Fuel_diff <- Fuel_data_log %>%
  mutate(Price_diff = Fuel_data$Price - lag(Fuel_data$Price))
Fuel_diff %>% select(Week, Price, Price_diff) %>%
head()
## # A tibble: 6 × 3
##   Week                Price Price_diff
##   <dttm>              <dbl>      <dbl>
## 1 2021-12-27 00:00:00  3.38    NA     
## 2 2021-12-20 00:00:00  3.40     0.0200
## 3 2021-12-13 00:00:00  3.41     0.0190
## 4 2021-12-06 00:00:00  3.44     0.0260
## 5 2021-11-29 00:00:00  3.48     0.0380
## 6 2021-11-22 00:00:00  3.49     0.0150
ggplot() + geom_line(aes(Fuel_diff$Week,Fuel_diff$Price_diff)) + theme_bw() + ggtitle("Fuel Price (First Difference)") + ylab("Price(Difference))") + xlab("Year")
## Warning: Removed 1 row(s) containing missing values (geom_path).

ACF

acf(Fuel_diff$Price)

We can see that the time series has no seasonality in the ACF plot above.

In the ACF plot, there is a Dampening effect. As a result, the time series is entirely an AR process.

PACF

pacf(Fuel_diff$Price)

We can see from the PACF that there are two statistically significant autocorrelations. As a result, the order of the AR process is 2. ARIMA is given as (2,1,0).

Log Transformation and Differencing

Fuel_data_log_diff <- Fuel_data %>%
 arrange(Fuel_data$Week) %>%
  mutate(Price_log = log1p(Fuel_data$Price), Price_diff = Price - lag(Fuel_data$Price),
         Price_log_diff = Price_log - lag(Price_log) ) %>%
  drop_na() 
Fuel_data_log_diff %>%
  ggplot() + geom_line(aes(Fuel_data_log_diff$Week,Fuel_data_log_diff$Price_diff)) + theme_bw() + ggtitle("Fuel Price (First Difference)") +    ylab("Price(Difference))") + xlab("Year")
## Warning: Use of `Fuel_data_log_diff$Week` is discouraged. Use `Week` instead.
## Warning: Use of `Fuel_data_log_diff$Price_diff` is discouraged. Use `Price_diff`
## instead.

Re-verifying using the ADF Test.

adf.test(Fuel_data_log_diff$Price)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price
## Dickey-Fuller = -2.2758, Lag order = 8, p-value = 0.4615
## alternative hypothesis: stationary
adf.test(Fuel_data_log_diff$Price_diff)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price_diff
## Dickey-Fuller = -3.175, Lag order = 8, p-value = 0.09224
## alternative hypothesis: stationary
adf.test(Fuel_data_log_diff$Price_log)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price_log
## Dickey-Fuller = -3.409, Lag order = 8, p-value = 0.0519
## alternative hypothesis: stationary
adf.test(Fuel_data_log_diff$Price_log_diff)
## Warning in adf.test(Fuel_data_log_diff$Price_log_diff): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Fuel_data_log_diff$Price_log_diff
## Dickey-Fuller = -6.9354, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(Fuel_data_log_diff$Price)
## Warning in kpss.test(Fuel_data_log_diff$Price): p-value smaller than printed p-
## value
## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price
## KPSS Level = 3.9025, Truncation lag parameter = 6, p-value = 0.01
kpss.test(Fuel_data_log_diff$Price_diff)
## Warning in kpss.test(Fuel_data_log_diff$Price_diff): p-value smaller than
## printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price_diff
## KPSS Level = 6.0787, Truncation lag parameter = 6, p-value = 0.01
kpss.test(Fuel_data_log_diff$Price_log)
## Warning in kpss.test(Fuel_data_log_diff$Price_log): p-value smaller than printed
## p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price_log
## KPSS Level = 3.7162, Truncation lag parameter = 6, p-value = 0.01
kpss.test(Fuel_data_log_diff$Price_log_diff)
## Warning in kpss.test(Fuel_data_log_diff$Price_log_diff): p-value greater than
## printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  Fuel_data_log_diff$Price_log_diff
## KPSS Level = 0.10982, Truncation lag parameter = 6, p-value = 0.1

The results of the ADF and KPSS tests show that time series is stationary.

ARIMA

par(mfrow = c(1,2))
acf(Fuel_data_log_diff$Price_log_diff,lag.max=20)
pacf(Fuel_data_log_diff$Price_log_diff,lag.max=20)

AIC of ARIMA Models.

AIC(
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,2)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,3)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(2,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(3,1,0))
)
##                                                         df       AIC
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 1))  2 -3568.583
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 2))  3 -3612.266
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 3))  4 -3622.255
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 0))  2 -3636.575
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 1))  3 -3634.718
## arima(Fuel_data_log_diff$Price_log, order = c(2, 1, 0))  3 -3634.702
## arima(Fuel_data_log_diff$Price_log, order = c(3, 1, 0))  4 -3633.285

According to the above AIC values Models, ARIMA (1,1,0) is the best model with an AIC = -3636.578.

BIC of ARIMA Models

BIC(
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,2)),
  arima(Fuel_data_log_diff$Price_log,order=c(0,1,3)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(1,1,1)),
  arima(Fuel_data_log_diff$Price_log,order=c(2,1,0)),
  arima(Fuel_data_log_diff$Price_log,order=c(3,1,0))
)
##                                                         df       BIC
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 1))  2 -3559.884
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 2))  3 -3599.219
## arima(Fuel_data_log_diff$Price_log, order = c(0, 1, 3))  4 -3604.859
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 0))  2 -3627.877
## arima(Fuel_data_log_diff$Price_log, order = c(1, 1, 1))  3 -3621.670
## arima(Fuel_data_log_diff$Price_log, order = c(2, 1, 0))  3 -3621.655
## arima(Fuel_data_log_diff$Price_log, order = c(3, 1, 0))  4 -3615.888

According to the above BIC values Models, ARIMA (1,1,1) is the best model with an BIC = -3621.676.

Auto ARIMA

auto.arima(Fuel_data_log_diff$Price_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: Fuel_data_log_diff$Price_log 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.6031
## s.e.  0.0332
## 
## sigma^2 = 0.0001009:  log likelihood = 1820.29
## AIC=-3636.58   AICc=-3636.55   BIC=-3627.88

As per the output of the model, we can conclude that ARIMA(1,1,0) is the best model which can be used for further analysis.

Ljung - Box Test

Model_1 = arima(Fuel_data_log_diff$Price_log,order=c(1,1,0))
forecast::checkresiduals(Model_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)
## Q* = 7.5908, df = 9, p-value = 0.5758
## 
## Model df: 1.   Total lags used: 10

From the viz, we can say that the residuals seem to be white noise.

par(mfrow=c(1,2))
res = Model_1$resid
acf(res,lag.max=20)
pacf(res,lag.max=20)

Predicting Model Performance using Forecast.

plot(forecast(Model_1,h=150),main="ARIMA(1,1,0) Forecast",ylab="Fuel Price per Gallon ($)",xlab="Date")

prediction <- Arima(Fuel_data_log_diff$Price, model = Model_1)
accuracy(prediction)
##                        ME      RMSE        MAE         MPE      MAPE      MASE
## Training set 0.0001401753 0.0390067 0.02871679 0.009395186 0.9943248 0.7846193
##                      ACF1
## Training set -0.006829676

The RMSE of the corresponding model is 0.039

The Forecast shows us that there wont be any major difference in Fuel price for the upcoming days.

## Time series analysis using Prophet Model

Fuel_data <- Fuel_prices_data %>%
  filter(Week >= ymd("2000=01-01")) %>% rename(Year = "Week",Avg_price ='Price') %>% 
  mutate(Month_yr = format(Year, "%Y-%m")) %>% distinct(Month_yr,.keep_all=TRUE) 
as_tibble(Fuel_data)
## # A tibble: 265 × 3
##    Year                Avg_price Month_yr
##    <dttm>                  <dbl> <chr>   
##  1 2022-01-17 00:00:00      3.40 2022-01 
##  2 2021-12-27 00:00:00      3.38 2021-12 
##  3 2021-11-29 00:00:00      3.48 2021-11 
##  4 2021-10-25 00:00:00      3.48 2021-10 
##  5 2021-09-27 00:00:00      3.27 2021-09 
##  6 2021-08-30 00:00:00      3.24 2021-08 
##  7 2021-07-26 00:00:00      3.23 2021-07 
##  8 2021-06-28 00:00:00      3.18 2021-06 
##  9 2021-05-31 00:00:00      3.12 2021-05 
## 10 2021-04-26 00:00:00      2.96 2021-04 
## # … with 255 more rows

Prophet Model

The time series data is divided into two parts: train data (from January 2000 to December 2014) and test data (from January 2015 to January 2021). The prophet model is then built based on the train data and forecast. The future forecast period units have been adjusted to reflect the fact that this series is collected at the month level.

prophet_fueldata = Fuel_data %>% 
    rename(ds = Year,
    y = Avg_price)  

train = prophet_fueldata %>% 
  filter(ds<ymd("2015-01-01"))

test = prophet_fueldata %>%
  filter(ds>=ymd("2015-01-01"))

model = prophet(train)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future = make_future_dataframe(model,periods = 85, freq = 'month')
forecast = predict(model,future)

plot(model,forecast)+
ylab("Average Fuel Price")+xlab("Date")+labs(title = 'Forecast: Average Fuel Price',subtitle = 'January 2000 - December 2021')+theme_bw()

Time series Components plots

prophet_plot_components(model,forecast)

We can see taht there’s an upward trend in the Jan-Mid of March period and then there’s a fall in the April followed by few fluctuations till July then during Fall it dips and rises afterwards.

Finding Saturation points

After finding the summaries of both the test and train data, we need to find the saturation points which means the threshold of the price which is reasonable as seen in the last 20 years of the data.

summary(train)
##        ds                            y           Month_yr        
##  Min.   :2000-01-31 00:00:00   Min.   :1.137   Length:180        
##  1st Qu.:2003-10-20 00:00:00   1st Qu.:1.686   Class :character  
##  Median :2007-07-12 12:00:00   Median :2.654   Mode  :character  
##  Mean   :2007-07-13 07:36:00   Mean   :2.560                     
##  3rd Qu.:2011-04-04 00:00:00   3rd Qu.:3.373                     
##  Max.   :2014-12-29 00:00:00   Max.   :4.146
summary(test)
##        ds                            y           Month_yr        
##  Min.   :2015-01-26 00:00:00   Min.   :1.870   Length:85         
##  1st Qu.:2016-10-31 00:00:00   1st Qu.:2.341   Class :character  
##  Median :2018-07-30 00:00:00   Median :2.589   Mode  :character  
##  Mean   :2018-07-28 02:32:28   Mean   :2.604                     
##  3rd Qu.:2020-04-27 00:00:00   3rd Qu.:2.857                     
##  Max.   :2022-01-17 00:00:00   Max.   :3.478
#setting floor for train data
train$floor = 1.13
train$cap = 4.14

#setting floor for forecast data

future$floor = 1.8
future$cap = 3.4
sat_model = prophet(train,growth='logistic')
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
sat_forecast = predict(sat_model,future)

plot(sat_model,sat_forecast)+ylim(0,4.5)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Fuel ',subtitle = 'Saturation Points')+theme_bw()

We set these saturation thresholds for the prophet model, and the resulting figure shows that the predicted values are substantially different from the previous forecast, and that the increasing trend is not as sharp as it was before the limitations were established.

Examination for Change points

model = prophet(train, n.changepoints = 10)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future = make_future_dataframe(model,periods = 85, freq = 'month')
plot(model,forecast)+add_changepoints_to_plot(model)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Prices',subtitle = 'January 2000 - December 2021')+theme_bw()

These are the two change points observed in the Time series.

Chekcing the Type of Seasonilty

additive = prophet(train)
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
add_fcst = predict(additive,future)

plot(additive,add_fcst)+
ylim(0,5)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Fuel',subtitle = 'Additive Seasonality Model')+theme_bw()

prophet_plot_components(additive,add_fcst)

The additive model does not appear to accurately fit the increased seasonal changes in average price values towards the end of the training period, although it performs well in the beginning.

multi = prophet(train,seasonality.mode = 'multiplicative')
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(0,5) +theme_bw()+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Fuel',subtitle = 'Multiplicative Seasonality Model')

prophet_plot_components(multi,multi_fcst)

Both the models seem to have the same trends.

Model Assessment

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  filter(ds>=ymd("2015-01-01"))
## Warning in base::check_tzones(e1, e2): 'tzone' attributes are inconsistent
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 1.85"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 1.76"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.71"

RMSE for the above model is ‘1.85’, whereas MAE for the same model is slightly lower which is ‘1.76’. The Mean Absolute percentage Error is 71%

Cross Validation

Model cross-validation will be performed by making the appropriate adjustments to incorporate monthly data. Setting the initial training time to 5 years and the rolling horizon to 1 year appears to produce relatively low error metrics after a variety of combinations.

df.cv <- cross_validation(model ,horizon=365,period = 365,initial = 5*365,units='days')
## Making 9 forecasts with cutoffs between 2005-12-31 and 2013-12-29
head(df.cv)
## # A tibble: 6 × 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  2.40 2006-01-30 00:00:00  2.34       2.19       2.46 2005-12-31 00:00:00
## 2  2.30 2006-02-27 00:00:00  2.40       2.27       2.52 2005-12-31 00:00:00
## 3  2.54 2006-03-27 00:00:00  2.53       2.40       2.67 2005-12-31 00:00:00
## 4  2.96 2006-04-24 00:00:00  2.52       2.39       2.65 2005-12-31 00:00:00
## 5  2.91 2006-05-29 00:00:00  2.66       2.52       2.79 2005-12-31 00:00:00
## 6  2.91 2006-06-26 00:00:00  2.68       2.55       2.81 2005-12-31 00:00:00
df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Avg Price")+
  scale_color_discrete(name = 'Cutoff')+
  theme(legend.position = "None")+
  labs(
    title = 'Forecast: Average Price of Fuel',
    subtitle = 'Cross Validation'
  )

plot_cross_validation_metric(df.cv, metric = 'rmse')

plot_cross_validation_metric(df.cv, metric = 'mape')

The rolling window predictions are compared to actuals in the cross validation result plot. For a one-year forecast horizon, the RMSE is between 0.25 and 0.75. The average absolute percent inaccuracy varies from 10 to 25%.

According to the results, the ARIMA model appears to have fewer errors for predictions in the foreseeable term, whereas larger errors in later periods are similar to the Prophet model. For near-future predictions, Prophet has higher errors, but the variability in the error is lower. We can observe that the ARIMA outperforms the Prophet for this time series based on the RMSE and the data above.

Conclusion