Time Series Analysis of Average Fuel Price, Part 3: Prophet Model

The Data

The dataset used for this time series analysis is the U.S city avaerage gasoline price dataset sourced from the Federal Reserve Economic Data (FRED) ,St. Louis, and the United States Bureau of Labor Statistics. It contains monthly average regular unleaded gasoline prices from January 1976 to December 2021.The two fields in the data are date of observation and the average price.

When I first came across this dataset on the FRED website, I was impressed with the amount of historical data recorded. I moved here from India, where I’ve driven for 10 years, and I plan to start driving in the US as soon as possible, so analyzing the trend of gasoline prices seemed like an interesting choice.

Fuel price fluctuations are common across the world.The concern is when the rates increase considerably and consistently over a short time period. The most common reasons listed for price fluctuations are market competitors, tax and transportation fees, inflation, supply and demand, etc.

The goal of this analysis, as of now, is to understand how gasoline prices have fluctuated over the years and to determine whether: (i) prices follow patterns based on time of the year and, (ii) the fluctuations can be attributed to any obvious external socio-economic factors. (iii) build a forecast model and assess performance

The dataset is continuous and does not have any gaps in monthly data points. it may not be easy to forecast the prices for very far into the future as there could be potential unforeseen changes in economic or social factors that affect the prices. However, predicting fluctuations for the near future should be closer to actuals.

For the purpose of this study, we will be considering data from January 2000 to December 2021.

We will attempt to fit a prophet model to the time series, explore seasonality and trend, and assess the model fit and performance through error metrics.

#Loading the dataset

fueldata_raw <- read_csv("/Users/samishav/Documents/Spring 2022/Forecasting & Time Series/Forecasting/APU000074714.csv")
summary(fueldata_raw)

##       DATE             APU000074714  
##  Min.   :1976-01-01   Min.   :0.592  
##  1st Qu.:1987-06-23   1st Qu.:1.119  
##  Median :1998-12-16   Median :1.323  
##  Mean   :1998-12-16   Mean   :1.767  
##  3rd Qu.:2010-06-08   3rd Qu.:2.470  
##  Max.   :2021-12-01   Max.   :4.090

#Filter study period (2000-2021) , rename columns and remove any duplicate data entries per month.

fueldata <- fueldata_raw %>% filter(DATE >= ymd("2000=01-01")) %>% rename(obs_date = "DATE",avg_price ='APU000074714') %>% mutate(month_year = format(obs_date, "%Y-%m")) %>% distinct(month_year,.keep_all=TRUE) 
as_tibble(fueldata)

## # A tibble: 264 × 3
##    obs_date   avg_price month_year
##    <date>         <dbl> <chr>     
##  1 2000-01-01      1.30 2000-01   
##  2 2000-02-01      1.37 2000-02   
##  3 2000-03-01      1.54 2000-03   
##  4 2000-04-01      1.51 2000-04   
##  5 2000-05-01      1.50 2000-05   
##  6 2000-06-01      1.62 2000-06   
##  7 2000-07-01      1.59 2000-07   
##  8 2000-08-01      1.51 2000-08   
##  9 2000-09-01      1.58 2000-09   
## 10 2000-10-01      1.56 2000-10   
## # … with 254 more rows

Behaviour of the Series

# Trend chart
monthly_price_plot = ggplot(fueldata)+
  geom_line(aes(obs_date,avg_price))+
  theme_bw()+
  xlab("Date")+
  ylab("Average Price")+
  labs(
    title = 'Average Price of Unleaded Regular Gasoline, FRED',
    subtitle = 'January 2000 - December 2021'
  )

#monthly_price_plot

monthly_price_plot + 
  geom_smooth(aes(obs_date,avg_price),method='lm',color='orange')

Prophet Model

We split the time series data into train data - from January 2000 to December 2014 - and test data - from January 2015 to January 2021. We then build the prophet model on the train data and forecast. Since this series is captured at month level, the future forecast period units have been modified accordingly.

prophet_fueldata = fueldata %>% 
    rename(ds = obs_date,
    y = avg_price)  

train = prophet_fueldata %>% 
  filter(ds<ymd("2015-01-01"))

test = prophet_fueldata %>%
  filter(ds>=ymd("2015-01-01"))

model = prophet(train)
future = make_future_dataframe(model,periods = 85, freq = 'month')
forecast = predict(model,future)

plot(model,forecast)+
ylab("Average Fuel Price")+xlab("Date")+labs(title = 'Forecast: Average Price of Unleaded Regular Gasoline, FRED',subtitle = 'January 2000 - December 2021')+theme_bw()

The model has a very steep upward trend, following the trend of the original time series. Since the model follows the default additive seasonality pattern, it fails to capture some of the the higher fluctuations from 2005 onward. Judging from the light blue overlay which shows the variability in the predicted values, the maximum value of average price at the end of the forecast peiod is estimated to be above $5, which is not as observed in the actual data. We may need to tune the model in terms of seasonality to get forecast behaviour closer to reality. The actual and predicted values can be examned more closely with the ineractive plot below.

Time Series Components

We plot the components of the series separately. Since the data is at a monthly level, no weekly seasonality information is available. Additionally, we will not be able to incorporate holiday information either as it does not apply to this level of data.

prophet_plot_components(model,forecast)

The trend component shows an general upward trend in the average price. The seasonality at yearly level reveals that the average price of gasoline tends to increase around the March and October in anticipation of the increased demand in Summer and after the onset of Fall. The dip in average price is the most in the months of January to February and August to September.

Determination of Saturation Points

We look at summary statistics of the train and test data to determine whether there is a reasonable limit within which the fuel prices tend to stay. We have 20 years of data in total which should be a reliable estimate of the range of the fuel price. We will compare the forecast fuel price and the trend of the forecast to determine whether we need to apply a minimum and maximum saturation point for more realistic estimates.

# Using summary statistics to determine the range of values for train and test data
summary(train)

##        ds                   y          month_year       
##  Min.   :2000-01-01   Min.   :1.130   Length:180        
##  1st Qu.:2003-09-23   1st Qu.:1.655   Class :character  
##  Median :2007-06-16   Median :2.591   Mode  :character  
##  Mean   :2007-06-16   Mean   :2.530                     
##  3rd Qu.:2011-03-08   3rd Qu.:3.323                     
##  Max.   :2014-12-01   Max.   :4.090

summary(test)

##        ds                   y          month_year       
##  Min.   :2015-01-01   Min.   :1.767   Length:84         
##  1st Qu.:2016-09-23   1st Qu.:2.240   Class :character  
##  Median :2018-06-16   Median :2.474   Mode  :character  
##  Mean   :2018-06-16   Mean   :2.513                     
##  3rd Qu.:2020-03-08   3rd Qu.:2.775                     
##  Max.   :2021-12-01   Max.   :3.482

# Set "floor" in training data
train$floor = 1.10
train$cap = 4.10

# Set floor in forecast data
future$floor = 1.70
future$cap = 3.5
sat_model = prophet(train,growth='logistic')
sat_forecast = predict(sat_model,future)

plot(sat_model,sat_forecast)+ylim(0,4.5)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Unleaded Regular Gasoline, FRED ',subtitle = 'Saturation Points')+theme_bw()

The summary statistics reveal that the values tend to stay between $1.1 and $4.1 for the training period while the interval is tighter - between $1.7 and $3.5 - in the test data for the forecast period. Therefore, we set these saturation thresholds for the prophet model and the resultant plot reveals that the forecast values are very much different from the earlier forecast and does not show as steep an upward trend as prior to setting the limits.

Examination for Changepoints

We will examine the series for significant change points in trend of the series using the training data.

model = prophet(train, n.changepoints = 10)
future = make_future_dataframe(model,periods = 85, freq = 'month')
plot(model,forecast)+add_changepoints_to_plot(model)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Unleaded Regular Gasoline, FRED',subtitle = 'January 2000 - December 2021')+theme_bw()

We observe that only two significant change-points have been identified for this series indicating where the trend changes drastically.

Type of Seasonality - Additive or Multiplicative?

additive = prophet(train)
add_fcst = predict(additive,future)

plot(additive,add_fcst)+
ylim(0,5)+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Unleaded Regular Gasoline, FRED',subtitle = 'Additive Seasonality Model')+theme_bw()

prophet_plot_components(additive,add_fcst)

The additive model does not seem to closely fit the higher seasonal fluctuations in average price values towards the later part of the training period but behaves well for the earlier time frame.

multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(0,5) +theme_bw()+xlab("Date")+ylab("Avg Price")+labs(title = 'Forecast: Average Price of Unleaded Regular Gasoline, FRED ',subtitle = 'Multiplicative Seasonality Model')

prophet_plot_components(multi,multi_fcst)

The multiplicative model visibly seems to capture the higher fluctuations towards the end of the period but does not fit the earlier period with lower fluctuations well.

From visual assessment, the additive model seems likely to perform better overall, especially since we are not able to conclude with certainty on the existence of multiplicative seasonality.

Model Assessment

We calculate the error metrics to assess model performance.

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  filter(ds>=ymd("2015-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 1.81"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 1.77"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.73"

The RMSE for the fitted model is 1.81. The MAE is slightly lower, 1.77. The mean absolute percentage error is quite high, 73%. This is because the average price values are very small values (less than 5), therefore, even half a dollar difference in the predicted price would account for a large error percent.

Cross Validation

We will perform model cross-validation by making the necessary changes to accommodate monthly data. After multiple combinations, setting the initial training period as 5 years and the rolling horizon as 1 year seems to have reasonably low error metrics.

df.cv <- cross_validation(model ,horizon=365,period = 365,initial = 5*365,units='days')
head(df.cv)

## # A tibble: 6 × 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  2.32 2006-01-01 00:00:00  2.33       2.19       2.47 2005-12-03 00:00:00
## 2  2.31 2006-02-01 00:00:00  2.42       2.27       2.56 2005-12-03 00:00:00
## 3  2.40 2006-03-01 00:00:00  2.56       2.42       2.70 2005-12-03 00:00:00
## 4  2.76 2006-04-01 00:00:00  2.65       2.51       2.79 2005-12-03 00:00:00
## 5  2.95 2006-05-01 00:00:00  2.66       2.51       2.80 2005-12-03 00:00:00
## 6  2.92 2006-06-01 00:00:00  2.66       2.51       2.81 2005-12-03 00:00:00

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Avg Price")+
  scale_color_discrete(name = 'Cutoff')+
  theme(legend.position = "None")+
  labs(
    title = 'Forecast: Average Price of Unleaded Regular Gasoline, FRED ',
    subtitle = 'Cross Validation'
  )

plot_cross_validation_metric(df.cv, metric = 'rmse')

plot_cross_validation_metric(df.cv, metric = 'mape')

The cross validation result plot reveals how the rolling window predictions compare against actuals. The RMSE ranges between 0.5 and 0.6 for the forecast horizon of 1 year. The mean absolute percent error ranges between 12 and 22 percent.

Comparison of Additive versus Multiplicative Fit for Seasonality

mod1 = prophet(train,seasonality.mode='additive')
forecast1 = predict(mod1)
df_cv1 <- cross_validation(mod1, horizon=365,period = 365,initial = 5*365,units='days')
metrics1 = performance_metrics(df_cv1) %>% 
  mutate(model = 'mod1')

mod2 = prophet(train,seasonality.mode='multiplicative')
forecast2 = predict(mod2)
df_cv2 <- cross_validation(mod2, horizon=365,period = 365,initial = 5*365,units='days')
metrics2 = performance_metrics(df_cv2) %>% 
  mutate(model = "mod2")
metrics1 %>% 
bind_rows(metrics2) %>% 
ggplot()+
geom_line(aes(horizon,rmse,color=model)) +theme_bw()+labs(title = 'RMSE Comparison for Additive and Multiplicative Model Fits  ',subtitle = 'Data: Average Price of Unleaded Regular Gasoline, FRED')

The forecast RMSE can be observed as generally much higher for the multiplicative fit model which further supports the earlier indication that the seasonality element in the series of average fuel price tends to follow an additive behavior.

The Prophet model provides an acceptable fit for the time series data, even though the frequency is at month level. The model has acceptable ranges of mean absolute percent error and rmse and also allows for a qualitative and quantitative assessment of the seasonality effect, thus enabling more accurate forecasts.