Forecasting with Prophet

Author

Lasya Pinnamaneni

Read library

Code
suppressWarnings({
  suppressMessages({
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(forecast)
library(modeltime)
library(fabletools)
library(feasts)
library(tsibble)
library(tibble)
library(fable)
library(data.table) 
library(knitr) 
library(prophet)  
library(fable.prophet)
library(tidyr)


options(warn=-1)
})
})

Read data and convert to time series format

Code
vehicle_df <- read.csv('/Users/lasyapinnamaneni/Desktop/MS BANA/Time series/total_vehicle_sales.csv')

 vehicle_sales_tbl_ts <- vehicle_df %>% select(date, vehicle_sales) %>%
  mutate(value = vehicle_sales) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)

Section 1

Fit and assess a Facebook Prophet model to the time series you have been assigned. Briefly discuss the methodology behind the Prophet model as it applies to time-series forecasting. An initial model can be specified using the default parameters.

Train Test Split

Code
# Set the seed for reproducibility
set.seed(123)

# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)

# Split the data into training and testing sets
vehicle_train <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
vehicle_test <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)

# Visualize the training and test sets
ggplot() +
  geom_line(data = vehicle_train, aes(x = date, y = value, color = "Training"), linewidth = 1) +
  geom_line(data = vehicle_test, aes(x = date, y = value, color = "Test"), linewidth = 1) +
  scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
  labs(title = "Training and Test Sets", x = "Date", y = "vehicle sales") +
  theme_minimal()

Forecast using prophet model

Code
vehicle_train %>%
    model(prophet = fable.prophet::prophet(vehicle_sales)) %>%
    forecast(h=115) %>%
    autoplot(vehicle_train %>% bind_rows(vehicle_test))+
    ylab('vehicle sales')+
    xlab('Month')+
    theme_bw()

PROPHET

  • Developed by Facebook around 2017

  • Widely used in business/data science for a number of different forecasting problems

  • Key benefits:

    • Handles multiple types of seasonality

    • Easy to implement

    • Very flexible

    • Fast

  • Works best on daily data, but suitable for all types of time series

  • Not especially sensitive to outliers

The algorithm draws from time-series decomposition, breaking down the time-series into:

  • Seasonal components: Daily, weekly, monthly, yearly, etc.

  • Holidays: For daily data

  • Trend: Estimated along the data with unique slopes identified using changepoint detection

The basic model is fit as:

yt = gt + st +ht + et

where gts the trend, stis seasonality, and ht are holidays.

Section 2

Decompose and visualize the elements of the time-series (trend, seasonality, etc.) as identified by your initial model. Examine the changepoints identified for the “trend” part of the time series.

Code
model = vehicle_train %>%
    model(prophet = fable.prophet::prophet(vehicle_sales))

model %>%
components() %>%
autoplot()

The decomposition of the time series data reveals an upward trend until early 2000, followed by a downward slope thereafter. Additionally, an additive seasonality component is evident, with no indication of any multiplicative seasonality present.

Do these detected changepoints make sense for your time-series? If not, adjust the hyperparameters of the model by specifying more or fewer changepoints (n_changepoints), or by changing the proportion of the time-series through which changepoints can be identified (changepoint_range), or the prior scale (changepoint_prior_scale). You should also assess whether a linear or logistic trend makes sense for your time-series.

Code
changepoints = model %>%
glance() %>%
pull(changepoints) %>%
bind_rows() %>%
.$changepoints

vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_vline(xintercept=as.Date(changepoints),color='red',linetype='dashed')

n_changepoints

Code
model2 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 25) + season(period = 'year') + season(period = 'month', order = 12)),
        prophet_10 = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 10) + season(period = 'year') + season(period = 'month', order = 12))
        )


changepoints_orig = model2 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints

changepoints_10 = model2 %>%
glance() %>%
filter(.model == 'prophet_10') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints

vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
# geom_vline(aes(xintercept=ymd('2000-01-01')))
geom_vline(xintercept=as.Date(changepoints_orig),color='red')+
geom_vline(xintercept=as.Date(changepoints_10),color='blue',linetype='dashed')

changepoint_range

Code
model1 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_90_range = fable.prophet::prophet(vehicle_sales~growth(changepoint_range=0.9)+season(period='year')+ season(period = 'month', order = 12)),

        )

changepoints_orig = model1 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints

changepoints_90_range = model1 %>%
glance() %>%
filter(.model == 'prophet_90_range') %>%
unnest(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints

model1 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_90_range),color='red',linetype='dashed')

changepoint_prior_scale

Code
model3 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_less_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.01)+season(period='year')),
        prophet_more_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.5)+season(period='year'))
        )

changepoints_orig = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>% 
filter(.model == 'prophet_orig') %>%
filter(abs(adjustment)>=0.05) %>%
.$changepoints

changepoints_more_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_more_flexible') %>%
filter(abs(adjustment)>=0.50) %>%
.$changepoints

changepoints_less_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_less_flexible') %>%
filter(abs(adjustment)>=0.01) %>%
.$changepoints

model3 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_more_flexible),color='green',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_less_flexible),color='red',linetype='dashed')

changepoints_less_flexible shows superior forecast predictions compared to other changepoint options, indicating that a less flexible approach may better capture the underlying patterns in the data. This suggests that simpler models with fewer changepoints may provide more reliable forecasts.

Finally, assess whether the model should take into account a saturating minimum/maximum point - if so, specify a floor and cap.

Code
model3 %>%
components() %>%
autoplot(trend)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_more_flexible),color='green',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_less_flexible),color='red',linetype='dashed')

Code
vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_saturating = fable.prophet::prophet(vehicle_sales~growth(type='linear',capacity=3.7,floor=2.8)+season('year'))
        ) %>%
    forecast(h=68) %>%
    autoplot(vehicle_train)


Despite adjusting the floor and capacity parameters significantly, the resulting forecasts show minimal deviation from the base Prophet model. This indicates that while these parameters theoretically influence the model’s behavior, their practical impact on forecasting accuracy may be limited in this particular dataset. Further exploration and experimentation might be warranted to better understand their effect.

Section 3

Discuss any seasonality identified by the model (daily, weekly, yearly, etc.). Does your time series seem to contain additive or multiplicative seasonality? If appropriate, and examining daily data, assess whether holidays should be included in your model. If there is no seasonality, ensure that your model is specified without seasonality.

Code
model %>%
  components()
# A dable: 460 x 10 [1M]
# Key:     .model [1]
# :        vehicle_sales = trend * (1 + multiplicative_terms) + additive_terms
#   + .resid
   .model      date vehicle_sales additive_terms multiplicative_terms trend
   <chr>      <mth>         <dbl>          <dbl>                <dbl> <dbl>
 1 prophet 1976 Jan          885.          -36.3                    0  929.
 2 prophet 1976 Feb          995.           59.3                    0  929.
 3 prophet 1976 Mar         1244.          337.                     0  929.
 4 prophet 1976 Apr         1191.          235.                     0  930.
 5 prophet 1976 May         1203.          319.                     0  930.
 6 prophet 1976 Jun         1255.          300.                     0  930.
 7 prophet 1976 Jul         1162.          226.                     0  931.
 8 prophet 1976 Aug         1026.          181.                     0  931.
 9 prophet 1976 Sep         1058.          131.                     0  932.
10 prophet 1976 Oct         1129.          140.                     0  932.
# ℹ 450 more rows
# ℹ 4 more variables: yearly <dbl>, weekly <dbl>, daily <dbl>, .resid <dbl>
Code
yearly_seasonality = model %>%
components() %>%
autoplot(yearly)

yearly_seasonality 

Code
model4 = vehicle_train %>%
    model(
      additive = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='additive')),
      multiplicative = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='multiplicative')))

model4 %>%
components() %>%
autoplot()

The decomposition graph and table indicate the presence of additive seasonality in the time series, as opposed to multiplicative seasonality. This suggests that the seasonal variations in the data have a constant magnitude throughout the series, rather than growing or shrinking proportionally with the trend.

Section 4

Validate your model using the techniques discussed in Assignment 4, including dividing your sample into a training and test set. You should conduct a rolling window cross-validation to assess performance of the model at meaningful thresholds. Use tables/visualizations as appropriate to assess the model across various metrics (RMSE, MAE, MAPE, etc.). You are encouraged to compare models with different parameters (e.g. additive/multiplicative seasonality, linear vs logistic growth, number of changepoints, etc.).

Finally, assess the performance of the best-performing model on the test set. How does the model perform? Does it seem to overfit the training data? If so, how might you adjust the model to improve performance?

Code
vehicle_train %>%
bind_rows(vehicle_test) %>%
    ggplot()+
    geom_line(aes(date, vehicle_sales))+
    geom_vline(aes(xintercept=ymd('2014-05-01')),color='red')+
  annotate("text", x = as.Date("1976-01-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9, 
           label = "Train", color = "blue", size = 4) +
  annotate("text", x = as.Date("2023-11-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9, 
           label = "Test", color = "blue", size = 4) +
  labs(x = "Week", y = "vehicle sales", title = "Training and Testing Dataset") +
    theme_bw()

Code
vehicle_train_data <- vehicle_train %>%
  stretch_tsibble(.init = 22*12, .step = 12)

vehicle_train_data %>%
    ggplot()+
    geom_point(aes(date,factor(.id),color=factor(.id)))+
    ylab('Iteration')+
    ggtitle('Samples included in each CV Iteration')

Code
vehicle_train_cv_forecast <- vehicle_train_data %>%
  model(
    arima = ARIMA(vehicle_sales~pdq(2,0,2)),
    prophet = fable.prophet::prophet(vehicle_sales~growth()+season(period='year')),
    naive = NAIVE(vehicle_sales~drift())
  ) %>%
  forecast(h=12)


vehicle_train_cv_forecast %>%
  as_tsibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(
      vehicle_train
  ) %>%
    ggplot()+
    geom_line(aes(date,vehicle_sales))+
    geom_line(aes(date,.mean,color=factor(.id),linetype=.model))+
    scale_color_discrete(name='Iteration')+
    ylab('Ridership Count')+
    theme_bw()
Joining with `by = join_by(date)`

Code
vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>%
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:12)
  ) %>%
  ungroup() %>%
  mutate(error = abs(vehicle_sales - .mean)) %>%
  ggplot()+
  geom_boxplot(aes(factor(months_ahead),error))+
  facet_wrap(~.model,ncol=1)+
  guides(color='none')+
  ylab('Absolute Error')+
  xlab('Days Ahead')+
  ggtitle('Absolute Error by Iteration')+
  ylim(0,600)
Joining with `by = join_by(date)`

Code
vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>% 
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:12)
  ) %>%
  ungroup() %>% 
  filter(!is.na(vehicle_sales)) %>%
  group_by(months_ahead,.model) %>%
  summarize(
    rmse = sqrt(mean((vehicle_sales - .mean)^2,na.rm=T)),
  ) %>%
  ungroup() %>%
  ggplot()+
  geom_point(aes(months_ahead,rmse,color=.model),alpha=0.4)+ 
  geom_smooth(aes(months_ahead,rmse,color=.model),se=F)+
  xlab("Months Ahead")+
  ylab("Smoothed RMSE")
Joining with `by = join_by(date)`
`summarise()` has grouped output by 'months_ahead'. You can override using the
`.groups` argument.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>%
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:12)
  ) %>%
  ungroup() %>%
  group_by(months_ahead,.model) %>%
  mutate(
    mape = mean(abs((vehicle_sales - .mean)/vehicle_sales),na.rm=T)
  ) %>%
  ungroup() %>%
  ggplot()+
  geom_point(aes(months_ahead,mape,color=.model),alpha=0.4)+ 
  geom_smooth(aes(months_ahead,mape,color=.model),se=F)+
  xlab("Months Ahead")+
  ylab("Smoothed MAPE")
Joining with `by = join_by(date)`
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
vehicle_train_cv_forecast %>%
  accuracy(vehicle_train)
# A tibble: 3 × 10
  .model  .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima   Test   -9.43  117.  87.7 -1.61  7.20 0.784 0.794 0.463
2 naive   Test  -42.5   192. 149.  -5.41 12.4  1.33  1.30  0.444
3 prophet Test  -25.5   219. 171.  -4.15 14.9  1.53  1.48  0.804

While the forecast model effectively captures the variation and trend of the data, it falls short in accurately capturing the magnitude of the data. This suggests that while the model tracks the overall patterns well, it may underestimate the actual values, impacting its performance on the test set. ARIMA model has lowest error rate among all the models when compared with RMSE or MAPE

Code
vehicle_train_model <- vehicle_train %>%
  model(
    prophet = fable.prophet::prophet(vehicle_sales~growth()+season(period='year', type = 'additive'))
  )
Code
vehicle_train_model %>%
    forecast(h=115) %>%
    autoplot(vehicle_train %>%
    bind_rows(vehicle_test))+
    ylab('vehicle sales')+
    theme_bw()

Code
# Produce forecasts for the training set
train_forecast <- vehicle_train_model %>%
  forecast(new_data = vehicle_train)

# Visualize actual versus forecasted values for the training set
autoplot(train_forecast) +
  autolayer(vehicle_train) +
  labs(title = "Actual vs Forecasted Values for Training Set", y = "vehicle sales", x = "Date") +
  theme_minimal()
Plot variable not specified, automatically selected `.vars = vehicle_sales`

When plotting the fitted values against the actual values, it’s apparent that the model exhibits low bias, suggesting a good fit to the training data. However, upon comparing the forecast with the test data, it becomes evident that the test forecast demonstrates low variance, indicating a potential overfitting issue. This discrepancy suggests that while the model captures the underlying patterns well, it may struggle to generalize to unseen data, warranting further investigation and potential adjustments to address the overfitting problem.