Data Manipulation and extraction

#pivot sales data
test<-sales %>% 
  pivot_longer(cols = 6:ncol(sales),
               names_to = "Time",
               values_to = "sales",
               values_drop_na = TRUE) %>% 
  mutate(Month = ymd(Time)) %>%
  select(RegionName,Month, sales)

#extract sales for seattle
seattle_sales <- test %>% 
  filter(RegionName == "Seattle, WA") %>%  
  select(Month, sales)

# prophet requirements
prophet_data = seattle_sales %>% 
  rename(ds = Month, # Have to name our date variable "ds"
         y = sales)  # Have to name our time series "y"

# split data into train and test
train = prophet_data %>% 
  filter(ds<ymd("2021-01-01"))

test = prophet_data %>%
  filter(ds>=ymd("2021-01-01"))

Looking at the plot of the sales through time we can see that there have been a few changes in the trend of home sales in the United states. More noticeably we can see that sales seem to have flattened in the recent years. Some of the change can be explained as a consequence of COVID - 19 however, the flattening appears to have begun before 2019. Also, there may be multiplicative seasonality present here as the peaks are getting taller and troughs are getting deeper with time.

ggplot(seattle_sales,aes(x = Month, y = sales)) +
  geom_line(color = "blue") + 
  xlab("Year") + ylab("Sales (Count)") +
  ggtitle("Home Sales through the years") +theme_bw()

Modelling Sales with Prophet

model = prophet(train)
future = make_future_dataframe(model,periods = 12,freq = 'month')
forecast = predict(model,future)

dyplot.prophet(model,forecast)

The plot above shows that the prophet model does a very decent job in capturing the trend of the sales through the years and upon visual inspection there seems to be no need for any floors or caps to be set for the model.

Decomposition of the elements of the time series (trend, seasonality, etc.)

prophet_plot_components(model,forecast)

As a prophet model is an additive model it can capture any underlying trend and seasonality present in the time series. Since, the sales data is a monthly time series we can see the seasonality across the year.

The trend indicated here is in agreement with our earlier assessment that the sales have flattened in recent years.

The seasonality component shows an interesting pattern in the variation of sales. The sales seem to dip early in the year and peak around the summer.

Changepoints

plot(model,forecast)+add_changepoints_to_plot(model)+
  ylab("Sales")+xlab("Date")+theme_bw()

Here we can see the change points automatically identified during the modelling process by the prophet procedure. It has identified 9 change points that can be localized into 2 major segments of time where the change occurred.

Additive and multiplicative seasonality

Seasonality is an important component of the prophet procedure and from the plot above we were able to see the seasonality component in play. Although the seasonality here seems to be multiplicative in nature and not additive we will see that the performance here both in and out of sample is similar for both.

#seasonality
#additive
additive = prophet(train,seasonality.mode = 'additive')
add_fcst = predict(additive,future)
add_p<-plot(additive,add_fcst) + 
  ylab("Sales")+xlab("Date")+ ggtitle("Additive seasonality")+
  theme_bw()

#multiplicative
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
m_p<-plot(multi,multi_fcst) +
  ylab("Sales")+xlab("Date")+ ggtitle("Multiplicative seasonality")+
  theme_bw()
add_p + m_p

The above plots show us that the model with multiplicative seasonality performs similar to the model with additive seasonality.

# plot test with actual data
forecast_plot_data_add = add_fcst %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))
a_test<-ggplot()+
  geom_line(aes(test$ds,test$y))+
  geom_line(aes(forecast_plot_data_add$ds,forecast_plot_data_add$yhat),color='red')+
  ylab("Sales")+xlab("Date")+ ggtitle("Additive seasonality")+
  theme_bw()

# plot test with actual data
forecast_plot_data_multi = multi_fcst %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))
m_test<-ggplot()+
  geom_line(aes(test$ds,test$y))+
  geom_line(aes(forecast_plot_data_multi$ds,forecast_plot_data_multi$yhat),color='red')+
  ylab("Sales")+xlab("Date")+ ggtitle("Multiplicative seasonality")+
  theme_bw()

a_test+m_test

Out of sample as well the both models seem to be performing similarly.

Cross Validation

We can conduct rolling window cross validation with prophet as well.

df.cv <- cross_validation(model, initial = 731, period = 365, horizon = 365, units = "days")
head(df.cv)
## # A tibble: 6 x 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  1733 2011-01-31 00:00:00 1194.       666.      1758. 2011-01-03 00:00:00
## 2  1794 2011-02-28 00:00:00  962.       400.      1475. 2011-01-03 00:00:00
## 3  2622 2011-03-31 00:00:00 2484.      1978.      3034. 2011-01-03 00:00:00
## 4  2641 2011-04-30 00:00:00 2990.      2445.      3545. 2011-01-03 00:00:00
## 5  2841 2011-05-31 00:00:00 3326.      2802.      3902. 2011-01-03 00:00:00
## 6  3329 2011-06-30 00:00:00 4036.      3508.      4608. 2011-01-03 00:00:00
df.cv %>% 
  ggplot()+
  geom_line(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Home Sales(count)")+
  scale_color_discrete(name = 'Cutoff')

With cross-validation we can see that the model has instances of high error early on in the time series. These points appear to be located around the earlier identified change points. However, prophet seems to do a good enough job of correcting it as the cut-off increases.

Model performance

In-sample performance with cross validation

With prophet we can see RMSE through the various cutoffs in cross validation.

p1<-plot_cross_validation_metric(df.cv, metric = 'rmse')
p2<-plot_cross_validation_metric(df.cv,metric = 'mae')
p1+p2

The RMSE seems to increase slightly as the Horizon for cross validation increases but remains flat mostly. As the RMSE doesn’t vary significantly the model doesn’t necessarily perform much worse further in the future than it does earlier. The MAE plot corroborates this.

plot_cross_validation_metric(df.cv,metric = 'mape')

Plotting MAPE we can also say that the MAPE only seems to go down as the horizon increases. The plot reinforces that the model performance is not diminishing as the horizon increases.

Out-of-sample performance of original model

We can also get an understanding of how the model performs on the test data by looking at the same metrics.

#performance metrics
forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 471.91"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 402.43"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.07"

The model can forecast sales to within 7% of the original data.

Out-of-sample performance

# plot test with actual data
forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-01-01"))
forecast_not_scaled = ggplot()+
  geom_line(aes(test$ds,test$y))+
  geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red') +
  ylab("Sales")+xlab("Date")+ ggtitle("Out of sample performance")+
  theme_bw()
forecast_scaled = forecast_not_scaled + 
  ylim(2000,10000)

forecast_scaled

Upon visual inspection, we can conclude that the best model out of all 3 models we have looked at so far the original prophet model seems to perform the best out of sample.