Introduction

In this document I show my use of my own functions to specify a model to forecast a new market. I use wikipedia pageview statistics for their page “Monday”.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
source("Functions.R")
df <- read_csv("Data/First Market.csv")

Exploratory Data Analysis

First, I produce an interactive line graph of my data using eda_dygraph.

eda_dygraph(df)

Next, I evaluate possible seasonality patterns using eda_boxplot.

eda_boxplot(df, "Week")

There is obviously a weekly pattern, with Mondays being much higher than any other day.

eda_boxplot(df, "Month")

eda_boxplot(df, "Year")

It seems that their is not a strong monthly seasonality. There are a few outliers in August, so we will test for yearly seasonality, but it seems like it would not be a big factor.

eda_bogo(df)
## # A tibble: 1,668 x 5
##    ds             y wday  wday_mean   res
##    <date>     <dbl> <ord>     <dbl> <dbl>
##  1 2017-08-21  2447 Mon        716. 1731.
##  2 2019-08-12  2068 Mon        716. 1352.
##  3 2017-07-12  1061 Wed        375.  686.
##  4 2018-01-21   911 Sun        425.  486.
##  5 2017-08-22   893 Tue        426.  467.
##  6 2017-07-13   810 Thu        374.  436.
##  7 2018-01-22  1072 Mon        716.  356.
##  8 2015-09-28  1046 Mon        716.  330.
##  9 2017-05-01  1008 Mon        716.  292.
## 10 2017-10-23   997 Mon        716.  281.
## # ... with 1,658 more rows

It seems that there are a few outliers in the data, but since there is no way to find out a reason for that, I will leave them in for now.

Model Testing

Now that we have done some Exploratory Data Analysis, we will test some models. We will not need a monthly seasonality component or an external regressor dataframe since this time series is more simple than others I forecast for. We will test yearly seasonality and different transformations of our response variable.

m <- list(
  no_yearly = prophet(
    daily.seasonality = FALSE,
    weekly.seasonality = TRUE,
    yearly.seasonality = FALSE
  ),
  yearly = prophet(
    daily.seasonality = FALSE,
    weekly.seasonality = TRUE,
    yearly.seasonality = TRUE
  )
)

bind_rows(
  og = m %>% 
    map(fit.prophet, df) %>% 
    map_df(cv_df, .id = "model") %>% 
    cv_accuracy,
  sqrt = m %>% 
    map(fit.prophet, df %>% mutate(y = sqrt(y))) %>% 
    map_df(cv_df, function(x) x^2, .id = "model") %>% 
    cv_accuracy,
  log = m %>% 
    map(fit.prophet, df %>% mutate(y = log(y + 1))) %>% 
    map_df(cv_df, function(x) exp(x) - 1, .id = "model") %>% 
    cv_accuracy,
  .id = "transform"
) %>% 
  arrange(desc(cutoff), rmse)
## # A tibble: 132 x 10
##    transform cutoff model      me   mae  rmse    sd `rmse/sd` coverage
##    <chr>     <chr>  <chr>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 log       Total  year~   2.14   45.8  89.5 157.      0.571    0.736
##  2 sqrt      Total  year~   0.529  47.3  90.7 157.      0.579    0.772
##  3 og        Total  year~   0.794  50.7  93.5 157.      0.597    0.826
##  4 sqrt      Total  no_y~  -2.26   51.7  94.6 157.      0.604    0.768
##  5 log       Total  no_y~  -0.821  51.6  95.0 157.      0.606    0.717
##  6 og        Total  no_y~  -2.76   54.2  96.4 157.      0.615    0.822
##  7 log       2019-~ year~ -16.6    35.7  51.8  74.8     0.693    0.717
##  8 log       2019-~ no_y~ -15.1    38.5  56.3  74.8     0.753    0.767
##  9 sqrt      2019-~ year~ -24.8    44.8  67.1  74.8     0.896    0.783
## 10 sqrt      2019-~ no_y~ -25.6    46.8  70.9  74.8     0.948    0.783
## # ... with 122 more rows, and 1 more variable: width <dbl>

It seems that the log transformation of the response variable with a yearly seasonality component had the lowest root mean square error in the 21 cross validations, so we will use that model.

Prediction

m <- prophet(
  daily.seasonality = FALSE,
  weekly.seasonality = TRUE,
  yearly.seasonality = TRUE
) %>% 
  fit.prophet(df %>% mutate(y = log(y + 1)))

future <- tibble(ds = seq(last(df$ds) + 1, as_date("2020-04-30"), "day"))
fc <- predict(m, future)

fc %>% 
  transmute(ds = as_date(ds), yhat_lower, yhat, yhat_upper) %>% 
  mutate_if(is.numeric, function(x) exp(x) - 1) %>% 
  full_join(df, "ds") %>% 
  xts::xts(x = select(., -ds), order.by = .$ds) %>% 
  dygraphs::dygraph() %>% 
  dygraphs::dySeries("y", "Actual", "black") %>% 
  dygraphs::dySeries(c("yhat_lower", "yhat", "yhat_upper"), "Prediction", "blue") %>% 
  dygraphs::dyRangeSelector()