In this document I show my use of my own functions to specify a model to forecast a new market. I use wikipedia pageview statistics for their page “Monday”.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
source("Functions.R")
df <- read_csv("Data/First Market.csv")
First, I produce an interactive line graph of my data using eda_dygraph.
eda_dygraph(df)
Next, I evaluate possible seasonality patterns using eda_boxplot.
eda_boxplot(df, "Week")
There is obviously a weekly pattern, with Mondays being much higher than any other day.
eda_boxplot(df, "Month")
eda_boxplot(df, "Year")
It seems that their is not a strong monthly seasonality. There are a few outliers in August, so we will test for yearly seasonality, but it seems like it would not be a big factor.
eda_bogo(df)
## # A tibble: 1,668 x 5
## ds y wday wday_mean res
## <date> <dbl> <ord> <dbl> <dbl>
## 1 2017-08-21 2447 Mon 716. 1731.
## 2 2019-08-12 2068 Mon 716. 1352.
## 3 2017-07-12 1061 Wed 375. 686.
## 4 2018-01-21 911 Sun 425. 486.
## 5 2017-08-22 893 Tue 426. 467.
## 6 2017-07-13 810 Thu 374. 436.
## 7 2018-01-22 1072 Mon 716. 356.
## 8 2015-09-28 1046 Mon 716. 330.
## 9 2017-05-01 1008 Mon 716. 292.
## 10 2017-10-23 997 Mon 716. 281.
## # ... with 1,658 more rows
It seems that there are a few outliers in the data, but since there is no way to find out a reason for that, I will leave them in for now.
Now that we have done some Exploratory Data Analysis, we will test some models. We will not need a monthly seasonality component or an external regressor dataframe since this time series is more simple than others I forecast for. We will test yearly seasonality and different transformations of our response variable.
m <- list(
no_yearly = prophet(
daily.seasonality = FALSE,
weekly.seasonality = TRUE,
yearly.seasonality = FALSE
),
yearly = prophet(
daily.seasonality = FALSE,
weekly.seasonality = TRUE,
yearly.seasonality = TRUE
)
)
bind_rows(
og = m %>%
map(fit.prophet, df) %>%
map_df(cv_df, .id = "model") %>%
cv_accuracy,
sqrt = m %>%
map(fit.prophet, df %>% mutate(y = sqrt(y))) %>%
map_df(cv_df, function(x) x^2, .id = "model") %>%
cv_accuracy,
log = m %>%
map(fit.prophet, df %>% mutate(y = log(y + 1))) %>%
map_df(cv_df, function(x) exp(x) - 1, .id = "model") %>%
cv_accuracy,
.id = "transform"
) %>%
arrange(desc(cutoff), rmse)
## # A tibble: 132 x 10
## transform cutoff model me mae rmse sd `rmse/sd` coverage
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 log Total year~ 2.14 45.8 89.5 157. 0.571 0.736
## 2 sqrt Total year~ 0.529 47.3 90.7 157. 0.579 0.772
## 3 og Total year~ 0.794 50.7 93.5 157. 0.597 0.826
## 4 sqrt Total no_y~ -2.26 51.7 94.6 157. 0.604 0.768
## 5 log Total no_y~ -0.821 51.6 95.0 157. 0.606 0.717
## 6 og Total no_y~ -2.76 54.2 96.4 157. 0.615 0.822
## 7 log 2019-~ year~ -16.6 35.7 51.8 74.8 0.693 0.717
## 8 log 2019-~ no_y~ -15.1 38.5 56.3 74.8 0.753 0.767
## 9 sqrt 2019-~ year~ -24.8 44.8 67.1 74.8 0.896 0.783
## 10 sqrt 2019-~ no_y~ -25.6 46.8 70.9 74.8 0.948 0.783
## # ... with 122 more rows, and 1 more variable: width <dbl>
It seems that the log transformation of the response variable with a yearly seasonality component had the lowest root mean square error in the 21 cross validations, so we will use that model.
m <- prophet(
daily.seasonality = FALSE,
weekly.seasonality = TRUE,
yearly.seasonality = TRUE
) %>%
fit.prophet(df %>% mutate(y = log(y + 1)))
future <- tibble(ds = seq(last(df$ds) + 1, as_date("2020-04-30"), "day"))
fc <- predict(m, future)
fc %>%
transmute(ds = as_date(ds), yhat_lower, yhat, yhat_upper) %>%
mutate_if(is.numeric, function(x) exp(x) - 1) %>%
full_join(df, "ds") %>%
xts::xts(x = select(., -ds), order.by = .$ds) %>%
dygraphs::dygraph() %>%
dygraphs::dySeries("y", "Actual", "black") %>%
dygraphs::dySeries(c("yhat_lower", "yhat", "yhat_upper"), "Prediction", "blue") %>%
dygraphs::dyRangeSelector()