My Midterm Process

It’s been a long and winding road, but I know so much more at the end than I did when I started! And that I believe is the goal. Anyways, let’s get started on the code here, shall we?

### DON'T LOAD PACKAGES YOU DON'T NEED

setwd("C:/Users/andre/Documents/Boston College/2. Summer 2020/Forecasting/midterm")

# data manipulation
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr) # load last
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#visualization
library(ggplot2)
library(stargazer) # for making tables
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
#time series
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(smoother)
## Loading required package: TTR

I had some issues with packages loading on top of each other and causing errors, so I’ve tried to pare down the libraries that I have loaded to a minimum.

So, the first thing to address here is what exactly does the data look like? Let’s take a look.

# import data
air_visits <- read.csv("C:/Users/andre/Documents/Boston College/2. Summer 2020/Forecasting/midterm/air_visit_data.csv")

#descriptive statistics
summary(air_visits)
##  air_store_id        visit_date           visitors     
##  Length:252108      Length:252108      Min.   :  1.00  
##  Class :character   Class :character   1st Qu.:  9.00  
##  Mode  :character   Mode  :character   Median : 17.00  
##                                        Mean   : 20.97  
##                                        3rd Qu.: 29.00  
##                                        Max.   :877.00
str(air_visits)
## 'data.frame':    252108 obs. of  3 variables:
##  $ air_store_id: chr  "air_ba937bf13d40fb24" "air_ba937bf13d40fb24" "air_ba937bf13d40fb24" "air_ba937bf13d40fb24" ...
##  $ visit_date  : chr  "2016-01-13" "2016-01-14" "2016-01-15" "2016-01-16" ...
##  $ visitors    : int  25 32 29 22 6 9 31 21 18 26 ...

Right now, the visit_date variable is listed as a character object, and that needs to be changed to a factor of some sort. So, the following code was used to transfoorm visit_date so that r would recognize it as a date.

air_visits$visit_date <- as.Date(air_visits$visit_date)

Cool, now it’s in the date format. This was something new that I learned through playing with this data set. That and group_by() have been new and integral for dealing with this data set, because there are many different restaurants all at the same date in these files.

Next on the agenda is getting a visualization of what’s happening with the data.

p1 <- air_visits %>% ##line graph number of visitors by date
  group_by(visit_date) %>%
  summarise(all_visitors = sum(visitors)) %>%
  ggplot(aes(x = visit_date, y = all_visitors)) +
  geom_line(col = "blue") +
  ggtitle("Historical AIR Visit Data January 13, 2016 to April 22, 2017") +
  labs(x = "Date", y = "Visitors")
## `summarise()` ungrouping output (override with `.groups` argument)
p1

What this project has taught me a lot about is how to use pipe operators (%>%). This is my first exposure to using it consistently within a project, and I have to say they have been very helpful. In the discussion for the first week of class, somebody likened these pipe operators to the equivalent of “then”, and this thought process has stuck with me and helped me to understand the processes of code that others have posted. For example, in this code, I called air_visits THEN grouped by the visit_date THEN created a new data frame with the vairable all_visitors… Long story short, this has really helped me.

Moving on, let’s look at some other visualizations.

p2 <- air_visits %>% ##histogram number of visitors
  ggplot(aes(visitors)) +
  geom_vline(xintercept = 20, color = "red") +
  geom_histogram(fill = "orange", bins = 30) +
  scale_x_log10() + 
  ggtitle("Histogram of Historical AIR Visits, Number of VIsitors")
p2

p3 <- air_visits %>%   
  mutate(wday = wday(visit_date, label = TRUE, week_start = 1)) %>%   
  group_by(wday) %>%
  summarise(visits = median(visitors)) %>%
  ggplot(aes(wday, visits, fill = wday)) +
  geom_col() +  
  ggtitle("Visitors by Day of the Week") +
  labs(x = "Day of the week", y = "Median Visitors")
## `summarise()` ungrouping output (override with `.groups` argument)
p3

p4 <- air_visits %>%
  mutate(month = month(visit_date, label = TRUE)) %>%
  group_by(month) %>%
  summarise(visits = median(visitors)) %>%
  ggplot(aes(month, visits, fill = month)) + 
  geom_col() +
  ggtitle("Visitors by Month") + 
  labs(x = "Month", y = "Median Visitors")
## `summarise()` ungrouping output (override with `.groups` argument)
p4

Some things that we notice from these visualizations of the historic data: the first figure, the line graph, has a large uptick in visitors in July 2016, which stays fairly constant until a large spick right around Janurary of 2017. Next, we see from the histogram that the most common number of visitors per day peaks at around 20, and extends past 100 in extremely rare cases. The days of the week seems to increase in visitors from Monday to Saturday and then fall on Sunday (which is consistent with the literature review). Finally, March-May have consistent visitor numbers, with a lull in the summer and fall.

Now, I wanted to make sure that I had something modeled in case I couldn’t figure how to work with the reservation data. The literature tells us that reservations do make for a good indicator for future predictions. But first, I made forecasts solely with the historical data from the AIR restaurants to see how that would look (and to test my knowledge of the code of the models that I want to run.)

So, to being that process, I have to make my data into a time series.

#time series
by_date <- air_visits %>% 
  group_by(visit_date) %>%
  summarise(all_visitors = sum(visitors))
## `summarise()` ungrouping output (override with `.groups` argument)
by_dateTS <- ts(by_date$all_visitors, frequency = 7) ## why can't I add the date to this?
autoplot(by_dateTS)

This plot looks a lot like the plot that I made earlier, which means I think that I did it right. However, when I try to add date into it train_air <- ts(air_reserve$reserve_visitors, start = c(2016, 01, 01), end = c(2017, 05, 31), frequency = 7) I get a weird time series that only has 11 points to it. But, if I leave it alone, I have 478 observations, which is what I want.

Before I try to model anything, lets decompose the data to get a sense of the trends in visitors that this AIR restaurant has seen.

by_dateTS %>%
  decompose(type = "additive") %>%
  autoplot() + 
  xlab("Time") + 
  ggtitle("Classical Additive Decomposition")

by_dateTS %>%
  decompose(type = "multiplicative") %>% # multiplicative
  autoplot() + 
  xlab("Time") + 
  ggtitle("Classical Multiplicative Decomposition")

Based off of these, visuals, it’s really hard for me to tell whether this is more additive or multiplicative. This informs my decision for running models in that I should run at least one that has multiplicative components and one that has additive components, and then that way I can decide on metrics and not by my eyes.

On to setting up training and testing data.

#train & test 
train <- ts(by_dateTS[1:382], frequency = 7)
test <- ts(by_dateTS[383:478], frequency = 7)

Moving on to models, it’s about time! I’m running ETS and weighted moving average models. First is ETS

ets <- ets(train) # ets model
ets
## ETS(M,Ad,M) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.5916 
##     beta  = 1e-04 
##     gamma = 1e-04 
##     phi   = 0.9641 
## 
##   Initial states:
##     l = 2156.8817 
##     b = 321.3936 
##     s = 0.934 0.9336 0.8502 0.7356 0.9386 1.3499
##            1.2582
## 
##   sigma:  0.1448
## 
##      AIC     AICc      BIC 
## 7755.280 7756.269 7806.571
fc_ets <- forecast(ets,14) #forecast
autoplot(fc_ets) #plot

checkresiduals(fc_ets) #checking residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,M)
## Q* = 41.684, df = 3, p-value = 4.682e-09
## 
## Model df: 12.   Total lags used: 15
acc_ets <- accuracy(fc_ets, test[1:14]) #accuracy of model
acc_ets
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set  -6.032786 1596.076 822.7383 -1.895786 9.064925 0.4109862
## Test set     917.368225 1302.465 949.0908  5.751102 5.984893 0.4741036
##                   ACF1
## Training set 0.2953002
## Test set            NA

The parameters for ETS associated with this ETS models are M,Ad, M. So, there seems to be multiplicative error and seasonality but additive trend. When I decomposed the data, I wasn’t sure which one it was, but the software chose for me the multiplicative route. What happens to my metrics when I run something that is more additive?

ets2 <- ets(train, model = "AAA") #forced ets model
ets2
## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = train, model = "AAA") 
## 
##   Smoothing parameters:
##     alpha = 0.6474 
##     beta  = 1e-04 
##     gamma = 0.1773 
##     phi   = 0.9177 
## 
##   Initial states:
##     l = 1058.0137 
##     b = 444.9493 
##     s = -416.2249 -593.1969 -1519.982 -2563.953 -498.3742 3458.829
##            2132.902
## 
##   sigma:  1606.427
## 
##      AIC     AICc      BIC 
## 7924.629 7925.618 7975.919
fc_ets2 <- forecast(ets2,14) #forecast
checkresiduals(fc_ets2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,A)
## Q* = 88.169, df = 3, p-value < 2.2e-16
## 
## Model df: 12.   Total lags used: 15
acc_ets2 <- accuracy(fc_ets2, test[1:14]) #accuracy 
acc_ets2
##                      ME     RMSE       MAE       MPE      MAPE      MASE
## Training set   26.98695 1580.994  912.9441 -1.149958 10.760563 0.4560471
## Test set     1111.24922 1676.820 1169.7199  6.640796  7.200702 0.5843155
##                   ACF1
## Training set 0.1976004
## Test set            NA

Looking at the metrics, though the difference is small, the additive model actually has a lower RMSE! But, before I get too excited, I realize that this is for the training set, and not the test set, which is of course more important.

autoplot(train) +
  autolayer(fc_ets, series = "Multiplicative Forecast") +
  autolayer(fc_ets2, series = "Additive Forecast")

Foretunately, at least both ETS models do better than a naive forecast, given their MASE values are less than 1.

And then weighted moving average

mywma <- na.omit(WMA(train,14)) # weighted moving average, loaded with smoother
fc_wma <- forecast(mywma,14) # forecast
autoplot(fc_wma) #plot

checkresiduals(fc_wma) #checking residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,N)
## Q* = 340.93, df = 5, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 10
acc_wma <- accuracy(fc_wma, test[1:14]) # accuracy of model
acc_wma
##                       ME      RMSE       MAE        MPE      MAPE       MASE
## Training set    3.490081  340.1146  254.3016  0.0823534  2.516448  0.9432135
## Test set     2080.725828 4014.3363 2842.3760 10.1770561 17.609637 10.5424702
##                   ACF1
## Training set 0.1471929
## Test set            NA

The next steps are to interpret these findings, but then to see if I can figure out how to analyze the other valuable data that has been made available for this project. I guess we will see!

Next Steps

For me, it was most important to make sure that I had some kind of forecast down. This is the most amount of information I’ve coded with before, and now that I’ve been able to code with just the historical data, I want to branch out and see what more I can do with the information at hand. So, the next question that I would want to figure out is the difference between reservations and actual visitors. So, here is the plot for the reservations and expected number of visitors.

# further attempts
air_reserve <- read.csv("C:/Users/andre/Documents/Boston College/2. Summer 2020/Forecasting/midterm/air_reserve.csv")

# make dates
air_reserve$reserve_datetime <- as.Date(air_reserve$reserve_datetime)

#visualizations
q1 <- air_reserve %>% ##line graph number of reservations by date
  group_by(reserve_datetime) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(x = reserve_datetime, y = all_visitors)) +
  geom_line(col = "blue") +
  ggtitle("Historical AIR Reservation Data January 1, 2016 to April 22, 2017") +
  labs(x = "Date", y = "Reserved Visitors")
## `summarise()` ungrouping output (override with `.groups` argument)
q1 

Even eyeballing this and the historical visitor data, there seems to be differences in reservations and actual visitors. While eyeballing is a good place to start, the next steps are to combine the two datasets so that the difference can be measured. However, I had a lot of trouble determine how to merge the files. If I could have merged the files, I would have tried to graph the difference between the number of people who reserved and the number of people who actually visited the restaurant. I feel like that is a very valuable metric for a restauranteur to have.

# time series
by_date_reserve <- air_reserve %>% 
  group_by(reserve_datetime) %>%
  summarise(all_visitors = sum(reserve_visitors))
## `summarise()` ungrouping output (override with `.groups` argument)
reserve_dateTS <- ts(by_date$all_visitors, frequency = 7) ## why can't I add the date to this?

autoplot(reserve_dateTS, series = "Reservations (Expected Visitors)") + 
  autolayer(by_dateTS, series = "Visitors") ##why do these overlap exactly?

When I tried to make a time series of the reservation data, it looked exactly like my visitor data and not like the graph that I created in the last chunk of code and I could not figure out why that was the case.

Another limitation that I talked about in my paper was how important it would be to include the holiday in with the forecast, as it makes sense that you’d want to plan special for a holiday and understand what the expectations are around that. However, I couldn’t determine how to merge files, or even if I did, how to code that into a time series. Does that mean that it would be a multi-variate time series? Or a multi-step forecast like the ones that I was reading about in the literature review?

To wrap this up, I know that I learned a lot from this project. I think that it really pushed me to understand the data I was working with and how to make what I had into a valuable forecast. It also enabled me to learn more about some pretty useful functions, like the group_by(), summarise(), as.Date(), and %>% functions/operators. I’m excited to be building this foundation and also excited to continue to learn even more.