Libraries

library(readxl)
library(ggplot2)
library(fpp3)
library(dplyr)
library(fst)
library(tsibbledata)
library(tsibble)
library(feasts)
library(lubridate)
library(seasonal)
library(zoo)
library(fable)
library(readxl)
library(ggplot2)
library(forecast)
library(tidyverse)
library(kableExtra)
library(Amelia)
library(reshape2)
library(reticulate)

Problem

A problem for restaurants looking to optimize their operations is the unknown. Preparing for customers on a given day means either spending on surplus, not meeting your needs, or ideally, operating efficiently. The analysis done in this report and the problem being addressed aims to reduce the unknown by predicting the number of daily visitors for many Japanese restaurants.

Significance

Performing optimally has obvious implications on whether a restaurant will stay in business and continue providing their customers. However, if data of this sort consisted of every restaurant in each city, this type of analysis could give insight into general consumer foot traffic and spending habits. General trends regarding which parts of the city are busiest would be incredibly valuable for many businesses for a variety of reasons.

Literature

The two pieces of work most like this analysis are by Ma et al. (2018) and Tanizaki et al. (2019) who both make predictions on restaurant visitors/demand using ML techniques. Another similar study by Shi and Xu (2021) studies dine-in and take-out trends in demand during COVID.

The remaining two studies utilize user reviews to analyze customer preferences (Luo & Xu, 2019) and to map restaurant quality (Grimaldi et al., 2021).

The models considered by Ma et al. (2018) considered the following machine learning: Support Vector Machines (SVM), Random Forests (RF), Deep Neural Networks (DNN), K-Nearest Neighbor (KNN), and XGBoost. With consideration of the computation power necessary to run some of these models, they decided to use KNN, RF and XGBoost. These three models were combined to avoid any bias brought from a single regressor by creating a weighted regressor, which used the mean of all predictions for one input instance from the three regressors. This resulted in their lowest RMSLE test error at (0.490).

As for Tanizaki et al. (2019), the models used consist of Bayesian Linear Regression, Boosted Decision Tree Regression, and Decision Forest Regression for machine learning and Stepwise method used for statistical analysis. There was no major difference seen between the methods.

With Shi and Xu (2021), a probabilistic learning model was used after analyzing the aggregate foot traffic volumes in Washington DC throughout various stages of COVID. While this study did not aim to predict visitors, it found that in the short term, limited-service and budget restaurants saw a more speedy recovery given their comparative advantage. In the long-run, the demand was preserved for dine-in and full-service restaurants.

For the last two studies, ML techniques were only used by Luo and Xu (2019) and consist of Latent Dirichlet Allocation (LDA), Support Vector Machine (SVM) with a Fuzzy Domain Ontology (FDO) algorithm, Naïve Bayes (NB), and K-Nearest Neighbor (KNN). The chosen model was a SVM as it was one of the most effective methods to categorize unlabeled data at the time of study. Use of SVM with FDO algorithm achieved highest prediction accuracy (79.59%) and precision rate (81.62%) to predict restaurant review usefulness of three U.S.-based cities on Yelp. Since a 2017 TripAdvisor Restaurant Marketing Survey reported that 94% of diners choose a restaurant based on online reviews, this is an insightful prediction. Lastly, Grimaldi et al., 2021 simply plotted heat maps on restaurant ratings in Manhattan. No models were used.

Data

- Reading

sample_submission <- read_csv("data/Final/sample_submission.csv")
head(sample_submission)

# A tibble: 6 x 2
  id                              visitors
  <chr>                              <dbl>
1 air_00a91d42b08b08d9_2017-04-23        0
2 air_00a91d42b08b08d9_2017-04-24        0
3 air_00a91d42b08b08d9_2017-04-25        0
4 air_00a91d42b08b08d9_2017-04-26        0
5 air_00a91d42b08b08d9_2017-04-27        0
6 air_00a91d42b08b08d9_2017-04-28        0

air_store_info <- read_csv("data/Final/air_store_info.csv")
air_visit_data <- read_csv("data/Final/air_visit_data.csv")
date_info <- read_csv("data/Final/date_info.csv")


# The omitted datasets:
#
#  air_reserve <- read_csv("data/Final/air_reserve.csv")
#  hpg_reserve <- read_csv("data/Final/hpg_reserve.csv")
#  hpg_store_info <- read_csv("data/Final/hpg_store_info.csv")
#  store_id_relation <- read_csv("data/Final/store_id_relation.csv")

- Exploring

######################################
# sample_submission
######################################

# Finding locations of '_"
which(strsplit(toString(sample_submission[1,1]), "")[[1]] == "_")

[1]  4 21

nchar(toString(sample_submission[1,1]))

[1] 31

sample_submission['air_id'] = substr(sample_submission$id,1,20)
sample_submission['Date'] = substr(sample_submission$id,22,31)

head(sample_submission)

# A tibble: 6 x 4
  id                              visitors air_id               Date      
  <chr>                              <dbl> <chr>                <chr>     
1 air_00a91d42b08b08d9_2017-04-23        0 air_00a91d42b08b08d9 2017-04-23
2 air_00a91d42b08b08d9_2017-04-24        0 air_00a91d42b08b08d9 2017-04-24
3 air_00a91d42b08b08d9_2017-04-25        0 air_00a91d42b08b08d9 2017-04-25
4 air_00a91d42b08b08d9_2017-04-26        0 air_00a91d42b08b08d9 2017-04-26
5 air_00a91d42b08b08d9_2017-04-27        0 air_00a91d42b08b08d9 2017-04-27
6 air_00a91d42b08b08d9_2017-04-28        0 air_00a91d42b08b08d9 2017-04-28

length(unique(sample_submission$air_id))

[1] 821

range(sample_submission$Date)

[1] "2017-04-23" "2017-05-31"

# 831 unique ID's and date range from 2017-04-23 to 2017-05-31


######################################
# air_store_info
######################################
head(air_store_info)

# A tibble: 6 x 5
  air_store_id         air_genre_name air_area_name           latitude longitude
  <chr>                <chr>          <chr>                      <dbl>     <dbl>
1 air_0f0cdeee6c9bf3d7 Italian/French Hyogo-ken Kobe-shi Kum~     34.7      135.
2 air_7cc17a324ae5c7dc Italian/French Hyogo-ken Kobe-shi Kum~     34.7      135.
3 air_fee8dcf4d619598e Italian/French Hyogo-ken Kobe-shi Kum~     34.7      135.
4 air_a17f0778617c76e2 Italian/French Hyogo-ken Kobe-shi Kum~     34.7      135.
5 air_83db5aff8f50478e Italian/French Tokyo-to Minato-ku Shi~     35.7      140.
6 air_99c3eae84130c1cb Italian/French Tokyo-to Minato-ku Shi~     35.7      140.

unique(air_store_info$air_genre_name)

 [1] "Italian/French"               "Dining bar"                  
 [3] "Yakiniku/Korean food"         "Cafe/Sweets"                 
 [5] "Izakaya"                      "Okonomiyaki/Monja/Teppanyaki"
 [7] "Bar/Cocktail"                 "Japanese food"               
 [9] "Creative cuisine"             "Other"                       
[11] "Western food"                 "International cuisine"       
[13] "Asian"                        "Karaoke/Party"

length(unique(air_store_info$air_genre_name))

[1] 14

length(unique(air_store_info$air_area_name))

[1] 103

length(unique(air_store_info$air_store_id))

[1] 829

# 14 unique genres, 103 unique locations, 829 unique ID's 

info <- air_store_info[(match(unique(sample_submission$air_id), air_store_info$air_store_id)),]
info['visitors'] <- air_visit_data[(match(unique(sample_submission$air_id), air_store_info$air_store_id)),3]

genre_dat <- info %>% 
  group_by(air_genre_name) %>%
  summarize(Visitors = sum(visitors))

ggplot(data=genre_dat, aes(x=reorder(air_genre_name, Visitors), y=Visitors)) + 
  geom_bar(stat="identity", fill="steelblue")+ coord_flip() + 
  theme(axis.title.y=element_blank()) + 
  ggtitle("Visitors by Genre")

######################################
# air_visit_data
######################################
head(air_visit_data)

# A tibble: 6 x 3
  air_store_id         visit_date visitors
  <chr>                <date>        <dbl>
1 air_ba937bf13d40fb24 2016-01-13       25
2 air_ba937bf13d40fb24 2016-01-14       32
3 air_ba937bf13d40fb24 2016-01-15       29
4 air_ba937bf13d40fb24 2016-01-16       22
5 air_ba937bf13d40fb24 2016-01-18        6
6 air_ba937bf13d40fb24 2016-01-19        9

length(unique(air_visit_data$air_store_id))

[1] 829

range(air_visit_data$visit_date)

[1] "2016-01-01" "2017-04-22"

# 829 unique ID's, date range from 2016-01-01 to 2017-04-22

ggplot(data=air_visit_data, aes(x=visitors)) + 
  geom_histogram(aes(y=stat(density)), color="blue", fill="lightblue", bins=25) + 
  geom_density(col="darkblue", size=1) + 
  scale_x_continuous(breaks=scales::pretty_breaks(n = 15))+
  xlim(0,100) + ggtitle("Daily Visitors")

ggplot(data=air_visit_data,  aes(x = visitors)) + 
  geom_boxplot(color="cadetblue4", fill="cadetblue1") + coord_flip() +
  ggtitle("Daily Visitors")

# Checking and replacing outliers
bp <- boxplot(air_visit_data$visitors, id=list(method="y"))

par(mfrow=c(1,2))
plot(bp$out, main="Sum of Outliers")
boxplot(bp$out, main="Sum of Outliers")

summary(bp$out)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  60.00   63.00   68.00   75.16   78.50  877.00

length(bp$out)

[1] 6959

q2_o <- quantile(bp$out, 0.25)
#q4_o <- quantile(bp$out, 0.75)

m = min(bp$out)-5
s = 5

air_visit_data$visitors[air_visit_data$visitors > q2_o] <- rnorm(30,m,s)

ggplot(data=air_visit_data,  aes(x = visitors)) + 
  geom_boxplot(color="cadetblue4", fill="cadetblue1") + coord_flip() +
  ggtitle("Daily Visitors")

######################################
# date_info
######################################
head(date_info)

# A tibble: 6 x 3
  calendar_date day_of_week holiday_flg
  <date>        <chr>             <dbl>
1 2016-01-01    Friday                1
2 2016-01-02    Saturday              1
3 2016-01-03    Sunday                1
4 2016-01-04    Monday                0
5 2016-01-05    Tuesday               0
6 2016-01-06    Wednesday             0

range(date_info$calendar_date)

[1] "2016-01-01" "2017-05-31"

# date range from 2016-01-01 to 2017-05-31

by_date <- air_visit_data %>% 
  group_by(visit_date) %>%
  summarize(visitors = sum(visitors))

ggplot(data=by_date, aes(x=visit_date)) + 
  geom_line(aes(y = visitors), color = "coral4") +
  ggtitle("Total Visitors by Date")

# The number of stores included in dataset increase July 2016
nrow(unique(air_visit_data[which(air_visit_data$visit_date == "2016-06-03"),1]))

[1] 297

nrow(unique(air_visit_data[which(air_visit_data$visit_date == "2016-08-03"),1]))

[1] 703

import plotly_express as px

# Reading data from R
#data = r.data  
data = r.air_store_info

# Creating Map
fig = px.scatter_mapbox(data, lat="latitude", lon="longitude", hover_name="air_area_name",
                        hover_data={"air_genre_name"},
                        color_discrete_sequence=["fuchsia"], zoom=4, height=800,
                        title = 'All Restaurant Locations (interactive)')

# Map Type
fig.update_layout(mapbox_style="open-street-map")

- Preparing

# Matching and populating dates for all 'air' ID's.
# If there is no data for a given date, a value of 0 is assigned.
date.prep = air_visit_data %>%
  group_by(air_store_id) %>%
  mutate(Date = visit_date) %>%
  summarize(Date = Date, 
            Visitors = visitors) %>%
  complete(Date = seq.Date(min(date_info$calendar_date),
                           max(date_info$calendar_date),
                           by="day")) %>%
  mutate(Visitors = replace_na(Visitors, -1))

# log1p transformation
data.trs = date.prep
data.trs[,3] = as.data.frame(
  ifelse(data.trs$Visitors == -1, 
         0,
         log1p(data.trs$Visitors)))


############
data = recast(date.prep, air_store_id ~ Date, id.var = c("Date", "air_store_id"))
data = cbind(data, air_store_info[,c(2,3,4,5)])

data = data %>% filter(data$air_store_id %in% sample_submission$air_id)
############
data_log = recast(data.trs, air_store_id ~ Date, id.var = c("Date", "air_store_id"))
data_log = cbind(data_log, air_store_info[,c(2,3,4,5)])

data_log = data_log %>% filter(data_log$air_store_id %in% sample_submission$air_id)
############

##################################################
#
# Encoding area_name and genre_name
#          #########     ##########
##################################################
data_num = data
data_num$air_genre_name = as.numeric(factor(data_num$air_genre_name))
data_num$air_area_name = as.numeric(factor(data_num$air_area_name))

data_num[data_num == -1] <- 0
#
#
#
# All data of value is now aggregated into data_num. 
#

data = r.data_num
data.head()

           air_store_id  2016-01-01  ...   latitude   longitude
0  air_00a91d42b08b08d9         0.0  ...  34.695124  135.197853
1  air_0164b9927d20bcc3         0.0  ...  34.695124  135.197853
2  air_0241aa3964b7f861         0.0  ...  34.695124  135.197853
3  air_0328696196e46f18         0.0  ...  34.695124  135.197853
4  air_034a3d5b40d5b1b1         0.0  ...  35.658068  139.751599

[5 rows x 522 columns]

# Making Training Data
last_date_loc = grep("2017-04-22", colnames(data_num))

train = data_num[,1:round(last_date_loc*0.9)]
test = data_num[,(last_date_loc+1):518]
valid = data_num[,1:(last_date_loc+1)]

Discuss

After many hours trying to understand the data, it was learned that much of the data provided by kaggle is unnecessary for analysis. As the final results are to be submitted with the data’s ‘air_ID’ number of visitors, the data collected through ‘hpg’ systems has no direct relevance. The same is also true for reservation data with both ‘air’ and ‘hpg’ systems. For the sake of cleanliness, only the relevant datasets were imported and used.

The total data (as used in this analysis) consists of a unique restaurant ID, the genre, area, coordinates, and the amount of daily visitors. The data is collected from 2016-01-01 to 2017-04-22 and is to be predicted from 2017-04-23 to 2017-05-31. All data was aggregated into a matrix with dates for columns and ID’s for rows. The values consisted of the number of visitors. The coordinates, genre and area name were appended as the last columns and genre and area were encoded to their respective numeric value. Aggregating the data in this way seemed to suggest that a neural network might be best, however given our time constrain, one wasn’t implemented.

Models

- Functions and ETS v ARIMA

# Takes data from aggregated dataset and formats it for modeling. 
data_prep = function(iter, train, valid){
  current = as.data.frame(t(rbind(colnames(train), train[iter,1:ncol(train)])))
  current = current[-1,]
  names(current) = c("Date", "Value")
  current$Date = as.Date(current$Date)
  current$Value = as.numeric(current$Value)
  current = current %>% as_tsibble(index = Date)
  
  valid = as.data.frame(t(rbind(colnames(valid), valid[iter,1:ncol(valid)])))
  valid = valid[-1,]
  names(valid) = c("Date", "Value")
  valid$Date = as.Date(valid$Date)
  valid$Value = as.numeric(valid$Value)
  valid = valid %>% as_tsibble(index = "Date")
  
  out = list()
  out$current = current
  out$valid = valid
  out[is.na(out)] <- 0
  return(out)
}


# This function is called, it calls the data_prep function above for data, then
# calls the ets and arima functions below to receive accuracy measures for both
# model types. 
val_prep = function(iter, train, valid){
  all = data_prep(iter, train, valid)
  current = all$current
  valid = all$valid
  
  out = list()
  out$ets = receive_ets(iter, current, valid)
  out$arima = receive_arima(iter, current, valid)
  return(out)
}

# Models and returns ETS accuracy measures
receive_ets = function(iter, current, valid){
  ets = current %>%
    model(
      ETS(Value)
    )
  
  m = bind_rows(
    ets %>% accuracy(),
    ets %>% forecast(h=39) %>% accuracy(valid)
  ) %>% select(-MPE,-MAPE,-ACF1)
  return(m)
}

# Models and returns ARIMA accuracy measures
receive_arima = function(iter, current, valid){
  arima = current %>%
    model(
      ARIMA(Value)
    )
  
  m = bind_rows(
    arima %>% accuracy(),
    arima %>% forecast(h=39) %>% accuracy(valid)
  ) %>% select(-MPE,-MAPE,-ACF1)
  return(m)
}


# This loop calls the above functions to produce comparison accuracy measures for
# ETS and ARIMA models. 
ets_totals = list()
arima_totals = list()
for(i in 1:2){
  my_rand = sample(1:nrow(train), 75)
  for(s in 1:75){

    vals = val_prep(my_rand[s], train, valid)
    
    ets_totals = bind_rows(ets_totals, vals$ets)
    arima_totals = bind_rows(arima_totals, vals$arima)
  }
}
head(ets_totals)

# A tibble: 6 x 7
  .model     .type        ME  RMSE   MAE  MASE RMSSE
  <chr>      <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
1 ETS(Value) Training -0.200  7.73  5.54 0.920 0.758
2 ETS(Value) Test     -1.30   5.41  3.94 0.654 0.530
3 ETS(Value) Training  0.403  7.72  5.97 0.756 0.740
4 ETS(Value) Test      0.859  8.17  6.53 0.827 0.783
5 ETS(Value) Training  0.359  9.25  6.28 0.852 0.741
6 ETS(Value) Test      0.866  8.77  7.40 1.00  0.702

ets_vals = ets_totals %>%
  group_by(.type) %>%
  summarize(RMSE = mean(RMSE))

arima_vals = arima_totals %>%
  group_by(.type) %>%
  summarize(RMSE = mean(RMSE))

hist(ets_totals$RMSE, breaks = 100, main="ETS", xlab="RMSE", col="blue")

boxplot(ets_totals$RMSE, col="blue", main="ETS")

hist(arima_totals$RMSE, breaks = 100, main="ARIMA", xlab="RMSE", col="green")

boxplot(arima_totals$RMSE, col="green", main="ARIMA")

ets_vals # average of all ets models

# A tibble: 2 x 2
  .type     RMSE
  <chr>    <dbl>
1 Test      9.08
2 Training  7.98

arima_vals # average of all arima models

# A tibble: 2 x 2
  .type     RMSE
  <chr>    <dbl>
1 Test     10.1 
2 Training  8.43

# The better RMSE score has so far alternated between the two models. It does seem 
# that ETS model performs just slightly better.

- Finding best Arima

# This function produces accuracy measures for each combination
arimaIterate <- function(train_t, valid_t, my_rand, p, d, q){
  myarimaC = train_t %>%
    model(
      `Auto` = ARIMA(Value~ pdq(p,d,q) )
    )
  
  measures = bind_rows(
    myarimaC %>% accuracy(),
    myarimaC %>% forecast(h = 39) %>% accuracy(valid_t) 
  ) %>% select(-MPE,-MAPE,-ACF1)
  
  measures['combo'] = toString(unlist(rbind(p,d,q)))
  measures['p'] = as.numeric(unlist(p))
  measures['d'] = as.numeric(unlist(d))
  measures['q'] = as.numeric(unlist(q))
  measures['index'] = as.numeric(unlist(my_rand))
  return(measures)
}

# this loop sends pdq combinations to above function
totals_a = list()
my_rand = sample(1:nrow(train), 5)
for(i in 1:5){
  for (p in 0:4){
    for(d in 0:2){
      for(q in 0:4){
        
        dat = data_prep(my_rand[i], train, valid)
        train_t = dat$current
        valid_t = dat$valid
        
        vals = arimaIterate(train_t, valid_t, my_rand[i], p,d,q)
        totals_a = bind_rows(totals_a, vals)
      }
    }
  }
}
totals_a_cond = na.omit(totals_a)
head(totals_a_cond)

# A tibble: 6 x 12
  .model .type         ME  RMSE   MAE  MASE RMSSE combo      p     d     q index
  <chr>  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>
1 Auto   Training 0.0217   2.26  1.60  1.04 0.801 0, 0,~     0     0     0   518
2 Auto   Test     0.719    2.17  1.67  1.09 0.768 0, 0,~     0     0     0   518
3 Auto   Training 0.00755  2.24  1.63  1.06 0.793 0, 0,~     0     0     1   518
4 Auto   Test     0.775    2.18  1.64  1.07 0.772 0, 0,~     0     0     1   518
5 Auto   Training 0.00699  2.20  1.55  1.01 0.778 0, 0,~     0     0     2   518
6 Auto   Test     0.763    2.18  1.64  1.07 0.770 0, 0,~     0     0     2   518

measure_test = totals_a_cond %>%
  filter(.type == "Test")
mean(measure_test$RMSE)

[1] 15.06245

# Finds best 10 combinations for each index
top10_avg = totals_a_cond %>% filter(.type == "Test") %>% 
  group_by(index) %>% slice_min(order_by = RMSE, n = 10)

tab1 = top10_avg[,c(9,10,11)]
tab = table(apply(tab1, 1, paste, collapse = ", "))

tab = sort(tab, decreasing = TRUE)
tab = tab[1:10]
#tab[1:10]

kbl(tab, caption = "test", booktabs=T) %>%
  kable_styling(latex_options = c("striped", "hold_position"))

test
Var1	Freq
0, 0, 4	3
1, 0, 3	3
2, 0, 2	3
0, 0, 3	2
0, 1, 0	2
0, 1, 2	2
1, 0, 0	2
1, 1, 1	2
1, 1, 2	2
1, 1, 3	2

# these values change based on which observation in the data is being modeled but 
# follow the general trend seen in the variables below. 

p = c(0, 1)
d = c(0, 1)
q = c(0, 1, 2, 3, 4)

- Evaluating best Arima

# does the same as the loop above with further limits on pdq combinations
totals_b = list()
my_rand = sample(1:nrow(train), 10)
for(i in 1:10){
  for (p in 0:1){
    for(d in 0:1){
      for(q in 0:4){
        
        dat = data_prep(my_rand[i], train, valid)
        train_t = dat$current
        valid_t = dat$valid
        
        vals = arimaIterate(train_t, valid_t, my_rand[i], p,d,q)
        totals_b = bind_rows(totals_b, vals)
      }
    }
  }
}
totals_b_cond = na.omit(totals_b)
head(totals_b_cond)

# A tibble: 6 x 12
  .model .type         ME  RMSE   MAE  MASE RMSSE combo      p     d     q index
  <chr>  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>
1 Auto   Training  0.104   7.89  6.22 0.901 0.791 0, 0,~     0     0     0   187
2 Auto   Test     -0.0933  7.56  6.27 0.908 0.758 0, 0,~     0     0     0   187
3 Auto   Training  0.0457  8.24  6.81 0.987 0.826 0, 0,~     0     0     1   187
4 Auto   Test      0.197   7.89  6.76 0.980 0.791 0, 0,~     0     0     1   187
5 Auto   Training  0.0456  8.24  6.81 0.987 0.826 0, 0,~     0     0     2   187
6 Auto   Test      0.213   7.88  6.75 0.978 0.790 0, 0,~     0     0     2   187

measure_test2 = totals_b_cond %>%
  filter(.type == "Test")
mean(measure_test2$RMSE) # getting better but doesnt beat autoarima above

[1] 11.59281

# Finds best 10 combinations for each index
top10_avg2 = totals_b_cond %>% filter(.type == "Test") %>% 
  group_by(index) %>% slice_min(order_by = RMSE, n = 10)

tab2 = top10_avg2[,c(9,10,11)]
tab = table(apply(tab2, 1, paste, collapse = ", "))

tab = sort(tab, decreasing = TRUE)
tab = tab[1:10]

kbl(tab, caption = "test", booktabs=T) %>%
  kable_styling(latex_options = c("striped", "hold_position"),
                position="center")

test
Var1	Freq
0, 0, 0	8
0, 1, 1	8
1, 0, 0	8
0, 1, 2	7
0, 1, 3	7
0, 1, 4	7
1, 0, 1	7
1, 1, 1	7
1, 1, 2	7
1, 1, 3	7

# better off sticking with basic autoarima model unless we find and apply best model
# to each observation but that would be extremely time-consuming.

- ETS v ARIMA - log transform

last_date_loc = grep("2017-04-22", colnames(data_log))

train = data_log[,1:round(last_date_loc*0.9)]
test = data_log[,(last_date_loc+1):518]
valid = data_log[,1:(last_date_loc+1)]


# Same ETS v ARIMA comparison as above -- with log transformed data
ets_log_totals = list()
arima_log_totals = list()
for(i in 1:2){
  my_rand = sample(1:nrow(train), 75)
  for(s in 1:75){
    
    vals = val_prep(my_rand[s], train, valid)
    
    ets_log_totals = bind_rows(ets_log_totals, vals$ets)
    arima_log_totals = bind_rows(arima_log_totals, vals$arima)
  }
}
head(ets_log_totals)

# A tibble: 6 x 7
  .model     .type          ME  RMSE   MAE  MASE RMSSE
  <chr>      <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
1 ETS(Value) Training  0.0516  1.23  0.834 0.988 0.765
2 ETS(Value) Test      0.346   1.27  1.08  1.27  0.787
3 ETS(Value) Training  0.00969 0.954 0.747 0.821 0.657
4 ETS(Value) Test     -0.701   1.14  0.833 0.916 0.783
5 ETS(Value) Training  0.0153  0.502 0.282 0.741 0.654
6 ETS(Value) Test      0.340   0.597 0.468 1.23  0.778

ets_log_vals = ets_log_totals %>%
  group_by(.type) %>%
  summarize(RMSE = mean(RMSE))

arima_log_vals = arima_log_totals %>%
  group_by(.type) %>%
  summarize(RMSE = mean(RMSE))

hist(ets_log_totals$RMSE, breaks = 100, col="blue", main="logETS")

boxplot(ets_log_totals$RMSE, col="blue", main="logETS")

hist(arima_log_totals$RMSE, breaks = 100, col="green", main="logARIMA")

boxplot(arima_log_totals$RMSE, col="green", main="logARIMA")

ets_log_vals

# A tibble: 2 x 2
  .type     RMSE
  <chr>    <dbl>
1 Test     0.733
2 Training 0.681

arima_log_vals

# A tibble: 2 x 2
  .type     RMSE
  <chr>    <dbl>
1 Test     0.792
2 Training 0.710

# transforming data reduced severity of outlier

- Fitting Model

# fitting onto test data
# combining train and validation data
train = data_num[,1:last_date_loc]
test = sample_submission

# receives data and ID
# formats it to meet submission requirements
myfit = function(trainx, id){
  mymodel = trainx %>%
    model(
      ETS(Value)
    )
  fc = mymodel %>% forecast(h = 39)
  fc['id'] = id
  
  f = fc[,c(5,2,4)]
  return(f)
}

# calls data_prep function at the beginning to receive formatted data
# call function above to receive properly formatted data to bind. 
submit = list()
for(i in 1:nrow(train)){
  id = train[i,1]
  vals = data_prep(i, train, valid)
  trainx = vals$current
  
  myRes = myfit(trainx, id)
  myRes = as.data.frame(myRes)
  submit = dplyr::bind_rows(submit, myRes)
}



head(submit) # results

                    id       Date     .mean
1 air_00a91d42b08b08d9 2017-04-23  3.546309
2 air_00a91d42b08b08d9 2017-04-24 26.878185
3 air_00a91d42b08b08d9 2017-04-25 33.839474
4 air_00a91d42b08b08d9 2017-04-26 32.646970
5 air_00a91d42b08b08d9 2017-04-27 36.282076
6 air_00a91d42b08b08d9 2017-04-28 39.866960

dim(submit)

[1] 32019     3

head(sample_submission) # correct format

# A tibble: 6 x 4
  id                              visitors air_id               Date      
  <chr>                              <dbl> <chr>                <chr>     
1 air_00a91d42b08b08d9_2017-04-23        0 air_00a91d42b08b08d9 2017-04-23
2 air_00a91d42b08b08d9_2017-04-24        0 air_00a91d42b08b08d9 2017-04-24
3 air_00a91d42b08b08d9_2017-04-25        0 air_00a91d42b08b08d9 2017-04-25
4 air_00a91d42b08b08d9_2017-04-26        0 air_00a91d42b08b08d9 2017-04-26
5 air_00a91d42b08b08d9_2017-04-27        0 air_00a91d42b08b08d9 2017-04-27
6 air_00a91d42b08b08d9_2017-04-28        0 air_00a91d42b08b08d9 2017-04-28

dim(sample_submission)

[1] 32019     4

# Both datasets have the same amount of columns. 


# All 'air_id's in both datasets have a match in the other.
sum(submit$id %in% sample_submission$air_id)

[1] 32019

# Combining ID and Date
submit$combined <- str_c(submit$id, "_", submit$Date)

final_submit <- submit[,c(4,3)]
names(final_submit) <- c("id", "visitors")
head(final_submit)

                               id  visitors
1 air_00a91d42b08b08d9_2017-04-23  3.546309
2 air_00a91d42b08b08d9_2017-04-24 26.878185
3 air_00a91d42b08b08d9_2017-04-25 33.839474
4 air_00a91d42b08b08d9_2017-04-26 32.646970
5 air_00a91d42b08b08d9_2017-04-27 36.282076
6 air_00a91d42b08b08d9_2017-04-28 39.866960

# write.csv(final_submit, "data/Final/final_submit2.csv", row.names = FALSE)

Formulation

The data being formatted as it was required additional manipulation down the line. The basic method used was to randomly sample observations within the set to model and compare accuracy measures to find what performed best. This was done through functions and loops.

Performance / Accuracy

Both ETS and ARIMA models performed similarly. The submitted data received a private score of 0.58908.

Limitations

A major limitation of the deployed models is the amount of information potentially going unused with such a large dataset. There are likely Deep Learning models that can perform much better than the models used here. There was no best model that could be applied to all data which inevitably led to higher variation.

Learning

“Users can change the kernel functions of SVM tocomplete learning tasks on non-linear distributions of trainingdata. However, the quality of features heavily affect the perfor-mance of SVM. If training data has irrelevant features, SVMmay have very low accuracy.” - Ma et al.(2019)

References

MA, XU & TIAN, YANSHAN & Luo, Chu & Zhang, Yuehui. (2018). Predicting Future Visitors Of Restaurants Using Big Data. 269-274. 10.1109/ICMLC.2018.8526963.

Luo, Y.; Xu, X. Predicting the Helpfulness of Online Restaurant Reviews Using Different Machine Learning Algorithms: A Case Study of Yelp. Sustainability 2019, 11, 5254. https://doi.org/10.3390/su11195254

Takashi Tanizaki, Tomohiro Hoshino, Takeshi Shimmura, Takeshi Takenaka, Demand forecasting in restaurants using machine learning and statistical analysis, Procedia CIRP, Volume 79, 2019, Pages 679-683, ISSN 2212-8271, https://doi.org/10.1016/j.procir.2019.02.042.

Grimaldi, D.; Collins, C.; Acosta, S.G. Dynamic Restaurants Quality Mapping Using Online User Reviews. Smart Cities 2021, 4, 1104-1112. https://doi.org/10.3390/smartcities4030058

Shi, Linxuan and Xu, Zhengtian, Dine in or Take out? Trends on Restaurant Service Demand amid the COVID-19 Pandemic (April 21, 2021). Available at SSRN: https://ssrn.com/abstract=3934589 or http://dx.doi.org/10.2139/ssrn.3934589

Final

Silas Selfe

2022-05-03