1 Introduction

The data comes in the shape of 8 relational files which are derived from two separate Japanese websites that collect user information: “Hot Pepper Gourmet (hpg): similar to Yelp” (search and reserve) and “AirREGI / Restaurant Board (air): similar to Square” (reservation control and cash register). The training data is based on the time range of Jan 2016 - most of Apr 2017, while the test set includes the last week of Apr plus May 2017.

The individual files’ details:

  • air_visit_data.csv: historical visit data for the air restaurants. This is essentially the main training data set.

  • air_reserve.csv / hpg_reserve.csv: reservations made through the air / hpg systems.

  • air_store_info.csv / hpg_store_info.csv: details about the air / hpg restaurants including genre and location.

  • store_id_relation.csv: connects the air and hpg ids

  • date_info.csv: essentially flags the Japanese holidays.

We could check tables’ relationship by following graph.

From this plot, we could know that the same restaurant will have different in two reservation system. We could merge two parts data by id for future analysis.

2 Overview: Files strudture and relation

2.1 air_reserve

glimpse(air_reserve)
## Observations: 92,378
## Variables: 4
## $ air_store_id     <chr> "air_877f79706adbfb06", "air_db4b38ebe7a7ceff...
## $ visit_datetime   <date> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-...
## $ reserve_datetime <date> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-...
## $ reserve_visitors <int> 1, 3, 6, 2, 5, 2, 4, 2, 2, 2, 3, 3, 2, 6, 7, ...
length(unique(air_reserve$air_store_id))
## [1] 314

There are 314 different stores. In the future analysis, we could do more feature engineering work for visit_datetime and reserve_datetime.

2.2 air_store_info

glimpse(air_store_info)
## Observations: 829
## Variables: 5
## $ air_store_id   <chr> "air_0f0cdeee6c9bf3d7", "air_7cc17a324ae5c7dc",...
## $ air_genre_name <chr> "Italian/French", "Italian/French", "Italian/Fr...
## $ air_area_name  <chr> "Hyōgo-ken Kōbe-shi Kumoidōri", "Hyōgo-ken Kōbe...
## $ latitude       <dbl> 34.69512, 34.69512, 34.69512, 34.69512, 35.6580...
## $ longitude      <dbl> 135.1979, 135.1979, 135.1979, 135.1979, 139.751...
length(unique(air_store_info$air_store_id))
## [1] 829
length(unique(air_store_info$air_genre_name))
## [1] 14
length(unique(air_store_info$air_area_name))
## [1] 103

For this file, there are 829 different stores. All these stores could be divided 14 different food generates and located in 103 areas. We will check more details about this information later.

2.3 air_visit_data

glimpse(air_visit_data)
## Observations: 252,108
## Variables: 3
## $ air_store_id <chr> "air_ba937bf13d40fb24", "air_ba937bf13d40fb24", "...
## $ visit_date   <date> 2016-01-13, 2016-01-14, 2016-01-15, 2016-01-16, ...
## $ visitors     <int> 25, 32, 29, 22, 6, 9, 31, 21, 18, 26, 21, 11, 24,...
length(unique(air_visit_data$air_store_id))
## [1] 829

2.4 hpg_reserve

glimpse(hpg_reserve)
## Observations: 1,048,575
## Variables: 4
## $ hpg_store_id     <chr> "hpg_c63f6f42e088e50f", "hpg_dac72789163a3f47...
## $ visit_datetime   <dttm> 2016-01-01 11:00:00, 2016-01-01 13:00:00, 20...
## $ reserve_datetime <dttm> 2016-01-01 09:00:00, 2016-01-01 06:00:00, 20...
## $ reserve_visitors <int> 1, 3, 2, 5, 13, 2, 2, 2, 2, 6, 2, 2, 2, 2, 5,...
length(unique(hpg_reserve$hpg_store_id))
## [1] 12721

This file contains 1,048,575 records and 12,721 different stores. Based on these two numbers, we could say Hot Pepper Gourmet(hpg) contains more records than Restaurant Board (air). Also, we need do feature engineering for reserve_datetime and visit_datetime later.

2.5 hpg_store_info

glimpse(hpg_store_info)
## Observations: 4,690
## Variables: 5
## $ hpg_store_id   <chr> "hpg_6622b62385aec8bf", "hpg_e9e068dd49c5fa00",...
## $ hpg_genre_name <chr> "Japanese style", "Japanese style", "Japanese s...
## $ hpg_area_name  <chr> "Tōkyō-to Setagaya-ku Taishidō", "Tōkyō-to Seta...
## $ latitude       <dbl> 35.64367, 35.64367, 35.64367, 35.64367, 35.6436...
## $ longitude      <dbl> 139.6682, 139.6682, 139.6682, 139.6682, 139.668...
length(unique(hpg_store_info$hpg_store_id))
## [1] 4690
length(unique(hpg_store_info$hpg_genre_name))
## [1] 34
length(unique(hpg_store_info$hpg_area_name))
## [1] 119

We could find 4,690 stores could be divide in 34 different food generate and located in 119 areas.

2.6 date_info

glimpse(date_info)
## Observations: 517
## Variables: 3
## $ calendar_date <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04,...
## $ day_of_week   <chr> "Friday", "Saturday", "Sunday", "Monday", "Tuesd...
## $ holiday_flg   <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...

2.7 store_id_relation

glimpse(store_id_relation)
## Observations: 150
## Variables: 2
## $ air_store_id <chr> "air_63b13c56b7201bd9", "air_a24bf50c3e90d583", "...
## $ hpg_store_id <chr> "hpg_4bc649e72e2a239a", "hpg_c34b496d0305a809", "...

This is a relational file that connects the air and hpg ids. There are only 150 pairs, which is less than 20% of all air stores and much less than hpg store.

3 Individual feature visualisations

3.1 Air Reservations

The first thing we want to check data is that relationship among the visit_datetime, reserve_datetime and reserve_visitors. We will also do further data transformation for this data and try to dig more information.

p1 = air_reserve %>%
  group_by(reserve_datetime) %>%
  summarise(total_visitor = sum(reserve_visitors)) %>%
  ggplot() +
  geom_line(aes(x = reserve_datetime, y = total_visitor), size = 0.4, col = 'steelblue') +
  labs(title = "Time Series plot of visitors in Reserve Date", x = "Reserve Date", y = "Number of reserve visitors" ) + 
  scale_x_date(date_breaks = "1 month") + # set your x axis's break
  theme_linedraw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7)) #make you x lable vertical

p1

Figure 1. The time series plot of the count of visitors in Visit Date and Reserve Date.

blank_theme <- theme_minimal()+
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_blank(),
    panel.border = element_blank(),
    panel.grid=element_blank(),
    axis.ticks = element_blank(),
    plot.title=element_text(size=14, face="bold")
  )
air_year_reserve = air_reserve_summary %>%
  select(year, total_visitor) %>% 
  group_by(year) %>%
  summarise(total = sum(total_visitor)) %>%
  data.frame() 

air_year_reserve$year = factor(air_year_reserve$year)

pie_plot = air_year_reserve %>%
  ggplot(aes(x = '', y = total, fill = year)) +
  geom_bar(width = 1, stat = 'identity') + 
  coord_polar("y", start = 0) +
  scale_fill_brewer("Blues", guide = guide_legend(title = "Year")) + 
  geom_text(aes(y = total/3 + c(0, cumsum(total)[-length(total)]), 
                label = percent(air_year_reserve$total/sum(air_year_reserve$total)))) +
  blank_theme +
  labs(title = paste("The ration of The Number of Visitor in each Year"))

Figure 2. The distribution for reserve visitors based on month and week days.

pie_plot

Figure 3. The pie plot for the count of visitors in different years.

We find:

  • Based on Figure 1, we find the number of reserve visitors in 2016 is much less than 2017. We could find a clear seasonal trend in 2017. The possible reason for this clear increasing trend might be the system collect more store information from 2017. Also we need to notice that from 2016-08 to 2016-11, in this time of period, the number of visitors in Visit date is 0 and in Reserve Date is close 0. In our real life, this is impossible. We might guess the reservation system might have some maintain or update work, so, system loosed this part data.

  • Based on Figure 2, we find shape of visitor counts in different month show a “U” shape. Remove the potential system reason, we might conclude in this system, people prefer going to restaurant except summer. However, when we check the histogram based on different week days, the number is quit average. Only Monday and Sunday show a clear less than rest days.

  • The numbers of visitors in 2016 and 2017 are comparable.

3.2 Air Visits

lay = rbind(c(1,1,1,1),
            c(2,3,4,4))

grid.arrange(p1, p2, p3, p4, layout_matrix = lay)

Figure 4.

We find:

  • There is an interesting long-term step structure in the overall time series. This might be related to new restaurants being added to the data base. In addition, we already see a periodic pattern that most likely corresponds to a weekly cycle.

  • The number of guests per visit per restaurant per day peaks at 19 (Dec.). The distribution extends up to 100 and, in rare cases, beyond.

  • Friday and the weekend appear to be the most popular days; which is to be expected. Monday and Tuesday have the lowest numbers of average visitors.

  • Also during the year there is a certain amount of variation. Dec appears to be the most popular month for restaurant visits. The period of Mar - May is consistently busy.

We will be forecasting for the last week of April plus May 2017, so let’s look at this time range in our 2016 training data:

p5

Figure 5.

Here, the black line is the date and the blue line corresponds to a smoothing fit with a corresponding grey confidence area. We see again the weekly period and also the impact of the aforementioned Golden Week, which in 2016 happened between Apr 29 and May 5.

# combine date information
new_air_visit =  air_visit_data %>%
  left_join(date_info, by=c("visit_date"="calendar_date"))

p1 = new_air_visit %>%
  group_by(holiday_flg) %>%
  summarise(total_visitor = sum(visitors)) %>%
  data.frame() %>%
  ggplot(aes(x = '', y = total_visitor, fill = factor(holiday_flg))) +
  geom_bar(width = 1, stat = 'identity') + 
  coord_polar("y", start = 0) +
  scale_fill_brewer("Blues", guide = guide_legend(title = "Holiday or Not")) + 
  geom_text(aes(y = total_visitor/2 + c(0, cumsum(total_visitor)[-length(total_visitor)]), 
                label = percent(total_visitor/sum(total_visitor)))) +
  blank_theme +
  labs(title = paste("The ration of The Number of Visitor in each Year"))

week_day = c("Sunday", 'Monday', "Tuesday", 'Wednesday', 'Thursday', 'Friday', 'Saturday')

holiday_air_visit = new_air_visit %>%
  filter(holiday_flg == 1) %>%
  select(visit_date, visitors, day_of_week) %>%
  group_by(day_of_week) %>%
  summarise(total = sum(visitors)) 

holiday_air_visit$day_of_week = factor(holiday_air_visit$day_of_week, levels = week_day)

p2 = holiday_air_visit %>%
  arrange(day_of_week) %>%
  ggplot(aes(x = day_of_week, y = total)) + 
  geom_line(group = 1, col = 'steelblue', size = 0.4) +
  geom_text(aes(label = total), position=position_dodge(width=0.9), vjust=0.5) +
  labs(title = "Visitors in holidays", x = '') +
  theme_linedraw()

Not_holiday_air_visit = new_air_visit %>%
  filter(holiday_flg == 0) %>%
  select(visit_date, visitors, day_of_week) %>%
  group_by(day_of_week) %>%
  summarise(total = sum(visitors)) 

Not_holiday_air_visit$day_of_week = factor(Not_holiday_air_visit$day_of_week, levels = week_day)

p3 = Not_holiday_air_visit %>%
  arrange(day_of_week) %>%
  ggplot(aes(x = day_of_week, y = total)) +
  geom_line(group = 1, col = 'steelblue', size = 0.4) +
  geom_text(aes(label = total), position=position_dodge(width=0.9), vjust=0.5) +
  labs(title = 'Visitors in Non_holidays', x = '') +
  theme_linedraw()
p1

Figure 6.

grid.arrange(p2, p3)

Figure 7.

We find:

  • On this data set, most of data records come from Non_holiday, only a tiny part data come from holiday.
  • From Figure 7, on holiday, we cannot find a clear trend or pattern. The peak showed in Monday. However, on the Non_holiday, we could find there is a increasing trend from Sunday to Saturday. In real life, when weekend approached, people are prefer going to restaurant.

3.3 Air store information

leaflet(air_store_info) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addMarkers(~longitude, ~latitude,
             popup = ~air_store_id, label = ~air_genre_name,
             clusterOptions = markerClusterOptions())

Figure 8. This is an interactive map, click it for more information.

We find:

  • Most of restaurants located around big cities in Japan, such as Tokyo, Osaka and Fkuoka.
p1 = air_store_info %>%
  group_by(air_genre_name) %>%
  count() %>%
  ggplot(aes(reorder(air_genre_name, n, FUN = min), n)) +
  geom_col(fill = 'steelblue') + 
  geom_text(aes(label = n), position=position_dodge(width=0.9), hjust=-0.25) +
  labs(y = "Number of Restaurants in Air System", x = '') +
  coord_flip() + 
  theme_linedraw()
p2 = air_store_info %>%
  group_by(air_area_name) %>%
  count() %>%
  ungroup() %>%
  top_n(15, n) %>%
  ggplot(aes(reorder(air_area_name, n, FUN = min), n)) + 
  geom_col(fill = 'steelblue') + 
  geom_text(aes(label = n), position = position_dodge(width = 0.9), hjust = -0.25) +
  labs(y = 'Number of Restaurants in Air System',
       x = '',
       title = 'TOP 15 areas') + 
  coord_flip() +
  theme_linedraw()
p1

Figure 9.

p2

Figure 10.

3.4 HPG Reservations

hpg_reserve$reserve_datetime = as.Date(hpg_reserve$reserve_datetime)
hpg_reserve$visit_datetime = as.Date(hpg_reserve$visit_datetime)

hpg_data = hpg_reserve %>%
  group_by(reserve_datetime) %>%
  summarise(total_visitor = sum(reserve_visitors)) 

p1 = hpg_data %>%
  ggplot() +
  geom_line(aes(x = reserve_datetime, y = total_visitor), size = 0.4, col = 'steelblue') +
  labs(title = "Time Series plot of visitors in Reserve Date", x = "Reserve Date", y = "Number of reserve visitors" ) + 
  scale_x_date(date_breaks = "1 month") + # set your x axis's break
  theme_linedraw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7)) #make you x lable vertical

p1

Figure 11.

hpg_reserve_summary = hpg_reserve %>%
  select(reserve_datetime, reserve_visitors) %>%
  group_by(reserve_datetime) %>%
  summarise(total_visitor = sum(reserve_visitors)) %>%
  dplyr::mutate(year = year(reserve_datetime),
                month = month(reserve_datetime),
                week = wday(reserve_datetime))

hpg_month_reserve = hpg_reserve_summary %>%
  select(month, total_visitor) %>% 
  group_by(month) %>%
  summarise(total = sum(total_visitor)) %>%
  data.frame() 

hpg_month_reserve$month = factor(hpg_month_reserve$month)

p1 = hpg_month_reserve %>%
  ggplot(aes(x = month, y = total)) +
  geom_bar(stat = 'identity',fill = "steelblue") + 
  theme_linedraw() + 
  geom_text(aes(label = hpg_month_reserve$total), position=position_dodge(width=0.9), vjust=0.5) +
  labs(title = paste("The Number of reserve Visitor in each Month"))

# week
hpg_week_reserve = hpg_reserve_summary %>%
  select(week, total_visitor) %>% 
  group_by(week) %>%
  summarise(total = sum(total_visitor)) %>%
  data.frame() 

hpg_week_reserve$week = factor(hpg_week_reserve$week)

p2 = hpg_week_reserve %>%
  ggplot(aes(x = week, y = total)) +
  geom_bar(stat = 'identity',fill = "steelblue") + 
  theme_linedraw() + 
  geom_text(aes(label = hpg_week_reserve$total), position=position_dodge(width=0.9), vjust=0.5) +
  labs(title = paste("The Number of reserve Visitor in each Week"))
grid.arrange(p1, p2, nrow = 2)

Figure 12.

3.5 HPG store information

leaflet(hpg_store_info) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addMarkers(~longitude, ~latitude,
             popup = ~hpg_store_id, label = ~hpg_genre_name,
             clusterOptions = markerClusterOptions())

Figure 13.

hpg_store_info %>%
  group_by(hpg_genre_name) %>%
  count() %>%
  ggplot(aes(reorder(hpg_genre_name, n, FUN = min), n)) +
  geom_col(fill = 'steelblue') + 
  geom_text(aes(label = n), position=position_dodge(width=0.9), hjust=-0.25) +
  labs(y = "Number of Restaurants in HPG  System", x = '') +
  coord_flip() + 
  theme_linedraw()

Figure 14.

hpg_store_info %>%
  group_by(hpg_area_name) %>%
  count() %>%
  ungroup() %>%
  top_n(15, n) %>%
  ggplot(aes(reorder(hpg_area_name, n, FUN = min), n)) + 
  geom_col(fill = 'steelblue') + 
  geom_text(aes(label = n), position = position_dodge(width = 0.9), hjust = -0.25) +
  labs(y = 'Number of Restaurants in HPG System',
       x = '',
       title = 'TOP 15 areas') + 
  coord_flip() +
  theme_linedraw()

Figure 15.

3.6 Holiday information

p1 = date_info %>%
  ggplot(aes(factor(holiday_flg), fill = factor(holiday_flg))) + 
  geom_bar() + 
  labs(x = '') + 
  theme_linedraw() +
  theme(legend.position = 'none')
p2 = date_info %>%
  mutate( Year = year(calendar_date)) %>%
  filter(Year == 2016) %>%
  ggplot(aes(x = calendar_date, y = factor(holiday_flg), color = factor(holiday_flg))) +
  geom_point(size = 0.5) + 
  labs(x = 'Date', y = 'Holiday or not', title = 'Scatter point for Holiday distribution in 2016') +
  theme_linedraw() +
  theme(legend.position = 'none')
p3 = date_info %>%
  mutate( Year = year(calendar_date)) %>%
  filter(Year == 2017) %>%
  ggplot(aes(x = calendar_date, y = factor(holiday_flg), color = factor(holiday_flg))) +
  geom_point(size = 0.5) + 
  labs(x = 'Date', y = 'Holiday or not', title = 'Scatter point for Holiday distribution in 2017') +
  theme_linedraw() +
  theme(legend.position = 'none') 
lay = rbind(c(1,2,2),
            c(1,3,3))

grid.arrange(p1, p2, p3, layout_matrix = lay)

Figure 16.

4 Feauture relations

4.1 Air visitors per generate

vis_data =  air_visit_data %>%
  left_join(air_store_info, by = 'air_store_id')
vis_data %>%
  group_by(visit_date, air_genre_name) %>%
  summarise(mean_visitors = mean(visitors)) %>%
  ungroup() %>%
  ggplot(aes(visit_date, mean_visitors, color = air_genre_name)) +
  geom_line() +
  labs(y = "Average number of visitors in Air system", x = "Date") +
  theme(legend.position = "none") +
  scale_y_log10() +
  facet_wrap(~ air_genre_name) + 
  theme_linedraw() + 
  theme(legend.position = 'none')

Figure 17.

Note: For this plot, we combine air_visit_data and air_store_info by air_store_info.

We find:

  • We total have 14 different generates, we use logarithmic y-axis for a clear plot.
  • Not all generates have full data start from 2016-01.
  • The mean value for each generate is between 10 and 100 per day. Most of them looks do not have a clear trend. Only Karaoke/party shows a dramatic change.

After we know the relationship between Date and different food generates. We would like to explore more about it.

p1 = vis_data %>%
  mutate(wday = wday(visit_date, label = TRUE)) %>%
  group_by(wday, air_genre_name) %>%
  summarise(mean_visitors = mean(visitors)) %>%
  ggplot(aes(air_genre_name, mean_visitors, color = wday)) +
  geom_point(size = 4) +
  coord_flip() +
  scale_x_discrete(position = "top") +
  ggtitle("air_genre_name") +
  scale_color_hue() + 
  theme_linedraw() +
  theme(legend.position = "left", axis.text.y = element_blank(),
        plot.title = element_text(size = 14)) +
  labs(y = 'Mean Visitors', x = ' ', title ='')

p2 = vis_data %>%
  ggplot(aes(visitors, air_genre_name, fill = air_genre_name)) +
  geom_density_ridges(bandwidth = 0.1, alpha = 0.8) +
  scale_x_log10() +
  theme(legend.position = "none") +
  labs(y = "") +
  scale_fill_cyclical(values = c("#1FBFC3", "#F57670")) +
  theme_linedraw()
grid.arrange(p1, p2, ncol = 2)

Figure 18.

Note: These two plots use the same y-axis lables. For density plot we use logarithmic x-axis.

We find:

*The biggest difference between weekend and weekdays exists for the “Karaoke” bars, which rule the weekend. A similar trend, although with a considerably smaller gap, can be seen for the “International” cuisine.

*No genre really goes against the trend of busier weekends. The smallest variations are in the generic “Other” category, the “Japanese” food, and also the “Korean” cuisine which is the only category where Fridays are the busiest days. General “Bars/Cocktail” are notably unpopular overall.

*The density curves confirm the impression we got from the week-day distribution: the “Asian” restaurants have rarely less than 10 visitors per date and the “Karaoke” places show a very broad distribution due to the strong impact of the weekends.

*The Asian food is pretty popular!

4.2 Holiday

We would like to know the influence of holidays on our visitor numbers by comparing the statistics for days with holidays vs those without holiday flag:

vis_data = air_visit_data %>%
  left_join(date_info, by = c("visit_date" = "calendar_date"))


p1 = vis_data %>%
  ggplot(aes(x = factor(holiday_flg), visitors, color = holiday_flg)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(x = "Holiday or not") +
  theme_linedraw() +
  theme(legend.position = "none")

p2 = vis_data %>%
  group_by(day_of_week, holiday_flg) %>%
  summarise(mean_visitors = mean(visitors)) %>%
  ggplot(aes(day_of_week, mean_visitors, color = factor(holiday_flg))) +
  geom_point(size = 4) +
  theme(legend.position = "none") +
  labs(y = "Average number of visitors", x = ' ') + 
  theme_linedraw() +
  theme(legend.position = 'none', axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))
grid.arrange(p1, p2, ncol = 2)

Figure 19.

Note: For boxplot, 0 means not holiday and 1 means holiday. We use logarithmic y-axis For dot plot, blue dot means holiday, red dot means not holiday.

We find:

  • The distribution of holiday and non_holiday is almost same, seems that holiday or not dose not have any influence. We would like to check this more later.

  • While a weekend holiday has little impact on the visitor numbers, and even decreases them slightly, there is a much more pronounced effect for the weekdays; especially Monday and Tuesday.

4.3 Restaurants per area and the effect on visitor numbers

Our next exploration follows from a simple thought: if gastropubs are popular and we own the only gastropub in the area then we can expect lots of customers. If there are twelve other gastropubs in our street then, try as we might, some of those customers will venture into other establishments. Economists tell us that we can ultimately expect a convergence towards an equilibrium between supply and demand. But for snapshots like our data set, and for relatively localised areas, there might still be merit in investigating restaurant clustering. Therefore, let’s study the number of restaurants of a certain genre per area and their impact on visitor numbers.

We begin with an overview plot of the frequency of certain genres per area for the two data sets of air and hpg stores. The following count plots show which genres exist in which areas (names truncated). The size of the dots is proportional to the number of cases:

air_store_info %>%
  mutate(area = str_sub(air_area_name, 1, 12)) %>%
  ggplot(aes(area, air_genre_name)) +
  geom_count(colour = "#1FBFC3", alpha = 0.8) +
  labs(y = 'Air genre name') +
  theme_linedraw() +
  theme(legend.position = "bottom",
        axis.text.x  = element_text(angle=90, hjust=1, vjust=0.9))

Figure 20.

We find:

*Some areas have lots of restaurants and much variety, whereas others contain only a single air restaurant. Large parts of the parameter space are empty.

*Similarly, some cuisines like “Izakaya” or “Cafe” are pretty ubiquitous, whereas others can only be found in a few areas. Note, that the only 2 Karaoke bars in the air sample are in “Hokkaido Sapporo-shi Minami 3 Jonishi”, whereas the only 2 “International cuisine” restaurants as well as the only two “Asian” places can be found in “Tokyo-to Shibuya-ku Shibuya”.

The same kind of plot for the hpg data looks similar albeit more busy due to the larger number of genres:

hpg_store_info %>%
  mutate(area = str_sub(hpg_area_name, 1, 10)) %>%
  ggplot(aes(area, hpg_genre_name)) +
  geom_count(colour = "#F57670") +
  labs(y = 'HPG genre name') +
  theme_linedraw() +
  theme(legend.position = "bottom",
        axis.text.x  = element_text(angle=90, hjust=1, vjust=0.9))

Figure 21.

We find:

*Also here there are busy areas and those with only a few restaurants. Unsurprisingly, Tokyo features prominently in the areas with a lot of culinary diversity.

*“Japanese style” and “International cuisine” are popular pretty much everywhere. “Amusement bars” and “Udon/Soba” places are rare, as are “Shanghai food” or “Dim Sum”.

The count plots tell us that there is a distribution of how many restaurants of a certain genre can be found per area. Here we look at these distributions in detail via box plots with overlayed jitter plots. The genres are ordered by decreasing mean cases per area, i.e. the mean of a horizontal sequence of dots in a count plot. The we overlay the individual data point and assign each dot a random jitter to visually separate otherwise overlapping data. Here, the y axis (i.e. “Occurrences per area”) correspond to the size of the dots in the count plots above. We’re using single plots here, instead of panels, because these plots are quite detailed. Note the logarithmic y-axes.

We start with the air data:

air_store_info %>%
  group_by(air_genre_name, air_area_name) %>%
  count() %>%
  ggplot(aes(reorder(air_genre_name, n, FUN = mean), n)) +
  geom_jitter(color = "steelblue", size = 1) +
  scale_y_log10() +
  coord_flip() +
  labs(x = "Air genre", y = "Occurences per air area") + 
  theme_linedraw()

Figure 22.

We find:

*Only few genres have medians of more than 2 restaurants per area. Examples are “Italian/French” restaurants or “Bar/Cocktail” places, which are more likely to be found in groups of more than 2 per area.

*For the majority of genres the distribution is firmly clustered around 2 cases per area with a bit of scatter towards higher numbers. “Cafes” have the highest number with 26 occurences in a single area (Fukuoka-ken Fukuoka-shi Daimyō).

Now we look at the same distribution for the HPG restaurants:

vis_data <- hpg_store_info %>%
  group_by(hpg_genre_name, hpg_area_name) %>%
  count()

vis_data %>%
  ggplot(aes(reorder(hpg_genre_name, n, FUN = mean), n)) +
  geom_jitter(color = "#F57670") +
  scale_y_log10() +
  coord_flip() +
  labs(x = "hpg genre", y = "Cases per hpg area") +
  theme_linedraw()

Figure 23.

We find:

*Here we clearly have a minimum of 1 genre per area, and also much more variety in median cases due to the higher overall numbers.

*The most extreme genre is “Japanese style” for which the median is just above 10 restaurants per area. Alongside of this, there a number of other genres for which the lower box hinge is not touching the minimum of 1 case per area.

4.4 Reservations vs Visits

Next we will turn our attention to the reservation numbers in the air_reserve and hpg_reserve data sets. We have seen their time series and distributions back in Sections 4.2 and 4.3; now we will compare the reservation numbers to the actual visitor numbers.

For this, we compute the sum of reserve_visitors per day (i.e. the number of people reservations were made for) for each restaurant id and then join these summaries to the air_visitors file. In order to include the hpg reservations we need to use the store_ids data to join the hpg_store_ids from the hpg_reserve file to the corresponding air_store_ids:

foo <- air_reserve %>%
  mutate(visit_date = date(visit_datetime)) %>%
  group_by(air_store_id,visit_date) %>%
  summarise(reserve_visitors_air = sum(reserve_visitors))

bar <- hpg_reserve %>%
  mutate(visit_date = date(visit_datetime)) %>%
  group_by(hpg_store_id,visit_date) %>%
  summarise(reserve_visitors_hpg = sum(reserve_visitors)) %>%
  inner_join(store_id_relation, by = "hpg_store_id")

all_reserve <- air_visit_data %>%
  inner_join(foo, by = c("air_store_id", "visit_date")) %>%
  inner_join(bar, by = c("air_store_id", "visit_date")) %>%
  mutate(reserve_visitors = reserve_visitors_air + reserve_visitors_hpg)

p <- all_reserve %>%
  filter(reserve_visitors < 120) %>%
  ggplot(aes(reserve_visitors, visitors)) +
  geom_point(color = "black", alpha = 0.3) +
  geom_abline(slope = 1, intercept = 0, color = "grey60") +
  geom_smooth(method = "lm", color = "steelblue") + 
  theme_linedraw()

ggMarginal(p, type="histogram", fill = "steelblue", bins=50)

Figure 24.

We find:

  • The histograms show that the reserve_visitors and visitors numbers peak below ~20 and are largely confined to the range below 100.

  • The scatter points fall largely above the line of identity, indicating that there were more visitors that day than had reserved a table. This is not surprising, since a certain number of people will always be walk-in customers.

  • A notable fraction of the points is below the line, which probably indicates that some people made a reservation but changed their mind and didn’t go. That kind of effect is probably to be expected and taking it into account will be one of the challenges in this competition.

  • The linear fit suggests a trend in which larger numbers of reserve_visitors are more likely to underestimate the eventual visitor numbers. This is not surprising either, since I can imagine that it is more likely that (a) a large reservation is cancelled than (b) a large group of people walk in a restaurant without reservation.

Now we will break down the discrepancy between visitors - reserve_visitors over time, look at the overall histograms, and visualise the air_reserve vs hpg_reserve numbers separately. Here, the time series for air (blue) and hpg (red) are offset vertically by 150 and -250 (see the solid black baselines):

p1 <- all_reserve %>%
  ggplot(aes(visitors - reserve_visitors)) +
  geom_histogram(binwidth = 5, fill = 'black', color = "grey60") +
  labs(x = "") +
  theme_linedraw()

p2 <- all_reserve %>%
  ggplot(aes(visitors - reserve_visitors_air)) +
  geom_histogram(binwidth = 5, fill = "#1FBFC3", color = 'grey60') +
  labs(x = "") + 
  theme_linedraw()

p3 <- all_reserve %>%
  ggplot(aes(visitors - reserve_visitors_hpg)) +
  geom_histogram(binwidth = 5, fill = "#F57670", color = 'grey60') +
  labs(x = "") +
  theme_linedraw()

p4 <- all_reserve %>%
  ggplot(aes(visit_date, visitors - reserve_visitors)) +
  geom_hline(yintercept = c(150, 0, -250)) +
  geom_line(color = 'black') +
  geom_line(aes(visit_date, visitors - reserve_visitors_air + 150), color = "#1FBFC3") +
  geom_line(aes(visit_date, visitors - reserve_visitors_hpg - 250), color = "#F57670") +
  ggtitle("Visitors - Reserved: all (black), air (blue), hpg (red)") +
  theme_linedraw()

lay = rbind(c(1, 2, 3),
            c(4, 4, 4))

grid.arrange(p1, p2, p3, p4, layout_matrix = lay)

Figure 25.

We find:

  • The time series show significant scatter throughout the training time range. While the air (blue) and hpg (red) curves are predominantly above the baseline (i.e. more visitors than reservations), combining the two data sets brings the mean of the distribution closer to the zero line. This can also be seen in the corresponding histograms on the right side.

  • We see the gap where there are no air reservations (compare Sect. 4.2). We could only look at the hpg reservations here (for which this gap does not exist, Sect. 4.3) but it appears safe to assume that they would follow the same trend and can be used as a proxy for the air reservations. Feel free to check this assumption for the gap.

  • The (flipped) histograms in the 3 right panels are roughly aligned with the time series in the left panel for convenience of interpretation. They demonstrate how much the distributions are skewed towards larger visitor numbers than reserve_visitor numbers. We might see a mix here between two distributions: a (probably normal) spread due to cancellations plus a tail from walk-in visitors, which should follow a Poisson distribution.