601_HW5

Sheng Zhang
2022/1/9

Hotel Bookings

Read in the hotel_bookings dataset from sampledata and view the structure of this dataset

library(ggplot2)
library(dplyr)
hb <- read.csv("D:/601_workspace/hotel_bookings.csv")
dim(hb)
[1] 119390     32
head(hb)
         hotel is_canceled lead_time arrival_date_year
1 Resort Hotel           0       342              2015
2 Resort Hotel           0       737              2015
3 Resort Hotel           0         7              2015
4 Resort Hotel           0        13              2015
5 Resort Hotel           0        14              2015
6 Resort Hotel           0        14              2015
  arrival_date_month arrival_date_week_number
1               July                       27
2               July                       27
3               July                       27
4               July                       27
5               July                       27
6               July                       27
  arrival_date_day_of_month stays_in_weekend_nights
1                         1                       0
2                         1                       0
3                         1                       0
4                         1                       0
5                         1                       0
6                         1                       0
  stays_in_week_nights adults children babies meal country
1                    0      2        0      0   BB     PRT
2                    0      2        0      0   BB     PRT
3                    1      1        0      0   BB     GBR
4                    1      1        0      0   BB     GBR
5                    2      2        0      0   BB     GBR
6                    2      2        0      0   BB     GBR
  market_segment distribution_channel is_repeated_guest
1         Direct               Direct                 0
2         Direct               Direct                 0
3         Direct               Direct                 0
4      Corporate            Corporate                 0
5      Online TA                TA/TO                 0
6      Online TA                TA/TO                 0
  previous_cancellations previous_bookings_not_canceled
1                      0                              0
2                      0                              0
3                      0                              0
4                      0                              0
5                      0                              0
6                      0                              0
  reserved_room_type assigned_room_type booking_changes deposit_type
1                  C                  C               3   No Deposit
2                  C                  C               4   No Deposit
3                  A                  C               0   No Deposit
4                  A                  A               0   No Deposit
5                  A                  A               0   No Deposit
6                  A                  A               0   No Deposit
  agent company days_in_waiting_list customer_type adr
1  NULL    NULL                    0     Transient   0
2  NULL    NULL                    0     Transient   0
3  NULL    NULL                    0     Transient  75
4   304    NULL                    0     Transient  75
5   240    NULL                    0     Transient  98
6   240    NULL                    0     Transient  98
  required_car_parking_spaces total_of_special_requests
1                           0                         0
2                           0                         0
3                           0                         0
4                           0                         0
5                           0                         1
6                           0                         1
  reservation_status reservation_status_date
1          Check-Out              2015-07-01
2          Check-Out              2015-07-01
3          Check-Out              2015-07-02
4          Check-Out              2015-07-02
5          Check-Out              2015-07-03
6          Check-Out              2015-07-03

Descriptive Statistics

We may concerd about relevent numerical datas for each booking during different years

hb %>%
  group_by(arrival_date_year) %>%
  select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
  summarise_all(median, na.rm=TRUE)
# A tibble: 3 x 9
  arrival_date_year is_canceled lead_time is_repeated_guest adults
              <int>       <dbl>     <dbl>             <dbl>  <dbl>
1              2015           0        56                 0      2
2              2016           0        68                 0      2
3              2017           0        80                 0      2
# ... with 4 more variables: children <dbl>, babies <dbl>,
#   required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>
hb %>%
  group_by(arrival_date_year) %>%
  select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
  summarise_all(mean, na.rm=TRUE)
# A tibble: 3 x 9
  arrival_date_year is_canceled lead_time is_repeated_guest adults
              <int>       <dbl>     <dbl>             <dbl>  <dbl>
1              2015       0.370      97.2            0.0291   1.83
2              2016       0.359     103.             0.0314   1.85
3              2017       0.387     109.             0.0342   1.88
# ... with 4 more variables: children <dbl>, babies <dbl>,
#   required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>
hb %>%
  group_by(arrival_date_year) %>%
  select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
  summarise_all(sd, na.rm=TRUE)
# A tibble: 3 x 9
  arrival_date_year is_canceled lead_time is_repeated_guest adults
              <int>       <dbl>     <dbl>             <dbl>  <dbl>
1              2015       0.483      105.             0.168  0.851
2              2016       0.480      107.             0.174  0.498
3              2017       0.487      108.             0.182  0.496
# ... with 4 more variables: children <dbl>, babies <dbl>,
#   required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>

Then is the summarise of frequences of other datas Because there are too many manual data. I will give one exmple of frequence. We can use count() function to caculate the frequence of manual data.

count(hb,meal)
       meal     n
1        BB 92310
2        FB   798
3        HB 14463
4        SC 10650
5 Undefined  1169

Visulization

The plots below are two univariate plots. 1.This plot concentrate on the number of adults in each booking. This can used for making a plan of hotel’s room arrangement.

weight <- ggplot(data=hb, aes(x=adults))+
  xlim(0,5) +
  geom_histogram(data=hb, aes(x=adults), binwidth=1) 
weight

As a result we can rooms for 1,2,3 people as 3:12:1. One problem of this plot is the x axes. It’s a little hard to recogonize different X number and I will fix this problem latter. In homework5 I use xlim solved the problem of x axes.

  1. This plot is about meal frequency. We can use this to arrange meal plan.
meal_bar <- ggplot(data=hb, aes(x=meal, fill=meal))+
  geom_bar()+
  labs(x="Meal Type", y="Frequency", title="Meal Frequency")
meal_bar

As a result we can set more BB than the other meals.

  1. Advance plot

I darwed another plot which is adults by bookings and seperated by years. I used facet_wrap here and use fill to give color and seperate by hotel for each columns.

weight <- ggplot(data=hb, aes(x=adults, fill=hotel))+
  xlim(0,5) +
  geom_histogram(data=hb, aes(x=adults), binwidth=1) +
  facet_wrap(vars(arrival_date_year), scales = "free_y") +
  labs(x="Adults", y="Bookings", title="Adult by Bookings seperated by Years")
weight

I will answer the questions here:

  1. What is missing in your analysis process so far? I think nothing is missing usring my process.

  2. What conclusions can you make about your research questions at this point? Most of bookings have 2 adults. And the ratio for 1,2,3 adults is about 3:12:1 The bookings in 2016 is much more than 2015 and 2017. The bookings of City hotel is about 3 times than Resort hotel.

  3. What do you think a naive reader would need to fully understand your graphs? I think we need to note the x axes and y axes. What’s more we also need a braive and clear title. And if we could use different color to sperate different data it will be helpful.

  4. Is there anything you want to answer with your dataset, but can’t? Generally speaking I want to seperate the countire into sevel gruops but I don’t know how to solve this problem.

  1. The plot beelow is a bivarate plot.

This plot is about lead time in different countries. I think this can also be used for set the plan.

leadtime_country <- ggplot(data=hb, aes(x=country, y=lead_time))+
  geom_point()+
  labs(x="Country", y="Lead time", title="Country by lead time")
leadtime_country

As we can observed that generally different countires need different lead time. The points are concentrate in different regions for different countries. However one problem of this plot is that there are too many countries and it’s hard to concentrate on one country. Because the X-axis is too dense.