601_FinalProject

Introduction

This project is mainly aim to research hotel bookings. The data is about the bookings detalis of two hotels in 3 years. I’m interested in the relevant between these booking detalis. Such as the changing of adults for each booking by years. And whether there are some kind of relevant between lead time and countries. By researching these questions we can give a prediction for the future and help these hotles to make a plan for the next year.(Including the room arangement, meal plan, etc)

Data

As I mentioned in the intoruction part. This data is about the booking detalis for two hotels in 3 years. So firstly let’s focus on the structure of this dataset.

library(ggplot2)
library(dplyr)
hb <- read.csv("D:/601_workspace/hotel_bookings.csv")
dim(hb)

[1] 119390     32

head(hb)

         hotel is_canceled lead_time arrival_date_year
1 Resort Hotel           0       342              2015
2 Resort Hotel           0       737              2015
3 Resort Hotel           0         7              2015
4 Resort Hotel           0        13              2015
5 Resort Hotel           0        14              2015
6 Resort Hotel           0        14              2015
  arrival_date_month arrival_date_week_number
1               July                       27
2               July                       27
3               July                       27
4               July                       27
5               July                       27
6               July                       27
  arrival_date_day_of_month stays_in_weekend_nights
1                         1                       0
2                         1                       0
3                         1                       0
4                         1                       0
5                         1                       0
6                         1                       0
  stays_in_week_nights adults children babies meal country
1                    0      2        0      0   BB     PRT
2                    0      2        0      0   BB     PRT
3                    1      1        0      0   BB     GBR
4                    1      1        0      0   BB     GBR
5                    2      2        0      0   BB     GBR
6                    2      2        0      0   BB     GBR
  market_segment distribution_channel is_repeated_guest
1         Direct               Direct                 0
2         Direct               Direct                 0
3         Direct               Direct                 0
4      Corporate            Corporate                 0
5      Online TA                TA/TO                 0
6      Online TA                TA/TO                 0
  previous_cancellations previous_bookings_not_canceled
1                      0                              0
2                      0                              0
3                      0                              0
4                      0                              0
5                      0                              0
6                      0                              0
  reserved_room_type assigned_room_type booking_changes deposit_type
1                  C                  C               3   No Deposit
2                  C                  C               4   No Deposit
3                  A                  C               0   No Deposit
4                  A                  A               0   No Deposit
5                  A                  A               0   No Deposit
6                  A                  A               0   No Deposit
  agent company days_in_waiting_list customer_type adr
1  NULL    NULL                    0     Transient   0
2  NULL    NULL                    0     Transient   0
3  NULL    NULL                    0     Transient  75
4   304    NULL                    0     Transient  75
5   240    NULL                    0     Transient  98
6   240    NULL                    0     Transient  98
  required_car_parking_spaces total_of_special_requests
1                           0                         0
2                           0                         0
3                           0                         0
4                           0                         0
5                           0                         1
6                           0                         1
  reservation_status reservation_status_date
1          Check-Out              2015-07-01
2          Check-Out              2015-07-01
3          Check-Out              2015-07-02
4          Check-Out              2015-07-02
5          Check-Out              2015-07-03
6          Check-Out              2015-07-03

As we can see above this data set have 119390 rows and 32 columns. For each row is a booking with 32 relevant data.

We may concerd about relevent numerical datas for each booking during different years

hb %>%
  group_by(arrival_date_year) %>%
  select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
  summarise_all(median, na.rm=TRUE)

# A tibble: 3 x 9
  arrival_date_year is_canceled lead_time is_repeated_guest adults
              <int>       <dbl>     <dbl>             <dbl>  <dbl>
1              2015           0        56                 0      2
2              2016           0        68                 0      2
3              2017           0        80                 0      2
# ... with 4 more variables: children <dbl>, babies <dbl>,
#   required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>

hb %>%
  group_by(arrival_date_year) %>%
  select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
  summarise_all(mean, na.rm=TRUE)

# A tibble: 3 x 9
  arrival_date_year is_canceled lead_time is_repeated_guest adults
              <int>       <dbl>     <dbl>             <dbl>  <dbl>
1              2015       0.370      97.2            0.0291   1.83
2              2016       0.359     103.             0.0314   1.85
3              2017       0.387     109.             0.0342   1.88
# ... with 4 more variables: children <dbl>, babies <dbl>,
#   required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>

hb %>%
  group_by(arrival_date_year) %>%
  select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
  summarise_all(sd, na.rm=TRUE)

# A tibble: 3 x 9
  arrival_date_year is_canceled lead_time is_repeated_guest adults
              <int>       <dbl>     <dbl>             <dbl>  <dbl>
1              2015       0.483      105.             0.168  0.851
2              2016       0.480      107.             0.174  0.498
3              2017       0.487      108.             0.182  0.496
# ... with 4 more variables: children <dbl>, babies <dbl>,
#   required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>

As we can see above the median, mean and sd are shown above for all numerical datas.

Because there are too many manual data in this dataset. I will not give all the frequency for this dataset. And here is an example of manual data “meal” which I used during later research.

count(hb,meal)

       meal     n
1        BB 92310
2        FB   798
3        HB 14463
4        SC 10650
5 Undefined  1169

For the meal columns there are 5 types: BB, FB, HB, SC and Undefined. SC means self-catering (no meals are included). BB means bed and breakfast. HB means half board, in which breakfast and dinner are included. FB means full board, in which breakfast, lunch and dinner are included.

Visualization

1.This plot concentrate on the number of adults in each booking. This can used for making a plan of hotel’s room arrangement.

weight <- ggplot(data=hb, aes(x=adults))+
  xlim(0,5) +
  geom_histogram(data=hb, aes(x=adults), binwidth=1)+
  labs(x="Adults", y="Booking", title="Adults by Bookings")
weight

As a result we can arragne rooms for 1,2,3 people as 3:12:1.

This plot is about meal frequency. We can use this to arrange meal plan.

meal_bar <- ggplot(data=hb, aes(x=meal, fill=meal))+
  geom_bar()+
  labs(x="Meal Type", y="Frequency", title="Meal Frequency")
meal_bar

As we can see, most of guests will choose BB (Bed and Breakfast). According to this we can prepare more materials for breakfast.

I draw another plot which is adults by bookings and seperated by years. I used facet_wrap here and use fill to give color and seperate by hotel for each columns.

weight <- ggplot(data=hb, aes(x=adults, fill=hotel))+
  xlim(0,5) +
  geom_histogram(data=hb, aes(x=adults), binwidth=1) +
  facet_wrap(vars(arrival_date_year), scales = "free_y") +
  labs(x="Adults", y="Bookings", title="Adults by Bookings seperated by Years")
weight

From this plot we can see: The bookings in 2016 is much more than 2015 and 2017. The bookings of City hotel is about 3 times than Resort hotel.

The plot below is a bivarate plot.

This plot is about lead time in different countries. Firstly, accroding to this plot I want to know whether the lead time is relevant with guests’ nationality. What’s more I think this can also be used for set a plan for lead time.

leadtime_country <- ggplot(data=hb, aes(x=country, y=lead_time))+
  geom_point()+
  labs(x="Country", y="Lead time", title="Country by Lead time")
leadtime_country

As we can observed that generally different countires need different lead time. The points are concentrate in different regions for different countries. I think it means that lead time and country are relevant.

Reflection

This is my first project using R language and I do learned a lot from this. Throughout the course I think the most meaning thing is to think over the data. Because I used to do some machine learning project which is given particular data and I almost never figure out the relevant between data by myself.

During this project I think the most challenging thing is the grammer of R language especially some particular details. Such as I mentioned in the tutorial: the difference between `` and ’’. But after figuring out how to use these I found that R language is a powerful tool for cleaning data and visualization.

In this project, I selected several common data for analysis such as adults, meal and bookings. The main reason is that I think these will influence the number of bookings directly. And it’s easy to make management for the hotels.

What’s more I also curious about the lead time. I think this maybe influenced by the language of hosts speaking, the culture of different hosts etc. Thus I want to figure out whether the lead time and country are relevant. Thus I use a plot to show the relevant between lead time and country.

Conclusion

I draw several conclusion as below: 1. Most of bookings have 2 adults. And the ratio for 1,2,3 adults is about 3:12:1 2. The bookings in 2016 is much more than 2015 and 2017. 3. The bookings of City hotel is about 3 times than Resort hotel. 4. Lead time and country are relevant.

And I think we can use these to make particular arrangement for rooms, meal and leading in the future.

Bibliography

[1] Programming language: R language

[2] Hotel Booking data set

[3] Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

[4] DACSS 601: Data Science Fundamentals