Read in the hotel_bookings dataset from sampledata and view the structure of this dataset
[1] 119390 32
head(hb)
hotel is_canceled lead_time arrival_date_year
1 Resort Hotel 0 342 2015
2 Resort Hotel 0 737 2015
3 Resort Hotel 0 7 2015
4 Resort Hotel 0 13 2015
5 Resort Hotel 0 14 2015
6 Resort Hotel 0 14 2015
arrival_date_month arrival_date_week_number
1 July 27
2 July 27
3 July 27
4 July 27
5 July 27
6 July 27
arrival_date_day_of_month stays_in_weekend_nights
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
stays_in_week_nights adults children babies meal country
1 0 2 0 0 BB PRT
2 0 2 0 0 BB PRT
3 1 1 0 0 BB GBR
4 1 1 0 0 BB GBR
5 2 2 0 0 BB GBR
6 2 2 0 0 BB GBR
market_segment distribution_channel is_repeated_guest
1 Direct Direct 0
2 Direct Direct 0
3 Direct Direct 0
4 Corporate Corporate 0
5 Online TA TA/TO 0
6 Online TA TA/TO 0
previous_cancellations previous_bookings_not_canceled
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
reserved_room_type assigned_room_type booking_changes deposit_type
1 C C 3 No Deposit
2 C C 4 No Deposit
3 A C 0 No Deposit
4 A A 0 No Deposit
5 A A 0 No Deposit
6 A A 0 No Deposit
agent company days_in_waiting_list customer_type adr
1 NULL NULL 0 Transient 0
2 NULL NULL 0 Transient 0
3 NULL NULL 0 Transient 75
4 304 NULL 0 Transient 75
5 240 NULL 0 Transient 98
6 240 NULL 0 Transient 98
required_car_parking_spaces total_of_special_requests
1 0 0
2 0 0
3 0 0
4 0 0
5 0 1
6 0 1
reservation_status reservation_status_date
1 Check-Out 2015-07-01
2 Check-Out 2015-07-01
3 Check-Out 2015-07-02
4 Check-Out 2015-07-02
5 Check-Out 2015-07-03
6 Check-Out 2015-07-03
We may concerd about relevent numerical datas for each booking during different years
hb %>%
group_by(arrival_date_year) %>%
select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
summarise_all(median, na.rm=TRUE)
# A tibble: 3 x 9
arrival_date_year is_canceled lead_time is_repeated_guest adults
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 0 56 0 2
2 2016 0 68 0 2
3 2017 0 80 0 2
# ... with 4 more variables: children <dbl>, babies <dbl>,
# required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>
hb %>%
group_by(arrival_date_year) %>%
select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
summarise_all(mean, na.rm=TRUE)
# A tibble: 3 x 9
arrival_date_year is_canceled lead_time is_repeated_guest adults
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 0.370 97.2 0.0291 1.83
2 2016 0.359 103. 0.0314 1.85
3 2017 0.387 109. 0.0342 1.88
# ... with 4 more variables: children <dbl>, babies <dbl>,
# required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>
hb %>%
group_by(arrival_date_year) %>%
select("is_canceled","lead_time","is_repeated_guest","adults","children","babies","required_car_parking_spaces","total_of_special_requests") %>%
summarise_all(sd, na.rm=TRUE)
# A tibble: 3 x 9
arrival_date_year is_canceled lead_time is_repeated_guest adults
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 0.483 105. 0.168 0.851
2 2016 0.480 107. 0.174 0.498
3 2017 0.487 108. 0.182 0.496
# ... with 4 more variables: children <dbl>, babies <dbl>,
# required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>
Then is the summarise of frequences of other datas Because there are too many manual data. I will give one exmple of frequence. We can use count() function to caculate the frequence of manual data.
count(hb,meal)
meal n
1 BB 92310
2 FB 798
3 HB 14463
4 SC 10650
5 Undefined 1169
The plots below are two univariate plots. 1.This plot concentrate on the number of adults in each booking. This can used for making a plan of hotel’s room arrangement.
weight <- ggplot(data=hb, aes(x=adults))+
geom_histogram(data=hb, aes(x=adults), binwidth=1)
weight
As a result we can rooms for 1,2,3 people as 3:12:1. One problem of this plot is the x axes. It’s a little hard to recogonize different X number and I will fix this problem latter.
meal_bar <- ggplot(data=hb, aes(x=meal, fill=meal))+
geom_bar()+
labs(x="Meal Type", y="Frequency", title="Meal Frequency")
meal_bar
As a result we can set more BB than the other meals.
This plot is about lead time in different countries. I think this can also be used for set the plan.
leadtime_country <- ggplot(data=hb, aes(x=country, y=lead_time))+
geom_point()+
labs(x="Country", y="Lead time", title="Country by lead time")
leadtime_country
As we can observed that generally different countires need different lead time. The points are concentrate in different regions for different countries. However one problem of this plot is that there are too many countries and it’s hard to concentrate on one country. Because the X-axis is too dense.