It is a Hotel Group with offering resort and city hotel. From the given dataset, the management is interested in the following questions and looking for answers and insights:
Which type of hotel has more reservation among all during the given period?
Which type of hotel is favored by family with children?
How many channels the group has cooperated with?
How do diverse channels perform?
library(tidyverse) ## this is the code for pipe %>% (shift+ctrl+m)
## āā Attaching core tidyverse packages āāāāāāāāāāāāāāāāāāāāāāāā tidyverse 2.0.0 āā
## ā dplyr 1.1.3 ā readr 2.1.4
## ā forcats 1.0.0 ā stringr 1.5.0
## ā ggplot2 3.4.4 ā tibble 3.2.1
## ā lubridate 1.9.3 ā tidyr 1.3.0
## ā purrr 1.0.2
## āā Conflicts āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse_conflicts() āā
## ā dplyr::filter() masks stats::filter()
## ā dplyr::lag() masks stats::lag()
## ā¹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(dplyr) ## the package for summarize()
library(ggplot2) ## the package for visualization
dataframe = read_csv("hotel_booking.csv")
## Rows: 119390 Columns: 32
## āā Column specification āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## ā¹ Use `spec()` to retrieve the full column specification for this data.
## ā¹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataframe)
## # A tibble: 6 Ć 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## # ā¹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ā¦
colnames(dataframe) ## all names of the column are in right format
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
summary(dataframe) ## check main statistical features of the column values
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
sum(is.na(dataframe))
## [1] 4
dataframe2 = na.omit(dataframe) ##clean the dataframe, deleting all n/a value
glimpse(dataframe2)
## Rows: 119,386
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resortā¦
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ā¦
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ā¦
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201ā¦
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",ā¦
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,ā¦
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ā¦
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ā¦
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ā¦
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BBā¦
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBRā¦
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporatā¦
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporatā¦
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",ā¦
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",ā¦
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Depositā¦
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "ā¦
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",ā¦
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ customer_type <chr> "Transient", "Transient", "Transient", ā¦
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,ā¦
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā¦
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ā¦
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", ā¦
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20ā¦
There are now 119386 booking histories and 32 variables in the cleaned up data frame
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ā¹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
n_cityhotel = dataframe2 %>%
filter(hotel == "City Hotel") %>%
summarize(n_cityhotel = n())
percentage_cityhotel = round(n_cityhotel / nrow(dataframe2), 2)
1 - percentage_cityhotel
## n_cityhotel
## 1 0.34
There are almost 80,000 orders on City Hotel, taking up 66% of all reservations; while about 40,000 orders on Resort Hotel, counting 34%.
Add the number of children and babies to make a new varibale for number of kids
dataframe_children = dataframe2 %>%
mutate(no_kids = children + babies)
summary(dataframe_children$no_kids)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1118 0.0000 10.0000
Minimum Number of Kids is 0, Maxinum number of Kids is 10
The percentage of no-kids family staying in city hotel and resort hotel
is separately 67% and 33%, which doesnāt contradict general proportion
of reservations on two types of hotel. We cannot deduce that in the
given period and data, either of hotel is favored by family
customers.
unique(dataframe2$distribution_channel)
## [1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
The channels include Direct, Corporate, TA/TO, GDS and undefined, 5 in total
unique(dataframe2$customer_type)
## [1] "Transient" "Contract" "Transient-Party" "Group"
The types of customer include Transient, Contract, Transient-Party and Group, four in total.
Channel in contrast to Reservation
nTATO = dataframe2 %>%
filter(dataframe2$distribution_channel == "TA/TO") %>%
summarise(nTATO = n())
nTATO / nrow(dataframe2)
## nTATO
## 1 0.8197779
The TATO is the dominant source of booking, which takes up 82% among all.
Channels with different hotel types
Channel Corporate and Direct are significant for resort hotel reservation. Almost 50% reservations from the two channels are on resort hotel, which are higher than average 34% in amount of general reservation for this type.
There are lots of information and insights can be drawn from this data set, such as:
The channel performance in contrast to the different customer type.
Which segment contributes the most revenue?
The lead time of booking for various customer types.
Many moreā¦