The hotel sector is a data-rich industry that collects massive amounts of various forms of data. Analytics can be used to segment passengers based on booking patterns, behavior, and other criteria, revealing their tedency to respond to discounts and new travel trends. Hence, using our Hotel data we plan to find out insights which can be useful in future business planning and to optimize revenue. Below are some of the patterns we going to look into:
Some specific questions, we will be answering
We have used below mentioned packages for our analysis:
library(readxl) #### To import the excel files into R
library(DT) #### To display the data attractively
library(ggplot2) #### To visualize
library(tidyverse) #### To tidy/clean the data
library(SmartEDA) #### To visualize
library(dplyr) #### To tidy/clean the data
library(skimr) #### To describe the data
library(lubridate) ### To deal with dates
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_dictionary <- read_excel("D:/Course/Data Wrangling/Mid Term Project/Data_Dictionary.xlsx")
attach(data_dictionary)
datatable(
head(data_dictionary,40),
extensions = 'FixedColumns',
options = list(
scrollY = "500px",
scrollX = TRUE,
fixedColumns = TRUE
)
)
dim(hotels)
## [1] 119390 32
head(hotels, 10)
## # A tibble: 10 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
summary(hotels)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
hotels$is_canceled<-as.factor(hotels$is_canceled)
hotels$is_repeated_guest<-as.factor(hotels$is_repeated_guest)
sum(is.na(hotels))
## [1] 4
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
getmode = function(v) {
uniqv = unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]}
getmode(hotels$children)
## [1] 0
hotels$children[is.na(hotels$children)] <- getmode(hotels$children)
head(hotels, 10)
## # A tibble: 10 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <fct> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <fct>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 0 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
There are no missing value now.
Frequency for some categorical variables
table(hotels$arrival_date_month)
##
## April August December February January July June March
## 11089 13877 6780 8068 5929 12661 10939 9794
## May November October September
## 11791 6794 11160 10508
table(hotels$market_segment)
##
## Aviation Complementary Corporate Direct Groups
## 237 743 5295 12606 19811
## Offline TA/TO Online TA Undefined
## 24219 56477 2
Studying some Numerical Variables
boxplot(hotels$lead_time, main = "lead_time")
boxplot(hotels$arrival_date_year, main = "arrival_date_year")
boxplot(hotels$arrival_date_week_number, main = "arrival_date_week_number")
Significant outliers in lead_time but not for other two. Lead time is an important part of data hence can not be replaced or removed.
The data is obtained from github website and link to the site is Data source
The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.
To begin with our analysis, the first thing we are going to look for is the booking curve over the time. What is the time period when most number of bookings are received.
Booking Pattern
hotel1 <- hotels %>%
mutate(arrival_date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),arrival_date = parse_date(arrival_date, format = "%Y-%B-%d"))
p <- ggplot(hotel1, aes(x=hotel1$lead_time)) + geom_area(stat="bin",color="cornflowerblue",fill="lightblue")
p + labs(x="Number of Days", y="Count", title="Bookings: number of days out") + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can observe from the graph, maximum number of bookings are received within 0-30 and 31-60 window.
Customer Behavior
Once, we have the idea how the booking curve works, we should understand more about our customers. Their preferences while making a booking.
Below graph indicates the popular choice among customers for City Hotel or Resort Hotel.
hotels_by_type <- hotels %>% select(arrival_date_year, hotel) %>% group_by(arrival_date_year, hotel)%>% summarize(count = n()) %>% mutate(pct = count/sum(count)*100)
ggplot(hotels_by_type, aes(x = arrival_date_year, y = pct, fill = factor(hotel))) +
geom_bar(stat = "identity" , width=0.25) +
scale_x_continuous(breaks = seq(2015,2017,1)) +ggtitle("Preferred Type")
It’s clearly visible from the above graph that customers prefer City Hotel. The number of city hotels booked in 2015 is 62.5% of total number of bookings and the share increased by approximately 6% in 2016.
hotels %>%
group_by(is_repeated_guest) %>%
summarise(Count=n())%>% mutate(Percent = round(Count/ sum(Count)*100,2))
## # A tibble: 2 x 3
## is_repeated_guest Count Percent
## <fct> <int> <dbl>
## 1 0 115580 96.8
## 2 1 3810 3.19
Here, 0 represents the not repeated guest and 1 repeated guest. 96.81% of customers are not repeated, indicating customers pereference to book a new place every time they visit outside.
Further, we want to dig in to know the type of repeated and unrepeated guest
Customers_vs_repeatedguests <- hotels %>% group_by(customer_type, is_repeated_guest)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.
ggplot(Customers_vs_repeatedguests, aes(x = customer_type, y = pct, fill = factor(is_repeated_guest))) + geom_bar(stat = "identity" , width=0.5, position = "dodge")+ ggtitle("Customer Type")
People prefer visiting in groups while contracts has the least.
hotels %>% select (deposit_type, is_canceled) %>% group_by(deposit_type) %>%
summarise(Total_Cancellations=sum(is_canceled==1)) %>% mutate(PercentCancellation = Total_Cancellations/ sum(Total_Cancellations)*100)
## # A tibble: 3 x 3
## deposit_type Total_Cancellations PercentCancellation
## <chr> <int> <dbl>
## 1 No Deposit 29694 67.1
## 2 Non Refund 14494 32.8
## 3 Refundable 36 0.0814
Looking at the cancellation trend, we observe customer is more tend to cancel the bookings when it is ‘No deposit’ that is 67%
hotels %>% select (meal) %>% group_by(meal) %>%
summarise(Count=n()) %>% ggplot(aes(x=meal,y=Count))+ geom_bar(stat = 'identity',fill = 'Steelblue')+
geom_text(aes(label = meal),vjust = 1.5)+ggtitle("Preferred Room/Meal")
The most preferred room type among customers is BB i.e. ‘Bed and Breakfast’
One interesting thing to note is, not getting desired room is not the reasons for people cancelling the room.
hotels %>% filter(is_canceled==1, reserved_room_type != assigned_room_type) %>% summarise(n())
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 802
Only 0.6% cases of cancellation where customer does not get the desired room.
Promotional Activity
hotels %>% filter(arrival_date_year == '2016') %>% select (arrival_date_month, distribution_channel) %>% group_by(arrival_date_month,distribution_channel) %>%
summarise(Bookings_distribution_channel=n()) %>% mutate(PercentBooking = Bookings_distribution_channel/sum(Bookings_distribution_channel)*100) %>% arrange(desc(PercentBooking))
## `summarise()` has grouped output by 'arrival_date_month'. You can override using the `.groups` argument.
## # A tibble: 47 x 4
## # Groups: arrival_date_month [12]
## arrival_date_month distribution_channel Bookings_distribution~ PercentBooking
## <chr> <chr> <int> <dbl>
## 1 August TA/TO 4334 85.6
## 2 September TA/TO 4572 84.8
## 3 June TA/TO 4471 84.5
## 4 July TA/TO 3858 84.4
## 5 May TA/TO 4614 84.2
## 6 April TA/TO 4561 84.0
## 7 December TA/TO 3127 81.0
## 8 November TA/TO 3590 80.6
## 9 March TA/TO 3854 79.9
## 10 October TA/TO 4948 79.8
## # ... with 37 more rows
Above table shows us the monthwise distribution channel bookings which can be used for promotional activities. In general, booking share is like below:
bookings_by_distribution_channel <- hotels %>% group_by(distribution_channel, hotel)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'distribution_channel'. You can override using the `.groups` argument.
ggplot(bookings_by_distribution_channel, aes(x = distribution_channel, y = pct, fill = factor(hotel))) +
geom_bar(stat = "identity" , width=0.5, position = "dodge")+ggtitle("Booking through Distribution channel")
From this plot we can notice that distribution channel GDS does bookings only at City hotels and TA/TO has higher bookings at City hotels while Corporate and Direct have equal weights.
But we should also look for Percentage of cancellations by distribution Channel and Type
cancellation_by_distribution_channel <- hotels %>% group_by(distribution_channel, is_canceled)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'distribution_channel'. You can override using the `.groups` argument.
ggplot(cancellation_by_distribution_channel, aes(x = distribution_channel, y = pct, fill = factor(is_canceled))) +
geom_bar(stat = "identity" , width=0.5, position = "dodge")+ggtitle("Cancellation through Distribution channel")
From this plot it can be observed that distribution channel TA/TO cancelled around 40% of the bookings while others cancelled around 20%.
Some specific questions we look to answer
data_country <- hotels %>% group_by(country) %>% summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>%
ggplot(.,aes(country, booking_count)) +
geom_bar(stat = "identity", fill = "Blue")+ggtitle("Top 10 countries")
Portugal has highest number of bookings followed by Great Britain and France.
Cancellations_by_type <- hotels %>% filter(is_canceled == 1) %>% group_by(arrival_date_year, hotel)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'arrival_date_year'. You can override using the `.groups` argument.
ggplot(Cancellations_by_type, aes(x = arrival_date_year, y = pct, fill = factor(hotel))) +
geom_bar(stat = "identity" , width=0.25) +
scale_x_continuous(breaks = seq(2015,2017,1))+ ggtitle("Cancellation %")
Of the total number of cancellations city hotels has higher share than resort hotels which is expected as the number of bookings at city hotel is higher, but bookings percentage of 62.5 vs cancellation percentage of 75 are not exactly proportional.
df1=(hotels %>% group_by(lead_time) %>%
summarise(Bookings= n()))
df2= hotels %>% filter(is_canceled==1) %>% group_by(lead_time) %>%
summarise(cancellations= n())
df = merge(x=df1,y=df2,by="lead_time")
df3 = df %>% mutate(cancellationBooking = cancellations/Bookings*100)
p <- ggplot(df3, aes(lead_time,cancellationBooking)) + geom_area(color="cornflowerblue",fill="cornflowerblue")
p + labs(x="Lead Time", y="Cancellation to Booking %", title="Cancellation to Bookings") + theme_bw()
It is evident from the plot, 25% of the bookings are only cancelled closer to stay but bookings made one month out have higher chances of getting cancelled.