From the first day of 2020 till now, the whole world is shrouded in the shadow of Covid-19. Social distance, wear mask, work/study from home becomes our new life style. Do you ever miss the life before? We can travel at any time, to any where, with any one. Do you remember what we concern at that time? Maybe is something like when is a good time to book hotel, how long is the optimal length of stay in order to get best daily rate.
This assignment will provide some visualization and analysis on hotel’s booking behavior based on data set from 1 Jul 2015 to 30 Jun 2017 downloaded from Kaggle.
The data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies. The time period covered is from 1 Jul 2015 to 31 Aug 2017 (2 years plus two months). In order to have a better visualization, we will only focus on the period from 1 Jul 2015 to 30 Jun 2017(2 years). Further, we will only focus on Europe country to limit the scope of analysis.
| Column_names | Descriptions |
|---|---|
| hotel | Hotel type - city or resort |
| is_canceled | Indicate whether the booking is canceled or not, 1 means canceled |
| lead_time | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | Year of booking |
| arrival_date_month | Month of booking |
| arrival_date_week_number | Week of booking |
| arrival_date_day_of_month | Day of booking |
| stays_in_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked |
| stays_in_week_nights | Number of weekdat nights (Monday to Friday) the guest stayed or booked |
| adults | Number of adults per booking |
| children | Number of children per booking |
| babies | Number of babies per booking |
| meal | Type of meal booked |
| country | Country of origin |
| market_segment | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| distribution_channel | Booking distribution channel |
| is_repeated_guest | Whether is a repeated guest, 1 means yes, 0 means no |
| previous_cancellations | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | Code of room type reserved |
| assigned_room_type | Code for the type of room assigned to the booking |
| booking_changes | Number of changes/amendments made |
| deposit_type | Indication on if the customer made a deposit to guarantee the booking |
| agent | ID of the travel agency that made the booking |
| company | ID of the company/entity that made the booking or responsible for paying the booking |
| days_in_waiting_list | Number of days the booking was in the waiting list before it was confirmed to the customer |
| customer_type | Type of the customer |
| adr | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces | Number of car parking spaces required by the customer |
| total_of_special_requests | Number of special requests made by the customer |
| reservation_status | Reservation last status |
| reservation_status_date | Date at which the last status was set |
This assignment will perform visualization and analysis on hotel’s booking behavior like peak of booking period, number of customers per booking based on data set from 1 Jul 2015 to 30 Jun 2017 (2 years) downloaded from Kaggle.
| Problem_faced | Description | Propsed_solution |
|---|---|---|
| Arrival date | In orginal data set, the arrival day, month and year are separated into three columns, this is not convenient to visualise data in a time series |
To use approprate function such as as.POSIXct to combine and format as a date type.
|
| Incorrect data and outliers | Noted “adr” have negative value which is not reasonable. Further we need to check whether there is outliers in data set. If have, some treatments need to be done. | To exclude incorrect data and outliers based on 3IQR method. Refer to Wikipedia (link), outlier is defined as observations that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR. |
| High amouont of canceled bookings | Noted there are huge amount of canceled booking. | We will analysis canceled booking percentage by country first, and then exclude those records. For the main part of analysis and visualisation, we will only focus on non-canceled bookings. |
| Define scope of the work | As mentioned before, we will only focus on Europe country. To filter out the data set we need, other information such as continent is required. | To join this data set with other data set in order to get continent information |
| New variables to be derived | For analysis and visulisation purpose, several new variables are needed. | To derive new variables such as total no. of days booked, arrival date. |
The data set is in csv format, we will use read_csv to import it. As mentioned in previous section, we need to join “continent” information to the data set. Thus another data set “continent2.csv” which downloaded from Kaggle link is also imported.
data("World")
hotel<-read_csv('hotel_bookings.csv')
code<-read_csv('continents2.csv')
##### Glimpse the data set #####
glimpse(hotel)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ arrival_date_month <chr> "July", "July", "July", "July", "Jul...
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, ...
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corpo...
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Depo...
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240"...
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <chr> "Transient", "Transient", "Transient...
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out...
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02,...
##### View summary statistics #####
summary(hotel)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
From the summary above, it shows “adr” have negative value which is not reasonable. Also the mean of “is_canceled” amounting to 0.37 indicates there is quite numbers of bookings were canceled. Thus in the following sections, we will remove unreasonable “adr” value and analysis canceled bookings.
As mentioned in section A.3, this step is to join in continent information, and at the same time to exclude unreasonable “adr” records, extract the records pertains to European country.
#### Join in continent information ####
hotel_EU <- hotel%>%
left_join(.,code,by = c("country" = "alpha-3")) %>%
filter(region == 'Europe',customer_type %in% c('Transient', 'Transient-Party')) %>%
filter(adr >0)
cols <- c(1:33, 37)
hotel_EU <- hotel_EU[,cols]
#### Check missing data ####
missing = sum(is.na(hotel_EU$name))
missing
## [1] 0
In order to confirm the continent information is correctly joined into hotel data set, we used is.na function to check whether have missing value. The result 0 means the continent information is correctly joined in and no missing value noted.
As mentioned in section A.3, arrival date split into three variables - “arrival_date_year”, “arrival_date_month”, “arrival_date_day_of_month”. To better deal with data, a new variable “arrival_date” will be derived. After this, the data set will be subset to only contains records from 1 Jul 2015 to 30 Jun 2017.
####Arrival time ####
hotel_EU <- hotel_EU %>%
mutate(arrival_date = as.POSIXct(paste(arrival_date_year,
arrival_date_month,
arrival_date_day_of_month,
sep = '-'), format = '%Y-%B-%d')) %>%
filter(arrival_date >= '2015-07-01' & arrival_date <= '2017-06-30' )
1) Visualize and analysis
Noted from the output of summary performed in section B.3.1. The mean of “is_canceled” is 0.37 which is an indication that there are numbers of canceled bookings in data set. In this section, we will analysis the cancellation percentage by country.
Please noted that since some countries only have small amount of records (less than 100), which might not representative. Thus we will exclude those countries for visualization and analysis in this section.
##### Remove records less than 100 #####
df1<- hotel_EU%>%
group_by(country)%>%
summarise(no_of_records = n())
hotel_morethan_100 <- hotel_EU %>%
group_by(country)%>%
mutate(no_of_records = n())%>%
filter(no_of_records >=100)
##### Calculate cancellation rate #####
cancel <- hotel_morethan_100 %>%
group_by(name)%>%
summarise(cancel_perc = sum(`is_canceled`)/n())
##### Visualization #####
avg = mean(cancel$cancel_perc)
Q3 = quantile(cancel$cancel_perc,probs = 0.75)
ggplot(cancel, aes(x=reorder(name,-cancel_perc), y = cancel_perc))+
geom_bar(stat = 'identity', color = 'black', fill = 'light blue')+
coord_flip()+
geom_hline(aes(yintercept = Q3), color = 'red')+
geom_text(aes(21,Q3,label = '75%', vjust = -1.5, hjust = -.5))+
geom_hline(aes(yintercept = avg), color = 'red', linetype = 'dashed')+
geom_text(aes(21,avg,label = '50%',vjust = -1.5, hjust = 1))+
ylab('Cancellation Percentage')+
scale_y_continuous(labels = function(x) paste0(x*100, "%"))+
ggtitle('Cancellation percentage by country')+
labs(caption = "Note: Only includes countries with more than 100 records")+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.y = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.y = element_blank(),
axis.title.x = element_text(size = 9),
plot.caption = element_text(size = 7, colour = '#404040', face = 'italic'))
The chart above shows 5 countries have much higher cancellation rate compared with others, especially Portugal with close to 60%. Noted that the high cancellation rate probably due to many people booked the hotel is just for Visa application purpose. Showed on Schengen Visa Application, people need to provide hotel booking reservation for the whole duration of the intended stay in Portugal. Thus many people booked the hotel before Visa application and cancel it later.
Canceled bookings might not be reflectable to the actual booking behavior. Thus for the following visualization and analysis, we will only focus on those are not canceled.
2) Subset data set
##### Filter out not canceled #####
hotel_used <- hotel_EU %>%
filter(is_canceled == 0)
Three new variables are derived for analysis purpose.
#### Total number of days booked ####
hotel_used <- hotel_used %>%
mutate(total_days_booked = stays_in_weekend_nights+stays_in_week_nights)%>%
filter(total_days_booked !=0) %>%
mutate(weekend_perc=scales::percent(stays_in_weekend_nights/total_days_booked)) %>%
mutate(weekday_perc=scales::percent(stays_in_week_nights/total_days_booked))
In original data set, there are 7 market segment. To have a better visualization, three segments, namely Complementary, Aviation, Corporate will be grouped together as “other”.
##### Market segment #####
hotel_used$market_segment[hotel_used$market_segment %in% c('Complementary','Aviation','Corporate')] <- 'Other'
As mentioned in section A.3, we will check whether have outliers first.
Plot variables
##### Plot see distribution #####
g_price <- ggplot(hotel_used, aes(x= adr)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_day <- ggplot(hotel_used, aes(x= total_days_booked)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_leadtime <- ggplot(hotel_used, aes(x= lead_time)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_request <- ggplot(hotel_used, aes(x= total_of_special_requests)) +
geom_histogram(bins=40, color="black", fill="light blue")
ggarrange(g_price, g_day,g_leadtime,g_request, ncol = 2, nrow = 2)
From the plots above, we can see there are extreme high amount in “adr”, “total_days_booked”, “lead_time”.
Deal with outliers
3 IQR method will be used to identify outliers. The outliers are identified based on 3 variables - “No. of units”, “Area” and “Transacted Price”.
First IQR value for each variables and upper limit (Q3 +1.5IQR) will be calculated. Then filter function will be applied to exclude those higher than upper limit calculated.
##### Calculate IQR and upper limit for outliers #####
IQR_unit_days = IQR(hotel_used$total_days_booked)
IQR_unit_price = IQR(hotel_used$adr)
IQR_lead = IQR(hotel_used$lead_time)
days_upper = quantile(hotel_used$total_days_booked,probs = 0.75)+1.5*IQR_unit_days
price_upper = quantile(hotel_used$adr,probs = 0.75)+1.5*IQR_unit_price
lead_upper = quantile(hotel_used$lead_time,probs = 0.75)+1.5*IQR_lead
##### Filter out outliers #####
hotel_used <- hotel_used %>%
filter((total_days_booked<=days_upper)& (adr<=price_upper) & (lead_time<=lead_upper))
Plot variables again
##### Plot see distribution #####
g_price2 <- ggplot(hotel_used, aes(x= adr)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_day2 <- ggplot(hotel_used, aes(x= total_days_booked)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_leadtime2 <- ggplot(hotel_used, aes(x= lead_time)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_request2 <- ggplot(hotel_used, aes(x= total_of_special_requests)) +
geom_histogram(bins=40, color="black", fill="light blue")
ggarrange(g_price2, g_day2,g_leadtime2,g_request2, ncol = 2, nrow = 2)
After remove outliers using 3IQR method, we can see the range of variables is not abnormal.
There are two hotel types in the data set - city hotel and resort. One question people might ask is “Whether there is seasonality of hotel booking for city hotel and resort?”.
##### Month with hotel type #####
hotel_used$arrival_date_month = factor(hotel_used$arrival_date_month, levels = month.name)
ggplot(hotel_used, aes(x=arrival_date_month, fill = hotel))+
geom_bar(position = 'dodge')+
scale_fill_manual(values=c("#FFDAB9", "#E6E6FA"))+
ylab('No. of records')+
xlab('Month')+
ggtitle('No. of booking by hotel type and month')+
labs(fill='Hotel type')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 9),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
legend.position = "bottom",
legend.title = element_blank())
If we aggregate by month, we can see that generally city hotel has higher amount of bookings than resort hotel. Two types of hotels have similar trend that the bookings increase from Jan to May, and slightly decrease in Jun and Jul, followed by another increase from Aug till Oct. After that, the bookings decrease again.
##### Number of booking ~ Hotel type #####
df_no_customer <- hotel_used %>%
mutate(arrival_month_year = format(as.POSIXct(arrival_date), "%Y-%m"))%>%
group_by(hotel, arrival_month_year) %>%
summarise(no_booking = n())
g1 <- ggplot(df_no_customer, aes(x=arrival_month_year, y = no_booking, colour = hotel,
text = paste("Arrival period: ",arrival_month_year,
"\nNo. of booking: ", no_booking,
"\nHotel type: ", hotel)))+
geom_line(group = 1)+
geom_point()+
coord_cartesian(ylim = c(0,2500))+
ylab('No. of booking')+
ggtitle('No. of booking from Jul 2015 to Jun 2017')+
theme_classic()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 7),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9))
ggplotly(g1, tooltip = "text")
Do you ever wonder when children usually travel with their parent? In which month people will book longer period in hotel? Is the price fluctuated with the month? Following analysis will provide visualization based on data set.
##### Month with No. of adult/children/baby #####
df_avg_customer <- hotel_used %>%
group_by(arrival_date_month) %>%
summarise(no_children = sum(children)/n(),
no_baby = sum(babies)/n(),
no_adult = sum(adults)/n())
g2 <- ggplot(df_avg_customer, aes(x=arrival_date_month, y = no_children))+
geom_bar(stat = 'identity', fill = 'wheat')+
scale_x_discrete(labels=c("January" = "1", "February" = "2","March" = "3", "April" = "4", "May" = "5","June" = "6",
"July" = "7","August" = "8","September" = "9","October" = "10","November" = "11","December" = "12"))+
ylab('Children')+
ggtitle('Number of children')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 10))
g3 <- ggplot(df_avg_customer, aes(x=arrival_date_month, y = no_baby))+
geom_bar(stat = 'identity', fill = 'wheat')+
scale_x_discrete(labels=c("January" = "1", "February" = "2","March" = "3", "April" = "4", "May" = "5","June" = "6",
"July" = "7","August" = "8","September" = "9","October" = "10","November" = "11","December" = "12"))+
ylab('Baby')+
ggtitle('Number of babies')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 10))
g4 <- ggplot(df_avg_customer, aes(x=arrival_date_month, y = no_adult))+
geom_bar(stat = 'identity', fill = 'wheat')+
scale_x_discrete(labels=c("January" = "1", "February" = "2","March" = "3", "April" = "4", "May" = "5","June" = "6",
"July" = "7","August" = "8","September" = "9","October" = "10","November" = "11","December" = "12"))+
ylab('Adult')+
ggtitle('Number of adults')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 10))
##### Arrange three plots together #####
figure1 <- ggarrange(g4,g2,g3, ncol = 3, nrow = 1,
labels = c('i)', 'ii)', 'iii)'),
font.label = list(size = 10, face = 'italic'))
annotate_figure(figure1,
top = text_grob('Booking behavior',
face = 'bold', size = 13),
bottom = text_grob('Note: Data is from 1 July 2015 to 30 June 2017',
face = 'italic', color = '#404040', size = 7, hjust = 1, x=1 ))
Generally the average size of the team per booking increase in Jul and Aug given that we can see similar increase pattern in average number of adults/children/babies in Jul and Aug.
1. Average daily rate
g5 <- ggplot(hotel_used, aes(x=arrival_date_month, y = adr))+
geom_boxplot()+
facet_wrap(hotel~.)+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
scale_x_discrete(labels=c("January" = "Jan", "February" = "Feb","March" = "Mar", "April" = "Apr", "May" = "May","June" = "Jun","July" = "Jul","August" = "Aug","September" = "Sep","October" = "Oct","November" = "Nov","December" = "Dec"))+
ylab('Average daily rate')+
ggtitle('Average price per day by month')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
strip.background = element_blank())
ggplotly(g5)
Price for resort hotel diverse much larger compared with the price for city hotel. The highest average daily is almost three times of the lowest one (average $151 in Aug vs average $50 in Jan). City hotel also have similar trend that price is higher during Apr to Oct. But the fluctuation is much smaller than resort hotel.
2. Lead time
g6 <- ggplot(hotel_used, aes(x=arrival_date_month, y = lead_time))+
geom_boxplot()+
facet_wrap(hotel~.)+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
scale_x_discrete(labels=c("January" = "Jan", "February" = "Feb","March" = "Mar", "April" = "Apr", "May" = "May","June" = "Jun","July" = "Jul","August" = "Aug","September" = "Sep","October" = "Oct","November" = "Nov","December" = "Dec"))+
ylab('No. of lead time')+
ggtitle('Lead time by month')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
strip.background = element_blank())
ggplotly(g6)
Lead time is number of days people book in advance. It can reflect whether people make proper planning for the travel. We can see that people prefer to make the longer advance booking for the period from May to Sep.
3. Length of the booking
g7 <- ggplot(hotel_used, aes(x=arrival_date_month, y = total_days_booked))+
geom_boxplot()+
facet_wrap(hotel~.)+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
scale_x_discrete(labels=c("January" = "Jan", "February" = "Feb","March" = "Mar", "April" = "Apr", "May" = "May","June" = "Jun","July" = "Jul","August" = "Aug","September" = "Sep","October" = "Oct","November" = "Nov","December" = "Dec"))+
ylab('No. of days booked')+
ggtitle('Length of booking by month')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
strip.background = element_blank())
ggplotly(g7)
Compared city hotel with resort, it can obviously find out most of people (75%) will book less than 4 days in city hotel, no matter in which month. While people might book longer time for resort hotel, especially in Jul and Aug. This might be due to the nature of hotel.
In this section, we will visualize the data using map. Before plotting the map by using tmap package, we need to join in geographic information.
Join in geographic information
##### Map #####
df_map <- hotel_used %>%
group_by(country) %>%
summarise(no_booking = n(),
avg_price = sum(adr)/no_booking)
hotel_map <- World%>%
left_join(., df_map, by = c("iso_a3"="country"))%>%
filter(is.na(no_booking)==FALSE)
Plot interactive map
tmap_mode("view")
tm_shape(hotel_map)+
tm_polygons("no_booking",n = 6,
style = "quantile",
palette = "Blues",title = "No. of bookings")+
tm_bubbles(size = "avg_price", col = "avg_price",
border.col = "black", border.alpha = .5,title.col = "Average daily rate")+
tm_layout(legend.title.size = 0.7,
legend.text.size = 0.4,
legend.bg.color = "white",
legend.bg.alpha = 1)
From the map, we can see Southern Europe countries people made higher amount of bookings compared with the rest. In terms of price of the hotel, people from certain countries like Luxembourg (LUX), Norway(NOR) have higher average daily rate.
In this section, we will visualize by market segment.
1) Repeated customer percentage
hotel_used$is_repeated_guest <- as.factor(hotel_used$is_repeated_guest)
ggplot(hotel_used, aes(x=market_segment, fill = is_repeated_guest))+
geom_bar(position = 'fill')+
scale_fill_manual(values=c("light salmon", "pale violet red"))+
xlab('Market Segment')+
scale_y_continuous(labels = function(x) paste0(x*100, "%"))+
ggtitle('Repeated customer percentage by market segment')+
labs(fill='Repeated customer')+
theme_classic()+
theme(axis.text.x = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.y = element_blank(),
axis.title.x = element_text(size = 9),
plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
legend.title = element_text(size = 9))
From above chart, we can see that customer from other segment have the highest probability of making repeat booking. This might be because there are some corporate customers in this category, and the booking is for certain purpose such as business trip.
From the data visualization performed above, we can have following observation:
1) As the government need hotel reservation when foreigners apply visa in some countries, some people will book the hotel first, and cancel the booking after visa approved. This result of high cancelation percentage, which might bring troubles to the hotels as well as other customers.
2) Number of bookings increase from Jan to May, and slightly decrease in Jun and Jul, followed by another increase from Aug till Oct. After that, the bookings decrease again.
3) Generally average size of the booking is larger in Jul and Aug.
4) Average daily rate will increase from Apr and reach the peak in Aug. Resort hotel have bigger fluctuation in terms of the price compared with city hotel.
5) People prefer to make longer advance booking for the period from May to Sep.
6) Compared city hotel with resort, it can obviously find out most of people (75%) will book less than 4 days in city hotel, no matter in which month. While people might book longer time for resort hotel, especially in Jul and Aug. This might be due to the nature of hotel.
7) Southern Europe countries people made higher amount of bookings compared with the rest. In terms of price of the hotel, people from certain countries like Luxembourg (LUX), Norway(NOR) have higher average daily rate.
8) Customers from other segment have the highest probability to make repeated bookings.