A. Introduction



From the first day of 2020 till now, the whole world is shrouded in the shadow of Covid-19. Social distance, wear mask, work/study from home becomes our new life style. Do you ever miss the life before? We can travel at any time, to any where, with any one. Do you remember what we concern at that time? Maybe is something like when is a good time to book hotel, how long is the optimal length of stay in order to get best daily rate.

This assignment will provide some visualization and analysis on hotel’s booking behavior based on data set from 1 Jul 2015 to 30 Jun 2017 downloaded from Kaggle.

A.1 Description of dataset


The data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies. The time period covered is from 1 Jul 2015 to 31 Aug 2017 (2 years plus two months). In order to have a better visualization, we will only focus on the period from 1 Jul 2015 to 30 Jun 2017(2 years). Further, we will only focus on Europe country to limit the scope of analysis.

Column_names Descriptions
hotel Hotel type - city or resort
is_canceled Indicate whether the booking is canceled or not, 1 means canceled
lead_time Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year Year of booking
arrival_date_month Month of booking
arrival_date_week_number Week of booking
arrival_date_day_of_month Day of booking
stays_in_weekend_nights Number of weekend nights (Saturday or Sunday) the guest stayed or booked
stays_in_week_nights Number of weekdat nights (Monday to Friday) the guest stayed or booked
adults Number of adults per booking
children Number of children per booking
babies Number of babies per booking
meal Type of meal booked
country Country of origin
market_segment Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel Booking distribution channel
is_repeated_guest Whether is a repeated guest, 1 means yes, 0 means no
previous_cancellations Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type Code of room type reserved
assigned_room_type Code for the type of room assigned to the booking
booking_changes Number of changes/amendments made
deposit_type Indication on if the customer made a deposit to guarantee the booking
agent ID of the travel agency that made the booking
company ID of the company/entity that made the booking or responsible for paying the booking
days_in_waiting_list Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type Type of the customer
adr Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces Number of car parking spaces required by the customer
total_of_special_requests Number of special requests made by the customer
reservation_status Reservation last status
reservation_status_date Date at which the last status was set


A.2 Objective and problem statement

This assignment will perform visualization and analysis on hotel’s booking behavior like peak of booking period, number of customers per booking based on data set from 1 Jul 2015 to 30 Jun 2017 (2 years) downloaded from Kaggle.

A.3 Challenges faced and solution

There are some data quality issue in the data set such as missing data, useless variables, outliers, unreasonable data etc. Below table lists down the problems faced and proposed solution.

Problem_faced Description Propsed_solution
Arrival date In orginal data set, the arrival day, month and year are separated into three columns, this is not convenient to visualise data in a time series To use approprate function such as as.POSIXct to combine and format as a date type.
Incorrect data and outliers Noted “adr” have negative value which is not reasonable. Further we need to check whether there is outliers in data set. If have, some treatments need to be done. To exclude incorrect data and outliers based on 3IQR method. Refer to Wikipedia (link), outlier is defined as observations that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.
High amouont of canceled bookings Noted there are huge amount of canceled booking. We will analysis canceled booking percentage by country first, and then exclude those records. For the main part of analysis and visualisation, we will only focus on non-canceled bookings.
Define scope of the work As mentioned before, we will only focus on Europe country. To filter out the data set we need, other information such as continent is required. To join this data set with other data set in order to get continent information
New variables to be derived For analysis and visulisation purpose, several new variables are needed. To derive new variables such as total no. of days booked, arrival date.


A.4 Proposed sketched design





B. Step by step data visualization

B.2 Import data set

The data set is in csv format, we will use read_csv to import it. As mentioned in previous section, we need to join “continent” information to the data set. Thus another data set “continent2.csv” which downloaded from Kaggle link is also imported.

data("World")
hotel<-read_csv('hotel_bookings.csv')
code<-read_csv('continents2.csv')

B.3 Data preparation

B.3.1 Overview of data statistics summary

##### Glimpse the data set #####
glimpse(hotel)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ arrival_date_month             <chr> "July", "July", "July", "July", "Jul...
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, ...
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corpo...
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Depo...
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240"...
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type                  <chr> "Transient", "Transient", "Transient...
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out...
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02,...
##### View summary statistics #####
summary(hotel)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 


From the summary above, it shows “adr” have negative value which is not reasonable. Also the mean of “is_canceled” amounting to 0.37 indicates there is quite numbers of bookings were canceled. Thus in the following sections, we will remove unreasonable “adr” value and analysis canceled bookings.

B.3.2 Join and filter the data

As mentioned in section A.3, this step is to join in continent information, and at the same time to exclude unreasonable “adr” records, extract the records pertains to European country.

#### Join in continent information ####
hotel_EU <- hotel%>%
  left_join(.,code,by = c("country" = "alpha-3")) %>%
  filter(region == 'Europe',customer_type %in% c('Transient', 'Transient-Party')) %>%
  filter(adr >0)

cols <- c(1:33, 37)
hotel_EU <- hotel_EU[,cols]

#### Check missing data ####
missing = sum(is.na(hotel_EU$name))
missing
## [1] 0


In order to confirm the continent information is correctly joined into hotel data set, we used is.na function to check whether have missing value. The result 0 means the continent information is correctly joined in and no missing value noted.

B.3.3 Format arrival date and subset data

As mentioned in section A.3, arrival date split into three variables - “arrival_date_year”, “arrival_date_month”, “arrival_date_day_of_month”. To better deal with data, a new variable “arrival_date” will be derived. After this, the data set will be subset to only contains records from 1 Jul 2015 to 30 Jun 2017.

####Arrival time ####
hotel_EU <- hotel_EU %>%
  mutate(arrival_date = as.POSIXct(paste(arrival_date_year, 
                                         arrival_date_month, 
                                         arrival_date_day_of_month, 
                                         sep = '-'), format = '%Y-%B-%d')) %>%
  filter(arrival_date >= '2015-07-01' & arrival_date <= '2017-06-30' )


B.3.4 Booking cancelation

1) Visualize and analysis

Noted from the output of summary performed in section B.3.1. The mean of “is_canceled” is 0.37 which is an indication that there are numbers of canceled bookings in data set. In this section, we will analysis the cancellation percentage by country.

Please noted that since some countries only have small amount of records (less than 100), which might not representative. Thus we will exclude those countries for visualization and analysis in this section.

##### Remove records less than 100 #####
df1<- hotel_EU%>%
  group_by(country)%>%
  summarise(no_of_records = n())


hotel_morethan_100 <- hotel_EU %>%
  group_by(country)%>%
  mutate(no_of_records = n())%>%
  filter(no_of_records >=100)

##### Calculate cancellation rate #####
cancel <- hotel_morethan_100 %>%
  group_by(name)%>%
  summarise(cancel_perc = sum(`is_canceled`)/n())

##### Visualization #####
avg = mean(cancel$cancel_perc)
Q3 = quantile(cancel$cancel_perc,probs = 0.75)

ggplot(cancel, aes(x=reorder(name,-cancel_perc), y = cancel_perc))+
  geom_bar(stat = 'identity', color = 'black', fill = 'light blue')+
  coord_flip()+
  geom_hline(aes(yintercept = Q3), color = 'red')+
  geom_text(aes(21,Q3,label = '75%', vjust = -1.5, hjust = -.5))+
  geom_hline(aes(yintercept = avg), color = 'red', linetype = 'dashed')+
  geom_text(aes(21,avg,label = '50%',vjust = -1.5, hjust = 1))+
  ylab('Cancellation Percentage')+
  scale_y_continuous(labels = function(x) paste0(x*100, "%"))+
  ggtitle('Cancellation percentage by country')+
  labs(caption = "Note: Only includes countries with more than 100 records")+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.y = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.y = element_blank(),
        axis.title.x = element_text(size = 9),
        plot.caption = element_text(size = 7, colour = '#404040', face = 'italic'))


The chart above shows 5 countries have much higher cancellation rate compared with others, especially Portugal with close to 60%. Noted that the high cancellation rate probably due to many people booked the hotel is just for Visa application purpose. Showed on Schengen Visa Application, people need to provide hotel booking reservation for the whole duration of the intended stay in Portugal. Thus many people booked the hotel before Visa application and cancel it later.

Canceled bookings might not be reflectable to the actual booking behavior. Thus for the following visualization and analysis, we will only focus on those are not canceled.

2) Subset data set

##### Filter out not canceled #####
hotel_used <- hotel_EU %>%
  filter(is_canceled == 0)


B.3.5 Add new variables

Three new variables are derived for analysis purpose.

#### Total number of days booked ####
hotel_used <- hotel_used %>%
  mutate(total_days_booked = stays_in_weekend_nights+stays_in_week_nights)%>%
  filter(total_days_booked !=0) %>%
  mutate(weekend_perc=scales::percent(stays_in_weekend_nights/total_days_booked)) %>%
  mutate(weekday_perc=scales::percent(stays_in_week_nights/total_days_booked))


B.3.6 Regroup market segment

In original data set, there are 7 market segment. To have a better visualization, three segments, namely Complementary, Aviation, Corporate will be grouped together as “other”.

##### Market segment #####
hotel_used$market_segment[hotel_used$market_segment %in% c('Complementary','Aviation','Corporate')] <- 'Other'


B.3.7 Outliers

As mentioned in section A.3, we will check whether have outliers first.

Plot variables

##### Plot see distribution #####
g_price <- ggplot(hotel_used, aes(x= adr)) +
  geom_histogram(bins=40, color="black", fill="light blue")


g_day <- ggplot(hotel_used, aes(x= total_days_booked)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_leadtime <- ggplot(hotel_used, aes(x= lead_time)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_request <- ggplot(hotel_used, aes(x= total_of_special_requests)) +
  geom_histogram(bins=40, color="black", fill="light blue")

ggarrange(g_price, g_day,g_leadtime,g_request, ncol = 2, nrow = 2)



From the plots above, we can see there are extreme high amount in “adr”, “total_days_booked”, “lead_time”.

Deal with outliers

3 IQR method will be used to identify outliers. The outliers are identified based on 3 variables - “No. of units”, “Area” and “Transacted Price”.

First IQR value for each variables and upper limit (Q3 +1.5IQR) will be calculated. Then filter function will be applied to exclude those higher than upper limit calculated.

##### Calculate IQR and upper limit for outliers #####
IQR_unit_days = IQR(hotel_used$total_days_booked)
IQR_unit_price = IQR(hotel_used$adr)
IQR_lead = IQR(hotel_used$lead_time)


days_upper = quantile(hotel_used$total_days_booked,probs = 0.75)+1.5*IQR_unit_days
price_upper = quantile(hotel_used$adr,probs = 0.75)+1.5*IQR_unit_price
lead_upper = quantile(hotel_used$lead_time,probs = 0.75)+1.5*IQR_lead

##### Filter out outliers #####
hotel_used <- hotel_used %>%
  filter((total_days_booked<=days_upper)& (adr<=price_upper) & (lead_time<=lead_upper))


Plot variables again

##### Plot see distribution #####
g_price2 <- ggplot(hotel_used, aes(x= adr)) +
  geom_histogram(bins=40, color="black", fill="light blue")


g_day2 <- ggplot(hotel_used, aes(x= total_days_booked)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_leadtime2 <- ggplot(hotel_used, aes(x= lead_time)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_request2 <- ggplot(hotel_used, aes(x= total_of_special_requests)) +
  geom_histogram(bins=40, color="black", fill="light blue")

ggarrange(g_price2, g_day2,g_leadtime2,g_request2, ncol = 2, nrow = 2)


After remove outliers using 3IQR method, we can see the range of variables is not abnormal.

B.4 Exploratory data analysis

B.4.1 Seasonality of different hotel type

There are two hotel types in the data set - city hotel and resort. One question people might ask is “Whether there is seasonality of hotel booking for city hotel and resort?”.

##### Month with hotel type #####
hotel_used$arrival_date_month = factor(hotel_used$arrival_date_month, levels = month.name)

ggplot(hotel_used, aes(x=arrival_date_month, fill = hotel))+
  geom_bar(position = 'dodge')+
  scale_fill_manual(values=c("#FFDAB9", "#E6E6FA"))+
  ylab('No. of records')+
  xlab('Month')+
  ggtitle('No. of booking by hotel type and month')+
  labs(fill='Hotel type')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_text(size = 9),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
        legend.position = "bottom",
        legend.title = element_blank())


If we aggregate by month, we can see that generally city hotel has higher amount of bookings than resort hotel. Two types of hotels have similar trend that the bookings increase from Jan to May, and slightly decrease in Jun and Jul, followed by another increase from Aug till Oct. After that, the bookings decrease again.

##### Number of booking ~ Hotel type #####

df_no_customer <- hotel_used %>%
  mutate(arrival_month_year = format(as.POSIXct(arrival_date), "%Y-%m"))%>%
  group_by(hotel, arrival_month_year) %>%
  summarise(no_booking = n())


g1 <- ggplot(df_no_customer, aes(x=arrival_month_year, y = no_booking, colour = hotel,
                                 text = paste("Arrival period: ",arrival_month_year,
                                              "\nNo. of booking: ", no_booking, 
                                              "\nHotel type: ", hotel)))+
  geom_line(group = 1)+
  geom_point()+
  coord_cartesian(ylim = c(0,2500))+
  ylab('No. of booking')+
  ggtitle('No. of booking from Jul 2015 to Jun 2017')+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 7),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9))

ggplotly(g1, tooltip = "text")


When see the data on a time series, we can notice the fluctuation of city hotel booking is much bigger than resort hotel. It has three peaks for city hotel bookings - Sep to Oct 15, Mar to Oct 16 and Mar to Jun 17.

B.4.2 Booking behavior by month

Do you ever wonder when children usually travel with their parent? In which month people will book longer period in hotel? Is the price fluctuated with the month? Following analysis will provide visualization based on data set.

##### Month with No. of adult/children/baby #####
df_avg_customer <- hotel_used %>%
  group_by(arrival_date_month) %>%
  summarise(no_children = sum(children)/n(),
            no_baby = sum(babies)/n(),
            no_adult = sum(adults)/n())


g2 <- ggplot(df_avg_customer, aes(x=arrival_date_month, y = no_children))+
  geom_bar(stat = 'identity', fill = 'wheat')+
  scale_x_discrete(labels=c("January" = "1", "February" = "2","March" = "3", "April" = "4", "May" = "5","June" = "6",
                            "July" = "7","August" = "8","September" = "9","October" = "10","November" = "11","December" = "12"))+
  ylab('Children')+
  ggtitle('Number of children')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 10))


g3 <- ggplot(df_avg_customer, aes(x=arrival_date_month, y = no_baby))+
  geom_bar(stat = 'identity', fill = 'wheat')+
  scale_x_discrete(labels=c("January" = "1", "February" = "2","March" = "3", "April" = "4", "May" = "5","June" = "6",
                            "July" = "7","August" = "8","September" = "9","October" = "10","November" = "11","December" = "12"))+
  ylab('Baby')+
  ggtitle('Number of babies')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 10))


g4 <- ggplot(df_avg_customer, aes(x=arrival_date_month, y = no_adult))+
  geom_bar(stat = 'identity', fill = 'wheat')+
  scale_x_discrete(labels=c("January" = "1", "February" = "2","March" = "3", "April" = "4", "May" = "5","June" = "6",
                            "July" = "7","August" = "8","September" = "9","October" = "10","November" = "11","December" = "12"))+
  ylab('Adult')+
  ggtitle('Number of adults')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 10))


##### Arrange three plots together #####
figure1 <- ggarrange(g4,g2,g3, ncol = 3, nrow = 1,
                     labels = c('i)', 'ii)', 'iii)'),
                     font.label = list(size = 10, face = 'italic'))
annotate_figure(figure1,
                top = text_grob('Booking behavior',
                                face = 'bold', size = 13),
                bottom = text_grob('Note: Data is from 1 July 2015 to 30 June 2017',
                                   face = 'italic', color = '#404040', size = 7, hjust = 1, x=1 ))



Generally the average size of the team per booking increase in Jul and Aug given that we can see similar increase pattern in average number of adults/children/babies in Jul and Aug.


1. Average daily rate

g5 <- ggplot(hotel_used, aes(x=arrival_date_month, y = adr))+
  geom_boxplot()+
  facet_wrap(hotel~.)+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  scale_x_discrete(labels=c("January" = "Jan", "February" = "Feb","March" = "Mar", "April" = "Apr", "May" = "May","June" = "Jun","July" = "Jul","August" = "Aug","September" = "Sep","October" = "Oct","November" = "Nov","December" = "Dec"))+
  ylab('Average daily rate')+
  ggtitle('Average price per day by month')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
        strip.background = element_blank())
ggplotly(g5)


From the plot above, we can easily find out some trend of daily price for different month. Generally the price is higher during Apr to Oct. 

Price for resort hotel diverse much larger compared with the price for city hotel. The highest average daily is almost three times of the lowest one (average $151 in Aug vs average $50 in Jan). City hotel also have similar trend that price is higher during Apr to Oct. But the fluctuation is much smaller than resort hotel.

2. Lead time

g6 <- ggplot(hotel_used, aes(x=arrival_date_month, y = lead_time))+
  geom_boxplot()+
  facet_wrap(hotel~.)+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  scale_x_discrete(labels=c("January" = "Jan", "February" = "Feb","March" = "Mar", "April" = "Apr", "May" = "May","June" = "Jun","July" = "Jul","August" = "Aug","September" = "Sep","October" = "Oct","November" = "Nov","December" = "Dec"))+
  ylab('No. of lead time')+
  ggtitle('Lead time by month')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
        strip.background = element_blank())

ggplotly(g6)


Lead time is number of days people book in advance. It can reflect whether people make proper planning for the travel. We can see that people prefer to make the longer advance booking for the period from May to Sep.

3. Length of the booking

g7 <- ggplot(hotel_used, aes(x=arrival_date_month, y = total_days_booked))+
  geom_boxplot()+
  facet_wrap(hotel~.)+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  scale_x_discrete(labels=c("January" = "Jan", "February" = "Feb","March" = "Mar", "April" = "Apr", "May" = "May","June" = "Jun","July" = "Jul","August" = "Aug","September" = "Sep","October" = "Oct","November" = "Nov","December" = "Dec"))+
  ylab('No. of days booked')+
  ggtitle('Length of booking by month')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
        strip.background = element_blank())

ggplotly(g7)


Compared city hotel with resort, it can obviously find out most of people (75%) will book less than 4 days in city hotel, no matter in which month. While people might book longer time for resort hotel, especially in Jul and Aug. This might be due to the nature of hotel.

B.4.3 Analysis by country

In this section, we will visualize the data using map. Before plotting the map by using tmap package, we need to join in geographic information.

Join in geographic information

##### Map #####
df_map <- hotel_used %>%
  group_by(country) %>%
  summarise(no_booking = n(),
            avg_price = sum(adr)/no_booking)

hotel_map <- World%>%
  left_join(., df_map, by = c("iso_a3"="country"))%>%
  filter(is.na(no_booking)==FALSE)

Plot interactive map

tmap_mode("view")

tm_shape(hotel_map)+
  tm_polygons("no_booking",n = 6,
          style = "quantile", 
          palette = "Blues",title = "No. of bookings")+
  tm_bubbles(size = "avg_price", col = "avg_price", 
               border.col = "black", border.alpha = .5,title.col = "Average daily rate")+
  tm_layout(legend.title.size = 0.7,
          legend.text.size = 0.4,
          legend.bg.color = "white",
          legend.bg.alpha = 1)


From the map, we can see Southern Europe countries people made higher amount of bookings compared with the rest. In terms of price of the hotel, people from certain countries like Luxembourg (LUX), Norway(NOR) have higher average daily rate.

B.4.4 Analysis by Market segment

In this section, we will visualize by market segment.

1) Repeated customer percentage

hotel_used$is_repeated_guest <- as.factor(hotel_used$is_repeated_guest)


ggplot(hotel_used, aes(x=market_segment, fill = is_repeated_guest))+
  geom_bar(position = 'fill')+
  scale_fill_manual(values=c("light salmon", "pale violet red"))+
  xlab('Market Segment')+
  scale_y_continuous(labels = function(x) paste0(x*100, "%"))+
  ggtitle('Repeated customer percentage by market segment')+
  labs(fill='Repeated customer')+
  theme_classic()+
  theme(axis.text.x = element_text(size = 8),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8),
        axis.title.y = element_blank(),
        axis.title.x = element_text(size = 9),
        plot.title = element_text(size = 12, face = 'bold', hjust = 0.5),
        legend.title = element_text(size = 9))


From above chart, we can see that customer from other segment have the highest probability of making repeat booking. This might be because there are some corporate customers in this category, and the booking is for certain purpose such as business trip.

C. Description and observations

From the data visualization performed above, we can have following observation:
1) As the government need hotel reservation when foreigners apply visa in some countries, some people will book the hotel first, and cancel the booking after visa approved. This result of high cancelation percentage, which might bring troubles to the hotels as well as other customers.
2) Number of bookings increase from Jan to May, and slightly decrease in Jun and Jul, followed by another increase from Aug till Oct. After that, the bookings decrease again.
3) Generally average size of the booking is larger in Jul and Aug.
4) Average daily rate will increase from Apr and reach the peak in Aug. Resort hotel have bigger fluctuation in terms of the price compared with city hotel.
5) People prefer to make longer advance booking for the period from May to Sep.
6) Compared city hotel with resort, it can obviously find out most of people (75%) will book less than 4 days in city hotel, no matter in which month. While people might book longer time for resort hotel, especially in Jul and Aug. This might be due to the nature of hotel.
7) Southern Europe countries people made higher amount of bookings compared with the rest. In terms of price of the hotel, people from certain countries like Luxembourg (LUX), Norway(NOR) have higher average daily rate.
8) Customers from other segment have the highest probability to make repeated bookings.