Introduction

Through this analysis, we were able to answer key business questions which are stated below:

Dataset description

This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 31 variables describing the 40,060 observations and 79,330 observations. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.

Libraries Used

library(tidyverse)
library(ggplot2)

Data Preparation

Data Source

The data originally comes from an open hotel booking demand dataset on the website Antonio, Almeida and Nunes, 2019. It was published on Twitter as a part of weekly project in R named TidyTuesday.

Explanation of the dataset

The data was acquired from hotels’ Property Management System (PMS) SQL databases using TSQL queries. This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 32 variables describing the 40,060 observations and 79,330 observations respectively. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.

Data Importing and Cleaning

Importing datset

data<- read.csv("C:/Users/srhrs/OneDrive/Documents/Wrangling/Week 3/hotels.csv",stringsAsFactors = TRUE)

Viewing the structure of data

str(data)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                       : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment                : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type            : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                         : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                       : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date       : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...

The dataset contains 32 variables and 119390 records.

Generating a table for the first 10 rows of the data.

# Generating a table for the first 10 rows in the dataset.

knitr::kable(head(data,n = 10), "pandoc")
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
Resort Hotel 0 342 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 737 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 7 2015 July 27 1 0 1 1 0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit NULL NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 13 2015 July 27 1 0 1 1 0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304 NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 0 2015 July 27 1 0 2 2 0 0 BB PRT Direct Direct 0 0 0 C C 0 No Deposit NULL NULL 0 Transient 107.0 0 0 Check-Out 2015-07-03
Resort Hotel 0 9 2015 July 27 1 0 2 2 0 0 FB PRT Direct Direct 0 0 0 C C 0 No Deposit 303 NULL 0 Transient 103.0 0 1 Check-Out 2015-07-03
Resort Hotel 1 85 2015 July 27 1 0 3 2 0 0 BB PRT Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 82.0 0 1 Canceled 2015-05-06
Resort Hotel 1 75 2015 July 27 1 0 3 2 0 0 HB PRT Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 15 NULL 0 Transient 105.5 0 0 Canceled 2015-04-22

Calculating Missing values for all columns

# Calculating null value count for all columns

colSums(is.na(data))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
  • There is only one column i.e., children with 4 missing values.
  • We shall find out the summary statistics of children column.
# Finding summary statistics for 'children'

summary(data$children)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1039  0.0000 10.0000       4
  • Since, the minimum and median values are 0, we impute the 4 NA values with 0.
# Imputing missing values in 'children' column

data$children[is.na(data$children)] <- 0

After imputing, the number of missing values in the dataset is 0.

sum(is.na(data))
## [1] 0

Summary statistics of each variable in the dataset

summary(data)
##           hotel        is_canceled       lead_time   arrival_date_year
##  City Hotel  :79330   Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Resort Hotel:40060   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##                       Median :0.0000   Median : 69   Median :2016     
##                       Mean   :0.3704   Mean   :104   Mean   :2016     
##                       3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                       Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                       
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :13877      Min.   : 1.00            Min.   : 1.0             
##  July   :12661      1st Qu.:16.00            1st Qu.: 8.0             
##  May    :11791      Median :28.00            Median :16.0             
##  October:11160      Mean   :27.17            Mean   :15.8             
##  April  :11089      3rd Qu.:38.00            3rd Qu.:23.0             
##  June   :10939      Max.   :53.00            Max.   :31.0             
##  (Other):47873                                                        
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies                 meal          country     
##  Min.   : 0.0000   Min.   : 0.000000   BB       :92310   PRT    :48590  
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB       :  798   GBR    :12129  
##  Median : 0.0000   Median : 0.000000   HB       :14463   FRA    :10415  
##  Mean   : 0.1039   Mean   : 0.007949   SC       :10650   ESP    : 8568  
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000   Undefined: 1169   DEU    : 7287  
##  Max.   :10.0000   Max.   :10.000000                     ITA    : 3766  
##                                                          (Other):28635  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56477   Corporate: 6677      Min.   :0.00000  
##  Offline TA/TO:24219   Direct   :14645      1st Qu.:0.00000  
##  Groups       :19811   GDS      :  193      Median :0.00000  
##  Direct       :12606   TA/TO    :97870      Mean   :0.03191  
##  Corporate    : 5295   Undefined:    5      3rd Qu.:0.00000  
##  Complementary:  743                        Max.   :1.00000  
##  (Other)      :  239                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                A      :85994     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                D      :19201     
##  Median : 0.00000       Median : 0.0000                E      : 6535     
##  Mean   : 0.08712       Mean   : 0.1371                F      : 2897     
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                G      : 2094     
##  Max.   :26.00000       Max.   :72.0000                B      : 1118     
##                                                        (Other): 1551     
##  assigned_room_type booking_changes       deposit_type        agent      
##  A      :74053      Min.   : 0.0000   No Deposit:104641   9      :31961  
##  D      :25322      1st Qu.: 0.0000   Non Refund: 14587   NULL   :16340  
##  E      : 7806      Median : 0.0000   Refundable:   162   240    :13922  
##  F      : 3751      Mean   : 0.2211                       1      : 7191  
##  G      : 2553      3rd Qu.: 0.0000                       14     : 3640  
##  C      : 2375      Max.   :21.0000                       7      : 3539  
##  (Other): 3530                                            (Other):42797  
##     company       days_in_waiting_list         customer_type  
##  NULL   :112593   Min.   :  0.000      Contract       : 4076  
##  40     :   927   1st Qu.:  0.000      Group          :  577  
##  223    :   784   Median :  0.000      Transient      :89613  
##  67     :   267   Mean   :  2.321      Transient-Party:25124  
##  45     :   250   3rd Qu.:  0.000                             
##  153    :   215   Max.   :391.000                             
##  (Other):  4354                                               
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.58   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5714           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##                                                                         
##  reservation_status reservation_status_date
##  Canceled :43017    2015-10-21:  1461      
##  Check-Out:75166    2015-07-06:   805      
##  No-Show  : 1207    2016-11-25:   790      
##                     2015-01-01:   763      
##                     2016-01-18:   625      
##                     2015-07-02:   469      
##                     (Other)   :114477

Summary for variables of concern

  • There are two types of hotels. Resort Hotel with 40060 records and City Hotel with 79330 records.
  • Median value of arrival_date_week_number is 28. This shows that there are more number of bookings in the 2nd half of the year.
  • On an average, weekdays from Monday to Friday(stays_in_weekend_nights) observe longer stays than those in weekends(stays_in_week_nights) i.e., on Saturday, Sunday.
  • adults column has a minimum value of 0. We must inspect those records.
  • distribution_channel has 5 Undefined values.
  • Type A rooms have 85994 bookings registered against them. Also, there are 74053 bookings assigned for type A rooms.
  • agent and company columns have records stored under the category NULL.
  • Average Daily Rate adr column has negative values. These must be investigated.

Visualization

Histogram of yearly bookings

hist(data$arrival_date_year, prob = T, main = "Histogram Yearly Bookings", xlab = "Yearly Bookings")

Majority bookings can be seen in 2016

Histogram of Avg. Daily Price

hist(data$adr, prob = T, breaks = 50,xlim = c(min(data$adr) - 1,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")

There seem some outliers in the adr column

Finding outliers

Below are the plots of various numeric variables:

par(mfrow = c(3,3), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(data$is_canceled, main = "is_canceled")
boxplot(data$lead_time, main = "lead_time")
boxplot(data$arrival_date_year, main = "arrival_date_year")
boxplot(data$arrival_date_week_number, main = "arrival_date_week_number")
boxplot(data$arrival_date_day_of_month, main = "arrival_date_day_of_month")
boxplot(data$stays_in_weekend_nights, main = "stays_in_weekend_nights")
boxplot(data$stays_in_week_nights, main = "stays_in_week_nights")
boxplot(data$children, main = "children")
boxplot(data$adults, main = "adults")

boxplot(data$babies, main = "babies")
boxplot(data$is_repeated_guest, main = "is_repeated_gues")
boxplot(data$previous_cancellations, main = "previous_cancellation")
boxplot(data$previous_bookings_not_canceled, main ="previous_bookings_not_canceled")
boxplot(data$booking_changes, main = "booking_changes")
boxplot(data$days_in_waiting_list, main = "days_in_waiting_list")
boxplot(data$adr, main = "adr")
boxplot(data$required_car_parking_spaces, main = "required_car_parking_spaces")
boxplot(data$total_of_special_requests, main = "total_of_special_requests")

Observations from boxplots

Treating the outliers

data <- data[!(data$adr < 0 | data$adr == 5400),]

boxplot(data$adr, main = "Boxplot : Avg. Daily Price")

  • Let us now check the summary of adr column.
summary(data$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   69.29   94.58  101.79  126.00  510.00
  • Let us also check the distribution of adr
hist(data$adr, prob = T, col ="grey", breaks = 100, xlim = c(min(data$adr) - 1 ,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")

* Let us check the final dataset.

knitr::kable(head(data,n=10), "pandoc")
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
Resort Hotel 0 342 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 737 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 7 2015 July 27 1 0 1 1 0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit NULL NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 13 2015 July 27 1 0 1 1 0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304 NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 0 2015 July 27 1 0 2 2 0 0 BB PRT Direct Direct 0 0 0 C C 0 No Deposit NULL NULL 0 Transient 107.0 0 0 Check-Out 2015-07-03
Resort Hotel 0 9 2015 July 27 1 0 2 2 0 0 FB PRT Direct Direct 0 0 0 C C 0 No Deposit 303 NULL 0 Transient 103.0 0 1 Check-Out 2015-07-03
Resort Hotel 1 85 2015 July 27 1 0 3 2 0 0 BB PRT Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 82.0 0 1 Canceled 2015-05-06
Resort Hotel 1 75 2015 July 27 1 0 3 2 0 0 HB PRT Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 15 NULL 0 Transient 105.5 0 0 Canceled 2015-04-22

Exploratory Data Analysis

Number of Bookings by Hotel Type

Let us analyze the number of bookings by the type of hotel for each available year.

ggplot(data = data,aes(arrival_date_year,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  ylab("Number Of Bookings") +
  xlab("Years") +
  ggtitle("Total bookings across years") +
  labs(fill = 'Hotel Type')

Total number of bookings in 2016 seems to be higher compared to 2015 & 2017. We can use the arrival_date_month field to view the data across every month for the availabe years.

data_2015 <- data[data$arrival_date_year == 2015,]
data_2016 <- data[data$arrival_date_year == 2016,]
data_2017 <- data[data$arrival_date_year == 2017,]
ggplot(data = data_2015,aes(arrival_date_month,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  ylab("Number Of Bookings") +
  xlab("Years") +
  ggtitle("Total bookings for year 2015") +
  labs(fill = 'Hotel Type')

ggplot(data = data_2016,aes(arrival_date_month,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  ylab("Number Of Bookings") +
  xlab("Years") +
  ggtitle("Total bookings for year 2016") +
  labs(fill = 'Hotel Type')

ggplot(data = data_2017,aes(arrival_date_month,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  ylab("Number Of Bookings") +
  xlab("Years") +
  ggtitle("Total bookings for years 2017") +
  labs(fill = 'Hotel Type')

Looking at the above plots we can see that complete data is available only for 2 quarters for years 2015 and 2017 and the data is missing for remaining quarters whereas, for 2016 the data is available for the entire year. Due to this the number of bookings for 2016 is showing a much higher value than the rest of the years.

Revenue Trend

data.weeks <- (data %>% 
  group_by(arrival_date_year) %>%
  summarize(min = min(arrival_date_week_number),
            max = max(arrival_date_week_number),
            weeks = max(arrival_date_week_number) - min(arrival_date_week_number) + 1))
knitr::kable(data.weeks)
arrival_date_year min max weeks
2015 27 53 27
2016 1 53 53
2017 1 35 35

From the above table we can see the data for number of weeks is not same in each available year. So we find the average number of bookings and average revenue generated per week in the 2 types of hotel segments and compare them.

data.summary <- data %>%
  filter(is_canceled == 0) %>% 
  group_by(hotel,arrival_date_year) %>%
  summarize(hotel.count = n(),
            hotel.rates = sum(adr)) %>%
  transmute(arrival_date_year,
            avg_weekly_bookings = case_when(arrival_date_year == '2015' ~ (hotel.count/27),
            arrival_date_year == '2017' ~ (hotel.count/35),
            TRUE ~ (hotel.count/53)),
            avg_weekly_earnings = case_when(arrival_date_year == '2015' ~ (hotel.rates/27),
            arrival_date_year == '2017' ~ (hotel.rates/35),
            TRUE ~ round(hotel.rates/53)))
knitr::kable(data.summary)
hotel arrival_date_year avg_weekly_bookings avg_weekly_earnings
City Hotel 2015 284.3704 24988.60
City Hotel 2016 428.9245 44642.00
City Hotel 2017 451.9143 52791.06
Resort Hotel 2015 228.7407 20531.64
Resort Hotel 2016 257.3019 21589.00
Resort Hotel 2017 260.6857 26533.26

From the above table we can say the total number of bookings and earnings are increasing year over year for both types of hotels. But, this increase is more in City Hotels compared to Resort Hotels. Also the number of bookings increased by upto 56% from 2015 to 2016 for City Hotels but there is only a 6% increase in bookings from 2016 to 2017.

Booking Analysis by Country

We will now explore the countries from which the guests belong to.

data_country <- data %>% group_by(country) %>% dplyr::summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>% 
  ggplot(.,aes(country, booking_count)) +
  geom_bar(stat = "identity", width = 0.25, fill ="dark blue")

The above graph shows the top 10 countries from where the guests come from. Majority of the guests are from Portugal followed by Great Britain and France.

Preferred Distribution Channel

Now we will analyse the distribution channel which customers generally choose for their bookings.

data_DC <- data %>% filter(!grepl("Undefined", distribution_channel)) %>% group_by(distribution_channel) %>% dplyr::summarise(booking_count = n()) %>% arrange(desc(booking_count))
ggplot(data=data_DC,aes(distribution_channel, booking_count)) +
geom_bar(stat = "identity", width = 0.25, fill ="dark blue")

As shown above, majority of the bookings come from travel agents/tour operators, therefore the hotel company can tie up with more number of such agents/operators and retain the existing ones with lucrative offers and commissions.

Preferred Time in a Week

We will now analyze the bookings on weekdays and weekends for different hotel types.

week_summary <- data %>%
  filter(is_canceled == 0) %>% 
  group_by(hotel,arrival_date_year) %>%
  summarize(avg_weekend_stay = round(sum(stays_in_weekend_nights)/2),
            avg_weekday_stay = round(sum(stays_in_week_nights)/5))

knitr::kable(week_summary)
hotel arrival_date_year avg_weekend_stay avg_weekday_stay
City Hotel 2015 2862 3091
City Hotel 2016 9024 9555
City Hotel 2017 6620 6982
Resort Hotel 2015 3678 3954
Resort Hotel 2016 7369 7848
Resort Hotel 2017 5358 5612

Here we can see the number of bookings on weekdays is slightly higher compared to weekend bookings across hotel types. This is interesting as we would generally believe the hotel bookings to be higher on weekends compared to weekdays. So the hotel company can further investigate by interacting with the customers and identifying the customers who prefer weekday bookings & those who prefer weekends so that they can target the respective type of customers by a much more streamlined process.

Preferred Room Type

We will now study the room type preferred by guests.

room_summary <- data %>%
  filter(is_canceled == 0) %>% 
  group_by(reserved_room_type) %>%
  summarize(room_count = n()) %>% 
  arrange(-room_count)

knitr::kable(room_summary)
reserved_room_type room_count
A 52363
D 13099
E 4621
F 2017
G 1331
B 750
C 624
H 356
L 4
ggplot(data = data,aes(reserved_room_type,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  ylab("Number Of Bookings") +
  xlab("Room Type") +
  ggtitle("Room Type Preferred") +
  labs(fill = 'Hotel Type')

From the above table and graph, we can say that room type A is in high demand for both city and resort hotels. And therefore the company can think of strategies to improve the booking rate in the remaining room types and ensure that the number of rooms of type A in the hotel is much higher than the remaining room types.

Preferred Meal Type

Studying the meal type preferred by the guests.

ggplot(data, aes(meal)) +
  geom_bar(width=0.25, fill = "dark blue")

The above plots shows that “Bed & Breakfast” is the most preferred meal type by the guests. And therefore the hotel company should further investigate the reasons for the low number of bookings with other type of meal offerings and devise strategies to make them more attractive.

Cancellation Rate by Hotel type

We will now study the cancellation trend across hotel types.

ggplot(data, aes(x = is_canceled, fill = factor(hotel))) +
  geom_histogram(binwidth = 0.2) +
  scale_x_continuous(breaks = seq(0, 1, 1))

The above plot shows that there are more cancellations in City Hotels than Resort Hotels.

Linear Modeling

data$total_stay_nights <- rowSums(cbind(data$stays_in_weekend_nights, data$stays_in_week_nights), na.rm = TRUE)
data <- data %>% mutate(revenue = (total_stay_nights * adr))
data_hotel_revenue <- data %>% filter(is_canceled == 0) %>%  group_by(hotel) %>% dplyr::summarise(total_bookings= n(), total_revenue = sum(revenue))
data_lm <- data %>% filter(is_canceled == 0) %>%  group_by(arrival_date_week_number,hotel,customer_type, meal, is_repeated_guest) %>% dplyr::summarise(total_bookings = n(), total_guests = sum(adults + children + babies),average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
## `summarise()` has grouped output by 'arrival_date_week_number', 'hotel', 'customer_type', 'meal'. You can override using the `.groups` argument.
model <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_lm)

summary(model)
## 
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) + 
##     as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + 
##     week_adr, data = data_lm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55215 -20804  -5367  10432 242978 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               7434.26    3706.77   2.006   0.0451
## as.factor(hotel)Resort Hotel             -1644.98    2212.95  -0.743   0.4574
## as.factor(customer_type)Group            -3607.78    3597.32  -1.003   0.3161
## as.factor(customer_type)Transient        41462.05    3017.20  13.742  < 2e-16
## as.factor(customer_type)Transient-Party  14504.96    3108.08   4.667 3.36e-06
## as.factor(meal)FB                       -59550.92    4570.39 -13.030  < 2e-16
## as.factor(meal)HB                       -32974.17    2495.85 -13.212  < 2e-16
## as.factor(meal)SC                       -32095.51    2730.19 -11.756  < 2e-16
## as.factor(meal)Undefined                -53234.85    4497.99 -11.835  < 2e-16
## as.factor(is_repeated_guest)1           -33540.02    2258.14 -14.853  < 2e-16
## average_stay_nights                        656.51     345.97   1.898   0.0580
## week_adr                                   208.88      20.75  10.068  < 2e-16
##                                            
## (Intercept)                             *  
## as.factor(hotel)Resort Hotel               
## as.factor(customer_type)Group              
## as.factor(customer_type)Transient       ***
## as.factor(customer_type)Transient-Party ***
## as.factor(meal)FB                       ***
## as.factor(meal)HB                       ***
## as.factor(meal)SC                       ***
## as.factor(meal)Undefined                ***
## as.factor(is_repeated_guest)1           ***
## average_stay_nights                     .  
## week_adr                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34580 on 1378 degrees of freedom
## Multiple R-squared:  0.3765, Adjusted R-squared:  0.3716 
## F-statistic: 75.66 on 11 and 1378 DF,  p-value: < 2.2e-16

Conclusion

The conclusions from the analysis of the Hotel data are :