Introduction

Hotels are considered to be the most volatile industry,since there is a lot of uncertainty in demand. Hence they need to plan the best strategies always to keep the numbers moving and keep the business growing.

The objective of this project is to analyze the open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019 and find the critical factors from the past data which may help for the future planning of the business and thus to improve their revenue.

By this analysis, I would like to help the hotels understand the following

1. Which is the most preferred Hotel type by the guests ?
2. Which is the Busiest Month?
3. Which is the most preferred Meal Type?
4. Which country are most of the visitors from ?
5. What is the most preferred Room Type?
6. What are the factors to concentrate on to increase the Revenue?
7. What has been the Cancellation percentage?

I would start by Importing the data, Cleaning it and checking for completeness and then find the relationships between various factors, visually analyze the relationship between various columns and build models utilizing various functions and packages available in R.

This analysis would help the consumer to understand all the factors affecting the Business, prepare for the future bookings accordingly and thus improve their revenue.

Packages

The packages used for the analysis are

  • readxl : To import the excel files into R
  • tidyverse : To tidy/clean the data
  • lubridate : To manipulate the dates
  • psych : To plot multiple histograms
  • DT : To display the data attractively
library(readxl)
library(tidyverse)
library(lubridate)
library(psych)
library(DT)

Data preparation

Data Source

  • The data contains the open hotel booking demand dataset from Antonio, Almeida and Nunes 2019
  • The source data contains the complete booking information of the guests..It has 32 variables. It helps us understand everything from the source of booking the hotel for the guest to the preferred meal type and everything in between.
  • The whole information about the dataset is as follows.
data_dictionary <- read_excel("D:/UC BANA/Fall 2020/Data Wrangling/Final Project/data_dictionary_hotels.xlsx")
attach(data_dictionary)

datatable(
  head(data_dictionary,40),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

To import the data

data <- read.csv("D:/UC BANA/Fall 2020/Data Wrangling/Final Project/hotels.csv", stringsAsFactors = FALSE )

To check the structure of the data

str(data)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

There are 119390 observations of 32 variables. All the Data types seem appropriate. Hence I am not changing any Data types.

To view the raw data

head(data, 10)
##           hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1  Resort Hotel           0       342              2015               July
## 2  Resort Hotel           0       737              2015               July
## 3  Resort Hotel           0         7              2015               July
## 4  Resort Hotel           0        13              2015               July
## 5  Resort Hotel           0        14              2015               July
## 6  Resort Hotel           0        14              2015               July
## 7  Resort Hotel           0         0              2015               July
## 8  Resort Hotel           0         9              2015               July
## 9  Resort Hotel           1        85              2015               July
## 10 Resort Hotel           1        75              2015               July
##    arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                        27                         1                       0
## 2                        27                         1                       0
## 3                        27                         1                       0
## 4                        27                         1                       0
## 5                        27                         1                       0
## 6                        27                         1                       0
## 7                        27                         1                       0
## 8                        27                         1                       0
## 9                        27                         1                       0
## 10                       27                         1                       0
##    stays_in_week_nights adults children babies meal country market_segment
## 1                     0      2        0      0   BB     PRT         Direct
## 2                     0      2        0      0   BB     PRT         Direct
## 3                     1      1        0      0   BB     GBR         Direct
## 4                     1      1        0      0   BB     GBR      Corporate
## 5                     2      2        0      0   BB     GBR      Online TA
## 6                     2      2        0      0   BB     GBR      Online TA
## 7                     2      2        0      0   BB     PRT         Direct
## 8                     2      2        0      0   FB     PRT         Direct
## 9                     3      2        0      0   BB     PRT      Online TA
## 10                    3      2        0      0   HB     PRT  Offline TA/TO
##    distribution_channel is_repeated_guest previous_cancellations
## 1                Direct                 0                      0
## 2                Direct                 0                      0
## 3                Direct                 0                      0
## 4             Corporate                 0                      0
## 5                 TA/TO                 0                      0
## 6                 TA/TO                 0                      0
## 7                Direct                 0                      0
## 8                Direct                 0                      0
## 9                 TA/TO                 0                      0
## 10                TA/TO                 0                      0
##    previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                               0                  C                  C
## 2                               0                  C                  C
## 3                               0                  A                  C
## 4                               0                  A                  A
## 5                               0                  A                  A
## 6                               0                  A                  A
## 7                               0                  C                  C
## 8                               0                  C                  C
## 9                               0                  A                  A
## 10                              0                  D                  D
##    booking_changes deposit_type agent company days_in_waiting_list
## 1                3   No Deposit  NULL    NULL                    0
## 2                4   No Deposit  NULL    NULL                    0
## 3                0   No Deposit  NULL    NULL                    0
## 4                0   No Deposit   304    NULL                    0
## 5                0   No Deposit   240    NULL                    0
## 6                0   No Deposit   240    NULL                    0
## 7                0   No Deposit  NULL    NULL                    0
## 8                0   No Deposit   303    NULL                    0
## 9                0   No Deposit   240    NULL                    0
## 10               0   No Deposit    15    NULL                    0
##    customer_type   adr required_car_parking_spaces total_of_special_requests
## 1      Transient   0.0                           0                         0
## 2      Transient   0.0                           0                         0
## 3      Transient  75.0                           0                         0
## 4      Transient  75.0                           0                         0
## 5      Transient  98.0                           0                         1
## 6      Transient  98.0                           0                         1
## 7      Transient 107.0                           0                         0
## 8      Transient 103.0                           0                         1
## 9      Transient  82.0                           0                         1
## 10     Transient 105.5                           0                         0
##    reservation_status reservation_status_date
## 1           Check-Out              2015-07-01
## 2           Check-Out              2015-07-01
## 3           Check-Out              2015-07-02
## 4           Check-Out              2015-07-02
## 5           Check-Out              2015-07-03
## 6           Check-Out              2015-07-03
## 7           Check-Out              2015-07-03
## 8           Check-Out              2015-07-03
## 9            Canceled              2015-05-06
## 10           Canceled              2015-04-22

To check for the missing values in the dataset

colSums(is.na(data))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

It shows that there are 4 missing values in only one column “Number of children”. The number of incomplete rows are very less and can be neglected. Here I am planning to assume the number of children is 0.

data$children[is.na(data$children)] <- 0

Let us check for the missing values after imputing

sum(is.na(data))
## [1] 0

To check the summary of the data

summary(data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
## 

From the summary of the data I could only understand the integer columns, To understand the character columns I will further do the exploratory data analysis.

To show the clean data

head(data,10)
##           hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1  Resort Hotel           0       342              2015               July
## 2  Resort Hotel           0       737              2015               July
## 3  Resort Hotel           0         7              2015               July
## 4  Resort Hotel           0        13              2015               July
## 5  Resort Hotel           0        14              2015               July
## 6  Resort Hotel           0        14              2015               July
## 7  Resort Hotel           0         0              2015               July
## 8  Resort Hotel           0         9              2015               July
## 9  Resort Hotel           1        85              2015               July
## 10 Resort Hotel           1        75              2015               July
##    arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                        27                         1                       0
## 2                        27                         1                       0
## 3                        27                         1                       0
## 4                        27                         1                       0
## 5                        27                         1                       0
## 6                        27                         1                       0
## 7                        27                         1                       0
## 8                        27                         1                       0
## 9                        27                         1                       0
## 10                       27                         1                       0
##    stays_in_week_nights adults children babies meal country market_segment
## 1                     0      2        0      0   BB     PRT         Direct
## 2                     0      2        0      0   BB     PRT         Direct
## 3                     1      1        0      0   BB     GBR         Direct
## 4                     1      1        0      0   BB     GBR      Corporate
## 5                     2      2        0      0   BB     GBR      Online TA
## 6                     2      2        0      0   BB     GBR      Online TA
## 7                     2      2        0      0   BB     PRT         Direct
## 8                     2      2        0      0   FB     PRT         Direct
## 9                     3      2        0      0   BB     PRT      Online TA
## 10                    3      2        0      0   HB     PRT  Offline TA/TO
##    distribution_channel is_repeated_guest previous_cancellations
## 1                Direct                 0                      0
## 2                Direct                 0                      0
## 3                Direct                 0                      0
## 4             Corporate                 0                      0
## 5                 TA/TO                 0                      0
## 6                 TA/TO                 0                      0
## 7                Direct                 0                      0
## 8                Direct                 0                      0
## 9                 TA/TO                 0                      0
## 10                TA/TO                 0                      0
##    previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                               0                  C                  C
## 2                               0                  C                  C
## 3                               0                  A                  C
## 4                               0                  A                  A
## 5                               0                  A                  A
## 6                               0                  A                  A
## 7                               0                  C                  C
## 8                               0                  C                  C
## 9                               0                  A                  A
## 10                              0                  D                  D
##    booking_changes deposit_type agent company days_in_waiting_list
## 1                3   No Deposit  NULL    NULL                    0
## 2                4   No Deposit  NULL    NULL                    0
## 3                0   No Deposit  NULL    NULL                    0
## 4                0   No Deposit   304    NULL                    0
## 5                0   No Deposit   240    NULL                    0
## 6                0   No Deposit   240    NULL                    0
## 7                0   No Deposit  NULL    NULL                    0
## 8                0   No Deposit   303    NULL                    0
## 9                0   No Deposit   240    NULL                    0
## 10               0   No Deposit    15    NULL                    0
##    customer_type   adr required_car_parking_spaces total_of_special_requests
## 1      Transient   0.0                           0                         0
## 2      Transient   0.0                           0                         0
## 3      Transient  75.0                           0                         0
## 4      Transient  75.0                           0                         0
## 5      Transient  98.0                           0                         1
## 6      Transient  98.0                           0                         1
## 7      Transient 107.0                           0                         0
## 8      Transient 103.0                           0                         1
## 9      Transient  82.0                           0                         1
## 10     Transient 105.5                           0                         0
##    reservation_status reservation_status_date
## 1           Check-Out              2015-07-01
## 2           Check-Out              2015-07-01
## 3           Check-Out              2015-07-02
## 4           Check-Out              2015-07-02
## 5           Check-Out              2015-07-03
## 6           Check-Out              2015-07-03
## 7           Check-Out              2015-07-03
## 8           Check-Out              2015-07-03
## 9            Canceled              2015-05-06
## 10           Canceled              2015-04-22

Visualization

To show the distribution of the data

Histograms are used to show the distribution of the data for the numeric variables.

multi.hist(data[,sapply(data, is.numeric)])

The observations made from the above histograms and the numerical data summary are :

  • The most busiest month is August.
  • The year “2016” had the most number of bookings.
  • The second half of the year has most bookings.

To check for the outliers visually

Boxplots are used to identify the outliers.

par(mfrow = c(5,4), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)

boxplot(data$is_canceled, main = "is_canceled")
boxplot(data$lead_time, main = "lead_time")
boxplot(data$arrival_date_year, main = "arrival_date_year")
boxplot(data$arrival_date_week_number, main = "arrival_date_week_number")
boxplot(data$arrival_date_day_of_month, main = "arrival_date_day_of_month")
boxplot(data$stays_in_weekend_nights, main = "stays_in_weekend_nights")
boxplot(data$stays_in_week_nights, main = "stays_in_week_nights")
boxplot(data$children, main = "children")
boxplot(data$adults, main = "adults")
boxplot(data$babies, main = "babies")
boxplot(data$is_repeated_guest, main = "is_repeated_gues")
boxplot(data$previous_cancellations, main = "previous_cancellation")
boxplot(data$previous_bookings_not_canceled, main ="previous_bookings_not_canceled")
boxplot(data$booking_changes, main = "booking_changes")
boxplot(data$days_in_waiting_list, main = "days_in_waiting_list")
boxplot(data$adr, main = "adr")
boxplot(data$required_car_parking_spaces, main = "required_car_parking_spaces")
boxplot(data$total_of_special_requests, main = "total_of_special_requests")

The following outliers can be seen:

  • lead_time : There are lead times in the range 500 - 700. I would check this with the hotel to know if there are guests who made reservations 2 years prior to the arrival before doing anything to the data since there are many such observations.

  • days_in_waiting_list : There are observations with days in waiting list above 350. I would check with the hotel if they are really not able to assign a reservation to a guest for more than a year before doing anything to the data since there are many such observations.

  • adr : There is an observation with a negative adr and one with adr of 5400. The observation with negative adr can be removed without any questions. I would remove the adr of 5400 also since we are using the data for the analysis and not for reporting.

To remove the outliers

data <- data[ !(data$adr < 0 | data$adr == 5400), ]

Exploratory Data Analysis

I would now go through each variable of concern and understand how it affects the revenue

Most Preferred Hotel Type

To find the total number of nights the guest stayed including weekday nights and weekend nights

data$total_stay_nights <- rowSums(cbind(data$stays_in_weekend_nights, data$stays_in_week_nights), na.rm = TRUE)

To find the total revenue earned by each booking , this can be calculated by multiplying total number of nights and average daily price

data <- data %>% mutate(revenue = (total_stay_nights * adr))

To make a table of Total booking and Revenue for each Hotel type. Here I am only considering the reservations which are not canceled.

data_hotel_revenue <- data %>% filter(is_canceled == 0) %>%  group_by(hotel) %>% summarise(total_bookings= n(), total_revenue = sum(revenue))

To visually understand this observation

ggplot(data_hotel_revenue, aes(y=total_bookings, x=hotel)) + 
  geom_bar(position="dodge", stat = "identity" , width=0.25, fill = "blue" )

ggplot(data_hotel_revenue, aes(y=total_revenue, x=hotel)) + 
  geom_bar(position="dodge", stat = "identity", width=0.25, fill = "blue")

This observation shows that more bookings were made in City Hotel than the Resort Hotel and hence City Hotel got more Revenue between 2015 and 2017.

Cancellation

To see the cancellation trends over the years

ggplot(data, aes(x = is_canceled, fill = factor(hotel))) +
  geom_histogram(binwidth = 0.2) +
  scale_x_continuous(breaks = seq(0, 1, 1))

This shows that more cancellations are made in City Hotel. Also, the cancellation percentage is more than 50%.

To find which year gave the most revenue

For this calculation , I am considering only the bookings which were not canceled

data_not_canceled <- data %>% filter(is_canceled == 0)
ggplot(data_not_canceled, aes(x = arrival_date_year, y = revenue, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.25) +
  scale_x_continuous(breaks = seq(2015,2017,1))

This shows the most revenue was made in 2016 and the most revenue was made through City Hotel.

To find which month gave the most revenue

For this calculation , I am considering only the bookings which were not canceled

data_not_canceled$arrival_date_month <- factor(data_not_canceled$arrival_date_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
ggplot(data_not_canceled, aes(x = arrival_date_month, y = revenue, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.25)

This shows August has most bookings and hence gives most revenue.

To find the most preferred meal type

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.

ggplot(data, aes(meal)) +
  geom_bar(width=0.25, fill = "blue")

This shows “Bed & Breakfast” is the most preferred meal type

To find the most preferred Room type

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.

ggplot(data, aes(reserved_room_type , fill = factor(hotel))) +
  geom_bar(width=0.25, fill = "blue")

Most preferred room type is “A”.

To find the most common customer (guest) type

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.

ggplot(data, aes(customer_type)) +
  geom_bar(width=0.25, fill = "blue")

Most common customer type is “Transient”.

To find which country most of the Guests are from

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference. Since there are guests from almost 182 countries, Just showing the top 10 countries the guests are from, here.

data_country <- data %>% group_by(country) %>%  summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>% 
  ggplot(.,aes(country, booking_count)) +
  geom_bar(stat = "identity", width = 0.25, fill ="blue")

Most common country is “PRT”.

Linear Modelling

To build the model , I would first prepare the data week wise. I would like to include Week number of arrival, hotel type, customer type, meal type, is_repeated_guest, total number of bookings, total revenue for the week, average adr and average stay nights.

data_model <- data %>% filter(is_canceled == 0) %>%  group_by(arrival_date_week_number,hotel,customer_type, meal, is_repeated_guest) %>% summarise(total_bookings = n(), total_guests = sum(adults + children + babies),average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))

model_1 <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_model)

summary(model_1)
## 
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) + 
##     as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + 
##     week_adr, data = data_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55215 -20804  -5367  10432 242978 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               7434.26    3706.77   2.006   0.0451
## as.factor(hotel)Resort Hotel             -1644.98    2212.95  -0.743   0.4574
## as.factor(customer_type)Group            -3607.78    3597.32  -1.003   0.3161
## as.factor(customer_type)Transient        41462.05    3017.20  13.742  < 2e-16
## as.factor(customer_type)Transient-Party  14504.96    3108.08   4.667 3.36e-06
## as.factor(meal)FB                       -59550.92    4570.39 -13.030  < 2e-16
## as.factor(meal)HB                       -32974.17    2495.85 -13.212  < 2e-16
## as.factor(meal)SC                       -32095.51    2730.19 -11.756  < 2e-16
## as.factor(meal)Undefined                -53234.85    4497.99 -11.835  < 2e-16
## as.factor(is_repeated_guest)1           -33540.02    2258.14 -14.853  < 2e-16
## average_stay_nights                        656.51     345.97   1.898   0.0580
## week_adr                                   208.88      20.75  10.068  < 2e-16
##                                            
## (Intercept)                             *  
## as.factor(hotel)Resort Hotel               
## as.factor(customer_type)Group              
## as.factor(customer_type)Transient       ***
## as.factor(customer_type)Transient-Party ***
## as.factor(meal)FB                       ***
## as.factor(meal)HB                       ***
## as.factor(meal)SC                       ***
## as.factor(meal)Undefined                ***
## as.factor(is_repeated_guest)1           ***
## average_stay_nights                     .  
## week_adr                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34580 on 1378 degrees of freedom
## Multiple R-squared:  0.3765, Adjusted R-squared:  0.3716 
## F-statistic: 75.66 on 11 and 1378 DF,  p-value: < 2.2e-16

Summary

The Hotels ultimately want to increase their revenue and they would like to understand and concentrate on each and everything that improves their revenue.

The conclusions from my analysis of the data from 2015 to 2017 are :

From the Exploratory Data Analysis

  • Most of the bookings are done for the City Hotel and the major revenue is from that as well. So the consumer can plan appropriate marketing strategies.
  • The most busiest month has been August. So the consumer can be prepared to expect more guests, make necessary arrangements, do appropriate marketing. Also this helps them understand which months have the least bookings and can plan to give some deals to attract customers in the off season .
  • The most preferred meal type has been Bed & Breakfast. The consumer can be prepared to have more resources for meal time.
  • Most visitors are from PRT. This helps the consumer to know his target customers and can attract them during their holiday seasons.
  • The most preferred room type is A. This helps the consumer understand what his guests are looking for.
  • The cancellation percentage has been more than 50%. The consumer would definitely look for ways to reduce this. Like following up with reserved guests prior.
  • It can be concluded that the Revenue is depending on various factors. The consumer can plan for the appropriate marketing strategies.

From the Linear Model

  • The Revenue is positively affected by factors such as adr, Total nights stayed by the guest .
  • The Revenue is slightly increased by Transient and Transient Party customers.