DATA ANALYSIS

INTRODUCTION

The analysis focuses on addressing the following points based on the available hotel booking information.
- Customer preferences in the type of hotel
- Booking cancellations
- Customer footprint over time
- Busiest time of the year for the hotels
- Nationalities of guests
- Typical length of stay of customers

We will be using an open hotel booking demand data from July 2015 to August 2017.
This dataset contains booking information for two types of hotels - city and resort - and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

The data is imported into R and subjected to preliminary data analysis to understand the structure and variable types. Summary statistics are computed to reveal distributions of each variable, missing values and outliers. The outliers and missing values are treated as required.

Univariate and bivariate plots help determine any hidden patterns or relationships within the variables. We will also derive features from existing variables that shed light on customer behavior using feature engineering.

Based on the inferences from the aforesaid exploratory analysis, we will attempt to implement a regression model to attempt to estimate the length of stay and/or booking cancellations. We will also see the feasibility of using clustering techniques to identify customer groups that follow certain booking patterns.

The objective of this study is to help the end-user understand any underlying pattern in hotel bookings considered for the study and provide the proposed models to estimate length of booking or to predict whether a particular booking will be cancelled.

PACKAGES

suppressPackageStartupMessages(library(tidyverse,warn.conflicts = FALSE))

As you can see above,we have suppressed the messages and warnings resulting from loading the package tidyverse

Tidyverse package is a collection of multiple packages like dplyr, ggplot2, tidyr, tibble etc. Dplyr is used for data manipulation and contains functions like mutate(), select(), filter() etc. ggplot2 is used to have nice visualization plots. Tidyr is also used to tidy data i.e. like changing the shape of the data or to work with missing values etc.

hotel_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

DATA PREPARATION

Data Source

We chose the Hotel Bookings data, and downloaded the original data from here

This data set contains information about clients hotel bookings and reservations including cancellations information. It has two types of hotels: * City Hotel * Resort Hotel

Data includes details about type of hotel, number of adults and children/babies, booking information, length of stay, arrival dates, if the customer is repeated or not, cancellation details etc.

Bookings data represents the bookings due to arrive between 1st of July, 2015 and 31st August, 2017, including bookings that effectively arrived and cancelled. Data source is of hotels located in Portugal.

str(hotel_data)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

Original data has total 119390 observations with 32 variables; Data has around 17 variables that are integers and 14 variables that are character data type, whereas average daily rate is of num data type.

Data types for all variables seems to be correct, there is no need of correction here.

head(hotel_data,5)
## # A tibble: 5 x 32
##   hotel        is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
##   <chr>              <dbl>     <dbl>            <dbl> <chr>                       <dbl>
## 1 Resort Hotel           0       342             2015 July                           27
## 2 Resort Hotel           0       737             2015 July                           27
## 3 Resort Hotel           0         7             2015 July                           27
## 4 Resort Hotel           0        13             2015 July                           27
## 5 Resort Hotel           0        14             2015 July                           27
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, ...
tail(hotel_data,5)
## # A tibble: 5 x 32
##   hotel      is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
##   <chr>            <dbl>     <dbl>            <dbl> <chr>                       <dbl>
## 1 City Hotel           0        23             2017 August                         35
## 2 City Hotel           0       102             2017 August                         35
## 3 City Hotel           0        34             2017 August                         35
## 4 City Hotel           0       109             2017 August                         35
## 5 City Hotel           0       205             2017 August                         35
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, ...
attach(hotel_data)
summary(hotel_data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

Average Daily Rate has 1 negative value (-6.38) which is wrong.

Days_in_waiting_list has maximum value as 391, which is also not correct, as this variable represents no.of.days booking was in the waiting list which doesn’t seem right. Also the average value is around 2.32 which is very far from maximum value.

Removing NA’s

colSums(is.na(hotel_data))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Children has 4 NA values. This is very small compared to the total number of observations. So we can directly remove them or replace with mean value. Here we are replacing these NAs with mean values of children.

hotel_data$children[is.na(hotel_data$children)] = round(mean(hotel_data$children, na.rm = TRUE),0)
sum(is.na(hotel_data$children))
## [1] 0
colSums(is.na(hotel_data))

Removing/Replacing NULL Values

table(arrival_date_month)
table(meal)
table(country)
table(market_segment)
table(distribution_channel)
table(reserved_room_type)
table(assigned_room_type)
table(agent)
table(company)
table(customer_type)

Variables Agent 16340 NULL values and company has 112593 NULL values.

This is very high observations. So we cannot remove these NULL values as that will remove important data.

So we are replacing these NULL values with 0, as the agent and company values are just IDs replacing with 0 won’t create any problem.

But in Country also we have 488 NULL values, as this is not very big number we can either delete these rows, or replace NULL values with most repeating country.

Here we are replacing NULL values with maximum repeating country.

hotel_data$agent[hotel_data$agent== "NULL"] <- 0
#table(hotel_data$agent)
hotel_data$company[hotel_data$company== "NULL"] <- 0
#table(hotel_data$company)
hotel_data$country[hotel_data$country== "NULL"] <- max(hotel_data$country)
#table(hotel_data$country)
hotel_data <- filter(hotel_data, adr >= 0)
#str(hotel_data)

Also the negative value in Average daily rate -6.38, as it is just 1 observation we are directly removing that observation.

Now after removing we have 119389 observations instead of 119390

boxplot(days_in_waiting_list)

Days_in_waiting_list max value 391, I am leaving it as of now, as it seems to have values close to 300 from the box plot below. We can remove the outliers while doing any calculations if required.

Data Sanity Check

hotel_data <- mutate(hotel_data, 
                     total_guests = adults + children + babies)
str(hotel_data)
## spec_tbl_df [119,389 x 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119389] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119389] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119389] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119389] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119389] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119389] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119389] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119389] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119389] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119389] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119389] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119389] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119389] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119389] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119389] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119389] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119389] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119389] "0" "0" "0" "304" ...
##  $ company                       : chr [1:119389] "0" "0" "0" "0" ...
##  $ days_in_waiting_list          : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119389] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119389] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119389] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119389] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119389], format: "2015-07-01" "2015-07-01" ...
##  $ total_guests                  : num [1:119389] 2 2 1 1 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>
table(hotel_data$total_guests)
## 
##     0     1     2     3     4     5     6    10    12    20    26    27    40 
##   180 22581 82050 10495  3929   137     1     2     2     2     5     2     1 
##    50    55 
##     1     1

For each booking, there should be at-least 1 guest, here guests are of three categories: * Adults * Children * Babies We have combined all three and added a new column called “total_guests”.

But in data we have around 180 rows which have 0 guests. So we are removing those rows directly, as booking without guests doesn’t make sense.

hotel_data <- filter(hotel_data, total_guests > 0)
count(hotel_data[duplicated(hotel_data),])
## # A tibble: 1 x 1
##       n
##   <int>
## 1 31980
count(unique(hotel_data[duplicated(hotel_data),]))
## # A tibble: 1 x 1
##       n
##   <int>
## 1  8161

There 31994 duplicate values in hotel data set. Every column for these duplicate values are same. But we are not exactly sure of the reason for the duplicacy. As we don’t have any unique identifier like customer name, we are not removing these duplicate rows from the data. In real world scenario, we can check with the client the reason for the duplicates and decide on how to deal with them.

Outlier Detection

Plots of Various numeric variables

par(mfrow=c(3,3))

boxplot(hotel_data$adults,main = "Boxplot : # of Adults")

boxplot(hotel_data$children,main = "Boxplot : # of Children")

boxplot(hotel_data$babies,main = "Boxplot : # of Babies")

boxplot(hotel_data$stays_in_weekend_nights,main = "Boxplot : # of Weekend Night Stays")

boxplot(hotel_data$stays_in_week_nights,main = "Boxplot : # of Weekday Night Stays")

boxplot(hotel_data$days_in_waiting_list,main = "Boxplot : # of Days in Waiting List")

boxplot(hotel_data$required_car_parking_spaces,main = "Boxplot : # of Car Parking Spaces Req.")

boxplot(hotel_data$total_of_special_requests,main = "Boxplot : # of Spl. Requests")

boxplot(hotel_data$previous_cancellations,main = "Boxplot : # of Previous Cancellations")

boxplot(hotel_data$previous_bookings_not_canceled,main = "Boxplot : # of Previous Non-Cancellations")

boxplot(hotel_data$booking_changes,main = "Boxplot : # of Booking Changes")

boxplot(hotel_data$adr,main = "Boxplot : Average Daily Rate ")

Observations

  • In the Average Daily Rate column,one value equals 5400 which mostly might be an outlier. The minimum value of this column is 0.26 which is again possibly an outlier. But before the removal, it should be discussed it with the client.

  • Days in waiting column has values greater than a year. In a real world scenario, it might be necessary to communicate to the client and further proceed.

  • In the Children column, there is one specific observation where the number of children is 10.No of adults corresponding to that booking is however only 2.Mostly all the other values are within 4. This probably needs to be pointed out.

  • Nothing unusual found in the rest of the columns.

Removal of Outliers

The summary of the Average Daily rate

summary(hotel_data$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   69.50   94.95  101.97  126.00 5400.00
  • The outlier discusses previously in the Average Daily Rate column, one with a value 5400 should be removed
hotel_data <- hotel_data[hotel_data$adr < max(hotel_data$adr)  ,]
  • Post removal of the outliers,this is how the boxplot looks,
boxplot(hotel_data$adr,main = "Boxplot : Average Daily Rate ")

The summary of the Average Daily rate would now look like below

summary(hotel_data$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   69.50   94.95  101.93  126.00  510.00

We can observe that the outliers are removed and the maximum value now is 510.

rmarkdown::paged_table(hotel_data)

EXPLORATORY DATA ANALYSIS

We can implement Exploratory data analysis tools and techniques to investigate, analyze, and summarize the main characteristics of datasets, often utilizing data visualization methodologies. EDA techniques allow for effective manipulation of data, to find the answers by discovering data patterns, spotting anomalies, checking assumptions, or testing a hypothesis. It can help detect obvious errors, identify outliers in datasets, understand relationships, unearth important factors, find patterns within data, and provide new insights.

The exploratory data analysis steps that analysts have in mind when performing EDA include:

  • Asking the right questions related to the purpose of data analysis like how market segment of Booking is affecting cancellation, what are the other factors that are affecting cancellation of booking, etc.
  • Obtaining in-depth knowledge about problem domains
  • Setting clear objectives that are aligned with the desired outcomes.

There are four exploratory data analysis techniques that data experts use, which include:

  • Graphical
    • Univariate
    • Multivariate
  • Non-Graphical
    • Univariate
    • Multivariate
  1. Univariate Non-Graphical-

This is the simplest type of EDA, where data has a single variable like hotel, is_canceled, arrival_date, etc. Since there is only one variable, we do not have to deal with relationships.

  1. Univariate Graphical-

Non-graphical techniques do not present the complete picture of data. Graphical methods are therefore, required. Common types of univariate graphics include: * Stem-and-leaf plots, which show all data values and the shape of the distribution. + Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values. + Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum. Also, to detect outliers in the dataset.

  1. Multivariate Non-Graphical Multivariate data consists of several variables. Non-graphic multivariate EDA methods illustrate relationships between 2 or more data variables using statistics or cross-tabulation.

  2. Multivariate Graphical This EDA technique makes use of graphics to show relationships between 2 or more datasets like Reservation counts vs Number of special requests, Arrival Date Year vs Lead Time By Booking Cancellation Status, etc. Other common types of multivariate graphics include:

    • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
    • Multivariate chart, which is a graphical representation of the relationships between factors and a response.
    • Run chart, which is a line graph of data plotted over time.
    • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
    • Heat map, which is a graphical representation of data where values are depicted by color.

Different ways to look at the data

Different ways to look at the hotel booking data include clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables. K-means Clustering can also be used to look at the market segmentation of the hotel booking. Predictive models, such as linear regression, can be used as well that use statistics and data to predict hotel booking predictions.

Feature Engineering

We plan to perform Feature Engineering on the Hotel Booking dataset. It is the art of selecting the important features and transforming them into refined and meaningful features that suit the needs of the model. An effective Feature Engineering implies:

  • Higher efficiency of the model
  • Easier Algorithms that fit the data
  • Easier for Algorithms to detect patterns in the data

We plan to create some new features from the data. Also, remove unwanted features like reservation_status.

  • We have two features in our dataset reserved_room_type and another is assigned_room_type. We will make the new feature let us call it Room which will contain 1 if the guest was assigned the same room that was reserved else 0. Guest can cancel the booking if he did not get the same room.
  • Another feature will be net_cancelled. It will contain 1 If the current customer has canceled more bookings in the past than the number of bookings he did not cancel, else 0.

Further, we plan to look at the correlation between the variables. It can be done with the help of a heatmap. Methods of correlation summarize the relationship between two variables in a single number called the correlation coefficient. The correlation coefficient is usually represented using the symbol r, and it ranges from -1 to +1.

A correlation coefficient quite close to 0, but either positive or negative, implies little or no relationship between the two variables. A correlation coefficient close to plus 1 means a positive relationship between the two variables, with increases in one of the variables being associated with increases in the other variable.

A correlation coefficient close to -1 indicates a negative relationship between two variables, with an increase in one of the variables being associated with a decrease in the other variable. A correlation coefficient can be produced for ordinal, interval or ratio level variables, but has little meaning for variables which are measured on a scale which is no more than nominal.

Summarize the data

To summarize the data, we need to simplify the data. The distribution of a variable shows what values the variable takes and how often the variable takes these values. The two most useful ways of describing the distribution of data are: 1. The typical: This describes the center or middle of the data. This way of describing the center is also called a measure of central tendency. The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median) 2. The spread of the values around the center: This describes how densely the data is distributed around the center. This is also called a measure of dispersion. Looking at the spread of the distribution of data tells us about the amount of variation, or diversity, within the data. The three measures of the spread of the data are the range, the standard deviation, and the variance. These two ways of describing the data are also referred to as descriptive statistics.

Illustration

We would now go through each variable of concern and understand how it affects the revenue.

Preferred Hotel Type

To find the total number of nights the guest stayed including weekday nights and weekend nights

hotel_data$total_stay_nights <- rowSums(cbind(hotel_data$stays_in_weekend_nights, hotel_data$stays_in_week_nights), na.rm = TRUE)

To find the total revenue earned by each booking , this can be calculated by multiplying total number of nights and average daily price

hotel_data <- hotel_data %>% mutate(revenue = (total_stay_nights * adr))

To make a table of Total booking and Revenue for each Hotel type. Here I am only considering the reservations which are not canceled.

data_hotel_revenue <- hotel_data %>% filter(is_canceled == 0) %>%  group_by(hotel) %>% summarise(total_bookings= n(), total_revenue = sum(revenue))

To visually understand this observation,

ggplot(data_hotel_revenue, aes(y=total_revenue, x=hotel)) + 
  geom_bar(position="dodge", stat = "identity", width=0.25, fill = "orange")

ggplot(data_hotel_revenue, aes(y=total_revenue, x=hotel)) + 
  geom_bar(position="dodge", stat = "identity", width=0.25, fill = "orange")

This observation shows that more bookings were made in City Hotel than the Resort Hotel and hence City Hotel got more Revenue between 2015 and 2017.

Cancellation

To see the cancellation trends over the years,

ggplot(hotel_data, aes(x = is_canceled, fill = factor(hotel))) +
  geom_histogram(binwidth = 0.2) +
  scale_x_continuous(breaks = seq(0, 1, 1))

This shows that more cancellations are made in City Hotel. Also, the cancellation percentage is more than 50%. #### Revenue

To find which year gave the most revenue

For this calculation , we are considering only the bookings which were not cancelled

data_not_canceled <- hotel_data %>% filter(is_canceled == 0)
ggplot(data_not_canceled, aes(x = arrival_date_year, y = revenue, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.25) +
  scale_x_continuous(breaks = seq(2015,2017,1))

This shows the most revenue was made in 2016 and the most revenue was made through City Hotel.

To find which month gave the most revenue

For this calculation , we are considering only the bookings which were not cancelled

data_not_canceled$arrival_date_month <- factor(data_not_canceled$arrival_date_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
ggplot(data_not_canceled, aes(x = arrival_date_month, y = revenue, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.25)

This shows August has most bookings and hence gives most revenue.

Stays by Month

Here, we are looking at the distribution of hotel bookings by months of the year. Bookings are plotted in the below bar plot.

month_vector <- factor(hotel_data$arrival_date_month)
month<-factor(month_vector, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
levels(month)<-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
plot(month,col="coral1",ylab="Count",xlab="Month")

Hence, there are more bookings in summer months than colder months. August has the most bookings and January has the least.

Preferred Meal Type

To find the most preferred meal type

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.

ggplot(hotel_data, aes(meal)) +
  geom_bar(width=0.25, fill = "orange")

This shows “Bed & Breakfast” is the most preferred meal type.

Preferred Room Type

To find the most preferred Room type

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.

ggplot(hotel_data, aes(reserved_room_type , fill = factor(hotel))) +
  geom_bar(width=0.25, fill = "orange")

Most preferred room type is “A”.

Common Guest Type

To find the most common customer (guest) type

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.

ggplot(hotel_data, aes(customer_type)) +
  geom_bar(width=0.25, fill = "orange")

Most common customer type is “Transient”.

Guest Country Analysis

To find which country most of the Guests are from

Considering the whole data here (including cancellations) since this analysis is to understand the guest preference. Since there are guests from almost 182 countries, Just showing the top 10 countries the guests are from, here.

data_country <- hotel_data %>% group_by(country) %>%  summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>% 
  ggplot(.,aes(country, booking_count)) +
  geom_bar(stat = "identity", width = 0.25, fill ="orange")

Most common country is “Portugal”.

MODELLING

We are incorporating Linear Modeling to understand the impact of the other variables on Revenue.

To build the model , we would first prepare the data week wise. We would like to include Week number of arrival, hotel type, customer type, meal type, is_repeated_guest, total number of bookings, total revenue for the week, average adr and average stay nights.

data_model <- hotel_data %>% filter(is_canceled == 0) %>%  group_by(arrival_date_week_number,hotel,customer_type, meal, is_repeated_guest) %>% summarise(total_bookings = n(), total_guests = sum(adults + children + babies),average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
model_1 <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_model)
summary(model_1)
## 
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) + 
##     as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + 
##     week_adr, data = data_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56660 -20718  -5261  10370 241913 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               6595.58    3762.78   1.753   0.0799
## as.factor(hotel)Resort Hotel             -1051.02    2249.44  -0.467   0.6404
## as.factor(customer_type)Group            -3424.20    3654.23  -0.937   0.3489
## as.factor(customer_type)Transient        41707.90    3055.52  13.650  < 2e-16
## as.factor(customer_type)Transient-Party  14657.35    3154.51   4.646  3.7e-06
## as.factor(meal)FB                       -60125.95    4596.53 -13.081  < 2e-16
## as.factor(meal)HB                       -33364.72    2514.58 -13.269  < 2e-16
## as.factor(meal)SC                       -32748.98    2776.48 -11.795  < 2e-16
## as.factor(meal)Undefined                -53859.64    4525.11 -11.902  < 2e-16
## as.factor(is_repeated_guest)1           -34169.85    2278.98 -14.994  < 2e-16
## average_stay_nights                        664.05     354.42   1.874   0.0612
## week_adr                                   214.85      21.16  10.151  < 2e-16
##                                            
## (Intercept)                             .  
## as.factor(hotel)Resort Hotel               
## as.factor(customer_type)Group              
## as.factor(customer_type)Transient       ***
## as.factor(customer_type)Transient-Party ***
## as.factor(meal)FB                       ***
## as.factor(meal)HB                       ***
## as.factor(meal)SC                       ***
## as.factor(meal)Undefined                ***
## as.factor(is_repeated_guest)1           ***
## average_stay_nights                     .  
## week_adr                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34720 on 1361 degrees of freedom
## Multiple R-squared:  0.3776, Adjusted R-squared:  0.3725 
## F-statistic: 75.05 on 11 and 1361 DF,  p-value: < 2.2e-16

Linear Modeling Results and Way Forward

Of the variables included in the model, resort hotel, group and average_stay_nights are found to be not significant at the 95% confidence interval. Therefore we can conclude that based on this model, these variables do not have a statistically significant influence on revenue.

Out of the predictors that have been found statistically significant, we observe that average daily rate and transient customers are features that have a positive impact on the revenue.

Linear regression has revealed that atleast some of the variables have an influence on revenue and can be used to control and predict the revenue generation. Using this as a first step, we can explore other techniques to determine the most important variables contributing to revenue generation and find the best method to build a model that can predict revenue based on the supporting features.

SUMMARY

INSIGHTS

  • Most of the bookings are done for the City Hotel and the major revenue is from that as well. So the consumer can plan appropriate marketing strategies.

  • The most busiest month has been August. So the consumer can be prepared to expect more guests, make necessary arrangements, do appropriate marketing. Also this helps them understand which months have the least bookings and can plan to give some deals to attract customers in the off season.

  • The most preferred meal type has been Bed & Breakfast. The consumer can be prepared to have more resources for meal time.

  • Most visitors are from Portugal. This helps the consumer to know his target customers and can attract them during their holiday seasons.

  • The most preferred room type is A. This helps the consumer understand what his guests are looking for.

  • The cancellation percentage has been more than 50%. The consumer would definitely look for ways to reduce this. Like following up with reserved guests prior.

  • It can be concluded that the Revenue is depending on various factors. The consumer can plan for the appropriate marketing strategies.

LIMITATIONS

  • The definition of new customers is not very well described. A new customer this year will be existing next year, or they can be existing customer from the 2nd booking. A deeper analysis in required based on definition.
  • The forecasting aspect can be further drilled to analyze the residuals and split the model across years and various factors such as customer type or hotel type
  • The classification model uses few variables. We can tune the model with new variables and adjusting the cost of misclassification.

CONCLUSION

The study on this data thus far using established wrangling techniques has revealed answers to the questions we initially set out to answer. The next steps for us, if we choose to pursue this, would be to model revenue accurately using other techniques, as well as to use the available data to derive insights such as clear customer segments that reveals where the business is performing well versus where it is not. This could help the business identify where to focus on to improve revenue and reduce cancellations.