The analysis focuses on addressing the following points based on the available hotel booking information.
- Customer preferences in the type of hotel
- Booking cancellations
- Customer footprint over time
- Busiest time of the year for the hotels
- Nationalities of guests
- Typical length of stay of customers
We will be using an open hotel booking demand data from July 2015 to August 2017.
This dataset contains booking information for two types of hotels - city and resort - and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
The data is imported into R and subjected to preliminary data analysis to understand the structure and variable types. Summary statistics are computed to reveal distributions of each variable, missing values and outliers. The outliers and missing values are treated as required.
Univariate and bivariate plots help determine any hidden patterns or relationships within the variables. We will also derive features from existing variables that shed light on customer behavior using feature engineering.
Based on the inferences from the aforesaid exploratory analysis, we will attempt to implement a regression model to attempt to estimate the length of stay and/or booking cancellations. We will also see the feasibility of using clustering techniques to identify customer groups that follow certain booking patterns.
The objective of this study is to help the end-user understand any underlying pattern in hotel bookings considered for the study and provide the proposed models to estimate length of booking or to predict whether a particular booking will be cancelled.
suppressPackageStartupMessages(library(tidyverse,warn.conflicts = FALSE))
As you can see above,we have suppressed the messages and warnings resulting from loading the package tidyverse
Tidyverse package is a collection of multiple packages like dplyr, ggplot2, tidyr, tibble etc. Dplyr is used for data manipulation and contains functions like mutate(), select(), filter() etc. ggplot2 is used to have nice visualization plots. Tidyr is also used to tidy data i.e. like changing the shape of the data or to work with missing values etc.
hotel_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
We chose the Hotel Bookings data, and downloaded the original data from here
This data set contains information about clients hotel bookings and reservations including cancellations information. It has two types of hotels: * City Hotel * Resort Hotel
Data includes details about type of hotel, number of adults and children/babies, booking information, length of stay, arrival dates, if the customer is repeated or not, cancellation details etc.
Bookings data represents the bookings due to arrive between 1st of July, 2015 and 31st August, 2017, including bookings that effectively arrived and cancelled. Data source is of hotels located in Portugal.
str(hotel_data)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
Original data has total 119390 observations with 32 variables; Data has around 17 variables that are integers and 14 variables that are character data type, whereas average daily rate is of num data type.
Data types for all variables seems to be correct, there is no need of correction here.
head(hotel_data,5)
## # A tibble: 5 x 32
## hotel is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Resort Hotel 0 342 2015 July 27
## 2 Resort Hotel 0 737 2015 July 27
## 3 Resort Hotel 0 7 2015 July 27
## 4 Resort Hotel 0 13 2015 July 27
## 5 Resort Hotel 0 14 2015 July 27
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, ...
tail(hotel_data,5)
## # A tibble: 5 x 32
## hotel is_canceled lead_time arrival_date_ye~ arrival_date_mo~ arrival_date_we~
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 City Hotel 0 23 2017 August 35
## 2 City Hotel 0 102 2017 August 35
## 3 City Hotel 0 34 2017 August 35
## 4 City Hotel 0 109 2017 August 35
## 5 City Hotel 0 205 2017 August 35
## # ... with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, ...
attach(hotel_data)
summary(hotel_data)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
Average Daily Rate has 1 negative value (-6.38) which is wrong.
Days_in_waiting_list has maximum value as 391, which is also not correct, as this variable represents no.of.days booking was in the waiting list which doesn’t seem right. Also the average value is around 2.32 which is very far from maximum value.
colSums(is.na(hotel_data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
Children has 4 NA values. This is very small compared to the total number of observations. So we can directly remove them or replace with mean value. Here we are replacing these NAs with mean values of children.
hotel_data$children[is.na(hotel_data$children)] = round(mean(hotel_data$children, na.rm = TRUE),0)
sum(is.na(hotel_data$children))
## [1] 0
colSums(is.na(hotel_data))
table(arrival_date_month)
table(meal)
table(country)
table(market_segment)
table(distribution_channel)
table(reserved_room_type)
table(assigned_room_type)
table(agent)
table(company)
table(customer_type)
Variables Agent 16340 NULL values and company has 112593 NULL values.
This is very high observations. So we cannot remove these NULL values as that will remove important data.
So we are replacing these NULL values with 0, as the agent and company values are just IDs replacing with 0 won’t create any problem.
But in Country also we have 488 NULL values, as this is not very big number we can either delete these rows, or replace NULL values with most repeating country.
Here we are replacing NULL values with maximum repeating country.
hotel_data$agent[hotel_data$agent== "NULL"] <- 0
#table(hotel_data$agent)
hotel_data$company[hotel_data$company== "NULL"] <- 0
#table(hotel_data$company)
hotel_data$country[hotel_data$country== "NULL"] <- max(hotel_data$country)
#table(hotel_data$country)
hotel_data <- filter(hotel_data, adr >= 0)
#str(hotel_data)
Also the negative value in Average daily rate -6.38, as it is just 1 observation we are directly removing that observation.
Now after removing we have 119389 observations instead of 119390
boxplot(days_in_waiting_list)
Days_in_waiting_list max value 391, I am leaving it as of now, as it seems to have values close to 300 from the box plot below. We can remove the outliers while doing any calculations if required.
hotel_data <- mutate(hotel_data,
total_guests = adults + children + babies)
str(hotel_data)
## spec_tbl_df [119,389 x 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119389] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119389] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119389] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119389] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119389] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119389] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119389] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119389] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119389] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119389] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119389] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119389] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119389] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119389] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119389] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119389] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119389] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119389] "0" "0" "0" "304" ...
## $ company : chr [1:119389] "0" "0" "0" "0" ...
## $ days_in_waiting_list : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119389] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119389] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119389] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119389] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119389] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119389], format: "2015-07-01" "2015-07-01" ...
## $ total_guests : num [1:119389] 2 2 1 1 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
table(hotel_data$total_guests)
##
## 0 1 2 3 4 5 6 10 12 20 26 27 40
## 180 22581 82050 10495 3929 137 1 2 2 2 5 2 1
## 50 55
## 1 1
For each booking, there should be at-least 1 guest, here guests are of three categories: * Adults * Children * Babies We have combined all three and added a new column called “total_guests”.
But in data we have around 180 rows which have 0 guests. So we are removing those rows directly, as booking without guests doesn’t make sense.
hotel_data <- filter(hotel_data, total_guests > 0)
count(hotel_data[duplicated(hotel_data),])
## # A tibble: 1 x 1
## n
## <int>
## 1 31980
count(unique(hotel_data[duplicated(hotel_data),]))
## # A tibble: 1 x 1
## n
## <int>
## 1 8161
There 31994 duplicate values in hotel data set. Every column for these duplicate values are same. But we are not exactly sure of the reason for the duplicacy. As we don’t have any unique identifier like customer name, we are not removing these duplicate rows from the data. In real world scenario, we can check with the client the reason for the duplicates and decide on how to deal with them.
Plots of Various numeric variables
par(mfrow=c(3,3))
boxplot(hotel_data$adults,main = "Boxplot : # of Adults")
boxplot(hotel_data$children,main = "Boxplot : # of Children")
boxplot(hotel_data$babies,main = "Boxplot : # of Babies")
boxplot(hotel_data$stays_in_weekend_nights,main = "Boxplot : # of Weekend Night Stays")
boxplot(hotel_data$stays_in_week_nights,main = "Boxplot : # of Weekday Night Stays")
boxplot(hotel_data$days_in_waiting_list,main = "Boxplot : # of Days in Waiting List")
boxplot(hotel_data$required_car_parking_spaces,main = "Boxplot : # of Car Parking Spaces Req.")
boxplot(hotel_data$total_of_special_requests,main = "Boxplot : # of Spl. Requests")
boxplot(hotel_data$previous_cancellations,main = "Boxplot : # of Previous Cancellations")
boxplot(hotel_data$previous_bookings_not_canceled,main = "Boxplot : # of Previous Non-Cancellations")
boxplot(hotel_data$booking_changes,main = "Boxplot : # of Booking Changes")
boxplot(hotel_data$adr,main = "Boxplot : Average Daily Rate ")
In the Average Daily Rate column,one value equals 5400 which mostly might be an outlier. The minimum value of this column is 0.26 which is again possibly an outlier. But before the removal, it should be discussed it with the client.
Days in waiting column has values greater than a year. In a real world scenario, it might be necessary to communicate to the client and further proceed.
In the Children column, there is one specific observation where the number of children is 10.No of adults corresponding to that booking is however only 2.Mostly all the other values are within 4. This probably needs to be pointed out.
Nothing unusual found in the rest of the columns.
The summary of the Average Daily rate
summary(hotel_data$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 69.50 94.95 101.97 126.00 5400.00
hotel_data <- hotel_data[hotel_data$adr < max(hotel_data$adr) ,]
boxplot(hotel_data$adr,main = "Boxplot : Average Daily Rate ")
The summary of the Average Daily rate would now look like below
summary(hotel_data$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 69.50 94.95 101.93 126.00 510.00
We can observe that the outliers are removed and the maximum value now is 510.
rmarkdown::paged_table(hotel_data)
We can implement Exploratory data analysis tools and techniques to investigate, analyze, and summarize the main characteristics of datasets, often utilizing data visualization methodologies. EDA techniques allow for effective manipulation of data, to find the answers by discovering data patterns, spotting anomalies, checking assumptions, or testing a hypothesis. It can help detect obvious errors, identify outliers in datasets, understand relationships, unearth important factors, find patterns within data, and provide new insights.
The exploratory data analysis steps that analysts have in mind when performing EDA include:
There are four exploratory data analysis techniques that data experts use, which include:
This is the simplest type of EDA, where data has a single variable like hotel, is_canceled, arrival_date, etc. Since there is only one variable, we do not have to deal with relationships.
Non-graphical techniques do not present the complete picture of data. Graphical methods are therefore, required. Common types of univariate graphics include: * Stem-and-leaf plots, which show all data values and the shape of the distribution. + Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values. + Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum. Also, to detect outliers in the dataset.
Multivariate Non-Graphical Multivariate data consists of several variables. Non-graphic multivariate EDA methods illustrate relationships between 2 or more data variables using statistics or cross-tabulation.
Multivariate Graphical This EDA technique makes use of graphics to show relationships between 2 or more datasets like Reservation counts vs Number of special requests, Arrival Date Year vs Lead Time By Booking Cancellation Status, etc. Other common types of multivariate graphics include:
Different ways to look at the hotel booking data include clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables. K-means Clustering can also be used to look at the market segmentation of the hotel booking. Predictive models, such as linear regression, can be used as well that use statistics and data to predict hotel booking predictions.
We plan to perform Feature Engineering on the Hotel Booking dataset. It is the art of selecting the important features and transforming them into refined and meaningful features that suit the needs of the model. An effective Feature Engineering implies:
We plan to create some new features from the data. Also, remove unwanted features like reservation_status.
Further, we plan to look at the correlation between the variables. It can be done with the help of a heatmap. Methods of correlation summarize the relationship between two variables in a single number called the correlation coefficient. The correlation coefficient is usually represented using the symbol r, and it ranges from -1 to +1.
A correlation coefficient quite close to 0, but either positive or negative, implies little or no relationship between the two variables. A correlation coefficient close to plus 1 means a positive relationship between the two variables, with increases in one of the variables being associated with increases in the other variable.
A correlation coefficient close to -1 indicates a negative relationship between two variables, with an increase in one of the variables being associated with a decrease in the other variable. A correlation coefficient can be produced for ordinal, interval or ratio level variables, but has little meaning for variables which are measured on a scale which is no more than nominal.
To summarize the data, we need to simplify the data. The distribution of a variable shows what values the variable takes and how often the variable takes these values. The two most useful ways of describing the distribution of data are: 1. The typical: This describes the center or middle of the data. This way of describing the center is also called a measure of central tendency. The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median) 2. The spread of the values around the center: This describes how densely the data is distributed around the center. This is also called a measure of dispersion. Looking at the spread of the distribution of data tells us about the amount of variation, or diversity, within the data. The three measures of the spread of the data are the range, the standard deviation, and the variance. These two ways of describing the data are also referred to as descriptive statistics.
We would now go through each variable of concern and understand how it affects the revenue.
To find the total number of nights the guest stayed including weekday nights and weekend nights
hotel_data$total_stay_nights <- rowSums(cbind(hotel_data$stays_in_weekend_nights, hotel_data$stays_in_week_nights), na.rm = TRUE)
To find the total revenue earned by each booking , this can be calculated by multiplying total number of nights and average daily price
hotel_data <- hotel_data %>% mutate(revenue = (total_stay_nights * adr))
To make a table of Total booking and Revenue for each Hotel type. Here I am only considering the reservations which are not canceled.
data_hotel_revenue <- hotel_data %>% filter(is_canceled == 0) %>% group_by(hotel) %>% summarise(total_bookings= n(), total_revenue = sum(revenue))
To visually understand this observation,
ggplot(data_hotel_revenue, aes(y=total_revenue, x=hotel)) +
geom_bar(position="dodge", stat = "identity", width=0.25, fill = "orange")
ggplot(data_hotel_revenue, aes(y=total_revenue, x=hotel)) +
geom_bar(position="dodge", stat = "identity", width=0.25, fill = "orange")
This observation shows that more bookings were made in City Hotel than the Resort Hotel and hence City Hotel got more Revenue between 2015 and 2017.
To see the cancellation trends over the years,
ggplot(hotel_data, aes(x = is_canceled, fill = factor(hotel))) +
geom_histogram(binwidth = 0.2) +
scale_x_continuous(breaks = seq(0, 1, 1))
This shows that more cancellations are made in City Hotel. Also, the cancellation percentage is more than 50%. #### Revenue
To find which year gave the most revenue
For this calculation , we are considering only the bookings which were not cancelled
data_not_canceled <- hotel_data %>% filter(is_canceled == 0)
ggplot(data_not_canceled, aes(x = arrival_date_year, y = revenue, fill = factor(hotel))) +
geom_bar(stat = "identity" , width=0.25) +
scale_x_continuous(breaks = seq(2015,2017,1))
This shows the most revenue was made in 2016 and the most revenue was made through City Hotel.
To find which month gave the most revenue
For this calculation , we are considering only the bookings which were not cancelled
data_not_canceled$arrival_date_month <- factor(data_not_canceled$arrival_date_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
ggplot(data_not_canceled, aes(x = arrival_date_month, y = revenue, fill = factor(hotel))) +
geom_bar(stat = "identity" , width=0.25)
This shows August has most bookings and hence gives most revenue.
Stays by Month
Here, we are looking at the distribution of hotel bookings by months of the year. Bookings are plotted in the below bar plot.
month_vector <- factor(hotel_data$arrival_date_month)
month<-factor(month_vector, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
levels(month)<-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
plot(month,col="coral1",ylab="Count",xlab="Month")
Hence, there are more bookings in summer months than colder months. August has the most bookings and January has the least.
To find the most preferred meal type
Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.
ggplot(hotel_data, aes(meal)) +
geom_bar(width=0.25, fill = "orange")
This shows “Bed & Breakfast” is the most preferred meal type.
To find the most preferred Room type
Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.
ggplot(hotel_data, aes(reserved_room_type , fill = factor(hotel))) +
geom_bar(width=0.25, fill = "orange")
Most preferred room type is “A”.
To find the most common customer (guest) type
Considering the whole data here (including cancellations) since this analysis is to understand the guest preference.
ggplot(hotel_data, aes(customer_type)) +
geom_bar(width=0.25, fill = "orange")
Most common customer type is “Transient”.
To find which country most of the Guests are from
Considering the whole data here (including cancellations) since this analysis is to understand the guest preference. Since there are guests from almost 182 countries, Just showing the top 10 countries the guests are from, here.
data_country <- hotel_data %>% group_by(country) %>% summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>%
ggplot(.,aes(country, booking_count)) +
geom_bar(stat = "identity", width = 0.25, fill ="orange")
Most common country is “Portugal”.
We are incorporating Linear Modeling to understand the impact of the other variables on Revenue.
To build the model , we would first prepare the data week wise. We would like to include Week number of arrival, hotel type, customer type, meal type, is_repeated_guest, total number of bookings, total revenue for the week, average adr and average stay nights.
data_model <- hotel_data %>% filter(is_canceled == 0) %>% group_by(arrival_date_week_number,hotel,customer_type, meal, is_repeated_guest) %>% summarise(total_bookings = n(), total_guests = sum(adults + children + babies),average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
model_1 <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_model)
summary(model_1)
##
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) +
## as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights +
## week_adr, data = data_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56660 -20718 -5261 10370 241913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6595.58 3762.78 1.753 0.0799
## as.factor(hotel)Resort Hotel -1051.02 2249.44 -0.467 0.6404
## as.factor(customer_type)Group -3424.20 3654.23 -0.937 0.3489
## as.factor(customer_type)Transient 41707.90 3055.52 13.650 < 2e-16
## as.factor(customer_type)Transient-Party 14657.35 3154.51 4.646 3.7e-06
## as.factor(meal)FB -60125.95 4596.53 -13.081 < 2e-16
## as.factor(meal)HB -33364.72 2514.58 -13.269 < 2e-16
## as.factor(meal)SC -32748.98 2776.48 -11.795 < 2e-16
## as.factor(meal)Undefined -53859.64 4525.11 -11.902 < 2e-16
## as.factor(is_repeated_guest)1 -34169.85 2278.98 -14.994 < 2e-16
## average_stay_nights 664.05 354.42 1.874 0.0612
## week_adr 214.85 21.16 10.151 < 2e-16
##
## (Intercept) .
## as.factor(hotel)Resort Hotel
## as.factor(customer_type)Group
## as.factor(customer_type)Transient ***
## as.factor(customer_type)Transient-Party ***
## as.factor(meal)FB ***
## as.factor(meal)HB ***
## as.factor(meal)SC ***
## as.factor(meal)Undefined ***
## as.factor(is_repeated_guest)1 ***
## average_stay_nights .
## week_adr ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34720 on 1361 degrees of freedom
## Multiple R-squared: 0.3776, Adjusted R-squared: 0.3725
## F-statistic: 75.05 on 11 and 1361 DF, p-value: < 2.2e-16
Of the variables included in the model, resort hotel, group and average_stay_nights are found to be not significant at the 95% confidence interval. Therefore we can conclude that based on this model, these variables do not have a statistically significant influence on revenue.
Out of the predictors that have been found statistically significant, we observe that average daily rate and transient customers are features that have a positive impact on the revenue.
Linear regression has revealed that atleast some of the variables have an influence on revenue and can be used to control and predict the revenue generation. Using this as a first step, we can explore other techniques to determine the most important variables contributing to revenue generation and find the best method to build a model that can predict revenue based on the supporting features.
Most of the bookings are done for the City Hotel and the major revenue is from that as well. So the consumer can plan appropriate marketing strategies.
The most busiest month has been August. So the consumer can be prepared to expect more guests, make necessary arrangements, do appropriate marketing. Also this helps them understand which months have the least bookings and can plan to give some deals to attract customers in the off season.
The most preferred meal type has been Bed & Breakfast. The consumer can be prepared to have more resources for meal time.
Most visitors are from Portugal. This helps the consumer to know his target customers and can attract them during their holiday seasons.
The most preferred room type is A. This helps the consumer understand what his guests are looking for.
The cancellation percentage has been more than 50%. The consumer would definitely look for ways to reduce this. Like following up with reserved guests prior.
It can be concluded that the Revenue is depending on various factors. The consumer can plan for the appropriate marketing strategies.
The study on this data thus far using established wrangling techniques has revealed answers to the questions we initially set out to answer. The next steps for us, if we choose to pursue this, would be to model revenue accurately using other techniques, as well as to use the available data to derive insights such as clear customer segments that reveals where the business is performing well versus where it is not. This could help the business identify where to focus on to improve revenue and reduce cancellations.