After analysis of “Hotel Booking Demand dataset”, there are several burning questions which needs to be addressed that will form the basis of our further analysis.
This dataset is interesting because after finding the answers to our above problem statements, hotels will be able to manage their customer bookings more efficiently and effectively.
This dataset describes the booking demand of two hotels i.e. Resort Hotel and City Hotel.Both the hotels have 31 deciding variables. City Hotel and Resort Hotel has total 50,812 and 40,061 observations respectively. The dataset shows observations of 3 years i.e. from Aug 2015 to Aug 2017.
Data Source: Booking demand dataset of hotels is from the website
Antonio, Almeida and Nunes, 2019.
We will use R code to do our analysis. To handle these problem statements, we need to load below packages/libraries in RStudio.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
# Importing the dataset
Hotel_Demand_Data <- read.csv("C:/Sai Anne/MSBA/7025_001_Data Wrangling/Homework-3/hotels.csv",stringsAsFactors = TRUE)
# Structure of the Data
str(Hotel_Demand_Data)
## 'data.frame': 90873 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 13 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 6 levels "","BB","FB","HB",..: 2 2 2 2 2 2 2 3 2 4 ...
## $ country : Factor w/ 160 levels "","AGO","ALB",..: 124 124 54 54 54 54 124 124 124 124 ...
## $ market_segment : Factor w/ 9 levels "","Aviation",..: 5 5 5 4 8 8 5 5 8 7 ...
## $ distribution_channel : Factor w/ 6 levels "","Corporate",..: 3 3 3 2 5 5 3 3 5 5 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 11 levels "","A","B","C",..: 4 4 2 2 2 2 4 4 2 5 ...
## $ assigned_room_type : Factor w/ 13 levels "","A","B","C",..: 4 4 4 2 2 2 4 4 2 5 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 4 levels "","No Deposit",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ agent : Factor w/ 293 levels "","1","10","103",..: 293 293 293 140 100 100 293 139 100 39 ...
## $ company : Factor w/ 312 levels "","10","100",..: 312 312 312 312 312 312 312 312 312 312 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 5 levels "","Contract",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 4 levels "","Canceled",..: 3 3 3 3 3 3 3 3 2 2 ...
## $ reservation_status_date : Factor w/ 927 levels "","2014-10-17",..: 123 123 124 124 125 125 125 125 74 63 ...
Hotel_Demand_Data_Update <- Hotel_Demand_Data %>% slice(-c(90873))
# Calculating Missing/Null values from all columns
colSums(is.na(Hotel_Demand_Data_Update))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
There is only one column i.e., children with 4 missing values.
Since, children column is the only column with 4 missing values, we will impute it with the median value since it is a numerical data. First, we will find out the summary statistics of children column to find out median.
# Finding summary statistics for 'children'
summary(Hotel_Demand_Data_Update$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.00000 0.00000 0.09909 0.00000 10.00000 4
# Imputing missing values in 'children' column
Hotel_Demand_Data_Update$children[is.na(Hotel_Demand_Data_Update$children)] <- 0
After imputing, the number of missing values in the children column is 0.
sum(is.na(Hotel_Demand_Data_Update))
## [1] 0
summary(Hotel_Demand_Data_Update)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :50812 Min. :0.0000 Min. : 0 Min. :2015
## Resort Hotel:40060 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Median :0.0000 Median : 71 Median :2016
## Mean :0.4863 Mean :107 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:165 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August : 9809 Min. : 1.00 Min. : 1.00
## May : 9490 1st Qu.:16.00 1st Qu.: 8.00
## April : 9107 Median :27.00 Median :16.00
## October : 8944 Mean :27.02 Mean :15.72
## July : 8480 3rd Qu.:39.00 3rd Qu.:23.00
## September: 8339 Max. :53.00 Max. :31.00
## (Other) :36703
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.0000 Median : 2.000 Median : 2.000
## Mean : 0.9558 Mean : 2.598 Mean : 1.851
## 3rd Qu.: 2.0000 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.000 Max. :55.000
##
## children babies meal country
## Min. : 0.00000 Min. : 0.000000 : 0 PRT :44041
## 1st Qu.: 0.00000 1st Qu.: 0.000000 BB :71212 GBR : 9201
## Median : 0.00000 Median : 0.000000 FB : 795 ESP : 6743
## Mean : 0.09908 Mean : 0.008286 HB :12090 FRA : 5873
## 3rd Qu.: 0.00000 3rd Qu.: 0.000000 SC : 5606 DEU : 3646
## Max. :10.00000 Max. :10.000000 Undefined: 1169 IRL : 2744
## (Other):18624
## market_segment distribution_channel is_repeated_guest
## Online TA :39489 : 0 Min. :0.00000
## Offline TA/TO:19083 Corporate: 5519 1st Qu.:0.00000
## Groups :17881 Direct :11181 Median :0.00000
## Direct : 9355 GDS : 62 Mean :0.03585
## Corporate : 4396 TA/TO :74105 3rd Qu.:0.00000
## Complementary: 547 Undefined: 5 Max. :1.00000
## (Other) : 121
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.000 Min. : 0.0000 A :65040
## 1st Qu.: 0.000 1st Qu.: 0.0000 D :13829
## Median : 0.000 Median : 0.0000 E : 5702
## Mean : 0.113 Mean : 0.1621 F : 2151
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 G : 1802
## Max. :26.000 Max. :72.0000 C : 927
## (Other): 1421
## assigned_room_type booking_changes deposit_type agent
## A :55338 Min. : 0.000 : 0 9 :18851
## D :18730 1st Qu.: 0.000 No Deposit:76132 240 :13906
## E : 6748 Median : 0.000 Non Refund:14579 NULL :13574
## F : 2908 Mean : 0.204 Refundable: 161 1 : 6952
## C : 2252 3rd Qu.: 0.000 250 : 2869
## G : 2144 Max. :21.000 6 : 2184
## (Other): 2752 (Other):32536
## company days_in_waiting_list customer_type adr
## NULL :85404 Min. : 0.000 : 0 Min. : -6.38
## 223 : 784 1st Qu.: 0.000 Contract : 4013 1st Qu.: 64.00
## 40 : 665 Median : 0.000 Group : 476 Median : 89.10
## 67 : 267 Mean : 2.836 Transient :66979 Mean : 97.77
## 45 : 185 3rd Qu.: 0.000 Transient-Party:19404 3rd Qu.: 120.00
## 281 : 138 Max. :391.000 Max. :5400.00
## (Other): 3429
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 : 0
## 1st Qu.:0.00000 1st Qu.:0.0000 Canceled :42990
## Median :0.00000 Median :0.0000 Check-Out:46684
## Mean :0.06903 Mean :0.4803 No-Show : 1198
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## 2015-10-21: 1461
## 2015-07-06: 805
## 2015-01-01: 763
## 2016-11-25: 757
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :85992
Visualization
We can use histograms and box plots to visualize our data for further analysis.