Introduction to Problem statement:

After analysis of “Hotel Booking Demand dataset”, there are several burning questions which needs to be addressed that will form the basis of our further analysis.

This dataset is interesting because after finding the answers to our above problem statements, hotels will be able to manage their customer bookings more efficiently and effectively.

Explanation to address the problem statement with the required packages

This dataset describes the booking demand of two hotels i.e. Resort Hotel and City Hotel.Both the hotels have 31 deciding variables. City Hotel and Resort Hotel has total 50,812 and 40,061 observations respectively. The dataset shows observations of 3 years i.e. from Aug 2015 to Aug 2017.

Data Source: Booking demand dataset of hotels is from the website
Antonio, Almeida and Nunes, 2019.

We will use R code to do our analysis. To handle these problem statements, we need to load below packages/libraries in RStudio.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)

Preparing the data with the Proposed Approach to this problem.

# Importing the dataset

Hotel_Demand_Data <- read.csv("C:/Sai Anne/MSBA/7025_001_Data Wrangling/Homework-3/hotels.csv",stringsAsFactors = TRUE)
# Structure of the Data

str(Hotel_Demand_Data)
## 'data.frame':    90873 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 13 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 6 levels "","BB","FB","HB",..: 2 2 2 2 2 2 2 3 2 4 ...
##  $ country                       : Factor w/ 160 levels "","AGO","ALB",..: 124 124 54 54 54 54 124 124 124 124 ...
##  $ market_segment                : Factor w/ 9 levels "","Aviation",..: 5 5 5 4 8 8 5 5 8 7 ...
##  $ distribution_channel          : Factor w/ 6 levels "","Corporate",..: 3 3 3 2 5 5 3 3 5 5 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 11 levels "","A","B","C",..: 4 4 2 2 2 2 4 4 2 5 ...
##  $ assigned_room_type            : Factor w/ 13 levels "","A","B","C",..: 4 4 4 2 2 2 4 4 2 5 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 4 levels "","No Deposit",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ agent                         : Factor w/ 293 levels "","1","10","103",..: 293 293 293 140 100 100 293 139 100 39 ...
##  $ company                       : Factor w/ 312 levels "","10","100",..: 312 312 312 312 312 312 312 312 312 312 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 5 levels "","Contract",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 4 levels "","Canceled",..: 3 3 3 3 3 3 3 3 2 2 ...
##  $ reservation_status_date       : Factor w/ 927 levels "","2014-10-17",..: 123 123 124 124 125 125 125 125 74 63 ...
This dataset contains 31 variables and 90873 observation of two hotels.
By looking at our dataset, we found that there is one observation for which almost all variables are blank and thus, no significant loss of data will happen if we remove that observation. So, our first step is to remove that observation.
Hotel_Demand_Data_Update <- Hotel_Demand_Data %>% slice(-c(90873))
# Calculating Missing/Null values from all columns

colSums(is.na(Hotel_Demand_Data_Update))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
  • There is only one column i.e., children with 4 missing values.

  • Since, children column is the only column with 4 missing values, we will impute it with the median value since it is a numerical data. First, we will find out the summary statistics of children column to find out median.

# Finding summary statistics for 'children'

summary(Hotel_Demand_Data_Update$children)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00000  0.00000  0.00000  0.09909  0.00000 10.00000        4
  • Since, the median value is 0, we impute the 4 NA values with 0.
# Imputing missing values in 'children' column

Hotel_Demand_Data_Update$children[is.na(Hotel_Demand_Data_Update$children)] <- 0

After imputing, the number of missing values in the children column is 0.

sum(is.na(Hotel_Demand_Data_Update))
## [1] 0
After checking again for NA values, we found there is no longer any missing values. Thus, missing values are handled.
Summary statistics of each variable in the dataset
summary(Hotel_Demand_Data_Update)
##           hotel        is_canceled       lead_time   arrival_date_year
##  City Hotel  :50812   Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Resort Hotel:40060   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##                       Median :0.0000   Median : 71   Median :2016     
##                       Mean   :0.4863   Mean   :107   Mean   :2016     
##                       3rd Qu.:1.0000   3rd Qu.:165   3rd Qu.:2017     
##                       Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                       
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August   : 9809    Min.   : 1.00            Min.   : 1.00            
##  May      : 9490    1st Qu.:16.00            1st Qu.: 8.00            
##  April    : 9107    Median :27.00            Median :16.00            
##  October  : 8944    Mean   :27.02            Mean   :15.72            
##  July     : 8480    3rd Qu.:39.00            3rd Qu.:23.00            
##  September: 8339    Max.   :53.00            Max.   :31.00            
##  (Other)  :36703                                                      
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.000       Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.000       1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.000       Median : 2.000  
##  Mean   : 0.9558         Mean   : 2.598       Mean   : 1.851  
##  3rd Qu.: 2.0000         3rd Qu.: 3.000       3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.000       Max.   :55.000  
##                                                               
##     children            babies                 meal          country     
##  Min.   : 0.00000   Min.   : 0.000000            :    0   PRT    :44041  
##  1st Qu.: 0.00000   1st Qu.: 0.000000   BB       :71212   GBR    : 9201  
##  Median : 0.00000   Median : 0.000000   FB       :  795   ESP    : 6743  
##  Mean   : 0.09908   Mean   : 0.008286   HB       :12090   FRA    : 5873  
##  3rd Qu.: 0.00000   3rd Qu.: 0.000000   SC       : 5606   DEU    : 3646  
##  Max.   :10.00000   Max.   :10.000000   Undefined: 1169   IRL    : 2744  
##                                                           (Other):18624  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :39489            :    0      Min.   :0.00000  
##  Offline TA/TO:19083   Corporate: 5519      1st Qu.:0.00000  
##  Groups       :17881   Direct   :11181      Median :0.00000  
##  Direct       : 9355   GDS      :   62      Mean   :0.03585  
##  Corporate    : 4396   TA/TO    :74105      3rd Qu.:0.00000  
##  Complementary:  547   Undefined:    5      Max.   :1.00000  
##  (Other)      :  121                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.000         Min.   : 0.0000                A      :65040     
##  1st Qu.: 0.000         1st Qu.: 0.0000                D      :13829     
##  Median : 0.000         Median : 0.0000                E      : 5702     
##  Mean   : 0.113         Mean   : 0.1621                F      : 2151     
##  3rd Qu.: 0.000         3rd Qu.: 0.0000                G      : 1802     
##  Max.   :26.000         Max.   :72.0000                C      :  927     
##                                                        (Other): 1421     
##  assigned_room_type booking_changes      deposit_type       agent      
##  A      :55338      Min.   : 0.000             :    0   9      :18851  
##  D      :18730      1st Qu.: 0.000   No Deposit:76132   240    :13906  
##  E      : 6748      Median : 0.000   Non Refund:14579   NULL   :13574  
##  F      : 2908      Mean   : 0.204   Refundable:  161   1      : 6952  
##  C      : 2252      3rd Qu.: 0.000                      250    : 2869  
##  G      : 2144      Max.   :21.000                      6      : 2184  
##  (Other): 2752                                          (Other):32536  
##     company      days_in_waiting_list         customer_type        adr         
##  NULL   :85404   Min.   :  0.000                     :    0   Min.   :  -6.38  
##  223    :  784   1st Qu.:  0.000      Contract       : 4013   1st Qu.:  64.00  
##  40     :  665   Median :  0.000      Group          :  476   Median :  89.10  
##  67     :  267   Mean   :  2.836      Transient      :66979   Mean   :  97.77  
##  45     :  185   3rd Qu.:  0.000      Transient-Party:19404   3rd Qu.: 120.00  
##  281    :  138   Max.   :391.000                              Max.   :5400.00  
##  (Other): 3429                                                                 
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000                     :    0   
##  1st Qu.:0.00000             1st Qu.:0.0000            Canceled :42990   
##  Median :0.00000             Median :0.0000            Check-Out:46684   
##  Mean   :0.06903             Mean   :0.4803            No-Show  : 1198   
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  2015-10-21: 1461       
##  2015-07-06:  805       
##  2015-01-01:  763       
##  2016-11-25:  757       
##  2016-01-18:  625       
##  2015-07-02:  469       
##  (Other)   :85992

Proposed Exploratory Data Analysis to help the consumer

Visualization

We can use histograms and box plots to visualize our data for further analysis.