Authored by: Robert Culligan

Introduction

One kind of project I wanted to conduct was a simple Exploratory Data Analysis. This would generally mean getting an overall summary of the dataset, clean some of the columns, outline the of the numerical and categorical variables, as well as outlinte the deviations in numerical groups.

This is a fictional dataset that describes the customer base of a fictional hotel chain. There are two branches in this chain - “City Hotel” and “Resort Hotel”. This data set records a number of variables - such as occupation, children in Party, nights stayed, and reservation status.

Considering the data available and the scope of the project, I’d like to understand the distribution of the categorical and numerical factors. I’d particularly like to understand how stay length varies throughout the year - along with the deviation of those variables. Moreover, I’d like to understand how the family dynamic influences these factors.

Loading packages & datasets

library(ggplot2)
library(forcats)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(datasets)

These are the packages I intend to use when analyzing this dataset. You can also see me loading th dataset below.

hotel_demand <- read.csv('/Users/robertculligan/Desktop/R/Datasets/Hotel_Demand/hotel_bookings.csv')

I have stored this information on a password protected cloud.

Preliminary Analysis

This first function will highlight the shape of our dataset.

dim(hotel_demand)
## [1] 119390     32

The next function will highlight the column titles of the dataset.

names(hotel_demand)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"

What is apparent in this assessment is that the the date, month, and year are all in different columns. This change will make it difficult to assess data when time is a variable. What also is causing pause are the stay length variables being separated into weekday and weekend columns. Total nights stayed can be acquired by adding these columns together.

str(hotel_demand)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

What’s also clear is that the reservation date is also presented as a string, meaning that they will not be recognised as a date - further hampering analysis.

Preliminary Cleaning

This function will convert the month column to a month, meaning the column will convert the month into an integer.

hotel_demand$arrival_date_month <- sapply(hotel_demand$arrival_date_month, function(x) grep(paste("(?i)",x,sep=""), month.name))

Consequently, I am now going to merge the three date columns into a single string called “arrivaldate”.

hotel_demand$arrivaldate <- paste(hotel_demand$arrival_date_day_of_month,"-",hotel_demand$arrival_date_month,"-",hotel_demand$arrival_date_year)
hotel_demand$arrivaldate <- gsub('\\s+','',hotel_demand$arrivaldate)
hotel_demand$staylength <- (hotel_demand$stays_in_week_nights+hotel_demand$stays_in_weekend_nights)

Now that we have a string date for both factors, we can now convert them to the date format.

hotel_demand$arrivaldate <- as.Date(hotel_demand$arrivaldate, format = "%d-%m-%Y")
hotel_demand$reservation_status_date <- as.Date(hotel_demand$reservation_status_date, format = "%Y-%m-%d")

Finally, I am going to remove some duplicates.

hotel_demand[!duplicated(hotel_demand$Sepal.Width), ]

Secondary Assessment

This function will outline the general distribution of values across the data frame.

summary(hotel_demand)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Min.   : 1.000     Min.   : 1.00            Min.   : 1.0             
##  1st Qu.: 4.000     1st Qu.:16.00            1st Qu.: 8.0             
##  Median : 7.000     Median :28.00            Median :16.0             
##  Mean   : 6.552     Mean   :27.17            Mean   :15.8             
##  3rd Qu.: 9.000     3rd Qu.:38.00            3rd Qu.:23.0             
##  Max.   :12.000     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date  arrivaldate           staylength    
##  Min.   :2014-10-17      Min.   :2015-07-01   Min.   : 0.000  
##  1st Qu.:2016-02-01      1st Qu.:2016-03-13   1st Qu.: 2.000  
##  Median :2016-08-07      Median :2016-09-06   Median : 3.000  
##  Mean   :2016-07-30      Mean   :2016-08-28   Mean   : 3.428  
##  3rd Qu.:2017-02-08      3rd Qu.:2017-03-18   3rd Qu.: 4.000  
##  Max.   :2017-09-14      Max.   :2017-08-31   Max.   :69.000  
## 

Visualising the data, addressing outliers, and invalid inputs

Distribution of Numerical Factors

Histograms

This first histogram describes the distribution of stay lengths at both hotels.

data2<-hotel_demand[!(hotel_demand$staylength==0),]
ggplot(data2, aes(x=staylength))+
  geom_histogram(bins=30)

In this histogram, I decided to exclude stay lengths labelled ‘O’. You can’t stay at a hotel for 0 days - so I believe its an invalid that skews the data.

The next graph presents the distribution of family size choosing to stay at both hotels.

data3<-hotel_demand[!(hotel_demand$children==0),]
ggplot(data3, aes(x=children))+
  geom_histogram(bins = 30)
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

Similarly, I also excluded rows with “0” children. This graph specifically identifies distributions in families.

Distribution of Catagorical Factors

Bar Graphs

This is a simple bar graph that outlines how stays are distributed between the “city” and “resort” hotels.

ggplot(data2, aes(x=hotel))+
  geom_bar()

Similarly, this graph outlines the distribution within the Market segment variable.

ggplot(data2, aes(x=market_segment))+
  geom_bar()

Similarly, this graph outlines the distribution within the Customer_Type variable.

ggplot(data2, aes(x=customer_type))+
  geom_bar()

Similarly, this graph outlines the distribution within the reservation_status variable.

ggplot(data2, aes(x=reservation_status))+
  geom_bar()

Similarly, this graph outlines the distribution stays throughout the year, separated between the “city” and “resort” hotels.

data2 <- 
  hotel_demand %>% 
  mutate(
    month = format(arrivaldate, "%m"),
    label = format(arrivaldate, "%b"),
    label = fct_reorder(label, month, min)
  )
ggplot(data2, aes(x = label, fill = hotel)) +
  geom_bar()

Moreover, this boxplot outlines the distribution of stays throughout the year - with both “city” and “resort” being represented.

ggplot(data2, aes(x = label, y = staylength, fill = hotel)) +
  stat_boxplot(geom = "errorbar") +
  geom_boxplot()+
  xlab("Month") +
  ylab("Nights Stayed") +
  labs(title = "Montly stay Boxplot") +
  theme(plot.title = element_text(hjust = 0.5))