Authored by: Robert Culligan
One kind of project I wanted to conduct was a simple Exploratory Data Analysis. This would generally mean getting an overall summary of the dataset, clean some of the columns, outline the of the numerical and categorical variables, as well as outlinte the deviations in numerical groups.
This is a fictional dataset that describes the customer base of a fictional hotel chain. There are two branches in this chain - “City Hotel” and “Resort Hotel”. This data set records a number of variables - such as occupation, children in Party, nights stayed, and reservation status.
Considering the data available and the scope of the project, I’d like to understand the distribution of the categorical and numerical factors. I’d particularly like to understand how stay length varies throughout the year - along with the deviation of those variables. Moreover, I’d like to understand how the family dynamic influences these factors.
library(ggplot2)
library(forcats)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(datasets)
These are the packages I intend to use when analyzing this dataset. You can also see me loading th dataset below.
hotel_demand <- read.csv('/Users/robertculligan/Desktop/R/Datasets/Hotel_Demand/hotel_bookings.csv')
I have stored this information on a password protected cloud.
This first function will highlight the shape of our dataset.
dim(hotel_demand)
## [1] 119390 32
The next function will highlight the column titles of the dataset.
names(hotel_demand)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
What is apparent in this assessment is that the the date, month, and year are all in different columns. This change will make it difficult to assess data when time is a variable. What also is causing pause are the stay length variables being separated into weekday and weekend columns. Total nights stayed can be acquired by adding these columns together.
str(hotel_demand)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
What’s also clear is that the reservation date is also presented as a string, meaning that they will not be recognised as a date - further hampering analysis.
This function will convert the month column to a month, meaning the column will convert the month into an integer.
hotel_demand$arrival_date_month <- sapply(hotel_demand$arrival_date_month, function(x) grep(paste("(?i)",x,sep=""), month.name))
Consequently, I am now going to merge the three date columns into a single string called “arrivaldate”.
hotel_demand$arrivaldate <- paste(hotel_demand$arrival_date_day_of_month,"-",hotel_demand$arrival_date_month,"-",hotel_demand$arrival_date_year)
hotel_demand$arrivaldate <- gsub('\\s+','',hotel_demand$arrivaldate)
hotel_demand$staylength <- (hotel_demand$stays_in_week_nights+hotel_demand$stays_in_weekend_nights)
Now that we have a string date for both factors, we can now convert them to the date format.
hotel_demand$arrivaldate <- as.Date(hotel_demand$arrivaldate, format = "%d-%m-%Y")
hotel_demand$reservation_status_date <- as.Date(hotel_demand$reservation_status_date, format = "%Y-%m-%d")
Finally, I am going to remove some duplicates.
hotel_demand[!duplicated(hotel_demand$Sepal.Width), ]
This function will outline the general distribution of values across the data frame.
summary(hotel_demand)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Min. : 1.000 Min. : 1.00 Min. : 1.0
## 1st Qu.: 4.000 1st Qu.:16.00 1st Qu.: 8.0
## Median : 7.000 Median :28.00 Median :16.0
## Mean : 6.552 Mean :27.17 Mean :15.8
## 3rd Qu.: 9.000 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :12.000 Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date arrivaldate staylength
## Min. :2014-10-17 Min. :2015-07-01 Min. : 0.000
## 1st Qu.:2016-02-01 1st Qu.:2016-03-13 1st Qu.: 2.000
## Median :2016-08-07 Median :2016-09-06 Median : 3.000
## Mean :2016-07-30 Mean :2016-08-28 Mean : 3.428
## 3rd Qu.:2017-02-08 3rd Qu.:2017-03-18 3rd Qu.: 4.000
## Max. :2017-09-14 Max. :2017-08-31 Max. :69.000
##
This first histogram describes the distribution of stay lengths at both hotels.
data2<-hotel_demand[!(hotel_demand$staylength==0),]
ggplot(data2, aes(x=staylength))+
geom_histogram(bins=30)
In this histogram, I decided to exclude stay lengths labelled ‘O’. You
can’t stay at a hotel for 0 days - so I believe its an invalid that
skews the data.
The next graph presents the distribution of family size choosing to stay at both hotels.
data3<-hotel_demand[!(hotel_demand$children==0),]
ggplot(data3, aes(x=children))+
geom_histogram(bins = 30)
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).
Similarly, I also excluded rows with “0” children. This graph
specifically identifies distributions in families.
This is a simple bar graph that outlines how stays are distributed between the “city” and “resort” hotels.
ggplot(data2, aes(x=hotel))+
geom_bar()
Similarly, this graph outlines the distribution within the Market
segment variable.
ggplot(data2, aes(x=market_segment))+
geom_bar()
Similarly, this graph outlines the distribution within the Customer_Type
variable.
ggplot(data2, aes(x=customer_type))+
geom_bar()
Similarly, this graph outlines the distribution within the
reservation_status variable.
ggplot(data2, aes(x=reservation_status))+
geom_bar()
Similarly, this graph outlines the distribution stays throughout the
year, separated between the “city” and “resort” hotels.
data2 <-
hotel_demand %>%
mutate(
month = format(arrivaldate, "%m"),
label = format(arrivaldate, "%b"),
label = fct_reorder(label, month, min)
)
ggplot(data2, aes(x = label, fill = hotel)) +
geom_bar()
Moreover, this boxplot outlines the distribution of stays throughout the
year - with both “city” and “resort” being represented.
ggplot(data2, aes(x = label, y = staylength, fill = hotel)) +
stat_boxplot(geom = "errorbar") +
geom_boxplot()+
xlab("Month") +
ylab("Nights Stayed") +
labs(title = "Montly stay Boxplot") +
theme(plot.title = element_text(hjust = 0.5))