The data in this example is originally from the article Hotel Booking Demand Datasets (https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020 (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md).
More about the dataset : https://www.kaggle.com/jessemostipak/hotel-booking-demand
Reading the file ‘hotel_bookings.csv’ into a data frame:
hotel_bookings <- read.csv("hotel_bookings.csv")
Getting initial impressions of dataset
head(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
colnames(hotel_bookings)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
filter()
function to create a data set with only city hotels that are online TAonlineta_city_hotels <- filter(hotel_bookings,
(hotel=="City Hotel" &
hotel_bookings$market_segment=="Online TA"))
Viewing new data frame onlineta_city_hotels
View(onlineta_city_hotels)
onlineta_city_hotels_v2
:onlineta_city_hotels_v2 <- hotel_bookings %>%
filter(hotel=="City Hotel") %>%
filter(market_segment=="Online TA")
Viewing onlineta_city_hotels_v2
dataset
View(onlineta_city_hotels_v2)
Finding year of earliest and latest hotel booking and saving them to use later in plotting charts
min(hotel_bookings$arrival_date_year)
## [1] 2015
max(hotel_bookings$arrival_date_year)
## [1] 2017
mindate <- min(hotel_bookings$arrival_date_year)
maxdate <- max(hotel_bookings$arrival_date_year)
(1) Initial chart
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel))+
labs(title = "Bookings vs Distribution channel", subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(2) Hotel Bookings with respect to other Factors
Check for number bookings for each distribution type - is different depending on whether or not there was a deposit or what market segment they represent.
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel, fill=deposit_type))+
labs(title="Deposit type with respect to No. of bookings", subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(3) Market Segment with respect to No. of bookings
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel, fill=market_segment))+
labs(title = "Market Segment with respect to No. of bookings", subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(4) Visualizing Charts with each feature
Creating separate charts for each deposit type and market segment to help them understand the differences more clearly.
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel)) +
facet_wrap(~deposit_type)+
labs(title = "plot for each deposit type", subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(5) Creating a plot for each Market segment
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel)) +
facet_wrap(~market_segment)+
labs(title = "plot for each Market segment", subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(6) Create a single plot with both deposit type and market segment and explore the differences
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel)) +
facet_wrap(~deposit_type~market_segment)+
labs(title = "plot with both deposit type and market segment", subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(1) Comparison of market segments by hotel type for hotel bookings
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = market_segment)) +
facet_wrap(~hotel) +
labs(title="Comparison of market segments by hotel type for hotel bookings",
subtitle =paste0("Data from: ", mindate, " to ", maxdate),
x="Market Segment",
y="Number of Bookings")
(2) Market Segment with respect to Types of hotels
creating a bar chart showing each hotel type and market segment using different colors to represent each market segment:
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel, fill = market_segment))+
labs(title = "Market Segment with respect to Types of hotels",
subtitle =paste0("Data from: ", mindate, " to ", maxdate))
(3) Market Segment with respect to Types of hotels in grid View
use the facet_wrap() function to create a separate plot for each market segment:
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel)) +
facet_wrap(~market_segment)+
labs(title="Each Market Segment with respect to Types of hotels",
subtitle =paste0("Data from: ", mindate, " to ", maxdate))
Visulaize filtered dataset from step4
Plotting a scatter plot using new filtered data below with either onlineta_city_hotels
or onlineta_city_hotels_v2
:
ggplot(data = onlineta_city_hotels) +
geom_point(mapping = aes(x = lead_time, y = children))+
labs(title="Online City Hotels",
subtitle =paste0("Data from: ", mindate, " to ", maxdate))
## Warning: Removed 1 rows containing missing values (geom_point).