Executive Overview :
This report contains the results of the analysis of the hotel booking dataset.
Background :
This EDA was motivated by the management’s plan to provide promotions for prospective hotel guests. The promotions were expected to increase the number of bookings and hotel occupancy. The promotion must be given right on the target, so it is necessary to analyze the right promotion segment. For this reason, the analysis is carried out on several important variables that are estimated to affect the hotel bookings.
Purpose :
The stakeholder is interested in developing promotions based on several data analysis on booking variables.
Limitation :
Due to a number of data analysts available in the meantime, the analysis was conducted only on four variables.
Method :
Analysis conducted using the Exploratory Data Analysis.
Data Sources:
Dataset sources from Kaggle public dataset, named: hotel_bookings.csv, click this link
Tools:
Using R Programming Language in RStudio. Report generated using RMarkdown package.
References:
Coursera: Google Data Analytics Professional Certificate
RMarkdown Cheat Sheet, learn more
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
Load the dataset directly from Kaggle dataset URL Set “hotel_bookings” as the data frame
hotel_bookings <- read.csv("hotel_bookings.csv")
Summarize main information from the dataset
# View top rows from the dataset
head(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
# Overview column on the dataset
colnames(hotel_bookings)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
# Show the structure of the dataset
str(hotel_bookings)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
# Show the structure in pivot view of the dataset
glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort~
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",~
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat~
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat~
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit~
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "~
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type <chr> "Transient", "Transient", "Transient", ~
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", ~
## $ reservation_status_date <chr> "2015-07-01", "2015-07-01", "2015-07-02~
Spesific Purpose: To know how many transactions are occurring for each different distribution type
# Create a bar chart viewing "distribution channel" variable
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel)) +
labs(title = "Comparison of Distribution Channel", subtitle = "Number of transaction in each channel")
TA/TO is the highest number of bookings channel compared to corporate, direct, and GDS
# Segment the bar chart group by "deposit_type"
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel, fill = deposit_type)) +
labs(title = "Comparison of Distribution Channel", subtitle = "Segmented by Deposit Type")
No Deposit type is the most type of booking in all of the distibution channel, including the highest numbers of booking channel, TA/TO.
Now, lets segment the market of distribution channel by comparing the “market_segment” variable with the “distribution_channel”
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel, fill = market_segment)) +
labs(title = "Comparison of Distribution Channel", subtitle = "Group By Market Segment")
Online TA became the highest market segment than any other segments.
Separate chart by deposit type and market segment to understand the differences clearly.
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel, fill = distribution_channel)) +
facet_wrap(~deposit_type) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "Comparison of Three Types of Deposit", subtitle = "Values Showed By Distribution Channel")
Now compare more variable in facet of the graph by insert the market segment to the chart.
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel, fill = distribution_channel)) +
facet_wrap(~deposit_type ~market_segment) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "Segmented Plot of Distribution Channel", subtitle = "Segment Based on Deposit Types and Distribution Channels")
Hard to see. Seem overwhelming. But it’s useful to explore the dataset through visualization.
Spesific Purpose: To run a family-friendly promotion that targetting key market segments
# Show the correlation between booking's lead time and guests who bring their children
ggplot(data = hotel_bookings) +
geom_point(mapping = aes(x = lead_time, y = children)) +
labs(title = "Any Correlation Between Lead Time and Guest's Children?", subtitle = "Lead Time Versus Number of Children Bringing By The Guest") +
annotate("text", x = 300, y = 3.5, label = "Shorter lead times for guests with fewer children", color = "Blue", size =4.5)
The scatter chart above shows that most of the guest who booked the hotel earlier didn’t have or did’t bring their children.
# Show the hotel types and the market segments for each hotel type
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel, fill = market_segment)) +
labs(title = "What Hotel Type Mostly Chosen?", subtitle = "Based on Market Segment")
At a glance, City Hotel is the most hotel type chosen by the guest and the Online Travel Agent is the biggest segment in both types of hotel.
Lets see more clearly, More detail about time period
This is useful to target the promotion in the future.
mindate <- min(hotel_bookings$arrival_date_year)
maxdate <- max(hotel_bookings$arrival_date_year)
# Show Market Segmentation based on Hotel Types Chosen.
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = market_segment)) +
facet_wrap(~hotel) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "Market Segmentation Based On Hotel Types",
caption = paste0("Data From:", " ", mindate, " ", "to", " ", maxdate),
x = "Market Segment",
y = "Number of Bookings")
# Show the hotel types and the market segments for each hotel type
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel)) +
facet_wrap(~market_segment)
Now we can segment it more clearly. Online Travel Agent is the biggest market segment and City Hotel is the most chosen hotel types.
Lets take a look more deeply, about the spesific timing for the promotion
Need to filtering our data first. Then, plotting filtered data.
Specify our segment to Online TA.
Create new data frame named onlineta_city_hotels
# Filtering Data to create a data set that only includes the spesific variable, "hotel type" and "market segment"
onlineta_city_hotels <- hotel_bookings %>%
filter(hotel == "City Hotel") %>%
filter(market_segment == "Online TA")
View(onlineta_city_hotels)
# Show correlation between booking's lead time with number of children whom guest have.
ggplot(data = onlineta_city_hotels) +
geom_point(mapping = aes(x = lead_time, y = children, color = children))+
facet_wrap(~children) +
labs(title = "Lead Time and Number of Guest's Children")
Above plot reveals that:
Bookings with children tend to have a shorter lead time
Bookings with 3 children have significantly shorter lead time (less than 200 days)