SUMMARY OF THE ANALYSIS:


  1. Executive Overview :

    This report contains the results of the analysis of the hotel booking dataset.

    • Purpose: The analysis is provided recommendations for promotion for prospective hotel guests in the hotel booking process.
    • Main Point: The booking process is influenced by several variables, including lead time, the number of children visiting with the guests, distribution channel, and market segment. For this reason, in this report, an analysis of the correlation between these variables is carried out.
    • Results:
      1. The largest distribution channels in the booking process are Travel Agents and Tour Operators.
      2. The highest market segments are bookings made through online travel agents.
      3. No Deposit is the most common type used in bookings.
      4. City Hotel is the most booked hotel type
      5. Hotel guests who bring a large number of children (equal to or more than three children) have a much shorter lead time.
    • Recommendation:
      1. Management needs to focus its attention on the channel for making online bookings. Prospective guests should have a pleasant experience when booking hotels online.
      2. Management can provide certain incentives for TA/TO with the most customers. Management also needs to ensure that online booking channels are reliable and available.
      3. The stakeholder should decide to send the promotion to families that make online bookings for the city hotels.
      4. Promotion that targetting the families can be made closer to the valid booking dates.
  2. Background :

    This EDA was motivated by the management’s plan to provide promotions for prospective hotel guests. The promotions were expected to increase the number of bookings and hotel occupancy. The promotion must be given right on the target, so it is necessary to analyze the right promotion segment. For this reason, the analysis is carried out on several important variables that are estimated to affect the hotel bookings.

  3. Purpose :

    The stakeholder is interested in developing promotions based on several data analysis on booking variables.

  4. Limitation :

    Due to a number of data analysts available in the meantime, the analysis was conducted only on four variables.

  5. Method :

    Analysis conducted using the Exploratory Data Analysis.

  6. Data Sources:

    Dataset sources from Kaggle public dataset, named: hotel_bookings.csv, click this link

  7. Tools:

    Using R Programming Language in RStudio. Report generated using RMarkdown package.

  8. References:

    • Coursera: Google Data Analytics Professional Certificate

    • RMarkdown Cheat Sheet, learn more


DETAILED OF ANALYSIS:


Step 1: Load the packages: tidyverse and ggplot2

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)

Step 2: Load and Examine the Dataset

Load the dataset directly from Kaggle dataset URL Set “hotel_bookings” as the data frame

hotel_bookings <- read.csv("hotel_bookings.csv")

Summarize main information from the dataset

# View top rows from the dataset
head(hotel_bookings)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
# Overview column on the dataset
colnames(hotel_bookings)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"
# Show the structure of the dataset
str(hotel_bookings)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
# Show the structure in pivot view of the dataset
glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort~
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",~
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat~
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat~
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit~
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "~
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <chr> "Transient", "Transient", "Transient", ~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", ~
## $ reservation_status_date        <chr> "2015-07-01", "2015-07-01", "2015-07-02~

Step 3: Analyze the Booking’s Distribution Channel

Spesific Purpose: To know how many transactions are occurring for each different distribution type

# Create a bar chart viewing "distribution channel" variable
ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = distribution_channel)) +
    labs(title = "Comparison of Distribution Channel", subtitle = "Number of transaction in each channel")

TA/TO is the highest number of bookings channel compared to corporate, direct, and GDS

# Segment the bar chart group by "deposit_type"
ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = distribution_channel, fill = deposit_type)) +
    labs(title = "Comparison of Distribution Channel", subtitle = "Segmented by Deposit Type")

No Deposit type is the most type of booking in all of the distibution channel, including the highest numbers of booking channel, TA/TO.

Now, lets segment the market of distribution channel by comparing the “market_segment” variable with the “distribution_channel”

ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = distribution_channel, fill = market_segment)) +
    labs(title = "Comparison of Distribution Channel", subtitle = "Group By Market Segment")

Online TA became the highest market segment than any other segments.

Separate chart by deposit type and market segment to understand the differences clearly.

ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = distribution_channel, fill = distribution_channel)) +
    facet_wrap(~deposit_type) +
    theme(axis.text.x = element_text(angle = 45)) +
    labs(title = "Comparison of Three Types of Deposit", subtitle = "Values Showed By Distribution Channel")

Now compare more variable in facet of the graph by insert the market segment to the chart.

ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = distribution_channel, fill = distribution_channel)) +
    facet_wrap(~deposit_type ~market_segment) +
    theme(axis.text.x = element_text(angle = 45)) +
    labs(title = "Segmented Plot of Distribution Channel", subtitle = "Segment Based on Deposit Types and Distribution Channels")

Hard to see. Seem overwhelming. But it’s useful to explore the dataset through visualization.

1st Conclusion:

  • Online TA is the highest distibution channel in booking the hotels in our dataset. The second is offline TA/TO.
  • No Deposit is the highest way choosen in booking the hotels in our dataset.

Step 4: Analyze the Dataset to Make a Family-Friendly Promotion’s Market Segments

Spesific Purpose: To run a family-friendly promotion that targetting key market segments

Show the correlation between booking’s lead time and guests who bring their children

# Show the correlation between booking's lead time and guests who bring their children
ggplot(data = hotel_bookings) +
    geom_point(mapping = aes(x = lead_time, y = children)) +
    labs(title = "Any Correlation Between Lead Time and Guest's Children?", subtitle = "Lead Time Versus Number of Children Bringing By The Guest") +
    annotate("text", x = 300, y = 3.5, label = "Shorter lead times for guests with fewer children", color = "Blue", size =4.5)

The scatter chart above shows that most of the guest who booked the hotel earlier didn’t have or did’t bring their children.

# Show the hotel types and the market segments for each hotel type
ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = hotel, fill = market_segment)) +
    labs(title = "What Hotel Type Mostly Chosen?", subtitle = "Based on Market Segment")

At a glance, City Hotel is the most hotel type chosen by the guest and the Online Travel Agent is the biggest segment in both types of hotel.

Lets see more clearly, More detail about time period

This is useful to target the promotion in the future.

mindate <- min(hotel_bookings$arrival_date_year)
maxdate <- max(hotel_bookings$arrival_date_year)
# Show Market Segmentation based on Hotel Types Chosen.
ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = market_segment)) +
    facet_wrap(~hotel) +
    theme(axis.text.x = element_text(angle = 45)) +
    labs(title = "Market Segmentation Based On Hotel Types",
        caption = paste0("Data From:", " ", mindate, " ", "to", " ", maxdate),
        x = "Market Segment",
        y = "Number of Bookings")

# Show the hotel types and the market segments for each hotel type
ggplot(data = hotel_bookings) +
    geom_bar(mapping = aes(x = hotel)) +
    facet_wrap(~market_segment)

Now we can segment it more clearly. Online Travel Agent is the biggest market segment and City Hotel is the most chosen hotel types.

Lets take a look more deeply, about the spesific timing for the promotion

Need to filtering our data first. Then, plotting filtered data.

Specify our segment to Online TA.

Create new data frame named onlineta_city_hotels

# Filtering Data to create a data set that only includes the spesific variable, "hotel type" and "market segment"
onlineta_city_hotels <- hotel_bookings %>%
    filter(hotel == "City Hotel") %>%
    filter(market_segment == "Online TA")
View(onlineta_city_hotels)
# Show correlation between booking's lead time with number of children whom guest have.
ggplot(data = onlineta_city_hotels) +
    geom_point(mapping = aes(x = lead_time, y = children, color = children))+
    facet_wrap(~children) +
    labs(title = "Lead Time and Number of Guest's Children")

Above plot reveals that:

Bookings with children tend to have a shorter lead time

Bookings with 3 children have significantly shorter lead time (less than 200 days)

2nd Conclusion:

  • The online segment is the fastest growing segment.
  • Families (guests with children) tend to spent more time at the city hotels than other type of guests.