The goal of this assignment is to give you practive in preparing different datasets for downstram analysis work.

Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets
  1. Please include in your homework submission, for each of the three chosen datasets:

Note: My collaborators were each responsible to tidy one data set which we would then share and join into our project 2. Below you’ll see the data set I chose to tidy.


Load the Libraries

#Import libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(tidyr)
library(ggplot2)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

About the Dataset

The dataset was taken from Kaggle.com which was originally taken from an article called Hotel Booking Demands Datasets https://www.sciencedirect.com/science/article/pii/S2352340918315191.

It contains 32 columns listed below with a brief description:

Columns Description
Hotel H1 = Resort Hotel, H2 = City Hotel
is_canceled 1 = cancelled, 0 = not cancelled
lead_time number of days that elapsed between entering the data into the Property Management System (PMS) and arrival date
arrival_date_year year of arrival date
arrival_date_month month of arrival date
arrival_date_week_number week number of year for arrival date
arrival_date_day_of_month day of arrival date
stays_in_weekend_nights number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_night number of weeknights (Monday - Friday) the guest stayed or booked to stay at the hotel
adults number of adults
children number of children
babies number of babies
meal types of meals booked: Undefined/SC = no meal package, BB = Bed & Breakfast, HB = Half board (breakfast and one other meal - usual dinner), FB = Full board (breakfast, lunch and dinner)
country country of origin, categories are represented in the ISO 355-3.2013 format
market_segment market segment designation, TA = Travel Agent, TO = Tour Operators
distribution_channel booking distribution channel, TA = Travel Agent, TO = Tour
is_repeated_guest Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations number of previous bookings that were cancelled by the customer prior to the current booking
previous_booking_not_canceled number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type code of room type reserved; Code is presented instead of designation for anonymity reasons
assigned_room_type code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
booking_changes number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type indication on if the customer made a deposit to guarantee the booking. No Deposit = no deposit was made, Non Refund = a deposit was made in the value of the total stay cost, Refundable = a deposit was made with a value under the total cost of stay
agent ID of the travel agency that made the booking
company ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list number of days the booking was in the waiting list before it was confirmed to the customer
customer_type type of booking; Contract = when the booking has an allotment or other type of contract associate to it, Group = when the booking is associated to a group, Transient = when the booking is not part of a group or contract, and is not associated to other transient booking, Transient-party = when the booking is transient, but is associated to at least other transient booking
adr Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces number of car parking spaces required by the customer
total_of_special_requests number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status reservation last status; Canceled = booking was canceled by the customer, Check-Out = customer has checked in but already departed, No-Show = customer did not check-in and did inform the hotel of the reason why
reservation_status_date date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

With this dataset I am looking to analyze the following:
  • Cancellations of hotel with the arrival date month
  • Cancellations of hotel with arrival date year
I would also like to analyze the following
  • What type of hotels do repeated guest book?
  • What type of customers tend to be repeated guests?
  • Do repeated guests have a preferred meal choice?
  • Do guests have special requests?
  • How do total stays compare with total costs at hotels?

Load the data set

theUrl <- "https://raw.githubusercontent.com/letisalba/Data607_Project2/main/hotel_bookings%202.csv"

hotel_booking <- read.csv(file = theUrl, header = TRUE, sep = ",")
head(hotel_booking)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03

Data Wrangling

#Get column names
colnames(hotel_booking)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"
#Rename column names for more clarity
colnames(hotel_booking) <-  c("Hotel", "Cancellation_Status", "Lead_Time", "Arrival_Year", "Arrival_Month", "Arrival_Week", "Arrival_Day_of_Month", "Stays_Weekend_Nights", "Stays_Week_Nights", "Adults", "Children", "Babies", "Meal", "Country", "Market_Segment", "Distribution_Channel", "Repeated_Guest", "Previous_Cancellations", "Prev_Non_Cancellations", "Reserved_Room", "Assigned_Room", "Booking_Changes", "Deposit_Type", "Agent", "Company", "Days_Waitlisted", "Customer_Type", "ADR", "Required_Parking_Spaces", "Special_Requests", "Reservation_Status", "Reservation_Status_Date")
head(hotel_booking)
##          Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel                   0       342         2015          July
## 2 Resort Hotel                   0       737         2015          July
## 3 Resort Hotel                   0         7         2015          July
## 4 Resort Hotel                   0        13         2015          July
## 5 Resort Hotel                   0        14         2015          July
## 6 Resort Hotel                   0        14         2015          July
##   Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1           27                    1                    0                 0
## 2           27                    1                    0                 0
## 3           27                    1                    0                 1
## 4           27                    1                    0                 1
## 5           27                    1                    0                 2
## 6           27                    1                    0                 2
##   Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1      2        0      0   BB     PRT         Direct               Direct
## 2      2        0      0   BB     PRT         Direct               Direct
## 3      1        0      0   BB     GBR         Direct               Direct
## 4      1        0      0   BB     GBR      Corporate            Corporate
## 5      2        0      0   BB     GBR      Online TA                TA/TO
## 6      2        0      0   BB     GBR      Online TA                TA/TO
##   Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1              0                      0                      0             C
## 2              0                      0                      0             C
## 3              0                      0                      0             A
## 4              0                      0                      0             A
## 5              0                      0                      0             A
## 6              0                      0                      0             A
##   Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1             C               3   No Deposit  NULL    NULL               0
## 2             C               4   No Deposit  NULL    NULL               0
## 3             C               0   No Deposit  NULL    NULL               0
## 4             A               0   No Deposit   304    NULL               0
## 5             A               0   No Deposit   240    NULL               0
## 6             A               0   No Deposit   240    NULL               0
##   Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1     Transient   0                       0                0          Check-Out
## 2     Transient   0                       0                0          Check-Out
## 3     Transient  75                       0                0          Check-Out
## 4     Transient  75                       0                0          Check-Out
## 5     Transient  98                       0                1          Check-Out
## 6     Transient  98                       0                1          Check-Out
##   Reservation_Status_Date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
#Glimpse of data
glimpse(hotel_booking)
## Rows: 119,390
## Columns: 32
## $ Hotel                   <chr> "Resort Hotel", "Resort Hotel", "Resort Hotel"…
## $ Cancellation_Status     <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Lead_Time               <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, 23, 35,…
## $ Arrival_Year            <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
## $ Arrival_Month           <chr> "July", "July", "July", "July", "July", "July"…
## $ Arrival_Week            <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
## $ Arrival_Day_of_Month    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Stays_Weekend_Nights    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Stays_Week_Nights       <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4…
## $ Adults                  <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Children                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ Babies                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Meal                    <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB", "FB"…
## $ Country                 <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR", "PRT…
## $ Market_Segment          <chr> "Direct", "Direct", "Direct", "Corporate", "On…
## $ Distribution_Channel    <chr> "Direct", "Direct", "Direct", "Corporate", "TA…
## $ Repeated_Guest          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Previous_Cancellations  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Prev_Non_Cancellations  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Reserved_Room           <chr> "C", "C", "A", "A", "A", "A", "C", "C", "A", "…
## $ Assigned_Room           <chr> "C", "C", "C", "A", "A", "A", "C", "C", "A", "…
## $ Booking_Changes         <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ Deposit_Type            <chr> "No Deposit", "No Deposit", "No Deposit", "No …
## $ Agent                   <chr> "NULL", "NULL", "NULL", "304", "240", "240", "…
## $ Company                 <chr> "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"…
## $ Days_Waitlisted         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Customer_Type           <chr> "Transient", "Transient", "Transient", "Transi…
## $ ADR                     <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00, 107.00…
## $ Required_Parking_Spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Special_Requests        <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, 1, 0, 3…
## $ Reservation_Status      <chr> "Check-Out", "Check-Out", "Check-Out", "Check-…
## $ Reservation_Status_Date <chr> "2015-07-01", "2015-07-01", "2015-07-02", "201…
#Summary of columns in data
summary(hotel_booking)
##     Hotel           Cancellation_Status   Lead_Time    Arrival_Year 
##  Length:119390      Min.   :0.0000      Min.   :  0   Min.   :2015  
##  Class :character   1st Qu.:0.0000      1st Qu.: 18   1st Qu.:2016  
##  Mode  :character   Median :0.0000      Median : 69   Median :2016  
##                     Mean   :0.3704      Mean   :104   Mean   :2016  
##                     3rd Qu.:1.0000      3rd Qu.:160   3rd Qu.:2017  
##                     Max.   :1.0000      Max.   :737   Max.   :2017  
##                                                                     
##  Arrival_Month       Arrival_Week   Arrival_Day_of_Month Stays_Weekend_Nights
##  Length:119390      Min.   : 1.00   Min.   : 1.0         Min.   : 0.0000     
##  Class :character   1st Qu.:16.00   1st Qu.: 8.0         1st Qu.: 0.0000     
##  Mode  :character   Median :28.00   Median :16.0         Median : 1.0000     
##                     Mean   :27.17   Mean   :15.8         Mean   : 0.9276     
##                     3rd Qu.:38.00   3rd Qu.:23.0         3rd Qu.: 2.0000     
##                     Max.   :53.00   Max.   :31.0         Max.   :19.0000     
##                                                                              
##  Stays_Week_Nights     Adults          Children           Babies         
##  Min.   : 0.0      Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000000  
##  1st Qu.: 1.0      1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.000000  
##  Median : 2.0      Median : 2.000   Median : 0.0000   Median : 0.000000  
##  Mean   : 2.5      Mean   : 1.856   Mean   : 0.1039   Mean   : 0.007949  
##  3rd Qu.: 3.0      3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.000000  
##  Max.   :50.0      Max.   :55.000   Max.   :10.0000   Max.   :10.000000  
##                                     NA's   :4                            
##      Meal             Country          Market_Segment     Distribution_Channel
##  Length:119390      Length:119390      Length:119390      Length:119390       
##  Class :character   Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character    
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  Repeated_Guest    Previous_Cancellations Prev_Non_Cancellations
##  Min.   :0.00000   Min.   : 0.00000       Min.   : 0.0000       
##  1st Qu.:0.00000   1st Qu.: 0.00000       1st Qu.: 0.0000       
##  Median :0.00000   Median : 0.00000       Median : 0.0000       
##  Mean   :0.03191   Mean   : 0.08712       Mean   : 0.1371       
##  3rd Qu.:0.00000   3rd Qu.: 0.00000       3rd Qu.: 0.0000       
##  Max.   :1.00000   Max.   :26.00000       Max.   :72.0000       
##                                                                 
##  Reserved_Room      Assigned_Room      Booking_Changes   Deposit_Type      
##  Length:119390      Length:119390      Min.   : 0.0000   Length:119390     
##  Class :character   Class :character   1st Qu.: 0.0000   Class :character  
##  Mode  :character   Mode  :character   Median : 0.0000   Mode  :character  
##                                        Mean   : 0.2211                     
##                                        3rd Qu.: 0.0000                     
##                                        Max.   :21.0000                     
##                                                                            
##     Agent             Company          Days_Waitlisted   Customer_Type     
##  Length:119390      Length:119390      Min.   :  0.000   Length:119390     
##  Class :character   Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Mode  :character   Median :  0.000   Mode  :character  
##                                        Mean   :  2.321                     
##                                        3rd Qu.:  0.000                     
##                                        Max.   :391.000                     
##                                                                            
##       ADR          Required_Parking_Spaces Special_Requests Reservation_Status
##  Min.   :  -6.38   Min.   :0.00000         Min.   :0.0000   Length:119390     
##  1st Qu.:  69.29   1st Qu.:0.00000         1st Qu.:0.0000   Class :character  
##  Median :  94.58   Median :0.00000         Median :0.0000   Mode  :character  
##  Mean   : 101.83   Mean   :0.06252         Mean   :0.5714                     
##  3rd Qu.: 126.00   3rd Qu.:0.00000         3rd Qu.:1.0000                     
##  Max.   :5400.00   Max.   :8.00000         Max.   :5.0000                     
##                                                                               
##  Reservation_Status_Date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 
#Convert columns to factor
hotel_bookings <- hotel_booking %>% 
   mutate(Cancellation_Status = as.factor(Cancellation_Status), 
          Meal = as.factor(Meal),
          Repeated_Guest = as.factor(Repeated_Guest),
         )
head(hotel_booking, n = 4)
##          Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel                   0       342         2015          July
## 2 Resort Hotel                   0       737         2015          July
## 3 Resort Hotel                   0         7         2015          July
## 4 Resort Hotel                   0        13         2015          July
##   Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1           27                    1                    0                 0
## 2           27                    1                    0                 0
## 3           27                    1                    0                 1
## 4           27                    1                    0                 1
##   Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1      2        0      0   BB     PRT         Direct               Direct
## 2      2        0      0   BB     PRT         Direct               Direct
## 3      1        0      0   BB     GBR         Direct               Direct
## 4      1        0      0   BB     GBR      Corporate            Corporate
##   Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1              0                      0                      0             C
## 2              0                      0                      0             C
## 3              0                      0                      0             A
## 4              0                      0                      0             A
##   Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1             C               3   No Deposit  NULL    NULL               0
## 2             C               4   No Deposit  NULL    NULL               0
## 3             C               0   No Deposit  NULL    NULL               0
## 4             A               0   No Deposit   304    NULL               0
##   Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1     Transient   0                       0                0          Check-Out
## 2     Transient   0                       0                0          Check-Out
## 3     Transient  75                       0                0          Check-Out
## 4     Transient  75                       0                0          Check-Out
##   Reservation_Status_Date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
#Calculate total cost of stays and the total number of stays for each reservation and create two separate columns for each

hotel_bookings <- hotel_booking %>% 
  mutate(Stays_Total = Stays_Weekend_Nights + Stays_Week_Nights,
       Stays_Total_Cost = ADR * Stays_Total)
head(hotel_bookings, n = 4)
##          Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel                   0       342         2015          July
## 2 Resort Hotel                   0       737         2015          July
## 3 Resort Hotel                   0         7         2015          July
## 4 Resort Hotel                   0        13         2015          July
##   Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1           27                    1                    0                 0
## 2           27                    1                    0                 0
## 3           27                    1                    0                 1
## 4           27                    1                    0                 1
##   Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1      2        0      0   BB     PRT         Direct               Direct
## 2      2        0      0   BB     PRT         Direct               Direct
## 3      1        0      0   BB     GBR         Direct               Direct
## 4      1        0      0   BB     GBR      Corporate            Corporate
##   Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1              0                      0                      0             C
## 2              0                      0                      0             C
## 3              0                      0                      0             A
## 4              0                      0                      0             A
##   Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1             C               3   No Deposit  NULL    NULL               0
## 2             C               4   No Deposit  NULL    NULL               0
## 3             C               0   No Deposit  NULL    NULL               0
## 4             A               0   No Deposit   304    NULL               0
##   Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1     Transient   0                       0                0          Check-Out
## 2     Transient   0                       0                0          Check-Out
## 3     Transient  75                       0                0          Check-Out
## 4     Transient  75                       0                0          Check-Out
##   Reservation_Status_Date Stays_Total Stays_Total_Cost
## 1              2015-07-01           0                0
## 2              2015-07-01           0                0
## 3              2015-07-02           1               75
## 4              2015-07-02           1               75
#Find missing values
colSums(is.na(hotel_bookings))
##                   Hotel     Cancellation_Status               Lead_Time 
##                       0                       0                       0 
##            Arrival_Year           Arrival_Month            Arrival_Week 
##                       0                       0                       0 
##    Arrival_Day_of_Month    Stays_Weekend_Nights       Stays_Week_Nights 
##                       0                       0                       0 
##                  Adults                Children                  Babies 
##                       0                       4                       0 
##                    Meal                 Country          Market_Segment 
##                       0                       0                       0 
##    Distribution_Channel          Repeated_Guest  Previous_Cancellations 
##                       0                       0                       0 
##  Prev_Non_Cancellations           Reserved_Room           Assigned_Room 
##                       0                       0                       0 
##         Booking_Changes            Deposit_Type                   Agent 
##                       0                       0                       0 
##                 Company         Days_Waitlisted           Customer_Type 
##                       0                       0                       0 
##                     ADR Required_Parking_Spaces        Special_Requests 
##                       0                       0                       0 
##      Reservation_Status Reservation_Status_Date             Stays_Total 
##                       0                       0                       0 
##        Stays_Total_Cost 
##                       0
#There are 4 values missing in children
#Replace missing values in Children by using the Median
#By using the Median it will allow for those missing values to be added without trying to guess what number to input. Using the number 0 would have also worked without skewing the data.

children_median <- median(hotel_bookings$Children, na.rm = TRUE)

hotel_bookings$Children[is.na(hotel_bookings$Children)] <- children_median

head(hotel_bookings, n = 4)
##          Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel                   0       342         2015          July
## 2 Resort Hotel                   0       737         2015          July
## 3 Resort Hotel                   0         7         2015          July
## 4 Resort Hotel                   0        13         2015          July
##   Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1           27                    1                    0                 0
## 2           27                    1                    0                 0
## 3           27                    1                    0                 1
## 4           27                    1                    0                 1
##   Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1      2        0      0   BB     PRT         Direct               Direct
## 2      2        0      0   BB     PRT         Direct               Direct
## 3      1        0      0   BB     GBR         Direct               Direct
## 4      1        0      0   BB     GBR      Corporate            Corporate
##   Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1              0                      0                      0             C
## 2              0                      0                      0             C
## 3              0                      0                      0             A
## 4              0                      0                      0             A
##   Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1             C               3   No Deposit  NULL    NULL               0
## 2             C               4   No Deposit  NULL    NULL               0
## 3             C               0   No Deposit  NULL    NULL               0
## 4             A               0   No Deposit   304    NULL               0
##   Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1     Transient   0                       0                0          Check-Out
## 2     Transient   0                       0                0          Check-Out
## 3     Transient  75                       0                0          Check-Out
## 4     Transient  75                       0                0          Check-Out
##   Reservation_Status_Date Stays_Total Stays_Total_Cost
## 1              2015-07-01           0                0
## 2              2015-07-01           0                0
## 3              2015-07-02           1               75
## 4              2015-07-02           1               75
#Check to see if there's any other missing values
colSums(is.na(hotel_bookings))
##                   Hotel     Cancellation_Status               Lead_Time 
##                       0                       0                       0 
##            Arrival_Year           Arrival_Month            Arrival_Week 
##                       0                       0                       0 
##    Arrival_Day_of_Month    Stays_Weekend_Nights       Stays_Week_Nights 
##                       0                       0                       0 
##                  Adults                Children                  Babies 
##                       0                       0                       0 
##                    Meal                 Country          Market_Segment 
##                       0                       0                       0 
##    Distribution_Channel          Repeated_Guest  Previous_Cancellations 
##                       0                       0                       0 
##  Prev_Non_Cancellations           Reserved_Room           Assigned_Room 
##                       0                       0                       0 
##         Booking_Changes            Deposit_Type                   Agent 
##                       0                       0                       0 
##                 Company         Days_Waitlisted           Customer_Type 
##                       0                       0                       0 
##                     ADR Required_Parking_Spaces        Special_Requests 
##                       0                       0                       0 
##      Reservation_Status Reservation_Status_Date             Stays_Total 
##                       0                       0                       0 
##        Stays_Total_Cost 
##                       0
#Replace all NULL values in data set
hotel_bookings2 <- hotel_bookings %>% replace(.=="NULL", 0) # replace with 0
head(hotel_bookings2, n = 4)
##          Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel                   0       342         2015          July
## 2 Resort Hotel                   0       737         2015          July
## 3 Resort Hotel                   0         7         2015          July
## 4 Resort Hotel                   0        13         2015          July
##   Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1           27                    1                    0                 0
## 2           27                    1                    0                 0
## 3           27                    1                    0                 1
## 4           27                    1                    0                 1
##   Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1      2        0      0   BB     PRT         Direct               Direct
## 2      2        0      0   BB     PRT         Direct               Direct
## 3      1        0      0   BB     GBR         Direct               Direct
## 4      1        0      0   BB     GBR      Corporate            Corporate
##   Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1              0                      0                      0             C
## 2              0                      0                      0             C
## 3              0                      0                      0             A
## 4              0                      0                      0             A
##   Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1             C               3   No Deposit     0       0               0
## 2             C               4   No Deposit     0       0               0
## 3             C               0   No Deposit     0       0               0
## 4             A               0   No Deposit   304       0               0
##   Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1     Transient   0                       0                0          Check-Out
## 2     Transient   0                       0                0          Check-Out
## 3     Transient  75                       0                0          Check-Out
## 4     Transient  75                       0                0          Check-Out
##   Reservation_Status_Date Stays_Total Stays_Total_Cost
## 1              2015-07-01           0                0
## 2              2015-07-01           0                0
## 3              2015-07-02           1               75
## 4              2015-07-02           1               75
#Replacing values in Cancellation_Status column so that 0 = not cancelled and 1 = cancelled
hotel_bookings3<- recode(hotel_bookings2$Cancellation_Status,
                              "0" = "Not_Cancelled",
                              "1" = "Cancelled")    
head(hotel_bookings3)
## [1] "Not_Cancelled" "Not_Cancelled" "Not_Cancelled" "Not_Cancelled"
## [5] "Not_Cancelled" "Not_Cancelled"
#Replacing values in Repeated_Guest column so that 0 = new guest and 1 = repeated guest
hotel_bookings4 <- recode(hotel_bookings2$Repeated_Guest,
                              "0" = "New_Guest",
                              "1" = "Repeated_Guest")    
head(hotel_bookings4)
## [1] "New_Guest" "New_Guest" "New_Guest" "New_Guest" "New_Guest" "New_Guest"
#cbind 3 tables 
hotelbookings <-  cbind(hotel_bookings2, hotel_bookings3, hotel_bookings4)

#Drop rows not needed
hotelbookings <- hotelbookings[, -c(2, 15:17, 20, 21, 23:25, 29, 32)]

head(hotelbookings)
##          Hotel Lead_Time Arrival_Year Arrival_Month Arrival_Week
## 1 Resort Hotel       342         2015          July           27
## 2 Resort Hotel       737         2015          July           27
## 3 Resort Hotel         7         2015          July           27
## 4 Resort Hotel        13         2015          July           27
## 5 Resort Hotel        14         2015          July           27
## 6 Resort Hotel        14         2015          July           27
##   Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights Adults Children
## 1                    1                    0                 0      2        0
## 2                    1                    0                 0      2        0
## 3                    1                    0                 1      1        0
## 4                    1                    0                 1      1        0
## 5                    1                    0                 2      2        0
## 6                    1                    0                 2      2        0
##   Babies Meal Country Previous_Cancellations Prev_Non_Cancellations
## 1      0   BB     PRT                      0                      0
## 2      0   BB     PRT                      0                      0
## 3      0   BB     GBR                      0                      0
## 4      0   BB     GBR                      0                      0
## 5      0   BB     GBR                      0                      0
## 6      0   BB     GBR                      0                      0
##   Booking_Changes Days_Waitlisted Customer_Type ADR Special_Requests
## 1               3               0     Transient   0                0
## 2               4               0     Transient   0                0
## 3               0               0     Transient  75                0
## 4               0               0     Transient  75                0
## 5               0               0     Transient  98                1
## 6               0               0     Transient  98                1
##   Reservation_Status Stays_Total Stays_Total_Cost hotel_bookings3
## 1          Check-Out           0                0   Not_Cancelled
## 2          Check-Out           0                0   Not_Cancelled
## 3          Check-Out           1               75   Not_Cancelled
## 4          Check-Out           1               75   Not_Cancelled
## 5          Check-Out           2              196   Not_Cancelled
## 6          Check-Out           2              196   Not_Cancelled
##   hotel_bookings4
## 1       New_Guest
## 2       New_Guest
## 3       New_Guest
## 4       New_Guest
## 5       New_Guest
## 6       New_Guest
#Get column names
colnames(hotelbookings)
##  [1] "Hotel"                  "Lead_Time"              "Arrival_Year"          
##  [4] "Arrival_Month"          "Arrival_Week"           "Arrival_Day_of_Month"  
##  [7] "Stays_Weekend_Nights"   "Stays_Week_Nights"      "Adults"                
## [10] "Children"               "Babies"                 "Meal"                  
## [13] "Country"                "Previous_Cancellations" "Prev_Non_Cancellations"
## [16] "Booking_Changes"        "Days_Waitlisted"        "Customer_Type"         
## [19] "ADR"                    "Special_Requests"       "Reservation_Status"    
## [22] "Stays_Total"            "Stays_Total_Cost"       "hotel_bookings3"       
## [25] "hotel_bookings4"
#Rename column names
colnames(hotelbookings) <-  c("Hotel", "Lead_Time", "Arrival_Year", "Arrival_Month", "Arrival_Week", "Arrival_Day_of_Month", "Stays_Weekend_Nights", "Stays_Week_Nights", "Adults", "Children", "Babies", "Meal", "Country", "Previous_Cancellations", "Prev_Non_Cancellations", "Booking_Changes", "Days_Waitlisted", "Customer_Type", "ADR", "Special_Requests", "Reservation_Status", "Stays_Total", "Stays_Total_Cost", "Cancellation_Status", "Is_Repeated_Guest")

#Reorder columns
col_order <- c("Hotel", "Cancellation_Status", "Lead_Time", "Arrival_Year", "Arrival_Month", "Arrival_Week", "Arrival_Day_of_Month", "Stays_Weekend_Nights", "Stays_Week_Nights", "Stays_Total", "Stays_Total_Cost", "Adults", "Children", "Babies", "Meal", "Country", "Is_Repeated_Guest", "Previous_Cancellations", "Prev_Non_Cancellations", "Booking_Changes", "Days_Waitlisted", "Customer_Type", "ADR", "Special_Requests", "Reservation_Status")

hotelbookings <- hotelbookings[, col_order]
head(hotelbookings)
##          Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel       Not_Cancelled       342         2015          July
## 2 Resort Hotel       Not_Cancelled       737         2015          July
## 3 Resort Hotel       Not_Cancelled         7         2015          July
## 4 Resort Hotel       Not_Cancelled        13         2015          July
## 5 Resort Hotel       Not_Cancelled        14         2015          July
## 6 Resort Hotel       Not_Cancelled        14         2015          July
##   Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1           27                    1                    0                 0
## 2           27                    1                    0                 0
## 3           27                    1                    0                 1
## 4           27                    1                    0                 1
## 5           27                    1                    0                 2
## 6           27                    1                    0                 2
##   Stays_Total Stays_Total_Cost Adults Children Babies Meal Country
## 1           0                0      2        0      0   BB     PRT
## 2           0                0      2        0      0   BB     PRT
## 3           1               75      1        0      0   BB     GBR
## 4           1               75      1        0      0   BB     GBR
## 5           2              196      2        0      0   BB     GBR
## 6           2              196      2        0      0   BB     GBR
##   Is_Repeated_Guest Previous_Cancellations Prev_Non_Cancellations
## 1         New_Guest                      0                      0
## 2         New_Guest                      0                      0
## 3         New_Guest                      0                      0
## 4         New_Guest                      0                      0
## 5         New_Guest                      0                      0
## 6         New_Guest                      0                      0
##   Booking_Changes Days_Waitlisted Customer_Type ADR Special_Requests
## 1               3               0     Transient   0                0
## 2               4               0     Transient   0                0
## 3               0               0     Transient  75                0
## 4               0               0     Transient  75                0
## 5               0               0     Transient  98                1
## 6               0               0     Transient  98                1
##   Reservation_Status
## 1          Check-Out
## 2          Check-Out
## 3          Check-Out
## 4          Check-Out
## 5          Check-Out
## 6          Check-Out

Data Exploration

City Hotels have the hightest number of bookings
#Count of Hotels
Hotel <- hotel_bookings2%>%
  group_by(Hotel)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Hotel)
## # A tibble: 2 × 2
##   Hotel          num
##   <chr>        <int>
## 1 City Hotel   79330
## 2 Resort Hotel 40060
There has been more non cancellations than cancellations when booking
#Count of Cancellation_Status
Cancellation_Status <- hotel_bookings2%>%
  group_by(Cancellation_Status)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Cancellation_Status)
## # A tibble: 2 × 2
##   Cancellation_Status   num
##                 <int> <int>
## 1                   0 75166
## 2                   1 44224
August and July are the two highest months when hotels are booked
#Count of Arrival_Month
Arrival_Month <- hotel_bookings2%>%
  group_by(Arrival_Month)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Arrival_Month)
## # A tibble: 6 × 2
##   Arrival_Month   num
##   <chr>         <int>
## 1 August        13877
## 2 July          12661
## 3 May           11791
## 4 October       11160
## 5 April         11089
## 6 June          10939
Zero and two weekend nights are amongst the highest guest have stayed in a hotel
#Count of Stays_Weekend_Nights
Stays_Weekend <- hotel_bookings2%>%
  group_by(Stays_Weekend_Nights)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Stays_Weekend)
## # A tibble: 6 × 2
##   Stays_Weekend_Nights   num
##                  <int> <int>
## 1                    0 51998
## 2                    2 33308
## 3                    1 30626
## 4                    4  1855
## 5                    3  1259
## 6                    6   153
Two and one week nights are the most guest stay or book to stay at the hotel.
#Count of Stays_Week_Nights
Stays_Week <- hotel_bookings2%>%
  group_by(Stays_Week_Nights)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Stays_Week)
## # A tibble: 6 × 2
##   Stays_Week_Nights   num
##               <int> <int>
## 1                 2 33684
## 2                 1 30310
## 3                 3 22258
## 4                 5 11077
## 5                 4  9563
## 6                 0  7645
Bed and Breakfast is the most common meal option.
#Count of Meals
Meal <- hotel_bookings2%>%
  group_by(Meal)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Meal)
## # A tibble: 5 × 2
##   Meal        num
##   <chr>     <int>
## 1 BB        92310
## 2 HB        14463
## 3 SC        10650
## 4 Undefined  1169
## 5 FB          798
Non repeated guest tend to book stays more often than repeated guests.
#Count of Repeated_Guest
Repeated_Guest <- hotel_bookings2%>%
  group_by(Repeated_Guest)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Repeated_Guest)
## # A tibble: 2 × 2
##   Repeated_Guest    num
##            <int>  <int>
## 1              0 115580
## 2              1   3810
Transient customers book the most hotel stays..
#Count on Customer_Type
Customer <- hotel_bookings2%>%
  group_by(Customer_Type)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Customer)
## # A tibble: 4 × 2
##   Customer_Type     num
##   <chr>           <int>
## 1 Transient       89613
## 2 Transient-Party 25124
## 3 Contract         4076
## 4 Group             577
Zero and 1 special requests are made by the customer
#Count on Special_Requests
Special_Requests <- hotel_bookings2%>%
  group_by(Special_Requests)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Special_Requests)
## # A tibble: 6 × 2
##   Special_Requests   num
##              <int> <int>
## 1                0 70318
## 2                1 33226
## 3                2 12969
## 4                3  2497
## 5                4   340
## 6                5    40
When combining the stays for weekend nights and week nights, the two days is the highest number of stays.
#Count on Stays_Total
Stays_Total <- hotel_bookings2%>%
  group_by(Stays_Total)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Stays_Total)
## # A tibble: 6 × 2
##   Stays_Total   num
##         <int> <int>
## 1           2 27643
## 2           3 27076
## 3           1 21020
## 4           4 17383
## 5           7  8655
## 6           5  7784
The most common total cost in a hotel stay is $124 (assuming the currency is dollars)
#Count on Stays_Total_Cost
Stays_Total_Cost <- hotel_bookings2%>%
  group_by(Stays_Total_Cost)%>%
  summarise(num=n())%>%
  arrange(desc(num))
head(Stays_Total_Cost)
## # A tibble: 6 × 2
##   Stays_Total_Cost   num
##              <dbl> <int>
## 1              124  2913
## 2                0  1959
## 3              180  1310
## 4              270  1090
## 5              240  1046
## 6              300  1030

Data Analysis:

Cancellations of hotel with Arrival Month

Cancellations in April is the highest.

# Bar Plot 
hotel_bookings2 %>% 
ggplot(aes(x = Cancellation_Status, fill = Arrival_Month)) +
  geom_bar()

Cancellations of hotel with arrival date year

The year 2016 had the most cancellations/non-cancellations.

# Bar Plot 
hotel_bookings2 %>% 
ggplot(aes(x = Arrival_Year, fill = Cancellation_Status)) +
  geom_bar()

What type of hotels do repeated guest book?

Repeated guests book City Hotels the most.

# Bar Plot 
hotel_bookings2 %>% 
ggplot(aes(x = Repeated_Guest, fill = Hotel)) +
  geom_bar()

What type of customers tend to be repeated guests?

Transient customers tend to be repeated guests.

# Plot
hotel_bookings2%>%
  ggplot(aes(x = Repeated_Guest, fill = Customer_Type)) +
  geom_bar()

Do repeated guests have a preferred meal choice?

Bed and Breakfast is the preferred meal of choice for repeated guest.

# Plot
hotel_bookings2%>%
  ggplot(aes(x = Repeated_Guest, fill = Meal)) +
  geom_bar()

Do guests have special requests?

Guests in general have no special requests during hotel stay.

# Plot
hotel_bookings2%>%
  ggplot(aes(x = Special_Requests, fill = Repeated_Guest)) +
  geom_bar()

How do total stays compare with total costs at hotels?

Whether it’s a City or Resort hotel, majority of the booking stays are between 0 to 20 days with a few outliers.

#Scatter plot
hotel_bookings2 %>% 
ggplot(aes(x = Stays_Total, y = Stays_Total_Cost, color = Hotel)) +
  geom_point() +
  geom_smooth(formula = y ~ x, method=lm , color="red", se=FALSE) + # with linear trend
  theme_ipsum()