Note: My collaborators were each responsible to tidy one data set which we would then share and join into our project 2. Below you’ll see the data set I chose to tidy.
#Import libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(tidyr)
library(ggplot2)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
Columns | Description |
---|---|
Hotel | H1 = Resort Hotel, H2 = City Hotel |
is_canceled | 1 = cancelled, 0 = not cancelled |
lead_time | number of days that elapsed between entering the data into the Property Management System (PMS) and arrival date |
arrival_date_year | year of arrival date |
arrival_date_month | month of arrival date |
arrival_date_week_number | week number of year for arrival date |
arrival_date_day_of_month | day of arrival date |
stays_in_weekend_nights | number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
stays_in_week_night | number of weeknights (Monday - Friday) the guest stayed or booked to stay at the hotel |
adults | number of adults |
children | number of children |
babies | number of babies |
meal | types of meals booked: Undefined/SC = no meal package, BB = Bed & Breakfast, HB = Half board (breakfast and one other meal - usual dinner), FB = Full board (breakfast, lunch and dinner) |
country | country of origin, categories are represented in the ISO 355-3.2013 format |
market_segment | market segment designation, TA = Travel Agent, TO = Tour Operators |
distribution_channel | booking distribution channel, TA = Travel Agent, TO = Tour |
is_repeated_guest | Value indicating if the booking name was from a repeated guest (1) or not (0) |
previous_cancellations | number of previous bookings that were cancelled by the customer prior to the current booking |
previous_booking_not_canceled | number of previous bookings not cancelled by the customer prior to the current booking |
reserved_room_type | code of room type reserved; Code is presented instead of designation for anonymity reasons |
assigned_room_type | code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons. |
booking_changes | number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
deposit_type | indication on if the customer made a deposit to guarantee the booking. No Deposit = no deposit was made, Non Refund = a deposit was made in the value of the total stay cost, Refundable = a deposit was made with a value under the total cost of stay |
agent | ID of the travel agency that made the booking |
company | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
days_in_waiting_list | number of days the booking was in the waiting list before it was confirmed to the customer |
customer_type | type of booking; Contract = when the booking has an allotment or other type of contract associate to it, Group = when the booking is associated to a group, Transient = when the booking is not part of a group or contract, and is not associated to other transient booking, Transient-party = when the booking is transient, but is associated to at least other transient booking |
adr | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
required_car_parking_spaces | number of car parking spaces required by the customer |
total_of_special_requests | number of special requests made by the customer (e.g. twin bed or high floor) |
reservation_status | reservation last status; Canceled = booking was canceled by the customer, Check-Out = customer has checked in but already departed, No-Show = customer did not check-in and did inform the hotel of the reason why |
reservation_status_date | date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
theUrl <- "https://raw.githubusercontent.com/letisalba/Data607_Project2/main/hotel_bookings%202.csv"
hotel_booking <- read.csv(file = theUrl, header = TRUE, sep = ",")
head(hotel_booking)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
#Get column names
colnames(hotel_booking)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
#Rename column names for more clarity
colnames(hotel_booking) <- c("Hotel", "Cancellation_Status", "Lead_Time", "Arrival_Year", "Arrival_Month", "Arrival_Week", "Arrival_Day_of_Month", "Stays_Weekend_Nights", "Stays_Week_Nights", "Adults", "Children", "Babies", "Meal", "Country", "Market_Segment", "Distribution_Channel", "Repeated_Guest", "Previous_Cancellations", "Prev_Non_Cancellations", "Reserved_Room", "Assigned_Room", "Booking_Changes", "Deposit_Type", "Agent", "Company", "Days_Waitlisted", "Customer_Type", "ADR", "Required_Parking_Spaces", "Special_Requests", "Reservation_Status", "Reservation_Status_Date")
head(hotel_booking)
## Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1 27 1 0 0
## 2 27 1 0 0
## 3 27 1 0 1
## 4 27 1 0 1
## 5 27 1 0 2
## 6 27 1 0 2
## Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1 2 0 0 BB PRT Direct Direct
## 2 2 0 0 BB PRT Direct Direct
## 3 1 0 0 BB GBR Direct Direct
## 4 1 0 0 BB GBR Corporate Corporate
## 5 2 0 0 BB GBR Online TA TA/TO
## 6 2 0 0 BB GBR Online TA TA/TO
## Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1 0 0 0 C
## 2 0 0 0 C
## 3 0 0 0 A
## 4 0 0 0 A
## 5 0 0 0 A
## 6 0 0 0 A
## Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1 C 3 No Deposit NULL NULL 0
## 2 C 4 No Deposit NULL NULL 0
## 3 C 0 No Deposit NULL NULL 0
## 4 A 0 No Deposit 304 NULL 0
## 5 A 0 No Deposit 240 NULL 0
## 6 A 0 No Deposit 240 NULL 0
## Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1 Transient 0 0 0 Check-Out
## 2 Transient 0 0 0 Check-Out
## 3 Transient 75 0 0 Check-Out
## 4 Transient 75 0 0 Check-Out
## 5 Transient 98 0 1 Check-Out
## 6 Transient 98 0 1 Check-Out
## Reservation_Status_Date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
#Glimpse of data
glimpse(hotel_booking)
## Rows: 119,390
## Columns: 32
## $ Hotel <chr> "Resort Hotel", "Resort Hotel", "Resort Hotel"…
## $ Cancellation_Status <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Lead_Time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, 23, 35,…
## $ Arrival_Year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
## $ Arrival_Month <chr> "July", "July", "July", "July", "July", "July"…
## $ Arrival_Week <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
## $ Arrival_Day_of_Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Stays_Weekend_Nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Stays_Week_Nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4…
## $ Adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ Babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB", "FB"…
## $ Country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR", "PRT…
## $ Market_Segment <chr> "Direct", "Direct", "Direct", "Corporate", "On…
## $ Distribution_Channel <chr> "Direct", "Direct", "Direct", "Corporate", "TA…
## $ Repeated_Guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Previous_Cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Prev_Non_Cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Reserved_Room <chr> "C", "C", "A", "A", "A", "A", "C", "C", "A", "…
## $ Assigned_Room <chr> "C", "C", "C", "A", "A", "A", "C", "C", "A", "…
## $ Booking_Changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ Deposit_Type <chr> "No Deposit", "No Deposit", "No Deposit", "No …
## $ Agent <chr> "NULL", "NULL", "NULL", "304", "240", "240", "…
## $ Company <chr> "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"…
## $ Days_Waitlisted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Customer_Type <chr> "Transient", "Transient", "Transient", "Transi…
## $ ADR <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00, 107.00…
## $ Required_Parking_Spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Special_Requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, 1, 0, 3…
## $ Reservation_Status <chr> "Check-Out", "Check-Out", "Check-Out", "Check-…
## $ Reservation_Status_Date <chr> "2015-07-01", "2015-07-01", "2015-07-02", "201…
#Summary of columns in data
summary(hotel_booking)
## Hotel Cancellation_Status Lead_Time Arrival_Year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## Arrival_Month Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights
## Length:119390 Min. : 1.00 Min. : 1.0 Min. : 0.0000
## Class :character 1st Qu.:16.00 1st Qu.: 8.0 1st Qu.: 0.0000
## Mode :character Median :28.00 Median :16.0 Median : 1.0000
## Mean :27.17 Mean :15.8 Mean : 0.9276
## 3rd Qu.:38.00 3rd Qu.:23.0 3rd Qu.: 2.0000
## Max. :53.00 Max. :31.0 Max. :19.0000
##
## Stays_Week_Nights Adults Children Babies
## Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. : 0.000000
## 1st Qu.: 1.0 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.000000
## Median : 2.0 Median : 2.000 Median : 0.0000 Median : 0.000000
## Mean : 2.5 Mean : 1.856 Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 3.0 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :50.0 Max. :55.000 Max. :10.0000 Max. :10.000000
## NA's :4
## Meal Country Market_Segment Distribution_Channel
## Length:119390 Length:119390 Length:119390 Length:119390
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Repeated_Guest Previous_Cancellations Prev_Non_Cancellations
## Min. :0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.: 0.0000
## Median :0.00000 Median : 0.00000 Median : 0.0000
## Mean :0.03191 Mean : 0.08712 Mean : 0.1371
## 3rd Qu.:0.00000 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :1.00000 Max. :26.00000 Max. :72.0000
##
## Reserved_Room Assigned_Room Booking_Changes Deposit_Type
## Length:119390 Length:119390 Min. : 0.0000 Length:119390
## Class :character Class :character 1st Qu.: 0.0000 Class :character
## Mode :character Mode :character Median : 0.0000 Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## Agent Company Days_Waitlisted Customer_Type
## Length:119390 Length:119390 Min. : 0.000 Length:119390
## Class :character Class :character 1st Qu.: 0.000 Class :character
## Mode :character Mode :character Median : 0.000 Mode :character
## Mean : 2.321
## 3rd Qu.: 0.000
## Max. :391.000
##
## ADR Required_Parking_Spaces Special_Requests Reservation_Status
## Min. : -6.38 Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median : 94.58 Median :0.00000 Median :0.0000 Mode :character
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## Reservation_Status_Date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
#Convert columns to factor
hotel_bookings <- hotel_booking %>%
mutate(Cancellation_Status = as.factor(Cancellation_Status),
Meal = as.factor(Meal),
Repeated_Guest = as.factor(Repeated_Guest),
)
head(hotel_booking, n = 4)
## Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1 27 1 0 0
## 2 27 1 0 0
## 3 27 1 0 1
## 4 27 1 0 1
## Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1 2 0 0 BB PRT Direct Direct
## 2 2 0 0 BB PRT Direct Direct
## 3 1 0 0 BB GBR Direct Direct
## 4 1 0 0 BB GBR Corporate Corporate
## Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1 0 0 0 C
## 2 0 0 0 C
## 3 0 0 0 A
## 4 0 0 0 A
## Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1 C 3 No Deposit NULL NULL 0
## 2 C 4 No Deposit NULL NULL 0
## 3 C 0 No Deposit NULL NULL 0
## 4 A 0 No Deposit 304 NULL 0
## Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1 Transient 0 0 0 Check-Out
## 2 Transient 0 0 0 Check-Out
## 3 Transient 75 0 0 Check-Out
## 4 Transient 75 0 0 Check-Out
## Reservation_Status_Date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
#Calculate total cost of stays and the total number of stays for each reservation and create two separate columns for each
hotel_bookings <- hotel_booking %>%
mutate(Stays_Total = Stays_Weekend_Nights + Stays_Week_Nights,
Stays_Total_Cost = ADR * Stays_Total)
head(hotel_bookings, n = 4)
## Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1 27 1 0 0
## 2 27 1 0 0
## 3 27 1 0 1
## 4 27 1 0 1
## Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1 2 0 0 BB PRT Direct Direct
## 2 2 0 0 BB PRT Direct Direct
## 3 1 0 0 BB GBR Direct Direct
## 4 1 0 0 BB GBR Corporate Corporate
## Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1 0 0 0 C
## 2 0 0 0 C
## 3 0 0 0 A
## 4 0 0 0 A
## Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1 C 3 No Deposit NULL NULL 0
## 2 C 4 No Deposit NULL NULL 0
## 3 C 0 No Deposit NULL NULL 0
## 4 A 0 No Deposit 304 NULL 0
## Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1 Transient 0 0 0 Check-Out
## 2 Transient 0 0 0 Check-Out
## 3 Transient 75 0 0 Check-Out
## 4 Transient 75 0 0 Check-Out
## Reservation_Status_Date Stays_Total Stays_Total_Cost
## 1 2015-07-01 0 0
## 2 2015-07-01 0 0
## 3 2015-07-02 1 75
## 4 2015-07-02 1 75
#Find missing values
colSums(is.na(hotel_bookings))
## Hotel Cancellation_Status Lead_Time
## 0 0 0
## Arrival_Year Arrival_Month Arrival_Week
## 0 0 0
## Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 0 0 0
## Adults Children Babies
## 0 4 0
## Meal Country Market_Segment
## 0 0 0
## Distribution_Channel Repeated_Guest Previous_Cancellations
## 0 0 0
## Prev_Non_Cancellations Reserved_Room Assigned_Room
## 0 0 0
## Booking_Changes Deposit_Type Agent
## 0 0 0
## Company Days_Waitlisted Customer_Type
## 0 0 0
## ADR Required_Parking_Spaces Special_Requests
## 0 0 0
## Reservation_Status Reservation_Status_Date Stays_Total
## 0 0 0
## Stays_Total_Cost
## 0
#There are 4 values missing in children
#Replace missing values in Children by using the Median
#By using the Median it will allow for those missing values to be added without trying to guess what number to input. Using the number 0 would have also worked without skewing the data.
children_median <- median(hotel_bookings$Children, na.rm = TRUE)
hotel_bookings$Children[is.na(hotel_bookings$Children)] <- children_median
head(hotel_bookings, n = 4)
## Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1 27 1 0 0
## 2 27 1 0 0
## 3 27 1 0 1
## 4 27 1 0 1
## Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1 2 0 0 BB PRT Direct Direct
## 2 2 0 0 BB PRT Direct Direct
## 3 1 0 0 BB GBR Direct Direct
## 4 1 0 0 BB GBR Corporate Corporate
## Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1 0 0 0 C
## 2 0 0 0 C
## 3 0 0 0 A
## 4 0 0 0 A
## Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1 C 3 No Deposit NULL NULL 0
## 2 C 4 No Deposit NULL NULL 0
## 3 C 0 No Deposit NULL NULL 0
## 4 A 0 No Deposit 304 NULL 0
## Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1 Transient 0 0 0 Check-Out
## 2 Transient 0 0 0 Check-Out
## 3 Transient 75 0 0 Check-Out
## 4 Transient 75 0 0 Check-Out
## Reservation_Status_Date Stays_Total Stays_Total_Cost
## 1 2015-07-01 0 0
## 2 2015-07-01 0 0
## 3 2015-07-02 1 75
## 4 2015-07-02 1 75
#Check to see if there's any other missing values
colSums(is.na(hotel_bookings))
## Hotel Cancellation_Status Lead_Time
## 0 0 0
## Arrival_Year Arrival_Month Arrival_Week
## 0 0 0
## Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 0 0 0
## Adults Children Babies
## 0 0 0
## Meal Country Market_Segment
## 0 0 0
## Distribution_Channel Repeated_Guest Previous_Cancellations
## 0 0 0
## Prev_Non_Cancellations Reserved_Room Assigned_Room
## 0 0 0
## Booking_Changes Deposit_Type Agent
## 0 0 0
## Company Days_Waitlisted Customer_Type
## 0 0 0
## ADR Required_Parking_Spaces Special_Requests
## 0 0 0
## Reservation_Status Reservation_Status_Date Stays_Total
## 0 0 0
## Stays_Total_Cost
## 0
#Replace all NULL values in data set
hotel_bookings2 <- hotel_bookings %>% replace(.=="NULL", 0) # replace with 0
head(hotel_bookings2, n = 4)
## Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1 27 1 0 0
## 2 27 1 0 0
## 3 27 1 0 1
## 4 27 1 0 1
## Adults Children Babies Meal Country Market_Segment Distribution_Channel
## 1 2 0 0 BB PRT Direct Direct
## 2 2 0 0 BB PRT Direct Direct
## 3 1 0 0 BB GBR Direct Direct
## 4 1 0 0 BB GBR Corporate Corporate
## Repeated_Guest Previous_Cancellations Prev_Non_Cancellations Reserved_Room
## 1 0 0 0 C
## 2 0 0 0 C
## 3 0 0 0 A
## 4 0 0 0 A
## Assigned_Room Booking_Changes Deposit_Type Agent Company Days_Waitlisted
## 1 C 3 No Deposit 0 0 0
## 2 C 4 No Deposit 0 0 0
## 3 C 0 No Deposit 0 0 0
## 4 A 0 No Deposit 304 0 0
## Customer_Type ADR Required_Parking_Spaces Special_Requests Reservation_Status
## 1 Transient 0 0 0 Check-Out
## 2 Transient 0 0 0 Check-Out
## 3 Transient 75 0 0 Check-Out
## 4 Transient 75 0 0 Check-Out
## Reservation_Status_Date Stays_Total Stays_Total_Cost
## 1 2015-07-01 0 0
## 2 2015-07-01 0 0
## 3 2015-07-02 1 75
## 4 2015-07-02 1 75
#Replacing values in Cancellation_Status column so that 0 = not cancelled and 1 = cancelled
hotel_bookings3<- recode(hotel_bookings2$Cancellation_Status,
"0" = "Not_Cancelled",
"1" = "Cancelled")
head(hotel_bookings3)
## [1] "Not_Cancelled" "Not_Cancelled" "Not_Cancelled" "Not_Cancelled"
## [5] "Not_Cancelled" "Not_Cancelled"
#Replacing values in Repeated_Guest column so that 0 = new guest and 1 = repeated guest
hotel_bookings4 <- recode(hotel_bookings2$Repeated_Guest,
"0" = "New_Guest",
"1" = "Repeated_Guest")
head(hotel_bookings4)
## [1] "New_Guest" "New_Guest" "New_Guest" "New_Guest" "New_Guest" "New_Guest"
#cbind 3 tables
hotelbookings <- cbind(hotel_bookings2, hotel_bookings3, hotel_bookings4)
#Drop rows not needed
hotelbookings <- hotelbookings[, -c(2, 15:17, 20, 21, 23:25, 29, 32)]
head(hotelbookings)
## Hotel Lead_Time Arrival_Year Arrival_Month Arrival_Week
## 1 Resort Hotel 342 2015 July 27
## 2 Resort Hotel 737 2015 July 27
## 3 Resort Hotel 7 2015 July 27
## 4 Resort Hotel 13 2015 July 27
## 5 Resort Hotel 14 2015 July 27
## 6 Resort Hotel 14 2015 July 27
## Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights Adults Children
## 1 1 0 0 2 0
## 2 1 0 0 2 0
## 3 1 0 1 1 0
## 4 1 0 1 1 0
## 5 1 0 2 2 0
## 6 1 0 2 2 0
## Babies Meal Country Previous_Cancellations Prev_Non_Cancellations
## 1 0 BB PRT 0 0
## 2 0 BB PRT 0 0
## 3 0 BB GBR 0 0
## 4 0 BB GBR 0 0
## 5 0 BB GBR 0 0
## 6 0 BB GBR 0 0
## Booking_Changes Days_Waitlisted Customer_Type ADR Special_Requests
## 1 3 0 Transient 0 0
## 2 4 0 Transient 0 0
## 3 0 0 Transient 75 0
## 4 0 0 Transient 75 0
## 5 0 0 Transient 98 1
## 6 0 0 Transient 98 1
## Reservation_Status Stays_Total Stays_Total_Cost hotel_bookings3
## 1 Check-Out 0 0 Not_Cancelled
## 2 Check-Out 0 0 Not_Cancelled
## 3 Check-Out 1 75 Not_Cancelled
## 4 Check-Out 1 75 Not_Cancelled
## 5 Check-Out 2 196 Not_Cancelled
## 6 Check-Out 2 196 Not_Cancelled
## hotel_bookings4
## 1 New_Guest
## 2 New_Guest
## 3 New_Guest
## 4 New_Guest
## 5 New_Guest
## 6 New_Guest
#Get column names
colnames(hotelbookings)
## [1] "Hotel" "Lead_Time" "Arrival_Year"
## [4] "Arrival_Month" "Arrival_Week" "Arrival_Day_of_Month"
## [7] "Stays_Weekend_Nights" "Stays_Week_Nights" "Adults"
## [10] "Children" "Babies" "Meal"
## [13] "Country" "Previous_Cancellations" "Prev_Non_Cancellations"
## [16] "Booking_Changes" "Days_Waitlisted" "Customer_Type"
## [19] "ADR" "Special_Requests" "Reservation_Status"
## [22] "Stays_Total" "Stays_Total_Cost" "hotel_bookings3"
## [25] "hotel_bookings4"
#Rename column names
colnames(hotelbookings) <- c("Hotel", "Lead_Time", "Arrival_Year", "Arrival_Month", "Arrival_Week", "Arrival_Day_of_Month", "Stays_Weekend_Nights", "Stays_Week_Nights", "Adults", "Children", "Babies", "Meal", "Country", "Previous_Cancellations", "Prev_Non_Cancellations", "Booking_Changes", "Days_Waitlisted", "Customer_Type", "ADR", "Special_Requests", "Reservation_Status", "Stays_Total", "Stays_Total_Cost", "Cancellation_Status", "Is_Repeated_Guest")
#Reorder columns
col_order <- c("Hotel", "Cancellation_Status", "Lead_Time", "Arrival_Year", "Arrival_Month", "Arrival_Week", "Arrival_Day_of_Month", "Stays_Weekend_Nights", "Stays_Week_Nights", "Stays_Total", "Stays_Total_Cost", "Adults", "Children", "Babies", "Meal", "Country", "Is_Repeated_Guest", "Previous_Cancellations", "Prev_Non_Cancellations", "Booking_Changes", "Days_Waitlisted", "Customer_Type", "ADR", "Special_Requests", "Reservation_Status")
hotelbookings <- hotelbookings[, col_order]
head(hotelbookings)
## Hotel Cancellation_Status Lead_Time Arrival_Year Arrival_Month
## 1 Resort Hotel Not_Cancelled 342 2015 July
## 2 Resort Hotel Not_Cancelled 737 2015 July
## 3 Resort Hotel Not_Cancelled 7 2015 July
## 4 Resort Hotel Not_Cancelled 13 2015 July
## 5 Resort Hotel Not_Cancelled 14 2015 July
## 6 Resort Hotel Not_Cancelled 14 2015 July
## Arrival_Week Arrival_Day_of_Month Stays_Weekend_Nights Stays_Week_Nights
## 1 27 1 0 0
## 2 27 1 0 0
## 3 27 1 0 1
## 4 27 1 0 1
## 5 27 1 0 2
## 6 27 1 0 2
## Stays_Total Stays_Total_Cost Adults Children Babies Meal Country
## 1 0 0 2 0 0 BB PRT
## 2 0 0 2 0 0 BB PRT
## 3 1 75 1 0 0 BB GBR
## 4 1 75 1 0 0 BB GBR
## 5 2 196 2 0 0 BB GBR
## 6 2 196 2 0 0 BB GBR
## Is_Repeated_Guest Previous_Cancellations Prev_Non_Cancellations
## 1 New_Guest 0 0
## 2 New_Guest 0 0
## 3 New_Guest 0 0
## 4 New_Guest 0 0
## 5 New_Guest 0 0
## 6 New_Guest 0 0
## Booking_Changes Days_Waitlisted Customer_Type ADR Special_Requests
## 1 3 0 Transient 0 0
## 2 4 0 Transient 0 0
## 3 0 0 Transient 75 0
## 4 0 0 Transient 75 0
## 5 0 0 Transient 98 1
## 6 0 0 Transient 98 1
## Reservation_Status
## 1 Check-Out
## 2 Check-Out
## 3 Check-Out
## 4 Check-Out
## 5 Check-Out
## 6 Check-Out
#Count of Hotels
Hotel <- hotel_bookings2%>%
group_by(Hotel)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Hotel)
## # A tibble: 2 × 2
## Hotel num
## <chr> <int>
## 1 City Hotel 79330
## 2 Resort Hotel 40060
#Count of Cancellation_Status
Cancellation_Status <- hotel_bookings2%>%
group_by(Cancellation_Status)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Cancellation_Status)
## # A tibble: 2 × 2
## Cancellation_Status num
## <int> <int>
## 1 0 75166
## 2 1 44224
#Count of Arrival_Month
Arrival_Month <- hotel_bookings2%>%
group_by(Arrival_Month)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Arrival_Month)
## # A tibble: 6 × 2
## Arrival_Month num
## <chr> <int>
## 1 August 13877
## 2 July 12661
## 3 May 11791
## 4 October 11160
## 5 April 11089
## 6 June 10939
#Count of Stays_Weekend_Nights
Stays_Weekend <- hotel_bookings2%>%
group_by(Stays_Weekend_Nights)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Stays_Weekend)
## # A tibble: 6 × 2
## Stays_Weekend_Nights num
## <int> <int>
## 1 0 51998
## 2 2 33308
## 3 1 30626
## 4 4 1855
## 5 3 1259
## 6 6 153
#Count of Stays_Week_Nights
Stays_Week <- hotel_bookings2%>%
group_by(Stays_Week_Nights)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Stays_Week)
## # A tibble: 6 × 2
## Stays_Week_Nights num
## <int> <int>
## 1 2 33684
## 2 1 30310
## 3 3 22258
## 4 5 11077
## 5 4 9563
## 6 0 7645
#Count of Meals
Meal <- hotel_bookings2%>%
group_by(Meal)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Meal)
## # A tibble: 5 × 2
## Meal num
## <chr> <int>
## 1 BB 92310
## 2 HB 14463
## 3 SC 10650
## 4 Undefined 1169
## 5 FB 798
#Count of Repeated_Guest
Repeated_Guest <- hotel_bookings2%>%
group_by(Repeated_Guest)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Repeated_Guest)
## # A tibble: 2 × 2
## Repeated_Guest num
## <int> <int>
## 1 0 115580
## 2 1 3810
#Count on Customer_Type
Customer <- hotel_bookings2%>%
group_by(Customer_Type)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Customer)
## # A tibble: 4 × 2
## Customer_Type num
## <chr> <int>
## 1 Transient 89613
## 2 Transient-Party 25124
## 3 Contract 4076
## 4 Group 577
#Count on Special_Requests
Special_Requests <- hotel_bookings2%>%
group_by(Special_Requests)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Special_Requests)
## # A tibble: 6 × 2
## Special_Requests num
## <int> <int>
## 1 0 70318
## 2 1 33226
## 3 2 12969
## 4 3 2497
## 5 4 340
## 6 5 40
#Count on Stays_Total
Stays_Total <- hotel_bookings2%>%
group_by(Stays_Total)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Stays_Total)
## # A tibble: 6 × 2
## Stays_Total num
## <int> <int>
## 1 2 27643
## 2 3 27076
## 3 1 21020
## 4 4 17383
## 5 7 8655
## 6 5 7784
#Count on Stays_Total_Cost
Stays_Total_Cost <- hotel_bookings2%>%
group_by(Stays_Total_Cost)%>%
summarise(num=n())%>%
arrange(desc(num))
head(Stays_Total_Cost)
## # A tibble: 6 × 2
## Stays_Total_Cost num
## <dbl> <int>
## 1 124 2913
## 2 0 1959
## 3 180 1310
## 4 270 1090
## 5 240 1046
## 6 300 1030
Cancellations in April is the highest.
# Bar Plot
hotel_bookings2 %>%
ggplot(aes(x = Cancellation_Status, fill = Arrival_Month)) +
geom_bar()
The year 2016 had the most cancellations/non-cancellations.
# Bar Plot
hotel_bookings2 %>%
ggplot(aes(x = Arrival_Year, fill = Cancellation_Status)) +
geom_bar()
Repeated guests book City Hotels the most.
# Bar Plot
hotel_bookings2 %>%
ggplot(aes(x = Repeated_Guest, fill = Hotel)) +
geom_bar()
Transient customers tend to be repeated guests.
# Plot
hotel_bookings2%>%
ggplot(aes(x = Repeated_Guest, fill = Customer_Type)) +
geom_bar()
Bed and Breakfast is the preferred meal of choice for repeated guest.
# Plot
hotel_bookings2%>%
ggplot(aes(x = Repeated_Guest, fill = Meal)) +
geom_bar()
Guests in general have no special requests during hotel stay.
# Plot
hotel_bookings2%>%
ggplot(aes(x = Special_Requests, fill = Repeated_Guest)) +
geom_bar()
Whether it’s a City or Resort hotel, majority of the booking stays are between 0 to 20 days with a few outliers.
#Scatter plot
hotel_bookings2 %>%
ggplot(aes(x = Stays_Total, y = Stays_Total_Cost, color = Hotel)) +
geom_point() +
geom_smooth(formula = y ~ x, method=lm , color="red", se=FALSE) + # with linear trend
theme_ipsum()