The packages we are using are tidyverse ,janitor and dplyr. Tidyverse is a collection of R packages designed for data science. All packages share a similar design principle and data structure. The core packages within Tidyverse are ggplot2, dplyr, readr, purr, tibble, stringr and forcats, which help transform, model and visualize data. Dplyr and tidyr are used to solve data manipulation and data tidying challeneges respectively. Dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. The janitor package has a wide range of functions that facilitate data cleaning and exploration. The package is designed to be compatible with the tidyverse, and can therefore be seamlessly integrated into most data prep workflows
| Library | Description |
|---|---|
| ‘tidyverse’ | Used for data manipulation. |
| ‘dplyr’ | Used for data wrangling & manipulation. |
| ‘ggplot2’ | Used for creating data visualizations. |
| ‘janitor’ | Facilitates data cleaning and exploration. |
# Install Packages
library("dplyr")
library("magrittr")
library("tidyverse")
library("tidytuesdayR")
library("tidyr")
library("here")
library("janitor")
library("ggplot2")
This is an open hotel booking demand dataset published on Science Direct website by Antonio, Almeida and Nunes in 2019 and we can obtain it from Github.
Source of the dataset is derived from two hotels (resort & city) located in Portugal. Most overlying purpose of this dataset was to study the revenue management by observing data for the time period 1 July,2015 - 31 Aug,2017. Data is extracted from hotel’s Property Management System SQL database using TSQL queries. There are a total of 32 variables and 119,390 records.
x <-read.csv('/Users/sarangsh/Documents/Subjects for Fall /6030 Data Management Prof. Chen/Week 6/hotel_bookings.csv')
hotels <- as.data.frame(x)
dim(hotels)
## [1] 119390 32
hotels %>%
janitor::clean_names()
Find structure and data type of each variable
str(hotels)
Some of the variables are converted to factors for modelling purpose in case
columns <- c("hotel", "market_segment", "distribution_channel",
"reserved_room_type", "assigned_room_type", "deposit_type", "customer_type")
hotels[columns] <- lapply(hotels[columns], as.factor)
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
Label the levels of some variables
hotels$is_canceled <- factor(hotels$is_canceled, levels=c(0, 1),
labels=c("Not canceled", "Canceled"))
Creating a new variable with months as factors
hotels$arrival_date_month<- factor(hotels$arrival_date_month, levels =
c("January", "February", "March", "April", "May", "June", "July",
"August", "September", "October", "November", "December"))
Add new column as total_stay = stays_in_week_nights+stays_in_weekend_nights
hotels <- hotels %>%
mutate(total_stay = stays_in_week_nights + stays_in_weekend_nights) %>%
select(-c(stays_in_week_nights, stays_in_weekend_nights))
Add new column as total_guests
hotels <- hotels %>%
mutate(total_guests = adults + children + babies) %>%
select(-c(adults , children , babies))
Change data type of total_guests from double to int as total number of people can’t be a fraction
hotels$total_guests <- as.integer(hotels$total_guests)
Identify NA
lapply(hotels, function(x) {length(which(is.na(x)))}) # There are 4 NA in the data set
## $hotel
## [1] 0
##
## $is_canceled
## [1] 0
##
## $lead_time
## [1] 0
##
## $arrival_date_year
## [1] 0
##
## $arrival_date_month
## [1] 0
##
## $arrival_date_week_number
## [1] 0
##
## $arrival_date_day_of_month
## [1] 0
##
## $meal
## [1] 0
##
## $country
## [1] 0
##
## $market_segment
## [1] 0
##
## $distribution_channel
## [1] 0
##
## $is_repeated_guest
## [1] 0
##
## $previous_cancellations
## [1] 0
##
## $previous_bookings_not_canceled
## [1] 0
##
## $reserved_room_type
## [1] 0
##
## $assigned_room_type
## [1] 0
##
## $booking_changes
## [1] 0
##
## $deposit_type
## [1] 0
##
## $agent
## [1] 0
##
## $company
## [1] 0
##
## $days_in_waiting_list
## [1] 0
##
## $customer_type
## [1] 0
##
## $adr
## [1] 0
##
## $required_car_parking_spaces
## [1] 0
##
## $total_of_special_requests
## [1] 0
##
## $reservation_status
## [1] 0
##
## $reservation_status_date
## [1] 0
##
## $total_stay
## [1] 0
##
## $total_guests
## [1] 4
Identify null values
lapply(hotels, function(x) {length(which(is.null(x)))}) # There are no nulls in the data set.
## $hotel
## [1] 0
##
## $is_canceled
## [1] 0
##
## $lead_time
## [1] 0
##
## $arrival_date_year
## [1] 0
##
## $arrival_date_month
## [1] 0
##
## $arrival_date_week_number
## [1] 0
##
## $arrival_date_day_of_month
## [1] 0
##
## $meal
## [1] 0
##
## $country
## [1] 0
##
## $market_segment
## [1] 0
##
## $distribution_channel
## [1] 0
##
## $is_repeated_guest
## [1] 0
##
## $previous_cancellations
## [1] 0
##
## $previous_bookings_not_canceled
## [1] 0
##
## $reserved_room_type
## [1] 0
##
## $assigned_room_type
## [1] 0
##
## $booking_changes
## [1] 0
##
## $deposit_type
## [1] 0
##
## $agent
## [1] 0
##
## $company
## [1] 0
##
## $days_in_waiting_list
## [1] 0
##
## $customer_type
## [1] 0
##
## $adr
## [1] 0
##
## $required_car_parking_spaces
## [1] 0
##
## $total_of_special_requests
## [1] 0
##
## $reservation_status
## [1] 0
##
## $reservation_status_date
## [1] 0
##
## $total_stay
## [1] 0
##
## $total_guests
## [1] 0
Identifying unique values
unique(hotels$company)
## [1] "NULL" "110" "113" "270" "178" "240" "154" "144" "307" "268"
## [11] "59" "204" "312" "318" "94" "174" "274" "195" "223" "317"
## [21] "281" "118" "53" "286" "12" "47" "324" "342" "373" "371"
## [31] "383" "86" "82" "218" "88" "31" "397" "392" "405" "331"
## [41] "367" "20" "83" "416" "51" "395" "102" "34" "84" "360"
## [51] "394" "457" "382" "461" "478" "386" "112" "486" "421" "9"
## [61] "308" "135" "224" "504" "269" "356" "498" "390" "513" "203"
## [71] "263" "477" "521" "169" "515" "445" "337" "251" "428" "292"
## [81] "388" "130" "250" "355" "254" "543" "531" "528" "62" "120"
## [91] "42" "81" "116" "530" "103" "39" "16" "92" "61" "501"
## [101] "165" "291" "290" "43" "325" "192" "108" "200" "465" "287"
## [111] "297" "490" "482" "207" "282" "437" "225" "329" "272" "28"
## [121] "77" "338" "72" "246" "319" "146" "159" "380" "323" "511"
## [131] "407" "278" "80" "403" "399" "14" "137" "343" "346" "347"
## [141] "349" "289" "351" "353" "54" "99" "358" "361" "362" "366"
## [151] "372" "365" "277" "109" "377" "379" "22" "378" "330" "364"
## [161] "401" "232" "255" "384" "167" "212" "514" "391" "400" "376"
## [171] "402" "396" "302" "398" "6" "370" "369" "409" "168" "104"
## [181] "408" "413" "148" "10" "333" "419" "415" "424" "425" "423"
## [191] "422" "435" "439" "442" "448" "443" "454" "444" "52" "459"
## [201] "458" "456" "460" "447" "470" "466" "484" "184" "485" "32"
## [211] "487" "491" "494" "193" "516" "496" "499" "29" "78" "520"
## [221] "507" "506" "512" "126" "64" "242" "518" "523" "539" "534"
## [231] "436" "525" "541" "40" "455" "410" "45" "38" "49" "48"
## [241] "67" "68" "65" "91" "37" "8" "179" "209" "219" "221"
## [251] "227" "153" "186" "253" "202" "216" "275" "233" "280" "309"
## [261] "321" "93" "316" "85" "107" "350" "279" "334" "348" "150"
## [271] "73" "385" "418" "197" "450" "452" "115" "46" "76" "96"
## [281] "100" "105" "101" "122" "11" "139" "142" "127" "143" "140"
## [291] "149" "163" "160" "180" "238" "183" "222" "185" "217" "215"
## [301] "213" "237" "230" "234" "35" "245" "158" "258" "259" "260"
## [311] "411" "257" "271" "18" "106" "210" "273" "71" "284" "301"
## [321] "305" "293" "264" "311" "304" "313" "288" "320" "314" "332"
## [331] "341" "352" "243" "368" "393" "132" "220" "412" "420" "426"
## [341] "417" "429" "433" "446" "357" "479" "483" "489" "229" "481"
## [351] "497" "451" "492"
"NULL" %in% hotels$company
## [1] TRUE
"NULL" %in% hotels$country
## [1] TRUE
"NULL" %in% hotels$agent
## [1] TRUE
Remove average daily rate < 0}
hotels <- filter(hotels, hotels$adr >= 0)
summary(hotels$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 69.29 94.59 101.83 126.00 5400.00
quantile(hotels$adr, probs = c(0.25,0.50,0.75,0.95,0.975,1.00), na.rm = TRUE)
## 25% 50% 75% 95% 97.5% 100%
## 69.290 94.590 126.000 193.500 221.383 5400.000
% of repeated guests Only 3.19% of total customers are repeated guests
hotels %>%
group_by(is_repeated_guest) %>%
count(is_repeated_guest) %>%
summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 × 3
## is_repeated_guest percent total
## <fct> <dbl> <int>
## 1 0 96.8 115580
## 2 1 3.19 3809
Investigating the percentage of guests that repeat Only 3.19% of the guests are repeated guests.
hotels %>% ggplot(aes(x = hotel, fill = is_repeated_guest)) +
geom_bar(position = "dodge") +
geom_text(stat = "Count", aes(label = scales::percent(..count../sum(..count..))), position = position_dodge(0.8), vjust = 1.5) +
scale_fill_manual(values=c("#C2A5CF", "#80CDC1"),labels = c("Non repeated guest", "Repeated guest")) + # labels the filter
guides(fill = guide_legend(title = NULL)) + # removes the guide title
ggtitle("Frequent visitors in two types of hotel")
How many bookings were cancelled - 37% of bookings were cancelled
hotels %>%
group_by(is_canceled) %>%
count(is_canceled) %>%
summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 × 3
## is_canceled percent total
## <fct> <dbl> <int>
## 1 Not canceled 63.0 75165
## 2 Canceled 37.0 44224
hotels %>% ggplot(aes(x = hotel, fill = is_canceled)) +
geom_bar(position = "dodge") +
geom_text(stat = "Count", aes(label = scales::percent(..count../sum(..count..))), position = position_dodge(0.8), vjust = 1.5) +
scale_fill_manual(values=c("#C2A5CF", "#80CDC1"),labels = c("Not canceled", "Canceled")) + # labels the filter
guides(fill = guide_legend(title = NULL)) + # removes the guide title
ggtitle("Bookings cancellations in two types of hotel")
Booking cancellations in two types of hotels
hotels %>% ggplot(aes(x = is_repeated_guest, fill = is_canceled)) +
geom_bar(position = "dodge") +
geom_text(stat = "Count", aes(label = scales::percent(..count../sum(..count..))), position = position_dodge(0.8), vjust = 1.0) +
scale_fill_manual(values=c("#C2A5CF", "#80CDC1"),labels = c("No previous cancellations", "Previous cancellations")) + # labels the filter
guides(fill = guide_legend(title = NULL)) + # removes the guide title
ggtitle("Booking cancellations in two types of hotel")
To find the most common customer (guest) type
ggplot(hotels, aes(customer_type)) +
geom_bar( fill = "#C2A5CF")
Bookings wrt room type Room A was the most popular Room
ggplot(hotels, aes(reserved_room_type , fill = factor(hotel))) +
geom_bar(fill= "#C2A5CF")
Hotel type preferred by repeated guests
ggplot(data = hotels, aes(x = hotel)) +
geom_bar() # customize ylim
Months with lowest / highest bookings August has highest bookings January, Dec, Non has seen 50% less bookings than August; might be because people prefer to spend holiday at their home or the prices are too high
ggplot(data = hotels, aes(x = arrival_date_month)) +
geom_bar()
To dig deeper we tried to find co-relation between arrival_date_week_number and arrival_date_year.
Negavtive co relation
cor(hotels$arrival_date_week_number, hotels$arrival_date_year)
## [1] -0.5405556
Most common days of booking 8-24 are the days within the month that have most number of bookings coming in
quantile(hotels$arrival_date_day_of_month, probs = c(0.25,0.50,0.75))
## 25% 50% 75%
## 8 16 23
ggplot(data = hotels, aes(x = hotel, y = arrival_date_day_of_month)) +
geom_boxplot()
Meal choices Bread and breakfast is the most popular meal choice among the customers
ggplot(data = hotels, aes(x = meal)) +
geom_bar()
Customer types
hotels %>%
group_by(hotels$customer_type) %>%
count()
## # A tibble: 4 × 2
## # Groups: hotels$customer_type [4]
## `hotels$customer_type` n
## <fct> <int>
## 1 Contract 4076
## 2 Group 577
## 3 Transient 89613
## 4 Transient-Party 25123