Data 607 - Assignment 7

The assignment is to present a use case for the tidyverse packages and sdemonstrate how to use one or more of the capabilities TidyVerse package with your selected dataset

1) Libraries and Data

Load needed libraries

# The easiest way to get all libraries is to load the whole tidyverse but we will load just the packages we need

#library(tidyverse)

# Alternatively, loading all packages that we use:
library(readr)
library(lubridate)
library(dplyr)
library(knitr)

create my github path

urlRemote  <- "https://raw.githubusercontent.com/"
pathGithub <- "chilleundso/DATA607/master/Tidyverse/"

2) Readr

We start of by downloading our csv file from my Githib (originally from https://www.kaggle.com/jessemostipak/hotel-booking-demand) and turning it into a dataframe format:

#create HTML URL
fileNamecsv   <- "hotels.csv"
csv_URL <- paste0(urlRemote, pathGithub, fileNamecsv)

#We read the CSV
hotels_raw <- readr::read_csv(csv_URL)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   hotel = col_character(),
##   arrival_date_month = col_character(),
##   meal = col_character(),
##   country = col_character(),
##   market_segment = col_character(),
##   distribution_channel = col_character(),
##   reserved_room_type = col_character(),
##   assigned_room_type = col_character(),
##   deposit_type = col_character(),
##   agent = col_character(),
##   company = col_character(),
##   customer_type = col_character(),
##   reservation_status = col_character(),
##   reservation_status_date = col_character()
## )

## See spec(...) for full column specifications.

3) Dplyr::filter

We want to do some early filtering on the data to exclude some special cases from our data set:

#we exclude all data rows that have no weekend and no weekday stays:

hotels <- dplyr::filter(hotels_raw, stays_in_weekend_nights != 0 | stays_in_week_nights != 0 )
names(hotels)

##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"

4) Lubridate

As we can see the dataframe has individual columns for the arrival year, month and day so we use lubridate to make an new arrival date column in date format and create a column that hows the check-out date based on adding days stayed in the hotel.

#lubridat lets us easily create a date object out of three columns that have year in yyyy, moonths in text format and days in dd
hotels$arrival_date <- paste(hotels$arrival_date_year  , hotels$arrival_date_month, hotels$arrival_date_day_of_month, sep="-")  %>% lubridate::ymd() %>% as.Date()

#we can easily add days to the date to get a cehck-out date column (some )
hotels$checkout_date <- ymd(hotels$arrival_date) + days(hotels$stays_in_weekend_nights) + days(hotels$stays_in_week_nights)

5) Dplyr::select

We want to have a look at just the columns we used and created in the above section so we use dplyr::select

kable(head(hotels %>% 
  dplyr::select(arrival_date_year:arrival_date_day_of_month, arrival_date : checkout_date)))

arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	arrival_date	checkout_date
2015	July	27	1	2015-07-01	2015-07-02
2015	July	27	1	2015-07-01	2015-07-02
2015	July	27	1	2015-07-01	2015-07-03
2015	July	27	1	2015-07-01	2015-07-03
2015	July	27	1	2015-07-01	2015-07-03
2015	July	27	1	2015-07-01	2015-07-03

6) Dplyr::summarise/group_by/count/arrange

Lastly we want see how we can use summarise, group by and count to show some overview statistics:

#We want to see how the reservation status behaves with the deposit type:

hotels %>% 
  dplyr::group_by(deposit_type) %>% 
  dplyr::count(reservation_status)

#To demonstrate the powerful pipe operator in combination with some dplyr functions we look at the average length of stay grouped by deposit type

hotels %>%
  dplyr::group_by(deposit_type) %>%
  dplyr::summarise(mean = mean(checkout_date-arrival_date), n = n()/nrow(hotels)) %>%
  dplyr::arrange(-mean)

From the first table we can actually see a large amount of cancellations even within the non-refundable bookings.

The second table shows us that the refunable bookings are on average the longest stay (which makes sense since they are probably the most expensive ones) however there are only very few of them (about 0.1%) We can see that the next longest stays are from bookings without deposits which make up about 88% of bookings. Lastly, no refund bookings are on average the shortest.

7) Summary

dplyr has great functions to summarise, an access certain fields and pivot them around to show any desired permutation of the data

GitHub: https://github.com/chilleundso/DATA607/blob/master/Tidyverse/Data607_Tidyverse_Manolis.Rmd