The assignment is to present a use case for the tidyverse packages and sdemonstrate how to use one or more of the capabilities TidyVerse package with your selected dataset
Load needed libraries
# The easiest way to get all libraries is to load the whole tidyverse but we will load just the packages we need
#library(tidyverse)
# Alternatively, loading all packages that we use:
library(readr)
library(lubridate)
library(dplyr)
library(knitr)
create my github path
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "chilleundso/DATA607/master/Tidyverse/"
We start of by downloading our csv file from my Githib (originally from https://www.kaggle.com/jessemostipak/hotel-booking-demand) and turning it into a dataframe format:
#create HTML URL
fileNamecsv <- "hotels.csv"
csv_URL <- paste0(urlRemote, pathGithub, fileNamecsv)
#We read the CSV
hotels_raw <- readr::read_csv(csv_URL)
## Parsed with column specification:
## cols(
## .default = col_double(),
## hotel = col_character(),
## arrival_date_month = col_character(),
## meal = col_character(),
## country = col_character(),
## market_segment = col_character(),
## distribution_channel = col_character(),
## reserved_room_type = col_character(),
## assigned_room_type = col_character(),
## deposit_type = col_character(),
## agent = col_character(),
## company = col_character(),
## customer_type = col_character(),
## reservation_status = col_character(),
## reservation_status_date = col_character()
## )
## See spec(...) for full column specifications.
We want to do some early filtering on the data to exclude some special cases from our data set:
#we exclude all data rows that have no weekend and no weekday stays:
hotels <- dplyr::filter(hotels_raw, stays_in_weekend_nights != 0 | stays_in_week_nights != 0 )
names(hotels)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
As we can see the dataframe has individual columns for the arrival year, month and day so we use lubridate to make an new arrival date column in date format and create a column that hows the check-out date based on adding days stayed in the hotel.
#lubridat lets us easily create a date object out of three columns that have year in yyyy, moonths in text format and days in dd
hotels$arrival_date <- paste(hotels$arrival_date_year , hotels$arrival_date_month, hotels$arrival_date_day_of_month, sep="-") %>% lubridate::ymd() %>% as.Date()
#we can easily add days to the date to get a cehck-out date column (some )
hotels$checkout_date <- ymd(hotels$arrival_date) + days(hotels$stays_in_weekend_nights) + days(hotels$stays_in_week_nights)
We want to have a look at just the columns we used and created in the above section so we use dplyr::select
kable(head(hotels %>%
dplyr::select(arrival_date_year:arrival_date_day_of_month, arrival_date : checkout_date)))
| arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | arrival_date | checkout_date |
|---|---|---|---|---|---|
| 2015 | July | 27 | 1 | 2015-07-01 | 2015-07-02 |
| 2015 | July | 27 | 1 | 2015-07-01 | 2015-07-02 |
| 2015 | July | 27 | 1 | 2015-07-01 | 2015-07-03 |
| 2015 | July | 27 | 1 | 2015-07-01 | 2015-07-03 |
| 2015 | July | 27 | 1 | 2015-07-01 | 2015-07-03 |
| 2015 | July | 27 | 1 | 2015-07-01 | 2015-07-03 |
Lastly we want see how we can use summarise, group by and count to show some overview statistics:
#We want to see how the reservation status behaves with the deposit type:
hotels %>%
dplyr::group_by(deposit_type) %>%
dplyr::count(reservation_status)
#To demonstrate the powerful pipe operator in combination with some dplyr functions we look at the average length of stay grouped by deposit type
hotels %>%
dplyr::group_by(deposit_type) %>%
dplyr::summarise(mean = mean(checkout_date-arrival_date), n = n()/nrow(hotels)) %>%
dplyr::arrange(-mean)
From the first table we can actually see a large amount of cancellations even within the non-refundable bookings.
The second table shows us that the refunable bookings are on average the longest stay (which makes sense since they are probably the most expensive ones) however there are only very few of them (about 0.1%) We can see that the next longest stays are from bookings without deposits which make up about 88% of bookings. Lastly, no refund bookings are on average the shortest.
dplyr has great functions to summarise, an access certain fields and pivot them around to show any desired permutation of the data
GitHub: https://github.com/chilleundso/DATA607/blob/master/Tidyverse/Data607_Tidyverse_Manolis.Rmd
This is really clear explanations of tidyverse packages done by Manolis Manoli.
Below use stringr,tibble,ggplot and tidyr package with the hotels dataset.
library(stringr)
library(tibble)
library(ggplot2)
library(tidyr)
Using stringr we can manupulate the strings in the dataset. From hotels dataset, select market segment column which have charecter datatype.
A numeric vector giving number of characters (code points) in each element of the character vector. Missing string have missing length.
# select market
hotels_str <- hotels %>% select(market_segment)
# unique values of market segment
market_segment <- unique(hotels_str$market_segment)
# length of each market segment
stringr::str_length(market_segment)
## [1] 6 9 9 13 13 6 9 8
Vectorised over string and pattern
# market segment which starts with "U"
stringr::str_subset(market_segment, "^U")
## [1] "Undefined"
# market segment which not includes "U"
stringr::str_subset(market_segment, "^U", negate = TRUE)
## [1] "Direct" "Corporate" "Online TA" "Offline TA/TO"
## [5] "Complementary" "Groups" "Aviation"
str_locate, an integer matrix. First column gives start postion of match, and second column gives end position.
# search position for "i"
stringr::str_locate(market_segment, "i")
## start end
## [1,] 2 2
## [2,] NA NA
## [3,] 4 4
## [4,] 5 5
## [5,] NA NA
## [6,] NA NA
## [7,] 6 6
## [8,] 3 3
# search position for "o"
stringr::str_locate(market_segment, "o")
## start end
## [1,] NA NA
## [2,] 2 2
## [3,] NA NA
## [4,] NA NA
## [5,] 2 2
## [6,] 3 3
## [7,] NA NA
## [8,] 7 7
Sort character vector to alphabetically.
# sort the string from A to Z
stringr::str_sort(market_segment, decreasing = FALSE)
## [1] "Aviation" "Complementary" "Corporate" "Direct"
## [5] "Groups" "Offline TA/TO" "Online TA" "Undefined"
# reverse the order
stringr::str_sort(market_segment, decreasing = TRUE)
## [1] "Undefined" "Online TA" "Offline TA/TO" "Groups"
## [5] "Direct" "Corporate" "Complementary" "Aviation"
tibble() will convert a passed dataframe to a tibble
tibble::as_tibble(hotels)
Here we can see, below column name datatype and table shows in a structure way.
Using tidyr package, easy to work with tidy data. There are many functions availabe in this package .I will use Spread().
spread : takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider.
# getting from Dplyr::summarise/group_by/count/arrange
data <- hotels %>%
dplyr::group_by(deposit_type) %>%
dplyr::count(reservation_status)
# change the table structure, convert deposite_type from column to row
data_tidy <- tidyr::spread(data, deposit_type, n)
tibble::as_tibble(data_tidy)
ggplot() is a system for declaratively creating graphics for data. Visualizations made with ggplot() are easy to understand and contruct, thanks to an API that allows visualizations to be “built” via layering of graphics and other visual elements.
data_1 <- data %>% group_by(reservation_status) %>% filter(reservation_status == "Canceled")
ggplot(data = data_1, aes(x = data_1$deposit_type, y = data_1$n )) + geom_bar(stat = "identity", fill = "steelblue")+
xlab("Reservation Status") + ylab("Canceled status count(#)") +
ggtitle("Deposite type count with canceled reservation status") + theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(x = data_1$deposit_type,y = data_1$n,label=data_1$n), hjust=0.2,vjust = -0.2,color="black", size=4.5)
data_2 <- data %>% group_by(reservation_status) %>% filter(reservation_status == "No-Show")
ggplot(data = data_2, aes(x = data_2$deposit_type, y = data_2$n )) + geom_bar(stat = "identity", fill = "steelblue")+
xlab("Reservation Status") + ylab("No show status count(#)") +
ggtitle("Deposite type count with no-show reservation status") + theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(x = data_2$deposit_type,y = data_2$n,label=data_2$n), hjust=0.2,vjust = -0.2,color="black", size=4.5)
data_3 <- data %>% group_by(reservation_status) %>% filter(reservation_status == "Check-Out")
ggplot(data = data_3, aes(x = data_3$deposit_type, y = data_3$n )) + geom_bar(stat = "identity", fill = "steelblue")+
xlab("Reservation Status") + ylab("Check out status count(#)") +
ggtitle("Deposite type count with check-out reservation status") + theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(x = data_3$deposit_type,y = data_3$n,label=data_3$n), hjust=0.2,vjust = -0.2,color="black", size=4.5)
Above sections 8,9,10, and 11 given examples of stringr, tidyr, tibble and ggplot package, using these packages, we can do a lot more data analysis. Stringr hepls manupulate string datatype, tidyr helps work with tidy data, tibble convert data to a daatframe, and ggplot helps in visualisation.