The packages we are using are tidyverse ,janitor and dplyr. Tidyverse is a collection of R packages designed for data science. All packages share a similar design principle and data structure. The core packages within Tidyverse are ggplot2, dplyr, readr, purr, tibble, stringr and forcats, which help transform, model and visualize data. Dplyr and tidyr are used to solve data manipulation and data tidying challeneges respectively. Dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. The janitor package has a wide range of functions that facilitate data cleaning and exploration. The package is designed to be compatible with the tidyverse, and can therefore be seamlessly integrated into most data prep workflows
| Library | Description |
|---|---|
| ‘tidyverse’ | Used for data manipulation. |
| ‘dplyr’ | Used for data wrangling & manipulation. |
| ‘ggplot2’ | Used for creating data visualizations. |
| ‘janitor’ | Facilitates data cleaning and exploration. |
# Install Packages
library(tidyverse)
library("feasts")
library("dplyr")
library("skimr")
This is an open hotel booking demand dataset published on Science Direct website by Antonio, Almeida and Nunes in 2019 and we can obtain it from Github.
Source of the dataset is derived from two hotels (resort & city) located in Portugal. Most overlying purpose of this dataset was to study the revenue management by observing data for the time period 1 July,2015 - 31 Aug,2017. Data is extracted from hotel’s Property Management System SQL database using TSQL queries. There are a total of 32 variables and 119,390 records.
tuesdata <- tidytuesdayR::tt_load(2020, week = 7)
hotels <- tuesdata$hotels
attach(hotels)
dim(hotels)
## [1] 119390 32
hotels %>%
janitor::clean_names()
hotels_1 <- subset(hotels %>%
mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
date = parse_date(date, format = "%Y-%B-%d")) ,
select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month))
lapply(hotels_1, function(x) {length(which(is.na(x)))}) # There are 4 NA in the data set
lapply(hotels, function(x) {length(which(is.null(x)))}) # There are no nulls in the data set.
summary(hotels_1$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
# Mean and median of children variable is same i.e. 0. Thus, we can replace NA's with 0
hotels_1$children[is.na(hotels_1$children)] <- 0
unique(hotels_1$company)
"NULL" %in% hotels_1$company
"NULL" %in% hotels_1$country
"NULL" %in% hotels_1$agent
hotels_1$hotel[hotels_1$hotel %in% c('Resort Hotel')] <- 0
hotels_1$hotel[(hotels_1$hotel == 'City Hotel')] <- 1
Skim() gives a summary of data set and a sneak peak into their statistics
skim(hotels_1)
| Name | hotels_1 |
| Number of rows | 119390 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| Date | 2 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| hotel | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
| country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
| market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
| distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
| reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
| assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
| deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
| agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
| company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
| customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
| reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
| date | 0 | 1 | 2015-07-01 | 2017-08-31 | 2016-09-06 | 793 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
| lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
| arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
| stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
| stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
| adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
| children | 0 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
| babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
| is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
| previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
| previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
| booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
| days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
| adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
| required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
| total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |