Case Study

We will be looking at a large data set containing information on the details of hotel bookings from around the United States. Data sets like this contain a lot of useful information to help business understand their customers and the services they require. However, it is too big of a data set to manipulate and investigate directly in a spreadsheet.

A definition of all variables is given in the Data dictionary section at the end, though you don’t need to familiarize yourself with all variables in order to work through these exercises.

Learning Goals

Load data into R from an external file
Learn how to see a quick summary of data sets in R
Select variables of interest from a data set
Filter a data set by desired characteristics of variables
Group and count variables
Mutate data to create new variables
Summarize variables by values like the mean, median, etc.

Instructions

On lines 3 and 4 replace the existing text with your name and data
Except for the YAML and code needed to load the packages and the data, delete all the text prior to Exercises
After Exercises the only text that should appear are the questions and your answers to them
Delete all the comments that have been left for you in the code chunks
Include all code in the final submission

Getting Started

We will start by loading the package required to complete this lab:

library(tidyverse)
library(skimr)

Loading Data into R

In this lab we will import data into R, as opposed to using a data set that is already stored in a package. The functions used to load data are specific to the type of file we have saved our data as. For example:

read_csv: loads data saved as a .csv (comma dilimited) file
read_tsv: loads data saved as a .tsv (tab-seperated value) file
read_delim: loads data saved with other deliminators, often used with .txt (text) files

The most common file type we will use is a .csv file, and therefore we will most regularly use the read_csv function. One of the reasons for this is that comma delimited files can be used to move information between programs that aren’t ordinarily able to exchange data. For example, even if we create and save our data in Excel, we will still be able to share it with someone who uses Numbers.

In order for R to load the data set, we need to tell it where to look. This can be any number of places, from folder on our computer to a website. For this lab, we will direct R to install a data set called hotels.csv.

The hotel.csv file can be found on Canvas. To load this data into R, we will need download the file and save it to the same place that we have saved the Lab 1 template. The location where we save the files is important. If we don’t save it to the right place, R won’t be able to find it and we will get the following error message:

The “working directory” is where R is looking for the file. If we get this error message we can change the working directory by going to Session, choosing Set Working Directory, and then Choose Directory… after which we can select the location in which we have saved the data file.

We need to be careful if we decide to change the working directory, though, as this will also change the place where we save our RMarkdown file for the lab as well as the document it produces. It is generally easier to save the data file in the correct location in the first place.

We will also need to be careful with our spelling. R will only recognize a file name exactly as we have saved it, including spelling errors, extra spaces and capitalizations.

hotels <- read_csv("hotels.csv")

R makes a determination as to the type of variables we are loading into the program.R stores the data as either double, integer, complex, logical or character.

int: integer - whole numbers
dbl: double - any real number (i.e., decimals are possible)
chr: character - character vectors or strings
dttm: date-times - a date and a time
lgl: logical - contains only TRUE or FALSE values
fctr: factors - categorical variables with fixed possible values
date: date - a calendar date

When we read data into R, it automatically assigns each variable to a data type. When we loaded the data into R using read_csv we can see in the output that it automatically tells us how many columns have been assigned to each data type. For example, there are 13 character variables, the first of which is hotel. It is possible to override R’s assignment of variable type if necessary, but we won’t get into that quite yet.

Exercises

Data Summaries

Whenever we work with a data set, the first thing we want to do is look at a summary of it. If the data set is new to us, this informs us about what the data set contains, while if it is our own data set it gives us a chance to make sure it has been loaded into R correctly.

There are three main R commands that we can use for the initial exploration of a data set:

dim The dimensions of a data set. The number of rows (i.e., the number of observations) is always given first, followed by the number of columns (i.e., the number of variables)
str The structure of the data set. This gives the dimensions of the data table, as well as the name of all of the columns preceded by a $ sign, the type of variable and examples of the first couple of entries.
glimpse A summary similar to that provided by str, but a little neater.

NOTE: Delete eval=FALSE before knitting the document

dim(hotels)

How many variables are there in the data set?

NOTE: Delete eval=FALSE before kniting the document

str(hotels)

How many observations of hotel bookings are there in our data set?

NOTE: Delete eval=FALSE before kniting the document

glimpse(hotels)

How many variables in the data set has R recorded as a character?

Selecting Variables

When working with a data set we sometimes what to select different variables to work with. We can select to keep variables, exclude variable and identify variables with certain characteristics. For example, we can select to keep the variable lead_time, or exclude the variable agent:

# This code should be run without making any changes to it
hotels %>%
  select(lead_time) # Keep s the variable lead_time

hotels %>%
  select(-agent)   # Excludes the variable agent

We might also want to retain a range of variables:

hotels %>%
  select(hotel:arrival_date_month)

Note: Delete `eval=FALSE and the underline, ___.

# Select the variables country (country) and meal (meal)
hotels %>%
  select(____)

Filtering

Often times we are interested in looking at only parts of our data set, such as only bookings from certain countries or where people paid less than a specified amount. We can filter our data set in R using logical operators.

> - greater than
< - less than
<= - less than or equal to
>= - greater than or equal to
== - exactly equal to
!= - not equal to
!x - not x (where x is replaced with the name of a variable we want to avoid)
x|y - x OR y (where x and y are replaced with the names of variables we are interested in)
x&y: x AND y (where x and y are replaced with the names of variables we are interested in)

These operators can also be combined to create a sequence of logical arguments. For example:

# Run this code without making any changes to it
onetoten <- c(1:10)                 # The values 1 to 10 
onetoten[onetoten>=8]               # The values in onetoten greater than or equal to 8
onetoten[(onetoten>8)|(onetoten<5)] # The values in onetoten greater than 8 or less than 5

In the example, we could have determined the answers by writing out the numbers one through ten and counting the ones that met each condition. However, in our hotels data set there are way more observations then we would ever want to count manually. Therefore we will often want to filter our data to only look at specific components. For example, we create a new data set where we look only at those travelers that are from the United States, and then look at the number of rows (nrow) to determine how many observations we have in this category and the number of columns (ncol) to determine the number of variables.

UStravelers <- hotels %>%
  filter(country=="USA")
nrow(UStravelers)
ncol(UStravelers)

How many travelers from the US are in our data set?

Now let’s look at whether people are traveling on a whim.

Note: You will need to delete eval=FALSE, as well as completely delete the underlines, ___, completely.

# Fill in the blanks to filter hotel bookings where the guest is 
#   1. NOT from the US (USA)  
#   2. The lead time for the booking (lead_time) is less than 1.
hotels %>%
  filter(
    country ____ "USA", 
    lead_time ___ ___
    )

How many travelers from outside the USA made their hotel reservations less than one day before their trip?

Next we will look at bookings involving children.

Note: You will need to delete eval=FALSEand completely delete [AT LEAST] and [OR].

#Replace:
#  1. [AT LEAST] with the logical operator for "at least" (in two places)
#  2. [OR] with the logical operator for "or"
hotels %>%
  filter(
    children [AT LEAST] 1 [OR] babies [AT LEAST] 1
    )

How many bookings included at least one child OR baby?

Let’s explore the number of bookings at resort hotels that included children or babies. When filtering by the name of a qualitative variable, it is necessary to include the name of the category we are interested in quotation marks in the R code. So we would use “Resort Hotel” in the following code.

Note: You will need to delete eval=FALSE before knitting the document.

#Replace:
#  1. [AT LEAST] with the logical operator for "at least" (in two places)
#  2. [OR] with the logical operator for "or"
#  3. [EXACTLY EQUALS] with the logical operator for "exactly equals"
#  4. [HOTEL TYPE] to indicate the type is "Resort Hotel"
hotels %>% 
  filter(
    children [AT LEAST] 1 [OR] babies [AT LEAST] 1,
    hotel [EXACTLY EQUALS] [HOTEL TYPE]
  )

How many bookings in resort hotels have at least one child or baby?

Finally, it is also possible to combine filter with the selection of variables. Lets look at how many children traveled without adults.

Note: You will need to deleteeval=FALSE before knitting the document

#Define the following conditions for the hotels data:
#  1. Filter the hotel bookings by those with no adults (adults) AND at least one child 
#     (children)
#  2. Select only the data on adults, children and babies
hotels %>% 
  filter() %>%
  select()

How many hotel bookings involved no adults, but at least one child?

Counting

When dealing with qualitative variables, we often want to count how often the different categories appear, building what is known as a frequency table.

hotels %>%
  count(market_segment)

We can also count quantitative outputs. Let’s look at how many times bookings of different numbers of adults occur.

Note: Delete eval=FALSE before knitting the document

# Count the number of adults (adults) to create a
#    frequency table
hotels %>%
  count()

What was the largest number of adults in a single booking?

It is also possible to count the observations that fall into multiple categories simultaneously. We will expand the frequency table for adult bookings to include canceled bookings (is_canceled, where a cancelled booking is indicated by a 1).

Note: Delete eval=FALSE before knitting the document

#  Count the number of adults (adults) and cancelled bookings (is_canceled) to create a 
#    frequency table 
hotels %>%
  count()

All bookings larger than what size were cancelled?

Mutating, Summarizing an Grouping

In addition to rearranging data sets, we can also add new variables (mutate), calculate summary statistics (summarise) and group our observations (group_by).

Let’s first look at mutate:

hotels %>%
  mutate(little_ones = children + babies) %>%
  select(children, babies, little_ones)

In the data set, the number of weekend nights (stays_in_weekend_nights) and weekday nights (stays_in_week_nights) were given, but we would like to know the total length of stay.

Note: Makes sure to delete eval=FALSE

# Mutate the data set to create a column called total_stay that is the sum of the number 
#    of weekend nights (stays_in_weekend_nights) and weekday nights (stays_in_week_nights)
#    for each booking
# Select the column just created (total_stay)
hotels %>%
  mutate() %>%
  select()

What is the longest stay out of those listed in the first 10 rows of the data?

When we summarize data, we often want to look at values such as minimum or maximum value of a variable or its mean and median.

min: finds the minimum value of a variable
max: finds the maximum value of a variable
mean: calculates the mean of a variable
median: calculates the median of a variable

When we use summarise we want to provide a useful column name of our choice prior to telling R what to calculate for the summary. For example, we can find the minimum and maximum values for the lead time (lead_time) before a reservation:

hotels %>%
  summarise(max_lead_time=max(lead_time),
            min_lead_time=min(lead_time))

The names of the columns in this case are max_lead_time and min_lead_time and should be changed to be informative of what is being calculated and for which variable.

We can also group these summaries by the categories in a qualitative variable.

hotels %>%
  group_by(customer_type) %>%
  summarise(max_lead_time=max(lead_time),
            min_lead_time=min(lead_time))

Which customers provide the longest lead times?

Now lets try that ourselves. Calculate the minimum, mean, median, and maximum average daily rate (adr) grouped by hotel type so that we can get these statistics separately for resort and city hotels.

#Do the following:
#  1. Type the name of the data set you wish to wrangle (hotels), and then (%>%)
#  2. Group by the hotel type (hotel), and then (%>%)
#  3. Summarize the average daily rate (adr) by its minimum, mean, median and maximum 
#     value.
#  4. Make sure that the name of each column is informative

Which two values in the summary appear to be unusual?

Data Dictionary

Below is the full data dictionary. Note that it is long (there are lots of variables in the data), but we will be using a limited set of the variables for our analysis.

variable	class	description
hotel	character	Hotel (Resort Hotel or City Hotel)
is_canceled	double	Value indicating if the booking was canceled (1) or not (0)
lead_time	double	Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year	double	Year of arrival date
arrival_date_month	character	Month of arrival date
arrival_date_week_number	double	Week number of year for arrival date
arrival_date_day_of_month	double	Day of arrival date
stays_in_weekend_nights	double	Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights	double	Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults	double	Number of adults
children	double	Number of children
babies	double	Number of babies
meal	character	Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
country	character	Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment	character	Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel	character	Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest	double	Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations	double	Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled	double	Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type	character	Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type	character	Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes	double	Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type	character	Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
agent	character	ID of the travel agency that made the booking
company	character	ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list	double	Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type	character	Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
adr	double	Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces	double	Number of car parking spaces required by the customer
total_of_special_requests	double	Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status	character	Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date	double	Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Lab 1: Hotel Bookings - Data Wrangling

Mine Çetinkaya-Rundel

Modified by Leslie New