In this activity, you’ll review a scenario, and practice creating a data visualization with ggplot2. You will learn how to make use of the filters and facets features of ggplot2 to create custom visualizations based on different criteria.
Throughout this activity, you will also have the opportunity to practice writing your own code by making changes to the code chunks yourself. If you encounter an error or get stuck, you can always check the Lesson3_Filters_Solutions .rmd file in the Solutions folder under Week 4 for the complete, correct code.
As a junior data analyst for a hotel booking company, you have been
asked to clean hotel booking data, create visualizations with
ggplot2
to gain insight into the data, and present
different facets of the data through visualization. Now, you are going
to build on the work you performed previously to apply filters to your
data visualizations in ggplot2
.
If you haven’t exited out of RStudio since importing this data last time, you can skip these steps. Rerunning these code chunks won’t affect your console if you want to run them just in case, though.
If this line causes an error, copy in the line setwd(“projects/Course 7/Week 4”) before it.
Run the code below to read in the file ‘hotel_bookings.csv’ into a data frame:
hotel_bookings <- read.csv("../data/hotel_bookings.csv")
By now, you are pretty familiar with this data set. But you can
refresh your memory with the head()
and
colnames()
functions. Run two code chunks below to get at a
sample of the data and also preview all the column names:
head(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
colnames(hotel_bookings)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
If you haven’t already installed and loaded the ggplot2
package, you will need to do that before you can use the
ggplot()
function. You only have to do this once though,
not every time you call ggplot()
.
You can also skip this step if you haven’t closed your RStudio
account since doing the last activity. If you aren’t sure, you can run
the code chunk and hit ‘cancel’ if the warning message pops up telling
you that have already downloaded the ggplot2
package.
Run the code chunk below to install and load ggplot2
.
This may take a few minutes!
Earlier, you created a scatterplot to explore the relationship between booking lead time and guests traveling with children. As a refresher, here’s the code:
ggplot(data = hotel_bookings) +
geom_point(mapping = aes(x = lead_time, y = children))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
Your stakeholder asked about the group of guests who typically make early bookings, and this plot showed that many of these guests do not have children.
Now, your stakeholder wants to run a family-friendly promotion targeting key market segments. She wants to know which market segments generate the largest number of bookings, and where these bookings are made (city hotels or resort hotels).
First, you decide to create a bar chart showing each hotel type and market segment. You use different colors to represent each market segment:
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel, fill = market_segment))
The geom_bar() function uses bars to create a bar chart. The chart has ‘hotel’ on the x-axis and ‘count’ on the y-axis (note: if you don’t specify a variable for the y-axis, the code defaults to ‘count’). The code maps the ‘fill’ aesthetic to the variable ‘market_segment’ to generate color-coded sections inside each bar.
After creating this bar chart, you realize that it’s difficult to compare the size of the market segments at the top of the bars. You want your stakeholder to be able to clearly compare each segment.
You decide to use the facet_wrap() function to create a separate plot for each market segment. In the parentheses of the facet_wrap() function, add the variable ‘market_segment’ after the tilde symbol (~):
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel)) +
facet_wrap(~market_segment)
Now you have a separate bar chart for each market segment. Your stakeholder has a clearer idea of the size of each market segment, as well as the corresponding data for each hotel type.
For the next step, you will need to have the tidyverse
package installed and loaded. You may see a pop-up asking if you want to
install; if that’s the case, click ‘Install.’ This may take a few
minutes!
If you have already done this because you’re using the
tidyverse
package on your own, you can skip this code
chunk.
install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
After considering all the data, your stakeholder decides to send the promotion to families that make online bookings for city hotels. The online segment is the fastest growing segment, and families tend to spend more at city hotels than other types of guests.
Your stakeholder asks if you can create a plot that shows the relationship between lead time and guests traveling with children for online bookings at city hotels. This will give her a better idea of the specific timing for the promotion.
You think about it, and realize you have all the tools you need to fulfill the request. You break it down into the following two steps: 1) filtering your data; 2) plotting your filtered data.
For the first step, you can use the filter()
function to
create a data set that only includes the data you want. Input ‘City
Hotel’ in the first set of quotation marks and ‘Online TA’ in the second
set of quotations marks to specify your criteria:
onlineta_city_hotels <- filter(hotel_bookings,
(hotel=="City Hotel" &
hotel_bookings$market_segment=="Online TA"))
Note that you can use the ‘&’ character to demonstrate that you want two different conditions to be true. Also, you can use the ‘$’ character to specify which column in the data frame ‘hotel_bookings’ you are referencing (for example, ‘market_segment’).
You can use theView
() function to check out your new
data frame:
View(onlineta_city_hotels)
## Rows: 38,748
## Columns: 32
## $ hotel <chr> "City Hotel", "City Hotel", "City Hotel…
## $ is_canceled <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
## $ lead_time <int> 88, 65, 92, 100, 79, 63, 62, 62, 80, 60…
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <int> 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ stays_in_weekend_nights <int> 0, 0, 2, 0, 0, 1, 2, 2, 1, 2, 1, 2, 2, …
## $ stays_in_week_nights <int> 4, 4, 4, 2, 3, 3, 3, 3, 2, 5, 1, 1, 1, …
## $ adults <int> 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, …
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "PRT", "PRT", "PRT", "PRT…
## $ market_segment <chr> "Online TA", "Online TA", "Online TA", …
## $ distribution_channel <chr> "TA/TO", "TA/TO", "TA/TO", "TA/TO", "TA…
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "A", "A", "A", "A", "A", "A", "A", "A",…
## $ assigned_room_type <chr> "A", "A", "A", "A", "A", "A", "A", "A",…
## $ booking_changes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "9", "9", "9", "9", "9", "9", "8", "8",…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 76.50, 68.00, 76.50, 76.50, 76.50, 68.0…
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <int> 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, …
## $ reservation_status <chr> "Canceled", "Canceled", "Canceled", "Ca…
## $ reservation_status_date <chr> "2015-07-01", "2015-04-30", "2015-06-23…
There is also another way to do this. You can use the pipe operator (%>%) to do this in steps!
You name this data frame onlineta_city_hotels_v2
:
onlineta_city_hotels_v2 <- hotel_bookings %>%
filter(hotel=="City Hotel") %>%
filter(market_segment=="Online TA")
Notice how in the code chunk above, the %>% symbol is used to note
the logical steps of this code. First, it starts with the name of the
data frame, onlineta_city_hotels_v2
, AND THEN it tells
R
to start with the original data frame
hotel_bookings
. Then it tells it to filter on the ‘hotel’
column; finally, it tells it to filter on the ‘market_segment’
column.
This code chunk generates the same data frame by using the
View()
function:
View(onlineta_city_hotels_v2)
## Rows: 38,748
## Columns: 32
## $ hotel <chr> "City Hotel", "City Hotel", "City Hotel…
## $ is_canceled <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, …
## $ lead_time <int> 88, 65, 92, 100, 79, 63, 62, 62, 80, 60…
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <int> 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ stays_in_weekend_nights <int> 0, 0, 2, 0, 0, 1, 2, 2, 1, 2, 1, 2, 2, …
## $ stays_in_week_nights <int> 4, 4, 4, 2, 3, 3, 3, 3, 2, 5, 1, 1, 1, …
## $ adults <int> 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, …
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "PRT", "PRT", "PRT", "PRT…
## $ market_segment <chr> "Online TA", "Online TA", "Online TA", …
## $ distribution_channel <chr> "TA/TO", "TA/TO", "TA/TO", "TA/TO", "TA…
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "A", "A", "A", "A", "A", "A", "A", "A",…
## $ assigned_room_type <chr> "A", "A", "A", "A", "A", "A", "A", "A",…
## $ booking_changes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "9", "9", "9", "9", "9", "9", "8", "8",…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 76.50, 68.00, 76.50, 76.50, 76.50, 68.0…
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <int> 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, …
## $ reservation_status <chr> "Canceled", "Canceled", "Canceled", "Ca…
## $ reservation_status_date <chr> "2015-07-01", "2015-04-30", "2015-06-23…
You can use either of the data frames you created above for your new plots because they are the same.
Using the code for your previous scatterplot, replace
variable_name
in the code chunk below with either
onlineta_city_hotels
or
onlineta_city_hotels_v2
to plot the data your stakeholder
requested:
ggplot(data = onlineta_city_hotels) +
geom_point(mapping = aes(x = lead_time, y = children))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(data = onlineta_city_hotels_v2) +
geom_point(mapping = aes(x = lead_time, y = children))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Based on your previous filter, this scatterplot shows data for online bookings for city hotels. The plot reveals that bookings with children tend to have a shorter lead time, and bookings with 3 children have a significantly shorter lead time (<200 days). So, promotions targeting families can be made closer to the valid booking dates.
Filters allow you to create different views of your data and allow
you to investigate more specific relationships within your data. You can
practice these skills by modifying the code chunks in the rmd file, or
use this code as a starting point in your own project console. Now that
your stakeholder has had a chance to review these plots, they are
interested in adding annotations they can use to explain the data in a
presentation. Luckily, ggplot2
has a function that will
allow you to do just that. You will learn more about
ggplot2
in the next activity!