In this activity, you’ll review a scenario, and focus on manipulating and changing real data in R. You will learn more about functions you can use to manipulate your data, use statistical summaries to explore your data, and gain initial insights for your stakeholders.
Throughout this activity, you will also have the opportunity to practice writing your own code by making changes to the code chunks yourself. If you encounter an error or get stuck, you can always check the Lesson3_Change_Solutions .rmd file in the Solutions folder under Week 3 for the complete, correct code.
In this scenario, you are a junior data analyst working for a hotel booking company. You have been asked to clean a .csv file that was created after querying a database to combine two different tables from different hotels. You have already performed some basic cleaning functions on this data; this activity will focus on using functions to conduct basic data manipulation.
Start by installing the required packages. If you have already
installed and loaded tidyverse
, skimr
, and
janitor
in this session, feel free to skip the code chunks
in this step. This may take a few minutes to run, and you may get a
pop-up window asking if you want to proceed. Click yes to continue
installing the packages.
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
Once a package is installed, you can load it by running the
library()
function with the package name inside the
parentheses:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
In the chunk below, you will use the read_csv()
function
to import data from a .csv in the project folder called
“hotel_bookings.csv” and save it as a data frame called
hotel_bookings
.
If this line causes an error, copy in the line setwd(“projects/Course 7/Week 3”) before it.
Type the file name in the correct place to read it into your R console:
hotel_bookings <- read_csv("../data/hotel_bookings.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Like you have been doing in other examples, you are going to use
summary functions to get to know your data. This time, you are going to
complete the code chunks below in order to use these different
functions. You can use the head()
function to preview the
columns and the first several rows of data. Finish the code chunk below
and run it:
head(hotel_bookings)
## # A tibble: 6 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
How many columns are in this dataset? A: 45 B: 100 C: 32 D: 60
The ‘arrival_date_month’ variable is chr or character type
data.
A: True B: False
Now you know this dataset contains information on hotel bookings. Each booking is a row in the dataset, and each column contains information such as what type of hotel was booked, when the booking took place, and how far in advance the booking took place (the ‘lead_time’ column).
In addition to head()
you can also use the
str()
and glimpse()
functions to get summaries
of each column in your data arranged horizontally. You can try these two
functions by completing and running the code chunks below:
(hotel_bookings)
## # A tibble: 119,390 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ℹ 119,380 more rows
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
You can see the different column names and some sample values to the right of the colon.
glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
You can also use colnames()
to get the names of the
columns in your dataset. Run the code chunk below to get the column
names:
colnames(hotel_bookings)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
Let’s say you want to arrange the data by most lead time to least
lead time because you want to focus on bookings that were made far in
advance. You decide you want to try using the arrange()
function; input the correct column name after the comma and run this
code chunk:
arrange(hotel_bookings, lead_time)
## # A tibble: 119,390 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 0 2015 July
## 2 Resort Hotel 0 0 2015 July
## 3 Resort Hotel 0 0 2015 July
## 4 Resort Hotel 0 0 2015 July
## 5 Resort Hotel 0 0 2015 July
## 6 Resort Hotel 0 0 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 0 2015 July
## 9 Resort Hotel 0 0 2015 July
## 10 Resort Hotel 0 0 2015 July
## # ℹ 119,380 more rows
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
But why are there so many zeroes? That’s because
arrange()
automatically orders by ascending order, and you
need to specifically tell it when to order by descending order, like the
below code chunk below:
arrange(hotel_bookings, desc(lead_time))
## # A tibble: 119,390 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 737 2015 July
## 2 Resort Hotel 0 709 2016 February
## 3 City Hotel 1 629 2017 March
## 4 City Hotel 1 629 2017 March
## 5 City Hotel 1 629 2017 March
## 6 City Hotel 1 629 2017 March
## 7 City Hotel 1 629 2017 March
## 8 City Hotel 1 629 2017 March
## 9 City Hotel 1 629 2017 March
## 10 City Hotel 1 629 2017 March
## # ℹ 119,380 more rows
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Now it is in the order you needed. You can click on the different pages of results to see additional rows of data, too.
What is the highest lead time for a hotel booking in this
dataset?
A: 737
B: 709
C: 629
D: 0
Notice that when you just run arrange()
without saving
your data to a new data frame, it does not alter the existing data
frame. Check it out by running head()
again to find out if
the highest lead times are first:
head(hotel_bookings)
## # A tibble: 6 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
This will be true of all the functions you will be using in this activity. If you wanted to create a new data frame that had those changes saved, you would use the assignment operator, <- , as written in the code chunk below to store the arranged data in a data frame named ‘hotel_bookings_v2’:
hotel_bookings_v2 <-
arrange(hotel_bookings, desc(lead_time))
Run head()
to check it out:
head(hotel_bookings_v2)
## # A tibble: 6 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 737 2015 July
## 2 Resort Hotel 0 709 2016 February
## 3 City Hotel 1 629 2017 March
## 4 City Hotel 1 629 2017 March
## 5 City Hotel 1 629 2017 March
## 6 City Hotel 1 629 2017 March
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
You can also find out the maximum and minimum lead times without
sorting the whole dataset using the arrange()
function. Try
it out using the max() and min() functions below:
max(hotel_bookings$lead_time)
## [1] 737
min(hotel_bookings$lead_time)
## [1] 0
Remember, in this case, you need to specify which dataset and which column using the $ symbol between their names. Try running the below to see what happens if you forget one of those pieces:
min(lead_time)
This is a common error that R users encounter. To correct this code chunk, you will need to add the data frame and the dollar sign in the appropriate places.
Now, let’s say you just want to know what the average lead time for
booking is because your boss asks you how early you should run
promotions for hotel rooms. You can use the mean()
function
to answer that question since the average of a set of number is also the
mean of the set of numbers:
mean(hotel_bookings$lead_time)
## [1] 104.0114
You should get the same answer even if you use the v2 dataset that
included the arrange()
function. This is because the
arrange()
function doesn’t change the values in the
dataset; it just re-arranges them.
mean(hotel_bookings_v2$lead_time)
## [1] 104.0114
What is the average lead time?
A: 100.0011
B: 45.0283
C: 14.0221
D: 104.0114
You were able to report to your boss what the average lead time before booking is, but now they want to know what the average lead time before booking is for just city hotels. They want to focus the promotion they’re running by targeting major cities.
You know that your first step will be creating a new dataset that
only contains data about city hotels. You can do that using the
filter()
function, and name your new data frame
‘hotel_bookings_city’:
hotel_bookings_city <-
filter(hotel_bookings, hotel_bookings$hotel=="City Hotel")
Check out your new dataset:
head(hotel_bookings_city)
## # A tibble: 6 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 City Hotel 0 6 2015 July
## 2 City Hotel 1 88 2015 July
## 3 City Hotel 1 65 2015 July
## 4 City Hotel 1 92 2015 July
## 5 City Hotel 1 100 2015 July
## 6 City Hotel 1 79 2015 July
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
You quickly check what the average lead time for this set of hotels is, just like you did for all of hotels before:
mean(hotel_bookings_city$lead_time)
## [1] 109.7357
Now, your boss wants to know a lot more information about city
hotels, including the maximum and minimum lead time. They are also
interested in how they are different from resort hotels. You don’t want
to run each line of code over and over again, so you decide to use the
group_by()
andsummarize()
functions. You can
also use the pipe operator to make your code easier to follow. You will
store the new dataset in a data frame named ‘hotel_summary’:
hotel_summary <-
hotel_bookings %>%
group_by(hotel) %>%
summarise(average_lead_time=mean(lead_time),
min_lead_time=min(lead_time),
max_lead_time=max(lead_time))
Check out your new dataset using head() again:
head(hotel_summary)
## # A tibble: 2 × 4
## hotel average_lead_time min_lead_time max_lead_time
## <chr> <dbl> <dbl> <dbl>
## 1 City Hotel 110. 0 629
## 2 Resort Hotel 92.7 0 737
Being able to manipulate data is a key skill for working in
R
. After this activity, you should be more familiar with
functions that allow you to change your data, such as
arrange()
, group_by()
, and
filter()
. You also have some experience using statistical
summaries to make insights into your data. You can continue to practice
these skills by modifying the code chunks in the rmd file, or use this
code as a starting point in your own project console. As you practice,
consider how performing tasks is similar and different in R
compared to other tools you have learned throughout this program.