Course Description
Say you’ve found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You’ll also learn to aggregate your data and add, remove, or change the variables. Along the way, you’ll explore a dataset containing information about counties in the United States. You’ll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.
Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.
Take a look at the counties
dataset using the
glimpse()
function.
What is the first value in the income
variable?
Select the following four columns from the counties
variable:
state
county
population
poverty
You don’t need to save the result to a variable.
counties
variable.# edited/added
library(tidyverse)
counties <- readRDS("counties.rds")
counties %>%
# Select the columns
select(state, county, population, poverty)
Here you see the counties_selected
dataset with a few
interesting variables selected. These variables:
private_work
, public_work
,
self_employed
describe whether people work for the
government, for private companies, or for themselves.
In these exercises, you’ll sort these observations to find the most interesting cases.
public_work
variable in descending order.counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
counties_selected %>%
# Add a verb to sort in descending order of public_work
arrange(desc(public_work))
You use the filter()
verb to get only observations that
match a particular condition, or match multiple conditions.
Find only the counties that have a population above one million
(1000000
).
Find only the counties in the state of California that also have a
population above one million (1000000
).
counties_selected <- counties %>%
select(state, county, population)
counties_selected %>%
# Filter for counties with a population above 1000000
filter(population > 1000000)
counties_selected <- counties %>%
select(state, county, population)
counties_selected %>%
# Filter for counties with a population above 1000000
filter(state == "California",
population > 1000000)
We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.
10000
), and sort them in descending order
of the percentage of people employed in private work.counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
counties_selected %>%
# Filter for Texas and more than 10000 people
filter(state == "Texas",
population > 10000) %>%
# Sort in descending order of private_work
arrange(desc(private_work))
In the video, you used the unemployment
variable, which
is a percentage, to calculate the number of unemployed people in each
county. In this exercise, you’ll do the same with another percentage
variable: public_work
.
The code provided already selects the state
,
county
, population
, and
public_work
columns.
mutate()
to add a column called
public_workers
to the dataset, with the number of people
employed in public (government) work.counties_selected <- counties %>%
select(state, county, population, public_work)
counties_selected %>%
# Add a new column public_workers with the number of people employed in public work
mutate(public_workers = public_work * population / 100)
counties_selected %>%
mutate(public_workers = public_work * population / 100) %>%
# Sort in descending order of the public_workers column
arrange(desc(public_workers))
The dataset includes columns for the total number (not percentage) of
men and women in each county. You could use this, along with the
population
variable, to compute the fraction of men (or
women) within each county.
In this exercise, you’ll select the relevant columns yourself.
state
, county
,
population
, men
, and women
.proportion_women
with the
fraction of the county’s population made up of women.counties_selected <- counties %>%
# Select the columns state, county, population, men, and women
select(state, county, population, men, women)
counties_selected %>%
# Calculate proportion_women as the fraction of the population made up of women
mutate(proportion_women = women / population)
In this exercise, you’ll put together everything you’ve learned in
this chapter (select()
, mutate()
,
filter()
and arrange()
), to find the counties
with the highest proportion of men.
state
, county
,
population
, men
, and women
.proportion_men
with the fraction of the
county’s population made up of men.10000
).counties %>%
# Select the five columns
select(state, county, population, men, women) %>%
# Add the proportion_men variable
mutate(proportion_men = men / population) %>%
# Filter for population of at least 10,000
filter(population >= 10000) %>%
# Arrange proportion of men in descending order
arrange(desc(proportion_men))
Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.
The counties
dataset contains columns for region, state,
population, and the number of citizens, which we selected and saved as
the counties_selected
table. In this exercise, you’ll focus
on the region
column.
counties_selected <- counties %>%
select(county, region, state, population, citizens)
count()
to find the number of counties in each
region, using a second argument to sort in descending order.# edited/added
counties_selected <- counties %>%
select(county, region, state, population, citizens)
# Use count to find the number of counties in each region
counties_selected %>%
count(region, sort = TRUE)
You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.
counties_selected <- counties %>%
select(county, region, state, population, citizens)
citizens
column, and sorted in descending order.# Find number of counties per state, weighted by citizens, sorted in descending order
counties_selected %>%
count(state, wt = citizens, sort = TRUE)
You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”
You’ll use the walk
column, which offers a percentage of
people in each county that walk to work, to add a new column and count
based on it.
counties_selected <- counties %>%
select(county, region, state, population, walk)
mutate()
to calculate and add a column called
population_walk
, containing the total number of people who
walk to work in a county.count()
to find the total
number of people who walk to work in each state.# edited/added
counties_selected <- counties %>%
select(county, region, state, population, walk)
counties_selected %>%
# Add population_walk containing the total number of people who walk to work
mutate(population_walk = population * walk / 100) %>%
# Count weighted by the new column, sort in descending order
count(state, wt = population_walk, sort = TRUE)
The summarize()
verb is very useful for collapsing a
large dataset into a single observation.
counties_selected <- counties %>%
select(county, population, income, unemployment)
min_population
(with the smallest population),
max_unemployment
(with the maximum unemployment), and
average_income
(with the mean of the income variable).# edited/added
counties_selected <- counties %>%
select(county, population, income, unemployment)
counties_selected %>%
# Summarize to find minimum population, maximum unemployment, and average income
summarize(min_population = min(population),
max_unemployment = max(unemployment),
average_income = mean(income))
Another interesting column is land_area
, which shows the
land area in square miles. Here, you’ll summarize both population and
land area by state, with the purpose of finding the density (in people
per square miles).
counties_selected <- counties %>%
select(state, county, population, land_area)
total_area
(with total area in square miles) and
total_population
(with total population).density
column with the people per square mile,
then arrange in descending order.# edited/added
counties_selected <- counties %>%
select(state, county, population, land_area)
counties_selected %>%
# Group by state
group_by(state) %>%
# Find the total area and population
summarize(total_area = sum(land_area),
total_population = sum(population))
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
# Add a density column
mutate(density = total_population / total_area) %>%
# Sort by density in descending order
arrange(desc(density))
You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.
counties_selected <- counties %>%
select(region, state, county, population)
total_pop
, in each combination of region and state, grouped
in that order.summarize()
step to calculate two new columns: the average
state population in each region (average_pop
) and the
median state population in each region (median_pop
).# edited/added
counties_selected <- counties %>%
select(region, state, county, population)
counties_selected %>%
# Group and summarize to find the total population
group_by(region, state) %>%
summarize(total_pop = sum(population))
counties_selected %>%
# Group and summarize to find the total population
group_by(region, state) %>%
summarize(total_pop = sum(population)) %>%
# Calculate the average_pop and median_pop columns
summarize(average_pop = mean(total_pop),
median_pop = median(total_pop))
Previously, you used the walk
column, which offers a
percentage of people in each county that walk to work, to add a new
column and count to find the total number of people who walk to work in
each county.
Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.
counties_selected <- counties %>%
select(region, state, county, metro, population, walk)
# edited/added
counties_selected <- counties %>%
select(region, state, county, metro, population, walk)
counties_selected %>%
# Group by region
group_by(region) %>%
# Find the county with the highest percentage of people who walk to work
slice_max(walk, n = 1)
You’ve been learning to combine multiple dplyr
verbs
together. Here, you’ll combine group_by()
,
summarize()
, and slice_min()
to find the state
in each region with the highest income.
When you group by multiple columns and then summarize, it’s important
to remember that the summarize “peels off” one of the groups, but leaves
the rest on. For example, if you group_by(X, Y)
then
summarize, the result will still be grouped by X
.
counties_selected <- counties %>%
select(region, state, county, population, income)
average_income
) of
counties within each region and state (notice the
group_by()
has already been done for you).average_income
in each
region.# edited/added
counties_selected <- counties %>%
select(region, state, county, population, income)
counties_selected %>%
group_by(region, state) %>%
# Calculate average income
summarize(average_income = mean(income)) %>%
# Find the lowest income state in each region
slice_min(average_income, n = 1)
In this chapter, you’ve learned to use six dplyr
verbs
related to aggregation: count()
, group_by()
,
summarize()
, ungroup()
,
slice_max()
, and slice_min()
. In this
exercise, you’ll combine them to answer a question:
In how many states do more people live in metro areas than non-metro areas?
Recall that the metro
column has one of the two values
"Metro"
(for high-density city areas) or
"Nonmetro"
(for suburban and country areas).
counties_selected <- counties %>%
select(state, metro, population)
For each combination of state
and metro
,
find the total population as total_pop
.
Extract the most populated row from each state, which will be either
Metro
or Nonmetro
.
Ungroup, then count how often Metro
or
Nonmetro
appears to see how many states have more people
living in those areas.
# edited/added
counties_selected <- counties %>%
select(state, metro, population)
counties_selected %>%
# Find the total population for each combination of state and metro
group_by(state, metro) %>%
summarize(total_pop = sum(population))
counties_selected %>%
# Find the total population for each combination of state and metro
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
# Extract the most populated row for each state
slice_max(total_pop, n = 1)
counties_selected %>%
# Find the total population for each combination of state and metro
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
# Extract the most populated row for each state
slice_max(total_pop, n = 1) %>%
# Count the states with more people in Metro or Nonmetro areas
ungroup() %>%
count(metro)
Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.
Using the select()
verb, we can answer interesting
questions about our dataset by focusing in on related groups of verbs.
The colon (:
) is useful for getting many columns at a
time.
glimpse()
to examine all the variables in the
counties
table.state
, county
,
population
, and (using a colon) all five of those
industry-related variables; there are five consecutive variables in the
table related to the industry of people’s work:
professional
, service
, office
,
construction
, and production
.service
to
find which counties have the highest rates of working in the service
industry.# Glimpse the counties table
glimpse(counties)
counties %>%
# Select state, county, population, and industry-related columns
select(state, county, population, professional:production) %>%
# Arrange service in descending order
arrange(desc(service))
In the video you learned about the select helper
starts_with()
. Another select helper is
ends_with()
, which finds the columns that end with a
particular string.
state
, county
,
population
, and all those that end with
work
.counties %>%
# Select the state, county, population, and those ending with "work"
select(state, county, population, ends_with("work")) %>%
# Filter for counties that have at least 50% of people engaged in public work
filter(public_work >= 50)
The rename()
verb is often useful for changing the name
of a column that comes out of another verb, such as
count()
. In this exercise, you’ll rename the default
n
column generated from count()
to something
more descriptive.
count()
to determine how many counties are in each
state.n
column in the output; use
rename()
to rename that to num_counties
.counties %>%
# Count the number of counties in each state
count(state)
counties %>%
# Count the number of counties in each state
count(state) %>%
# Rename the n column to num_counties
rename(num_counties = n)
rename()
isn’t the only way you can choose a new name
for a column; you can also choose a name as part of a
select()
.
state
, county
, and
poverty
from the counties
dataset; in the same
step, rename the poverty
column to
poverty_rate
.counties %>%
# Select state, county, and poverty as poverty_rate
select(state, county, poverty_rate = poverty)
Which of the following verbs would you use to calculate new columns while dropping other columns?
As you learned in the video, the transmute
verb allows
you to control which variables you keep, which variables you calculate,
and which variables you drop.
state
, county
, and
population
columns, and add a new column,
density
, that contains the population
per
land_area
.counties %>%
# Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
We’ve learned a number of new verbs in this chapter that you can use to modify and change the variables you have.
Leaves the columns you don’t mention alone; doesn’t allow you to calculate or change values.
Must mention all the columns you keep; allows you to calculate or change values.
Leaves the columns you don’t mention alone; allows you to calculate or change values.
In this chapter you’ve learned about the four verbs:
select
, mutate
, transmute
, and
rename
. Here, you’ll choose the appropriate verb for each
situation. You won’t need to change anything inside the parentheses.
unemployment
column to unemployment_rate
state
, county
, and the ones containing
poverty
.fraction_women
with the
fraction of the population made up of women, without dropping any
columns.state
,
county
, and employed / population
, which
you’ll call employment_rate
.# Change the name of the unemployment column
counties %>%
rename(unemployment_rate = unemployment)
# Keep the state and county columns, and the columns containing poverty
counties %>%
select(state, county, contains("poverty"))
# Calculate the fraction_women column without dropping the other columns
counties %>%
mutate(fraction_women = women / population)
# Keep only the state, county, and employment_rate columns
counties %>%
transmute(state, county, employment_rate = employed / population)
Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.
The dplyr
verbs you’ve learned are useful for exploring
data. For instance, you could find out the most common names in a
particular year.
# edited/added
babynames <- readRDS("babynames.rds")
babynames %>%
# Filter for the year 1990
filter(year == 1990) %>%
# Sort the number column in descending order
arrange(desc(number))
You saw that you could use filter()
and
arrange()
to find the most common names in one year.
However, you could also use group_by()
and
slice_max()
to find the most common name in every
year.
group_by()
and slice_max()
to find the
most common name for US babies in each year.babynames %>%
# Find the most common name in each year
group_by(year) %>%
slice_max(number, n = 1)
The dplyr
package is very useful for exploring data, but
it’s especially useful when combined with other tidyverse
packages like ggplot2
.
Steven
, Thomas
,
and Matthew
, and assign it to an object called
selected_names
.selected_names <- babynames %>%
# Filter for the names Steven, Thomas, and Matthew
filter(name %in% c("Steven", "Thomas", "Matthew"))
selected_names <- babynames %>%
# Filter for the names Steven, Thomas, and Matthew
filter(name %in% c("Steven", "Thomas", "Matthew"))
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.
To do this, you’ll be combining the grouped mutate approach with a
slice_max()
.
year_total
.year_total
to calculate the
fraction
of people born in each year that have each
name.fraction
column, in
combination with slice_max()
, to identify the year each
name was most common.# Calculate the fraction of people born each year with the same name
babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
# Calculate the fraction of people born each year with the same name
babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
# Find the year each name is most common
group_by(name) %>%
slice_max(fraction, n = 1)
In the video, you learned how you could group by the year and use
mutate()
to add a total for that year.
In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.
Once you add new columns, the result will still be grouped by name.
This splits it into 48,000 groups, which actually makes later steps like
mutate
s slower.
name_total
: the sum of the number of babies born with
that name in the entire dataset.name_max
: the maximum number of babies born with that
name in any year.fraction_max
containing the
number
in the year divided by name_max
.babynames %>%
# Add columns name_total and name_max for each name
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number))
babynames %>%
# Add columns name_total and name_max for each name
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number / name_max)
You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.
In this exercise, you’ll visualize the normalized popularity of each
name. Your work from the previous exercise,
names_normalized
, has been provided for you.
names_normalized <- babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
names_normalized
table to limit it to the
three names Steven
, Thomas
, and
Matthew
.fraction_max
over time,
colored by name.# edited/added
names_normalized <- babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
names_filtered <- names_normalized %>%
# Filter for the names Steven, Thomas, and Matthew
filter(name %in% c("Steven", "Thomas", "Matthew"))
# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()
In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?
You’ll start with the babynames_fraction
data already,
so that you can consider the popularity of each name within each
year.
ratio
containing the
ratio (not difference) of fraction
between
each year.# edited/added
babynames_fraction <- babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
babynames_fraction %>%
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio of fraction between each year
mutate(ratio = fraction / lag(fraction))
Previously, you added a ratio
column to describe the
ratio of the frequency of a baby name between consecutive years to
describe the changes in the popularity of a name. Now, you’ll look at a
subset of that data, called babynames_ratios_filtered
, to
look further into the names that experienced the biggest jumps in
popularity in consecutive years.
babynames_ratios_filtered <- babynames_fraction %>%
arrange(name, year) %>%
group_by(name) %>%
mutate(ratio = fraction / lag(fraction)) %>%
filter(fraction >= 0.00001)
ratio
; note the data is already grouped by
name
.ratio
column in descending order.babynames_ratios_filtered
data further by
filtering the fraction
column to only display results
greater than or equal to 0.001
.# edited/added
babynames_ratios_filtered <- babynames_fraction %>%
arrange(name, year) %>%
group_by(name) %>%
mutate(ratio = fraction / lag(fraction)) %>%
filter(fraction >= 0.00001)
babynames_ratios_filtered %>%
# Extract the largest ratio from each name
slice_max(ratio, n = 1) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)
Congratulations! You’ve made it to the end of the course.
In this course, you’ve learned how to transform data with the dplyr package. You reviewed the select, filter, mutate, and arrange verbs for transforming data. You mastered the count verb, and learned the powerful group_by and summarize pattern for aggregating data.
You learned four verbs you can use for adding, removing, and changing variables in a table: select, rename, transmute, and mutate.
You also applied many of these tools to a new babynames dataset, and gained experience with grouped mutates and window functions.
You’ll find all of these skills valuable in your own projects, as well as in countless other DataCamp courses, including more intermediate and advanced dplyr courses, and courses that combine dplyr with other Tidyverse packages.
Congratulations again, and best of luck using your new-found dplyr skills in your own analyses!