Data Manipulation with dplyr

Course Description

Say you’ve found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You’ll also learn to aggregate your data and add, remove, or change the variables. Along the way, you’ll explore a dataset containing information about counties in the United States. You’ll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.

Transforming Data with dplyr

Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.

Selecting columns

Select the following four columns from the counties variable:

state
county
population
poverty

You don’t need to save the result to a variable.

#load packages
library(dplyr)
library(ggplot2)

counties <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/04 Data Manipulation with dplyr/counties.rds")

glimpse(counties)

## Rows: 3,138
## Columns: 40
## $ census_id          <chr> "1001", "1003", "1005", "1007", "1009", "1011", ...
## $ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
## $ county             <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount...
## $ region             <chr> "South", "South", "South", "South", "South", "So...
## $ metro              <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", ...
## $ population         <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men                <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women              <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens           <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income             <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err         <dbl> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap     <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <dbl> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed           <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...

# Select the columns 
counties %>%
select(state, county, population, poverty)

Arranging observations

Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

In these exercises, you’ll sort these observations to find the most interesting cases.

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Add a verb to sort in descending order of public_work
counties_selected %>%
arrange(desc(public_work))

Filtering for conditions

You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

counties_selected <- counties %>%
  select(state, county, population)

# Filter for counties with a population above 1000000
counties_selected %>%
  filter(population > 1000000)

# Filter for counties in the state of California that have a population above 1000000
counties_selected %>%
filter(state == "California", population > 1000000)

Filtering and arranging

We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
filter(state == "Texas", population > 10000) %>%
arrange(desc(private_work))

Calculating the number of government employees

You can use the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work.

counties_selected <- counties %>%
  select(state, county, population, public_work)

# Add a new column public_workers with the number of people employed in public work
# Sort in descending order of the public_workers column
counties_selected %>%
  mutate(public_workers = public_work * population / 100) %>%
  arrange(desc(public_workers))

Calculating the percentage of women in a county

The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.

# Select the columns state, county, population, men, and women
counties_selected <- counties %>%
select(state, county, population, men, women)

# Calculate proportion_women as the fraction of the population made up of women
counties_selected %>%
mutate(proportion_women = women / population)

Notice that the proportion_women variable was added as a column to the counties_selected dataset, and the data now has 6 columns instead of 5.

Select, mutate, filter, and arrange

In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.

counties %>%
  # Select the five columns 
  select(state, county, population, men, women) %>%
  # Add the proportion_men variable
  mutate(proportion_men = men / population) %>%
  # Filter for population of at least 10,000
  filter(population >= 10000) %>%
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))

Notice Sussex County in Virginia is more than two thirds male: this is because of two men’s prisons in the county.

2. Aggregating Data

Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.

Counting by region

The counties dataset contains columns for region, state, population, and the number of citizens. In this exercise, you’ll focus on the region column.

# Use count to find the number of counties in each region
counties_selected <- counties %>%
  select(region, state, population, citizens)

counties_selected %>%
count(region, sort = TRUE)

Since the results have been arranged, you can see that the South has the greatest number of counties.

Counting citizens by state

You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.

counties_selected <- counties %>%
  select(region, state, population, citizens)

# Find number of counties per state, weighted by citizens
counties_selected %>%
count(state, wt = citizens, sort = TRUE)

Mutating and counting

You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”

You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

counties_selected <- counties %>%
  select(region, state, population, walk)

counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column
count(state, wt = population_walk, sort = TRUE)

We can see that while California had the largest total population, New York state has the largest number of people who walk to work.

Summarizing

The summarize() verb is very useful for collapsing a large dataset into a single observation.

counties_selected <- counties %>%
  select(county, population, income, unemployment)

# Summarize to find minimum population, maximum unemployment, and average income
counties_selected %>%
summarise(min_population = min(population), max_unemployment = max(unemployment), average_income = mean(income))

If we wanted to take this a step further, we could use filter() to determine the specific counties that returned the value for min_population and max_unemployment.

counties_selected %>%
filter(population == min(population))

Summarizing by state

Another interesting column is land_area, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).

counties_selected <- counties %>%
  select(state, county, population, land_area)

# Group by state and find the total area and population
# Add a density column, then sort in descending order
counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  mutate(density = total_population / total_area) %>%
  arrange(desc(density))

## `summarise()` ungrouping output (override with `.groups` argument)

Looks like New Jersey and Rhode Island are the “most crowded” of the US states, with more than a thousand people per square mile.

Summarizing by state and region

You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

counties_selected <- counties %>%
  select(region, state, county, population)

# Calculate the average_pop and median_pop columns 
counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>%
  summarize(average_pop = mean(total_pop), median_pop = median(total_pop))

## `summarise()` regrouping output by 'region' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

It looks like the South has the highest average_pop of 7370486, while the North Central region has the highest median_pop of 5580644.

Selecting a county from each region

Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.

Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.

counties_selected <- counties %>%
  select(region, state, county, metro, population, walk)

df <- data.frame(x = c(6, 4, 1, 10, 3, 1, 1))
# Selecting the top two highest values
df %>% top_n(2)

## Selecting by x

# Group by region and find the greatest number of citizens who walk to work
counties_selected %>%
group_by(region) %>%
top_n(1)

## Selecting by walk

Notice that three of the places lots of people walk to work are low-population nonmetro counties, but that New York City also pops up!

Finding the highest-income state in each region

You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and top_n() to find the state in each region with the highest income.

When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X.

counties_selected <- counties %>%
  select(region, state, county, population, income)

counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income)) %>%
  # Find the highest income state in each region
  top_n(1)

## `summarise()` regrouping output by 'region' (override with `.groups` argument)

## Selecting by average_income

From our results, we can see that the New Jersey in the Northeast is the state with the highest average_income of 73014.

Using summarize, top_n, and count together

In this chapter, you’ve learned to use five dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), and top_n(). In this exercise, you’ll use all of them to answer a question:

In how many states do more people live in metro areas than non-metro areas?

Recall that the metro column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).

counties_selected <- counties %>%
  select(state, metro, population)

# Count the states with more people in Metro or Nonmetro areas
counties_selected %>%
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  top_n(1, total_pop) %>%
  ungroup() %>%
  count(metro)

## `summarise()` regrouping output by 'state' (override with `.groups` argument)

Notice that 44 states have more people living in Metro areas, and 6 states have more people living in Nonmetro areas.

3. Selecting and Transforming Data

Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.

Selecting columns

Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.

# Glimpse the counties table
glimpse(counties)

## Rows: 3,138
## Columns: 40
## $ census_id          <chr> "1001", "1003", "1005", "1007", "1009", "1011", ...
## $ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
## $ county             <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount...
## $ region             <chr> "South", "South", "South", "South", "South", "So...
## $ metro              <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", ...
## $ population         <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men                <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women              <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens           <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income             <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err         <dbl> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap     <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <dbl> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed           <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...

counties %>%
  # Select state, county, population, and industry-related columns
  select(state, county, population, professional:production) %>%
  # Arrange service in descending order 
  arrange(desc(service))

Notice that when you select a group of related variables, it’s easy to find the insights you’re looking for.

Select helpers

starts_with() and ends_with() finds the columns that start or end with a particular string.

counties %>%
  # Select the state, county, population, and those ending with "work"
  select(state, county, population, ends_with("work")) %>%
  # Filter for counties that have at least 50% of people engaged in public work
  filter(public_work >= 50)

It looks like only a few counties have more than half the population working for the government.

Renaming a column after count

The rename() verb is often useful for changing the name of a column that comes out of another verb, such as count(). In this exercise, you’ll rename the n column from count() (which you learned about in Chapter 2) to something more descriptive.

# Renaming a column
x = c(12, 45, 7, 90)
y = c("facebook", "youtube", "twitter", "linkedin")
df <- data.frame(x, y)

df <- rename(df, page_views = x, social_media = y)
df

# Change the caps of the columns
rename_with(df, toupper)

# Rename the n column to num_counties
counties %>%
  count(state) %>%
  rename(num_counties = n)

Notice the difference between column names in the output from the first step to the second step. Don’t forget, using rename() isn’t the only way to choose a new name for a column!

Renaming a column as part of a select

rename() isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select().

# Select state, county, and poverty as poverty_rate
counties %>%
select(state, county, poverty_rate = poverty)

As you can see, we were able to select the four columns of interest from our dataset, and rename one of those columns, using only the select() verb!

Using transmute

The transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.

counties %>%
  # Keep the state, county, and populations columns, and add a density column
  transmute(state, county, population, density = population / land_area) %>%
  # Filter for counties with a population greater than one million 
  filter(population > 1000000) %>%
  # Sort density in ascending order 
  arrange(density)

Looks like San Bernadino is the lowest density county with a population about one million.

Choosing among the four verbs

In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.

# Change the name of the unemployment column
counties %>%
  rename(unemployment_rate = unemployment)

# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(state, county, contains("poverty"))

# Calculate the fraction_women column without dropping the other columns
counties %>%
  mutate(fraction_women = women / population)

# Keep only the state, county, and employment_rate columns
counties %>%
  transmute(state, county, employment_rate = employed / population)

Case Study: The babynames Dataset

Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.

Filtering and arranging for one year

The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.

babynames <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/04 Data Manipulation with dplyr/babynames.rds")

babynames %>%
  # Filter for the year 1990
  filter(year == "1990") %>%
  # Sort the number column in descending order 
  arrange(desc(number))

It looks like the most common names for babies born in the US in 1990 were Michael, Christopher, and Jessica.

Using top_n with babynames

You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by and top_n to find the most common name in every year.

# Find the most common name in each year
babynames %>%
group_by(year) %>%
top_n(1)

## Selecting by number

It looks like John was the most common name in 1880, and Mary was the most common name for a while after that.

Visualizing names with ggplot2

The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2.

# Filter for the names Steven, Thomas, and Matthew 
selected_names <- babynames %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"))

# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
  geom_line()

It looks like names like Steven and Thomas were common in the 1950s, but Matthew became common more recently.

Finding the year each name is most common

Earlier, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.

To do this, you’ll be combining the grouped mutate approach with a top_n.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
# Find the year each name is most common
  group_by(name) %>%
  top_n(1, fraction)

Adding the total and maximum for each name

You have learned how you could group by the year and use mutate() to add a total for that year.

In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.

Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutates slower.

babynames %>%
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup() %>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number / name_max)

This tells you, for example, that the name Abe was at 18.5% of its peak in the year 1880.

Visualizing the normalized change in popularity

You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.

In this exercise, you’ll visualize the normalized popularity of each name.

names_normalized <- babynames %>%
                     group_by(name) %>%
                     mutate(name_total = sum(number),
                            name_max = max(number)) %>%
                     ungroup() %>%
                     mutate(fraction_max = number / name_max)

# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
filter(name %in% c("Steven", "Thomas", "Matthew"))  

# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()

As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.

Using ratios to describe the frequency of a name

Previously, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?

You’ll start with the babynames_fraction data already, so that you can consider the popularity of each name within each year.

# The lag function moves the elements of the vector by one step. This is used to calculate the difference between consecutive numbers
x <- c(1, 2, 6, 9)
lag(x)

## [1] NA  1  2  6

x - lag(x)

## [1] NA  1  4  3

babynames_fraction <- babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)

babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name, year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio between each year 
  mutate(ratio = fraction / lag(fraction))

Notice that the first observation for each name is missing a ratio, since there is no previous year.

Biggest jumps in a name

Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.

babynames_ratios_filtered <- babynames_fraction %>%
                     arrange(name, year) %>%
                     group_by(name) %>%
                     mutate(ratio = fraction / lag(fraction)) %>%
                     filter(fraction >= 0.00001)

babynames_ratios_filtered %>%
  # Extract the largest ratio from each name 
  top_n(1, ratio) %>%
  # Sort the ratio column in descending order 
  arrange(desc(ratio)) %>%
  # Filter for fractions greater than or equal to 0.001
  filter(fraction >= 0.001)