Data Manipulation with dplyr
Instructor: Chris Cardillo-Datacamp
0.1 Introducion
This course has some similarities to other DataCamp courses, especially Introduction to the Tidyverse, and if you’ve taken those courses, this material might be review.
1 Transforming Data with dplyr
Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.
1.1 The counties dataset
1.1.1 Understanding your data
Take a look at the counties
dataset using the glimpse()
function.
What is the first value in the income
variable?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(dplyr)
<- read_csv("acs2017_county_data.csv") counties
## Rows: 3220 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): State, County
## dbl (35): CountyId, TotalPop, Men, Women, Hispanic, White, Black, Native, As...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(counties)
## Rows: 3,220
## Columns: 37
## $ CountyId <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017,…
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ County <chr> "Autauga County", "Baldwin County", "Barbour County",…
## $ TotalPop <dbl> 55036, 203360, 26201, 22580, 57667, 10478, 20126, 115…
## $ Men <dbl> 26899, 99527, 13976, 12251, 28490, 5616, 9416, 55593,…
## $ Women <dbl> 28137, 103833, 12225, 10329, 29177, 4862, 10710, 5993…
## $ Hispanic <dbl> 2.7, 4.4, 4.2, 2.4, 9.0, 0.3, 0.3, 3.6, 2.2, 1.6, 7.7…
## $ White <dbl> 75.4, 83.1, 45.7, 74.6, 87.4, 21.6, 52.2, 72.7, 56.2,…
## $ Black <dbl> 18.9, 9.5, 47.8, 22.0, 1.5, 75.6, 44.7, 20.4, 39.3, 5…
## $ Native <dbl> 0.3, 0.8, 0.2, 0.4, 0.3, 1.0, 0.1, 0.2, 0.3, 0.5, 0.4…
## $ Asian <dbl> 0.9, 0.7, 0.6, 0.0, 0.1, 0.7, 1.1, 1.0, 1.0, 0.1, 0.4…
## $ Pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ VotingAgeCitizen <dbl> 41016, 155376, 20269, 17662, 42513, 8212, 15459, 8838…
## $ Income <dbl> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 4368…
## $ IncomeErr <dbl> 2838, 1348, 2551, 3431, 2630, 5376, 2701, 1491, 2011,…
## $ IncomePerCap <dbl> 27824, 29364, 17561, 20911, 22021, 20856, 19004, 2363…
## $ IncomePerCapErr <dbl> 2024, 735, 798, 1889, 850, 2355, 943, 793, 1205, 1354…
## $ Poverty <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8,…
## $ ChildPoverty <dbl> 20.1, 16.1, 44.9, 26.6, 25.4, 50.4, 34.8, 26.6, 29.1,…
## $ Professional <dbl> 35.3, 35.7, 25.0, 24.4, 28.5, 19.7, 26.9, 29.0, 24.3,…
## $ Service <dbl> 18.0, 18.2, 16.8, 17.6, 12.9, 17.1, 17.3, 17.5, 13.5,…
## $ Office <dbl> 23.2, 25.6, 22.6, 19.7, 23.3, 18.6, 18.5, 23.7, 23.0,…
## $ Construction <dbl> 8.1, 9.7, 11.5, 15.9, 15.8, 14.0, 11.6, 10.4, 11.6, 1…
## $ Production <dbl> 15.4, 10.8, 24.1, 22.4, 19.5, 30.6, 25.7, 19.4, 27.6,…
## $ Drive <dbl> 86.0, 84.7, 83.4, 86.4, 86.8, 73.1, 83.6, 85.0, 87.1,…
## $ Carpool <dbl> 9.6, 7.6, 11.1, 9.5, 10.2, 15.7, 12.6, 9.2, 9.7, 12.1…
## $ Transit <dbl> 0.1, 0.1, 0.3, 0.7, 0.1, 0.3, 0.0, 0.2, 0.2, 0.4, 0.1…
## $ Walk <dbl> 0.6, 0.8, 2.2, 0.3, 0.4, 6.2, 0.9, 1.3, 0.6, 0.3, 0.6…
## $ OtherTransp <dbl> 1.3, 1.1, 1.7, 1.7, 0.4, 1.7, 0.9, 1.1, 0.5, 0.3, 1.8…
## $ WorkAtHome <dbl> 2.5, 5.6, 1.3, 1.5, 2.1, 3.0, 2.0, 3.2, 2.0, 2.0, 1.7…
## $ MeanCommute <dbl> 25.8, 27.0, 23.4, 30.0, 35.0, 29.8, 23.2, 24.8, 23.6,…
## $ Employed <dbl> 24112, 89527, 8878, 8171, 21380, 4290, 7727, 47392, 1…
## $ PrivateWork <dbl> 74.1, 80.7, 74.1, 76.0, 83.9, 81.4, 79.1, 74.9, 84.5,…
## $ PublicWork <dbl> 20.2, 12.9, 19.1, 17.4, 11.9, 13.6, 15.3, 19.9, 11.8,…
## $ SelfEmployed <dbl> 5.6, 6.3, 6.5, 6.3, 4.0, 5.0, 5.3, 5.1, 3.7, 8.1, 4.5…
## $ FamilyWork <dbl> 0.1, 0.1, 0.3, 0.3, 0.1, 0.0, 0.3, 0.1, 0.0, 0.0, 0.4…
## $ Unemployment <dbl> 5.2, 5.5, 12.4, 8.2, 4.9, 12.1, 7.6, 10.1, 6.4, 5.3, …
1.1.2 Selecting calumns
Select the following four columns from the counties
variable: * state
* county
* population
* poverty
You don’t need to save the result to a variable.
counties %>%
select("state", "county", "population", "poverty")
1.2 The filter and arrange verbs
1.2.1 Arranging observations
Here you see the counties_selected
dataset with a few interesting variables selected. These variables: private_work
, public_work
, self_employed
describe whether people work for the government, for private companies, or for themselves.
In these exercises, you’ll sort these observations to find the most interesting cases.
<- counties %>%
counties_selected select(State, County, TotalPop, PrivateWork, PublicWork, SelfEmployed, Walk)
# Add a verb to sort in descending order of public_work
%>%
counties_selected arrange(desc(PublicWork))
## # A tibble: 3,220 × 7
## State County TotalPop PrivateWork PublicWork SelfEmployed Walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South Dakota Oglala Lakot… 14291 31.1 64.8 4.1 11.9
## 2 Wisconsin Menominee Co… 4506 34.3 62.3 3.2 4.6
## 3 Hawaii Kalawao Coun… 86 34.9 61.9 3.2 40.6
## 4 North Dakota Sioux County 4420 34.3 58.3 7.1 5.5
## 5 Alaska Kusilvak Cen… 8129 41 57.8 1.1 45.5
## 6 South Dakota Buffalo Coun… 2048 38.9 57.7 2.2 7
## 7 South Dakota Todd County 10016 34.4 57.4 6.8 5.7
## 8 Alaska Yukon-Koyuku… 5453 38.8 56.5 4.7 39.8
## 9 Alaska Lake and Pen… 1301 40.4 54.2 5 36.6
## 10 South Dakota Dewey County 5709 33.7 50.2 14.8 2.9
## # … with 3,210 more rows
1.2.2 Filtering for conditions
You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.
# Filter for counties with a population above 1000000
%>%
counties_selected filter(TotalPop > 1000000)
## # A tibble: 44 × 7
## State County TotalPop PrivateWork PublicWork SelfEmployed Walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arizona Maricopa County 4155501 82.8 11.1 5.9 1.5
## 2 Arizona Pima County 1007257 75.7 17.6 6.5 2.3
## 3 California Alameda County 1629615 79.3 13.2 7.3 3.7
## 4 California Contra Costa C… 1123678 78 13.4 8.4 1.7
## 5 California Los Angeles Co… 10105722 79.3 11.2 9.3 2.7
## 6 California Orange County 3155816 82 10.1 7.7 1.9
## 7 California Riverside Coun… 2355002 77.7 14.7 7.5 1.6
## 8 California Sacramento Cou… 1495400 71.5 21.3 6.9 1.9
## 9 California San Bernardino… 2121220 77.3 16.2 6.4 1.7
## 10 California San Diego Coun… 3283665 77.8 14.2 7.8 2.9
## # … with 34 more rows
# Filter for counties in the state of California that have a population above 1000000
%>%
counties_selected filter(State == "California",
> 1000000) TotalPop
## # A tibble: 9 × 7
## State County TotalPop PrivateWork PublicWork SelfEmployed Walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California Alameda County 1629615 79.3 13.2 7.3 3.7
## 2 California Contra Costa Co… 1123678 78 13.4 8.4 1.7
## 3 California Los Angeles Cou… 10105722 79.3 11.2 9.3 2.7
## 4 California Orange County 3155816 82 10.1 7.7 1.9
## 5 California Riverside County 2355002 77.7 14.7 7.5 1.6
## 6 California Sacramento Coun… 1495400 71.5 21.3 6.9 1.9
## 7 California San Bernardino … 2121220 77.3 16.2 6.4 1.7
## 8 California San Diego County 3283665 77.8 14.2 7.8 2.9
## 9 California Santa Clara Cou… 1911226 84.7 9.3 5.9 2.1
Now you know that there are 9 counties in the state of California with a population greater than one million. In the next exercise, you’ll practice filtering and then sorting a dataset to focus on specific observations!
1.2.3 Filtering and arranging
We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.
# Filter for Texas and more than 10000 people; sort in descending order of private_work
%>%
counties_selected filter(State == "Texas", TotalPop > 10000) %>%
arrange(desc(PrivateWork))
## # A tibble: 168 × 7
## State County TotalPop PrivateWork PublicWork SelfEmployed Walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Texas Andrews County 17577 85.3 8.7 5.7 0
## 2 Texas Collin County 914075 84.3 9.7 5.9 0.8
## 3 Texas Dallas County 2552213 84.3 9.1 6.4 1.5
## 4 Texas Gregg County 123402 84.3 10.1 5.5 2
## 5 Texas Calhoun County 21821 83.8 10.5 5.7 0.9
## 6 Texas Titus County 32664 83.5 9.9 6.5 0.6
## 7 Texas Harris County 4525519 83.4 9.8 6.6 1.5
## 8 Texas Jefferson County 254574 83 13 3.9 0.7
## 9 Texas Tarrant County 1983675 83 11.3 5.6 1.2
## 10 Texas Midland County 159883 82.9 9.7 7.3 0.8
## # … with 158 more rows
You’ve learned how to filter and sort a dataset to answer questions about the data. Notice that you only need to slightly modify your code if you are interested in sorting the observations by a different column.
1.3 Mutate
1.3.1 Calculating the number of government employees
Use mutate()
to add a column called public_workers
to the dataset, with the number of people employed in public (government) work.
# Add a new column public_workers with the number of people employed in public work
%>%
counties_selected mutate(public_workers = PublicWork * TotalPop / 100) %>%
arrange(desc(public_workers))
## # A tibble: 3,220 × 8
## State County TotalPop PrivateWork PublicWork SelfEmployed Walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California Los Angeles Co… 10105722 79.3 11.2 9.3 2.7
## 2 Illinois Cook County 5238541 84.2 11.1 4.6 4.4
## 3 California San Diego Coun… 3283665 77.8 14.2 7.8 2.9
## 4 Arizona Maricopa County 4155501 82.8 11.1 5.9 1.5
## 5 Texas Harris County 4525519 83.4 9.8 6.6 1.5
## 6 New York Kings County 2635121 79.4 13.9 6.6 8.6
## 7 California Riverside Coun… 2355002 77.7 14.7 7.5 1.6
## 8 California San Bernardino… 2121220 77.3 16.2 6.4 1.7
## 9 New York Queens County 2339280 80.1 13.7 6.1 5.8
## 10 California Orange County 3155816 82 10.1 7.7 1.9
## # … with 3,210 more rows, and 1 more variable: public_workers <dbl>
It looks like Los Angeles is the county with the most government employees.
1.3.2 Calculating the percentage of women in a county
The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population
variable, to compute the fraction of men (or women) within each county.
# Select the columns state, county, population, men, and women
<- counties %>%
counties_selected select(State, County, TotalPop, Men, Women, Walk)
# Calculate proportion_women as the fraction of the population made up of women
%>%
counties_selected mutate(proportion_women = Women/TotalPop)
## # A tibble: 3,220 × 7
## State County TotalPop Men Women Walk proportion_women
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Autauga County 55036 26899 28137 0.6 0.511
## 2 Alabama Baldwin County 203360 99527 103833 0.8 0.511
## 3 Alabama Barbour County 26201 13976 12225 2.2 0.467
## 4 Alabama Bibb County 22580 12251 10329 0.3 0.457
## 5 Alabama Blount County 57667 28490 29177 0.4 0.506
## 6 Alabama Bullock County 10478 5616 4862 6.2 0.464
## 7 Alabama Butler County 20126 9416 10710 0.9 0.532
## 8 Alabama Calhoun County 115527 55593 59934 1.3 0.519
## 9 Alabama Chambers County 33895 16320 17575 0.6 0.519
## 10 Alabama Cherokee County 25855 12862 12993 0.3 0.503
## # … with 3,210 more rows
Notice that the
proportion_women
variable was added as a column to thecounties_selected
dataset, and the data now has 6 columns instead of 5.
1.3.3 Select, mutate, filter, and arrange
In this exercise, you’ll put together everything you’ve learned in this chapter (select()
, mutate()
, filter()
and arrange()
), to find the counties with the highest proportion of men.
%>%
counties # Select the five columns
select(State, County, TotalPop, Men, Women) %>%
# Add the proportion_men variable
mutate(proportion_men = Men/TotalPop) %>%
# Filter for population of at least 10,000
filter(TotalPop >= 10000) %>%
# Arrange proportion of men in descending order
arrange(desc(proportion_men))
## # A tibble: 2,509 × 6
## State County TotalPop Men Women proportion_men
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Georgia Chattahoochee County 11096 7455 3641 0.672
## 2 Louisiana West Feliciana Parish 15376 10210 5166 0.664
## 3 California Lassen County 31470 20786 10684 0.661
## 4 Virginia Sussex County 11595 7552 4043 0.651
## 5 Florida Union County 15300 9922 5378 0.648
## 6 Missouri DeKalb County 12564 8007 4557 0.637
## 7 Texas Jones County 19969 12658 7311 0.634
## 8 Virginia Greensville County 11606 7238 4368 0.624
## 9 Arkansas Lincoln County 13885 8643 5242 0.622
## 10 Texas Madison County 13979 8649 5330 0.619
## # … with 2,499 more rows
Notice Sussex County in Virginia is more than two thirds male: this is because of two men’s prisons in the county.
2 Aggregating Data
Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.
2.1 The count verb
2.1.1 Counting by region
The counties
dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected
table. In this exercise, you’ll focus on the region
column.
counties_selected <- counties %>%
select(county, region, state, population, citizens)
# Use count to find the number of counties in each state
%>%
counties_selected count(State, sort = TRUE)
## # A tibble: 52 × 2
## State n
## <chr> <int>
## 1 Texas 254
## 2 Georgia 159
## 3 Virginia 133
## 4 Kentucky 120
## 5 Missouri 115
## 6 Kansas 105
## 7 Illinois 102
## 8 North Carolina 100
## 9 Iowa 99
## 10 Tennessee 95
## # … with 42 more rows
2.1.2 Counting citizens by state
You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.
counties_selected <- counties %>%
select(county, region, state, population, citizens)
# Find number of counties per state, weighted by population
%>%
counties_selected count(State, wt=TotalPop, sort = TRUE)
## # A tibble: 52 × 2
## State n
## <chr> <dbl>
## 1 California 38982847
## 2 Texas 27419612
## 3 Florida 20278447
## 4 New York 19798228
## 5 Illinois 12854526
## 6 Pennsylvania 12790505
## 7 Ohio 11609756
## 8 Georgia 10201635
## 9 North Carolina 10052564
## 10 Michigan 9925568
## # … with 42 more rows
2.1.3 Mutating and counting
You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”
You’ll use the walk
column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.
counties_selected <- counties %>%
select(county, region, state, population, walk)
%>%
counties_selected # Add population_walk containing the total number of people who walk to work
mutate(population_walk = Walk * TotalPop/100) %>%
# Count weighted by the new column
count(State, wt = population_walk, sort = TRUE)
## # A tibble: 52 × 2
## State n
## <chr> <dbl>
## 1 New York 1214914.
## 2 California 1008788.
## 3 Pennsylvania 492609.
## 4 Texas 427780.
## 5 Illinois 391265.
## 6 Massachusetts 323859.
## 7 Florida 291701.
## 8 New Jersey 266857.
## 9 Ohio 260880.
## 10 Washington 250556.
## # … with 42 more rows
We can see that while California had the largest total population, New York state has the largest number of people who walk to work.
2.2 The group by, summarize and ungroup verbs
2.2.1 Summarizing
The summarize()
verb is very useful for collapsing a large dataset into a single observation.
counties_selected <- counties %>%
select(county, population, income, unemployment)
# Summarize to find minimum population, maximum unemployment, and average income
%>%
counties summarize(min_population = min(TotalPop), max_unemployment = max(Unemployment), average_income = mean(Income))
## # A tibble: 1 × 3
## min_population max_unemployment average_income
## <dbl> <dbl> <dbl>
## 1 74 40.9 48995.
If we wanted to take this a step further, we could use filter()
to determine the specific counties that returned the value for min_population
and max_unemployment.
2.2.2 Summarizing by state
We can see for example the percentage of men per state.
%>%
counties_selected group_by(State) %>%
summarize(total_men = sum(Men),
total_population = sum(TotalPop)) %>%
mutate(percentage_men = total_men / total_population) %>%
arrange(desc(percentage_men))
## # A tibble: 52 × 4
## State total_men total_population percentage_men
## <chr> <dbl> <dbl> <dbl>
## 1 Alaska 386319 738565 0.523
## 2 North Dakota 382121 745475 0.513
## 3 Wyoming 298301 583200 0.511
## 4 South Dakota 430587 855444 0.503
## 5 Utah 1506614 2993941 0.503
## 6 Montana 517860 1029862 0.503
## 7 Colorado 2731315 5436519 0.502
## 8 Hawaii 713981 1421658 0.502
## 9 Nevada 1450091 2887725 0.502
## 10 Idaho 830627 1657375 0.501
## # … with 42 more rows
2.2.3 Summarizing by state and region
You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.
counties_selected <- counties %>%
select(region, state, county, population)
It looks like the South has the highest average_pop
of 7370486, while the North Central region has the highest median_pop
of 5580644.
2.3 The top_n verb
Selecting a county from each region Previously, you used the walk
column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.
Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.
counties_selected <- counties %>%
select(region, state, county, metro, population, walk)
%>%
counties_selected group_by(State) %>%
# Find the greatest number of citizens who walk to work
top_n(1, TotalPop)
## # A tibble: 52 × 6
## # Groups: State [52]
## State County TotalPop Men Women Walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Jefferson County 659460 312333 347127 1.4
## 2 Alaska Anchorage Municipality 298225 152311 145914 3.4
## 3 Arizona Maricopa County 4155501 2055464 2100037 1.5
## 4 Arkansas Pulaski County 392848 188310 204538 1.3
## 5 California Los Angeles County 10105722 4979641 5126081 2.7
## 6 Colorado Denver County 678467 339400 339067 4.4
## 7 Connecticut Fairfield County 947328 461380 485948 2.6
## 8 Delaware New Castle County 555036 268818 286218 2.4
## 9 District of Columbia District of Columbia 672391 319046 353345 13.2
## 10 Florida Miami-Dade County 2702602 1311997 1390605 2.1
## # … with 42 more rows
<- counties %>%
counties_selected select(State, County, Unemployment)
%>%
counties_selected group_by(State) %>%
top_n(3, Unemployment)
## # A tibble: 161 × 3
## # Groups: State [52]
## State County Unemployment
## <chr> <chr> <dbl>
## 1 Alabama Clarke County 15.2
## 2 Alabama Monroe County 21.6
## 3 Alabama Wilcox County 17.2
## 4 Alaska Kusilvak Census Area 28.8
## 5 Alaska Northwest Arctic Borough 20.1
## 6 Alaska Yukon-Koyukuk Census Area 19.7
## 7 Arizona Apache County 14.1
## 8 Arizona Graham County 12.2
## 9 Arizona Navajo County 16.3
## 10 Arkansas Desha County 12.2
## # … with 151 more rows
2.3.1 Finding the highest-income state in each region
You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by()
, summarize()
, and top_n()
to find the state in each region with the highest income.
When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y)
then summarize, the result will still be grouped by X.
counties_selected <- counties %>%
select(region, state, county, population, income)
<- counties %>%
counties_selected select(State, County, TotalPop, Income)
%>%
counties_selected group_by(County, State) %>%
# Calculate average income
summarize(average_income = mean(Income) ) %>%
# Find the highest income state in each region
top_n(1, average_income)
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
## # A tibble: 1,955 × 3
## # Groups: County [1,955]
## County State average_income
## <chr> <chr> <dbl>
## 1 Abbeville County South Carolina 35254
## 2 Acadia Parish Louisiana 40492
## 3 Accomack County Virginia 42260
## 4 Ada County Idaho 60151
## 5 Adair County Iowa 49477
## 6 Adams County Colorado 64087
## 7 Addison County Vermont 61875
## 8 Adjuntas Municipio Puerto Rico 11680
## 9 Aguada Municipio Puerto Rico 16199
## 10 Aguadilla Municipio Puerto Rico 16821
## # … with 1,945 more rows
From our results, we can see that the New Jersey in the Northeast is the state with the highest
average_income
of 73014.
2.3.2 Using summarize, top_n, and count together
Five dplyr verbs related to aggregation: count()
, group_by()
, summarize()
, ungroup()
, and top_n()
. We’ll use all of them to answer a question: In how many states women population is greater than men?
<- counties %>%
counties_selected select(State, County, Men, Women)
# Extract the most populated row for each state
%>%
counties_selected group_by(State, County) %>%
summarize(women_total = sum(Women),
men_total = sum(Men),
women_more_men = women_total > men_total) %>%
top_n(1, women_total) %>%
ungroup() %>%
count(women_more_men)
## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 2
## women_more_men n
## <lgl> <int>
## 1 FALSE 8
## 2 TRUE 44
Notice that 44 states have more women than men.
3 Selecting and Transforming Data
This chapter focuses on advanced methods of selecting and transforming columns. There are a ton of variables in the counties dataset, and often you only want to work with a subset of them.
3.1 Selecting
3.1.1 Selecting columns
Using the select()
verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:
) is useful for getting many columns at a time.
%>%
counties # Select state, county, population, and industry-related columns
select(State, County, TotalPop, Professional:Production) %>%
# Arrange service in descending order
arrange(desc(Service))
## # A tibble: 3,220 × 8
## State County TotalPop Professional Service Office Construction Production
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Utah Dagge… 702 19.8 46.4 4.8 13 16
## 2 Texas Kinne… 3631 24.2 38.6 10.7 21.6 4.9
## 3 Texas Hudsp… 3702 17.8 38.4 14.1 17.5 12.2
## 4 Puerto R… Viequ… 8931 20.9 38.4 16.4 16.9 7.3
## 5 Hawaii Kalaw… 86 22.2 38.1 20.6 0 19
## 6 Wisconsin Menom… 4506 22 37.4 19.9 7.8 12.9
## 7 Texas Culbe… 2257 19.3 36 17.9 19.7 7.1
## 8 Alaska Denal… 2303 23.1 35.2 20.9 10.1 10.8
## 9 Kentucky Wolfe… 7251 18.7 35 16.5 15.3 14.5
## 10 New York Bronx… 1455846 25 33.4 23.5 7.1 11.1
## # … with 3,210 more rows
3.1.2 Select helpers
You learned about the select helper starts_with()
. Another select helper is ends_with(
), which finds the columns that end with a particular string.
Below a list of useful helpers: contains()
starts_with()
ends_with()
last_col()
You can see all typing ?select_helpers.
%>%
counties # Select the state, county, population, and those ending with "work"
select(State, County, TotalPop, ends_with("Work")) %>%
# Filter for counties that have at least 50% of people engaged in public work
filter(PublicWork >= 50)
## # A tibble: 10 × 6
## State County TotalPop PrivateWork PublicWork FamilyWork
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska Kusilvak Census Area 8129 41 57.8 0
## 2 Alaska Lake and Peninsula B… 1301 40.4 54.2 0.3
## 3 Alaska Yukon-Koyukuk Census… 5453 38.8 56.5 0
## 4 Hawaii Kalawao County 86 34.9 61.9 0
## 5 North Dakota Sioux County 4420 34.3 58.3 0.3
## 6 South Dakota Buffalo County 2048 38.9 57.7 1.2
## 7 South Dakota Dewey County 5709 33.7 50.2 1.4
## 8 South Dakota Oglala Lakota County 14291 31.1 64.8 0
## 9 South Dakota Todd County 10016 34.4 57.4 1.4
## 10 Wisconsin Menominee County 4506 34.3 62.3 0.3
3.2 The rename verb
3.2.1 Renaming a column after count
The rename()
verb is often useful for changing the name of a column that comes out of another verb, such as count()
. In this exercise, you’ll rename the n column from count()
(which you learned about in Chapter 2) to something more descriptive.
# Rename the n column to num_counties
%>%
counties count(State) %>%
rename(num_counties = n)
## # A tibble: 52 × 2
## State num_counties
## <chr> <int>
## 1 Alabama 67
## 2 Alaska 29
## 3 Arizona 15
## 4 Arkansas 75
## 5 California 58
## 6 Colorado 64
## 7 Connecticut 8
## 8 Delaware 3
## 9 District of Columbia 1
## 10 Florida 67
## # … with 42 more rows
Notice the difference between column names in the output from the first step to the second step. Don’t forget, using
rename()
isn’t the only way to choose a new name for a column!
3.2.2 Renaming a column as part of a select
rename()
isn’t the only way you can choose a new name for a column; you can also choose a name as part of a select()
.
# Select state, county, and poverty as poverty_rate
%>%
counties select(State, County, poverty_rate = Poverty)
## # A tibble: 3,220 × 3
## State County poverty_rate
## <chr> <chr> <dbl>
## 1 Alabama Autauga County 13.7
## 2 Alabama Baldwin County 11.8
## 3 Alabama Barbour County 27.2
## 4 Alabama Bibb County 15.2
## 5 Alabama Blount County 15.6
## 6 Alabama Bullock County 28.5
## 7 Alabama Butler County 24.4
## 8 Alabama Calhoun County 18.6
## 9 Alabama Chambers County 18.8
## 10 Alabama Cherokee County 16.1
## # … with 3,210 more rows
As you can see, we were able to select the four columns of interest from our dataset, and rename one of those columns, using only the
select()
verb!
3.3 The transmute verb
Recall, you can think of transmute()
as a combination of select()
and mutate()
, since you are getting back a subset of columns, but you are transforming and changing them at the same time.
3.3.1 Using transmute
As you learned in the video, the transmute
verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.
%>%
counties # Keep the state, county, and populations columns, and add a active column
transmute(State, County, TotalPop, employment_rate = Employed/TotalPop) %>%
# Filter for counties with a population greater than one million
filter(TotalPop > 1000000) %>%
# Sort active in ascending order
arrange(employment_rate)
## # A tibble: 44 × 4
## State County TotalPop employment_rate
## <chr> <chr> <dbl> <dbl>
## 1 New York Bronx County 1455846 0.408
## 2 Michigan Wayne County 1763822 0.409
## 3 California San Bernardino County 2121220 0.410
## 4 California Riverside County 2355002 0.416
## 5 Pennsylvania Philadelphia County 1569657 0.424
## 6 Arizona Pima County 1007257 0.430
## 7 California Sacramento County 1495400 0.446
## 8 Texas Bexar County 1892004 0.459
## 9 Florida Palm Beach County 1426772 0.459
## 10 New York Kings County 2635121 0.462
## # … with 34 more rows
3.3.2 Choosing among the four verbs
In this chapter you’ve learned about the four verbs: select
, mutate
, transmute
, and rename.
Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.
Examples using different verbs:
# Change the name of the unemployment column
%>%
counties rename(unemployment_rate = Unemployment)
## # A tibble: 3,220 × 37
## CountyId State County TotalPop Men Women Hispanic White Black Native Asian
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab… Autau… 55036 26899 28137 2.7 75.4 18.9 0.3 0.9
## 2 1003 Alab… Baldw… 203360 99527 103833 4.4 83.1 9.5 0.8 0.7
## 3 1005 Alab… Barbo… 26201 13976 12225 4.2 45.7 47.8 0.2 0.6
## 4 1007 Alab… Bibb … 22580 12251 10329 2.4 74.6 22 0.4 0
## 5 1009 Alab… Bloun… 57667 28490 29177 9 87.4 1.5 0.3 0.1
## 6 1011 Alab… Bullo… 10478 5616 4862 0.3 21.6 75.6 1 0.7
## 7 1013 Alab… Butle… 20126 9416 10710 0.3 52.2 44.7 0.1 1.1
## 8 1015 Alab… Calho… 115527 55593 59934 3.6 72.7 20.4 0.2 1
## 9 1017 Alab… Chamb… 33895 16320 17575 2.2 56.2 39.3 0.3 1
## 10 1019 Alab… Chero… 25855 12862 12993 1.6 91.8 5 0.5 0.1
## # … with 3,210 more rows, and 26 more variables: Pacific <dbl>,
## # VotingAgeCitizen <dbl>, Income <dbl>, IncomeErr <dbl>, IncomePerCap <dbl>,
## # IncomePerCapErr <dbl>, Poverty <dbl>, ChildPoverty <dbl>,
## # Professional <dbl>, Service <dbl>, Office <dbl>, Construction <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>, Walk <dbl>,
## # OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## # PrivateWork <dbl>, PublicWork <dbl>, SelfEmployed <dbl>, …
# Keep the state and county columns, and the columns containing poverty
%>%
counties select(State, County, contains("Poverty"))
## # A tibble: 3,220 × 4
## State County Poverty ChildPoverty
## <chr> <chr> <dbl> <dbl>
## 1 Alabama Autauga County 13.7 20.1
## 2 Alabama Baldwin County 11.8 16.1
## 3 Alabama Barbour County 27.2 44.9
## 4 Alabama Bibb County 15.2 26.6
## 5 Alabama Blount County 15.6 25.4
## 6 Alabama Bullock County 28.5 50.4
## 7 Alabama Butler County 24.4 34.8
## 8 Alabama Calhoun County 18.6 26.6
## 9 Alabama Chambers County 18.8 29.1
## 10 Alabama Cherokee County 16.1 20
## # … with 3,210 more rows
# Calculate the fraction_women column without dropping the other columns
%>%
counties mutate(fraction_women = Women / TotalPop)
## # A tibble: 3,220 × 38
## CountyId State County TotalPop Men Women Hispanic White Black Native Asian
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab… Autau… 55036 26899 28137 2.7 75.4 18.9 0.3 0.9
## 2 1003 Alab… Baldw… 203360 99527 103833 4.4 83.1 9.5 0.8 0.7
## 3 1005 Alab… Barbo… 26201 13976 12225 4.2 45.7 47.8 0.2 0.6
## 4 1007 Alab… Bibb … 22580 12251 10329 2.4 74.6 22 0.4 0
## 5 1009 Alab… Bloun… 57667 28490 29177 9 87.4 1.5 0.3 0.1
## 6 1011 Alab… Bullo… 10478 5616 4862 0.3 21.6 75.6 1 0.7
## 7 1013 Alab… Butle… 20126 9416 10710 0.3 52.2 44.7 0.1 1.1
## 8 1015 Alab… Calho… 115527 55593 59934 3.6 72.7 20.4 0.2 1
## 9 1017 Alab… Chamb… 33895 16320 17575 2.2 56.2 39.3 0.3 1
## 10 1019 Alab… Chero… 25855 12862 12993 1.6 91.8 5 0.5 0.1
## # … with 3,210 more rows, and 27 more variables: Pacific <dbl>,
## # VotingAgeCitizen <dbl>, Income <dbl>, IncomeErr <dbl>, IncomePerCap <dbl>,
## # IncomePerCapErr <dbl>, Poverty <dbl>, ChildPoverty <dbl>,
## # Professional <dbl>, Service <dbl>, Office <dbl>, Construction <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>, Walk <dbl>,
## # OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## # PrivateWork <dbl>, PublicWork <dbl>, SelfEmployed <dbl>, …
# Keep only the state, county, and employment_rate columns
%>%
counties transmute(State, County, employment_rate = Employed / TotalPop)
## # A tibble: 3,220 × 3
## State County employment_rate
## <chr> <chr> <dbl>
## 1 Alabama Autauga County 0.438
## 2 Alabama Baldwin County 0.440
## 3 Alabama Barbour County 0.339
## 4 Alabama Bibb County 0.362
## 5 Alabama Blount County 0.371
## 6 Alabama Bullock County 0.409
## 7 Alabama Butler County 0.384
## 8 Alabama Calhoun County 0.410
## 9 Alabama Chambers County 0.429
## 10 Alabama Cherokee County 0.382
## # … with 3,210 more rows
4 Case Study: The babynames Dataset
Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.
4.1 The babynames data
<- readRDS("babynames.rds")
babynames babynames
## # A tibble: 332,595 × 3
## year name number
## <dbl> <chr> <int>
## 1 1880 Aaron 102
## 2 1880 Ab 5
## 3 1880 Abbie 71
## 4 1880 Abbott 5
## 5 1880 Abby 6
## 6 1880 Abe 50
## 7 1880 Abel 9
## 8 1880 Abigail 12
## 9 1880 Abner 27
## 10 1880 Abraham 81
## # … with 332,585 more rows
4.1.1 Filtering and arranging for one year
The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.
%>%
babynames # Filter for the year 1990
filter(year == 1990) %>%
# Sort the number column in descending order
arrange(desc(number))
## # A tibble: 21,223 × 3
## year name number
## <dbl> <chr> <int>
## 1 1990 Michael 65560
## 2 1990 Christopher 52520
## 3 1990 Jessica 46615
## 4 1990 Ashley 45797
## 5 1990 Matthew 44925
## 6 1990 Joshua 43382
## 7 1990 Brittany 36650
## 8 1990 Amanda 34504
## 9 1990 Daniel 33963
## 10 1990 David 33862
## # … with 21,213 more rows
4.1.2 Using top_n with babynames
You saw that you could use filter()
and arrange()
to find the most common names in one year. However, you could also use group_by()
and top_n()
to find the most common name in every year.
%>%
babynames # Find the most common name in each year
group_by(year) %>%
top_n(1, number)
## # A tibble: 28 × 3
## # Groups: year [28]
## year name number
## <dbl> <chr> <int>
## 1 1880 John 9701
## 2 1885 Mary 9166
## 3 1890 Mary 12113
## 4 1895 Mary 13493
## 5 1900 Mary 16781
## 6 1905 Mary 16135
## 7 1910 Mary 22947
## 8 1915 Mary 58346
## 9 1920 Mary 71175
## 10 1925 Mary 70857
## # … with 18 more rows
4.1.3 Visualizing names with ggplot2
The dplyr
package is very useful for exploring data, but it’s especially useful when combined with other tidyverse
packages like ggplot2
<- babynames %>%
selected_names # Filter for the names Steven, Thomas, and Matthew
filter(name %in% c("Steven", "Thomas", "Matthew"))
<- babynames %>%
selected_names # Filter for the names Steven, Thomas, and Matthew
filter(name %in% c("Steven", "Thomas", "Matthew"))
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()
4.2 Grouped mutates
4.2.1 Finding the year each name is most common
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.
To do this, you’ll be combining the grouped mutate approach with a top_n.
# Calculate the fraction of people born each year with the same name
%>%
babynames group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
# Find the year each name is most common
group_by(name) %>%
top_n(1, fraction)
## # A tibble: 48,040 × 5
## # Groups: name [48,040]
## year name number year_total fraction
## <dbl> <chr> <int> <int> <dbl>
## 1 1880 Abbott 5 201478 0.0000248
## 2 1880 Abe 50 201478 0.000248
## 3 1880 Abner 27 201478 0.000134
## 4 1880 Adelbert 28 201478 0.000139
## 5 1880 Adella 26 201478 0.000129
## 6 1880 Adolf 6 201478 0.0000298
## 7 1880 Adolph 93 201478 0.000462
## 8 1880 Agustus 5 201478 0.0000248
## 9 1880 Albert 1493 201478 0.00741
## 10 1880 Albertina 7 201478 0.0000347
## # … with 48,030 more rows
Notice that the results are grouped by
year
, thenname
, so the first few entries are names that were most popular in the 1880’s that start with the letter A.
4.2.2 Adding the total and maximum for each name
In the video, you learned how you could group by the year and use mutate()
to add a total for that year.
In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.
Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutates
slower.
%>%
babynames # Add columns name_total and name_max for each name
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number/name_max)
## # A tibble: 332,595 × 6
## year name number name_total name_max fraction_max
## <dbl> <chr> <int> <int> <int> <dbl>
## 1 1880 Aaron 102 114739 14635 0.00697
## 2 1880 Ab 5 77 31 0.161
## 3 1880 Abbie 71 4330 445 0.160
## 4 1880 Abbott 5 217 51 0.0980
## 5 1880 Abby 6 11272 1753 0.00342
## 6 1880 Abe 50 1832 271 0.185
## 7 1880 Abel 9 10565 3245 0.00277
## 8 1880 Abigail 12 72600 15762 0.000761
## 9 1880 Abner 27 1552 199 0.136
## 10 1880 Abraham 81 17882 2449 0.0331
## # … with 332,585 more rows
This tells you, for example, that the name Abe was at 18.5% of its peak in the year 1880.
4.2.3 Visualizing the normalized change in popularity
You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.
In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized
, has been provided for you.
<- babynames %>%
names_normalized group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
<- names_normalized %>%
names_filtered # Filter for the names Steven, Thomas, and Matthew
filter(name %in% c( "Steven", "Thomas", "Matthew"))
# Visualize these names over time
ggplot(names_filtered, aes(x= year, y= fraction_max, color = name)) + geom_line()
> As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.
4.3 Window functions
4.3.1 Using ratios to describe the frequency of a name
You learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?
<- babynames %>%
babynames_fraction group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
# Find the year each name is most common
group_by(name) %>%
top_n(1, fraction)
%>%
babynames_fraction # Arrange the data in order of name, then year
arrange(name,year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio of fraction between each year
mutate(ratio = fraction/lag(fraction))
## # A tibble: 48,040 × 6
## # Groups: name [48,040]
## year name number year_total fraction ratio
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 2015 Aaban 15 3648781 0.00000411 NA
## 2 2015 Aadam 22 3648781 0.00000603 NA
## 3 2010 Aadan 11 3672066 0.00000300 NA
## 4 2015 Aadarsh 15 3648781 0.00000411 NA
## 5 2010 Aaden 450 3672066 0.000123 NA
## 6 2015 Aadhav 31 3648781 0.00000850 NA
## 7 2015 Aadhavan 5 3648781 0.00000137 NA
## 8 2015 Aadhya 265 3648781 0.0000726 NA
## 9 2010 Aadi 54 3672066 0.0000147 NA
## 10 2005 Aadil 20 3828460 0.00000522 NA
## # … with 48,030 more rows
Notice that the first observation for each name is missing a ratio, since there is no previous year.
4.3.2 Biggest jumps in a name
Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.
<- babynames %>%
babynames_fraction group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
# Find the year each name is most common
group_by(name) %>%
top_n(1, fraction)
%>%
babynames_fraction # Arrange the data in order of name, then year
arrange(name,year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio of fraction between each year
mutate(ratio = fraction/lag(fraction))
## # A tibble: 48,040 × 6
## # Groups: name [48,040]
## year name number year_total fraction ratio
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 2015 Aaban 15 3648781 0.00000411 NA
## 2 2015 Aadam 22 3648781 0.00000603 NA
## 3 2010 Aadan 11 3672066 0.00000300 NA
## 4 2015 Aadarsh 15 3648781 0.00000411 NA
## 5 2010 Aaden 450 3672066 0.000123 NA
## 6 2015 Aadhav 31 3648781 0.00000850 NA
## 7 2015 Aadhavan 5 3648781 0.00000137 NA
## 8 2015 Aadhya 265 3648781 0.0000726 NA
## 9 2010 Aadi 54 3672066 0.0000147 NA
## 10 2005 Aadil 20 3828460 0.00000522 NA
## # … with 48,030 more rows
<- babynames_fraction %>%
babynames_ratios_filtered arrange(name, year) %>%
group_by(name) %>%
mutate(ratio = fraction / lag(fraction)) %>%
filter(fraction >= 0.00001)
%>%
babynames_ratios_filtered # Extract the largest ratio from each name
top_n(1, ratio) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)
## # A tibble: 0 × 6
## # Groups: name [0]
## # … with 6 variables: year <dbl>, name <chr>, number <int>, year_total <int>,
## # fraction <dbl>, ratio <dbl>
Some of these can be interpreted: for example, Grover Cleveland was a president elected in 1884.
5 Conclusions
5.1 Summary
select()
filter()
mutate()
arrange()
count()
group_by()
summarize()
5.2 Other DataCamp courses
- Exploratory Data Analysis in R: Case Study
- Working with Data in the Tidyverse
- Machine Learning in the Tidyverse
- Categorical Data in the Tidyverse