Libraries
library(tidyverse)
counties <- readRDS("counties.rds")
babynames <- readRDS("babynames.rds")
Say you’ve found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You’ll also learn to aggregate your data and add, remove, or change the variables. Along the way, you’ll explore a dataset containing information about counties in the United States. You’ll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.
Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.
Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.
counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
counties_selected %>% arrange(desc(public_work))
## # A tibble: 3,138 × 6
## state county population private_work public_work self_employed
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Hawaii Kalawao 85 25 64.1 10.9
## 2 Alaska Yukon-Koyukuk… 5644 33.3 61.7 5.1
## 3 Wisconsin Menominee 4451 36.8 59.1 3.7
## 4 North Dakota Sioux 4380 32.9 56.8 10.2
## 5 South Dakota Todd 9942 34.4 55 9.8
## 6 Alaska Lake and Peni… 1474 42.2 51.6 6.1
## 7 California Lassen 32645 42.6 50.5 6.8
## 8 South Dakota Buffalo 2038 48.4 49.5 1.8
## 9 South Dakota Dewey 5579 34.9 49.2 14.7
## 10 Texas Kenedy 565 51.9 48.1 0
## # … with 3,128 more rows
You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.
counties_selected <- counties %>% select(state, county, population)
# Filter for counties with a population above 1000000
counties_selected %>% filter(population > 1000000)
## # A tibble: 41 × 3
## state county population
## <chr> <chr> <dbl>
## 1 Arizona Maricopa 4018143
## 2 California Alameda 1584983
## 3 California Contra Costa 1096068
## 4 California Los Angeles 10038388
## 5 California Orange 3116069
## 6 California Riverside 2298032
## 7 California Sacramento 1465832
## 8 California San Bernardino 2094769
## 9 California San Diego 3223096
## 10 California Santa Clara 1868149
## # … with 31 more rows
counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>% filter(state == "Texas", population > 10000) %>%
arrange(desc(private_work))
## # A tibble: 169 × 6
## state county population private_work public_work self_employed
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Texas Gregg 123178 84.7 9.8 5.4
## 2 Texas Collin 862215 84.1 10 5.8
## 3 Texas Dallas 2485003 83.9 9.5 6.4
## 4 Texas Harris 4356362 83.4 10.1 6.3
## 5 Texas Andrews 16775 83.1 9.6 6.8
## 6 Texas Tarrant 1914526 83.1 11.4 5.4
## 7 Texas Titus 32553 82.5 10 7.4
## 8 Texas Denton 731851 82.2 11.9 5.7
## 9 Texas Ector 149557 82 11.2 6.7
## 10 Texas Moore 22281 82 11.7 5.9
## # … with 159 more rows
Sort in descending order of the public_workers column
counties_selected <- counties %>% select(state, county, population, public_work)
# Add a new column public_workers with the number of people employed in public work
counties_selected %>% mutate(public_workers = public_work * population / 100) %>%
arrange(desc(public_workers))
## # A tibble: 3,138 × 5
## state county population public_work public_workers
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 California Los Angeles 10038388 11.5 1154415.
## 2 Illinois Cook 5236393 11.5 602185.
## 3 California San Diego 3223096 14.8 477018.
## 4 Arizona Maricopa 4018143 11.7 470123.
## 5 Texas Harris 4356362 10.1 439993.
## 6 New York Kings 2595259 14.4 373717.
## 7 California San Bernardino 2094769 16.7 349826.
## 8 California Riverside 2298032 14.9 342407.
## 9 California Sacramento 1465832 21.8 319551.
## 10 California Orange 3116069 10.2 317839.
## # … with 3,128 more rows
## Select the columns state, county, population, men, and women. Calculate proportion_women as the fraction of the population made up of women
counties_selected <- counties %>% select(state, county, metro, population, men, women)
counties_selected %>% mutate(proportion_women = women / population)
## # A tibble: 3,138 × 7
## state county metro population men women proportion_women
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Autauga Metro 55221 26745 28476 0.516
## 2 Alabama Baldwin Metro 195121 95314 99807 0.512
## 3 Alabama Barbour Nonmetro 26932 14497 12435 0.462
## 4 Alabama Bibb Metro 22604 12073 10531 0.466
## 5 Alabama Blount Metro 57710 28512 29198 0.506
## 6 Alabama Bullock Nonmetro 10678 5660 5018 0.470
## 7 Alabama Butler Nonmetro 20354 9502 10852 0.533
## 8 Alabama Calhoun Metro 116648 56274 60374 0.518
## 9 Alabama Chambers Nonmetro 34079 16258 17821 0.523
## 10 Alabama Cherokee Nonmetro 26008 12975 13033 0.501
## # … with 3,128 more rows
## Select the five columns, add the proportion_men variable, filter for population of at least 10,000 , and arrange proportion of men in descending order
counties %>% select(state, county, population, men, women) %>% mutate(proportion_men = men / population) %>% filter(population >= 10000) %>% arrange(desc(proportion_men))
## # A tibble: 2,437 × 6
## state county population men women proportion_men
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Virginia Sussex 11864 8130 3734 0.685
## 2 California Lassen 32645 21818 10827 0.668
## 3 Georgia Chattahoochee 11914 7940 3974 0.666
## 4 Louisiana West Feliciana 15415 10228 5187 0.664
## 5 Florida Union 15191 9830 5361 0.647
## 6 Texas Jones 19978 12652 7326 0.633
## 7 Missouri DeKalb 12782 8080 4702 0.632
## 8 Texas Madison 13838 8648 5190 0.625
## 9 Virginia Greensville 11760 7303 4457 0.621
## 10 Texas Anderson 57915 35469 22446 0.612
## # … with 2,427 more rows
Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.
## Use count to find the number of counties in each region
counties_selected %>% count("region", sort = TRUE)
## # A tibble: 1 × 2
## `"region"` n
## <chr> <int>
## 1 region 3138
## Find number of counties per state, weighted by citizens, sorted in descending order
counties %>% count("state", wt=citizens, sort=TRUE)
## # A tibble: 1 × 2
## `"state"` n
## <chr> <dbl>
## 1 state 221965911
## Add population_walk containing the total number of people who walk to work, and count weighted by the new column, sort in descending order
counties %>% mutate(population_walk = population * walk / 100) %>% count("state", wt = population_walk, sort = TRUE)
## # A tibble: 1 × 2
## `"state"` n
## <chr> <dbl>
## 1 state 8579507.
## Summarize to find minimum population, maximum unemployment, and average income
counties %>%
summarize(min_population=min(population), max_unemployment=max(unemployment), average_income=mean(income))
## # A tibble: 1 × 3
## min_population max_unemployment average_income
## <dbl> <dbl> <dbl>
## 1 85 29.4 46832.
## Group by state, and find the total area and population
counties %>% group_by(state) %>%
summarize(total_area=sum(land_area), total_population=sum(population))
## # A tibble: 50 × 3
## state total_area total_population
## <chr> <dbl> <dbl>
## 1 Alabama 50645. 4830620
## 2 Alaska 553560. 725461
## 3 Arizona 113594. 6641928
## 4 Arkansas 52035. 2958208
## 5 California 155779. 38421464
## 6 Colorado 103642. 5278906
## 7 Connecticut 4842. 3593222
## 8 Delaware 1949. 926454
## 9 Florida 53625. 19645772
## 10 Georgia 57514. 10006693
## # … with 40 more rows
## Add a density column, sort by density in descending order
counties %>% group_by(state) %>%
summarize(total_area = sum(land_area), total_population = sum(population)) %>%
mutate(density=total_population/total_area) %>% arrange(desc(density))
## # A tibble: 50 × 4
## state total_area total_population density
## <chr> <dbl> <dbl> <dbl>
## 1 New Jersey 7354. 8904413 1211.
## 2 Rhode Island 1034. 1053661 1019.
## 3 Massachusetts 7800. 6705586 860.
## 4 Connecticut 4842. 3593222 742.
## 5 Maryland 9707. 5930538 611.
## 6 Delaware 1949. 926454 475.
## 7 New York 47126. 19673174 417.
## 8 Florida 53625. 19645772 366.
## 9 Pennsylvania 44743. 12779559 286.
## 10 Ohio 40861. 11575977 283.
## # … with 40 more rows
## Group and summarize to find the total population, calculate the average_pop and median_pop columns
counties %>%
group_by(region, state) %>%
summarize(total_pop = sum(population)) %>%
summarize(average_pop = mean(total_pop), median_pop = median(total_pop))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## region average_pop median_pop
## <chr> <dbl> <dbl>
## 1 North Central 5627687. 5580644
## 2 Northeast 6221058. 3593222
## 3 South 7370486 4804098
## 4 West 5722755. 2798636
## Group by region and find the greatest number of citizens who walk to work
counties %>% group_by(region) %>% top_n(1, walk)
## # A tibble: 4 × 40
## # Groups: region [4]
## census_id state county region metro population men women hispanic white
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 Alaska Aleutia… West Nonm… 3304 2198 1106 12 15
## 2 36061 New Y… New York North… Metro 1629507 769434 860073 25.8 47.1
## 3 38051 North… McIntosh North… Nonm… 2759 1341 1418 0.9 95.8
## 4 51678 Virgi… Lexingt… South Nonm… 7071 4372 2699 3.9 75.4
## # … with 30 more variables: black <dbl>, native <dbl>, asian <dbl>,
## # pacific <dbl>, citizens <dbl>, income <dbl>, income_err <dbl>,
## # income_per_cap <dbl>, income_per_cap_err <dbl>, poverty <dbl>,
## # child_poverty <dbl>, professional <dbl>, service <dbl>, office <dbl>,
## # construction <dbl>, production <dbl>, drive <dbl>, carpool <dbl>,
## # transit <dbl>, walk <dbl>, other_transp <dbl>, work_at_home <dbl>,
## # mean_commute <dbl>, employed <dbl>, private_work <dbl>, …
## Calculate average income and find the highest income state in each region
counties %>%
group_by(region, state) %>%
summarize(average_income = mean(income)) %>% top_n(1, average_income)
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: region [4]
## region state average_income
## <chr> <chr> <dbl>
## 1 North Central North Dakota 55575.
## 2 Northeast New Jersey 73014.
## 3 South Maryland 69200.
## 4 West Alaska 65125.
## Find the total population for each combination of state and metro
counties_selected %>%
group_by(state, metro) %>% summarize(total_pop=sum(population))
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 97 × 3
## # Groups: state [50]
## state metro total_pop
## <chr> <chr> <dbl>
## 1 Alabama Metro 3671377
## 2 Alabama Nonmetro 1159243
## 3 Alaska Metro 494990
## 4 Alaska Nonmetro 230471
## 5 Arizona Metro 6295145
## 6 Arizona Nonmetro 346783
## 7 Arkansas Metro 1806867
## 8 Arkansas Nonmetro 1151341
## 9 California Metro 37587429
## 10 California Nonmetro 834035
## # … with 87 more rows
## Find the total population for each combination of state and metro, Extract the most populated row for each state
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
top_n(1, total_pop)
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 50 × 3
## # Groups: state [50]
## state metro total_pop
## <chr> <chr> <dbl>
## 1 Alabama Metro 3671377
## 2 Alaska Metro 494990
## 3 Arizona Metro 6295145
## 4 Arkansas Metro 1806867
## 5 California Metro 37587429
## 6 Colorado Metro 4590896
## 7 Connecticut Metro 3406918
## 8 Delaware Metro 926454
## 9 Florida Metro 18941821
## 10 Georgia Metro 8233886
## # … with 40 more rows
Find the total population for each combination of state and metro, Extract the most populated row for each state, Count the states with more people in Metro or Nonmetro areas
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
top_n(1, total_pop) %>%
ungroup() %>%
count(metro)
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 2
## metro n
## <chr> <int>
## 1 Metro 44
## 2 Nonmetro 6
Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.
## Glimpse the counties table
glimpse(counties)
## Rows: 3,138
## Columns: 40
## $ census_id <chr> "1001", "1003", "1005", "1007", "1009", "1011", "10…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
## $ county <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", …
## $ region <chr> "South", "South", "South", "South", "South", "South…
## $ metro <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", "No…
## $ population <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 1…
## $ men <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5627…
## $ women <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 603…
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7…
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.…
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3,…
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0…
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0…
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
## $ citizens <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88…
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41…
## $ income_err <dbl> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2949…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21…
## $ income_per_cap_err <dbl> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366, 15…
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.…
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.…
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.…
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.…
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.…
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5…
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.…
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.…
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, …
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0…
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1…
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1…
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1…
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.…
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401,…
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.…
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.…
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4…
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0…
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.…
## $ land_area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 77…
## Select state, county, population, and industry-related columns, Arrange service in descending order
counties %>%
select(state, county, population, professional:production) %>%
arrange(desc(service))
## # A tibble: 3,138 × 8
## state county population professional service office construction production
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Missis… Tunica 10477 23.9 36.6 21.5 3.5 14.5
## 2 Texas Kinney 3577 30 36.5 11.6 20.5 1.3
## 3 Texas Kenedy 565 24.9 34.1 20.5 20.5 0
## 4 New Yo… Bronx 1428357 24.3 33.3 24.2 7.1 11
## 5 Texas Brooks 7221 19.6 32.4 25.3 11.1 11.5
## 6 Colora… Fremo… 46809 26.6 32.2 22.8 10.7 7.6
## 7 Texas Culbe… 2296 20.1 32.2 24.2 15.7 7.8
## 8 Califo… Del N… 27788 33.9 31.5 18.8 8.9 6.8
## 9 Minnes… Mahno… 5496 26.8 31.5 18.7 13.1 9.9
## 10 Virgin… Lanca… 11129 30.3 31.2 22.8 8.1 7.6
## # … with 3,128 more rows
# Select the state, county, population, and those ending with "work", Filter for counties that have at least 50% of people engaged in public work
counties %>%
select(state, county, population, ends_with("work")) %>%
filter(public_work >= 50)
## # A tibble: 7 × 6
## state county population private_work public_work family_work
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska Lake and Peninsu… 1474 42.2 51.6 0.2
## 2 Alaska Yukon-Koyukuk Ce… 5644 33.3 61.7 0
## 3 California Lassen 32645 42.6 50.5 0.1
## 4 Hawaii Kalawao 85 25 64.1 0
## 5 North Dakota Sioux 4380 32.9 56.8 0.1
## 6 South Dakota Todd 9942 34.4 55 0.8
## 7 Wisconsin Menominee 4451 36.8 59.1 0.4
## Count the number of counties in each state, Rename the n column to num_counties
counties %>% count(state) %>% rename(num_counties=n)
## # A tibble: 50 × 2
## state num_counties
## <chr> <int>
## 1 Alabama 67
## 2 Alaska 28
## 3 Arizona 15
## 4 Arkansas 75
## 5 California 58
## 6 Colorado 64
## 7 Connecticut 8
## 8 Delaware 3
## 9 Florida 67
## 10 Georgia 159
## # … with 40 more rows
## Select state, county, and poverty as poverty_rate
counties %>% select(state, county, poverty_rate=poverty)
## # A tibble: 3,138 × 3
## state county poverty_rate
## <chr> <chr> <dbl>
## 1 Alabama Autauga 12.9
## 2 Alabama Baldwin 13.4
## 3 Alabama Barbour 26.7
## 4 Alabama Bibb 16.8
## 5 Alabama Blount 16.7
## 6 Alabama Bullock 24.6
## 7 Alabama Butler 25.4
## 8 Alabama Calhoun 20.5
## 9 Alabama Chambers 21.6
## 10 Alabama Cherokee 19.2
## # … with 3,128 more rows
The transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.
## Keep the state, county, and populations columns, and add a density column, Filter for counties with a population greater than one million, Sort density in ascending order
counties %>%
transmute(state, county, population, density = population / land_area) %>%
filter(population > 1000000) %>%
arrange(density)
## # A tibble: 41 × 4
## state county population density
## <chr> <chr> <dbl> <dbl>
## 1 California San Bernardino 2094769 104.
## 2 Nevada Clark 2035572 258.
## 3 California Riverside 2298032 319.
## 4 Arizona Maricopa 4018143 437.
## 5 Florida Palm Beach 1378806 700.
## 6 California San Diego 3223096 766.
## 7 Washington King 2045756 967.
## 8 Texas Travis 1121645 1133.
## 9 Florida Hillsborough 1302884 1277.
## 10 Florida Orange 1229039 1360.
## # … with 31 more rows
## Change the name of the unemployment column
counties %>% rename(unemployment_rate = unemployment)
## # A tibble: 3,138 × 40
## census_id state county region metro population men women hispanic white
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alabama Autauga South Metro 55221 26745 28476 2.6 75.8
## 2 1003 Alabama Baldwin South Metro 195121 95314 99807 4.5 83.1
## 3 1005 Alabama Barbour South Nonm… 26932 14497 12435 4.6 46.2
## 4 1007 Alabama Bibb South Metro 22604 12073 10531 2.2 74.5
## 5 1009 Alabama Blount South Metro 57710 28512 29198 8.6 87.9
## 6 1011 Alabama Bullock South Nonm… 10678 5660 5018 4.4 22.2
## 7 1013 Alabama Butler South Nonm… 20354 9502 10852 1.2 53.3
## 8 1015 Alabama Calhoun South Metro 116648 56274 60374 3.5 73
## 9 1017 Alabama Chambers South Nonm… 34079 16258 17821 0.4 57.3
## 10 1019 Alabama Cherokee South Nonm… 26008 12975 13033 1.5 91.7
## # … with 3,128 more rows, and 30 more variables: black <dbl>, native <dbl>,
## # asian <dbl>, pacific <dbl>, citizens <dbl>, income <dbl>, income_err <dbl>,
## # income_per_cap <dbl>, income_per_cap_err <dbl>, poverty <dbl>,
## # child_poverty <dbl>, professional <dbl>, service <dbl>, office <dbl>,
## # construction <dbl>, production <dbl>, drive <dbl>, carpool <dbl>,
## # transit <dbl>, walk <dbl>, other_transp <dbl>, work_at_home <dbl>,
## # mean_commute <dbl>, employed <dbl>, private_work <dbl>, …
## Keep the state and county columns, and the columns containing poverty
counties %>% select(state, county, contains("poverty"))
## # A tibble: 3,138 × 4
## state county poverty child_poverty
## <chr> <chr> <dbl> <dbl>
## 1 Alabama Autauga 12.9 18.6
## 2 Alabama Baldwin 13.4 19.2
## 3 Alabama Barbour 26.7 45.3
## 4 Alabama Bibb 16.8 27.9
## 5 Alabama Blount 16.7 27.2
## 6 Alabama Bullock 24.6 38.4
## 7 Alabama Butler 25.4 39.2
## 8 Alabama Calhoun 20.5 31.6
## 9 Alabama Chambers 21.6 37.2
## 10 Alabama Cherokee 19.2 30.1
## # … with 3,128 more rows
## Calculate the fraction_women column without dropping the other columns
counties %>% mutate(fraction_women = women / population)
## # A tibble: 3,138 × 41
## census_id state county region metro population men women hispanic white
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alabama Autauga South Metro 55221 26745 28476 2.6 75.8
## 2 1003 Alabama Baldwin South Metro 195121 95314 99807 4.5 83.1
## 3 1005 Alabama Barbour South Nonm… 26932 14497 12435 4.6 46.2
## 4 1007 Alabama Bibb South Metro 22604 12073 10531 2.2 74.5
## 5 1009 Alabama Blount South Metro 57710 28512 29198 8.6 87.9
## 6 1011 Alabama Bullock South Nonm… 10678 5660 5018 4.4 22.2
## 7 1013 Alabama Butler South Nonm… 20354 9502 10852 1.2 53.3
## 8 1015 Alabama Calhoun South Metro 116648 56274 60374 3.5 73
## 9 1017 Alabama Chambers South Nonm… 34079 16258 17821 0.4 57.3
## 10 1019 Alabama Cherokee South Nonm… 26008 12975 13033 1.5 91.7
## # … with 3,128 more rows, and 31 more variables: black <dbl>, native <dbl>,
## # asian <dbl>, pacific <dbl>, citizens <dbl>, income <dbl>, income_err <dbl>,
## # income_per_cap <dbl>, income_per_cap_err <dbl>, poverty <dbl>,
## # child_poverty <dbl>, professional <dbl>, service <dbl>, office <dbl>,
## # construction <dbl>, production <dbl>, drive <dbl>, carpool <dbl>,
## # transit <dbl>, walk <dbl>, other_transp <dbl>, work_at_home <dbl>,
## # mean_commute <dbl>, employed <dbl>, private_work <dbl>, …
## Keep only the state, county, and employment_rate columns
counties %>% transmute(state, county, employment_rate = employed / population)
## # A tibble: 3,138 × 3
## state county employment_rate
## <chr> <chr> <dbl>
## 1 Alabama Autauga 0.434
## 2 Alabama Baldwin 0.441
## 3 Alabama Barbour 0.319
## 4 Alabama Bibb 0.367
## 5 Alabama Blount 0.384
## 6 Alabama Bullock 0.362
## 7 Alabama Butler 0.384
## 8 Alabama Calhoun 0.406
## 9 Alabama Chambers 0.402
## 10 Alabama Cherokee 0.390
## # … with 3,128 more rows
Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.
You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by() and top_n() to find the most common name in every year.
## Filter for the year 1990, Sort the number column in descending order
babynames %>% filter(year == 1990) %>% arrange(desc(number))
## # A tibble: 21,223 × 3
## year name number
## <dbl> <chr> <int>
## 1 1990 Michael 65560
## 2 1990 Christopher 52520
## 3 1990 Jessica 46615
## 4 1990 Ashley 45797
## 5 1990 Matthew 44925
## 6 1990 Joshua 43382
## 7 1990 Brittany 36650
## 8 1990 Amanda 34504
## 9 1990 Daniel 33963
## 10 1990 David 33862
## # … with 21,213 more rows
## Find the most common name in each year
babynames %>% group_by(year) %>% top_n(1, number)
## # A tibble: 28 × 3
## # Groups: year [28]
## year name number
## <dbl> <chr> <int>
## 1 1880 John 9701
## 2 1885 Mary 9166
## 3 1890 Mary 12113
## 4 1895 Mary 13493
## 5 1900 Mary 16781
## 6 1905 Mary 16135
## 7 1910 Mary 22947
## 8 1915 Mary 58346
## 9 1920 Mary 71175
## 10 1925 Mary 70857
## # … with 18 more rows
## Filter for the names Steven, Thomas, and Matthew
selected_names <- babynames %>% filter(name %in% c("Steven", "Thomas", "Matthew"))
## Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.
## Calculate the fraction of people born each year with the same name, and Find the year each name is most common
babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
group_by(name) %>%
top_n(1, fraction)
## # A tibble: 48,040 × 5
## # Groups: name [48,040]
## year name number year_total fraction
## <dbl> <chr> <int> <int> <dbl>
## 1 1880 Abbott 5 201478 0.0000248
## 2 1880 Abe 50 201478 0.000248
## 3 1880 Abner 27 201478 0.000134
## 4 1880 Adelbert 28 201478 0.000139
## 5 1880 Adella 26 201478 0.000129
## 6 1880 Adolf 6 201478 0.0000298
## 7 1880 Adolph 93 201478 0.000462
## 8 1880 Agustus 5 201478 0.0000248
## 9 1880 Albert 1493 201478 0.00741
## 10 1880 Albertina 7 201478 0.0000347
## # … with 48,030 more rows
## Add columns name_total and name_max for each name
babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number))
## # A tibble: 332,595 × 5
## # Groups: name [48,040]
## year name number name_total name_max
## <dbl> <chr> <int> <int> <int>
## 1 1880 Aaron 102 114739 14635
## 2 1880 Ab 5 77 31
## 3 1880 Abbie 71 4330 445
## 4 1880 Abbott 5 217 51
## 5 1880 Abby 6 11272 1753
## 6 1880 Abe 50 1832 271
## 7 1880 Abel 9 10565 3245
## 8 1880 Abigail 12 72600 15762
## 9 1880 Abner 27 1552 199
## 10 1880 Abraham 81 17882 2449
## # … with 332,585 more rows
## Add columns name_total and name_max for each name, Ungroup the table, and Add the fraction_max column containing the number by the name maximum
babynames %>%
group_by(name) %>%
mutate(name_total = sum(number), name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
## # A tibble: 332,595 × 6
## year name number name_total name_max fraction_max
## <dbl> <chr> <int> <int> <int> <dbl>
## 1 1880 Aaron 102 114739 14635 0.00697
## 2 1880 Ab 5 77 31 0.161
## 3 1880 Abbie 71 4330 445 0.160
## 4 1880 Abbott 5 217 51 0.0980
## 5 1880 Abby 6 11272 1753 0.00342
## 6 1880 Abe 50 1832 271 0.185
## 7 1880 Abel 9 10565 3245 0.00277
## 8 1880 Abigail 12 72600 15762 0.000761
## 9 1880 Abner 27 1552 199 0.136
## 10 1880 Abraham 81 17882 2449 0.0331
## # … with 332,585 more rows
In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?
babynames_fraction <- babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
babynames_fraction %>%
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio of fraction between each year
mutate(ratio = fraction / lag(fraction))
## # A tibble: 332,595 × 6
## # Groups: name [48,040]
## year name number year_total fraction ratio
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 2010 Aaban 9 3672066 0.00000245 NA
## 2 2015 Aaban 15 3648781 0.00000411 1.68
## 3 1995 Aadam 6 3652750 0.00000164 NA
## 4 2000 Aadam 6 3767293 0.00000159 0.970
## 5 2005 Aadam 6 3828460 0.00000157 0.984
## 6 2010 Aadam 7 3672066 0.00000191 1.22
## 7 2015 Aadam 22 3648781 0.00000603 3.16
## 8 2010 Aadan 11 3672066 0.00000300 NA
## 9 2015 Aadan 10 3648781 0.00000274 0.915
## 10 2000 Aadarsh 5 3767293 0.00000133 NA
## # … with 332,585 more rows
Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.
babynames_ratios_filtered <- babynames_fraction %>%
arrange(name, year) %>%
group_by(name) %>%
mutate(ratio = fraction / lag(fraction)) %>%
filter(fraction >= 0.00001)
babynames_ratios_filtered %>%
# Extract the largest ratio from each name
top_n(1, ratio) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)
## # A tibble: 291 × 6
## # Groups: name [291]
## year name number year_total fraction ratio
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 1960 Tammy 14365 4152075 0.00346 70.1
## 2 2005 Nevaeh 4610 3828460 0.00120 45.8
## 3 1940 Brenda 5460 2301630 0.00237 37.5
## 4 1885 Grover 774 240822 0.00321 36.0
## 5 1945 Cheryl 8170 2652029 0.00308 24.9
## 6 1955 Lori 4980 4012691 0.00124 23.2
## 7 2010 Khloe 5411 3672066 0.00147 23.2
## 8 1950 Debra 6189 3502592 0.00177 22.6
## 9 2010 Bentley 4001 3672066 0.00109 22.4
## 10 1935 Marlene 4840 2088487 0.00232 16.8
## # … with 281 more rows
The End.
Thanks DataCamp
- My Favorite Team -