Data Manipulation with dplyr

Instructor: Chris Cardillo-Datacamp


0.1 Introducion

This course has some similarities to other DataCamp courses, especially Introduction to the Tidyverse, and if you’ve taken those courses, this material might be review.

1 Transforming Data with dplyr

Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.

1.1 The counties dataset

1.1.1 Understanding your data

Take a look at the counties dataset using the glimpse() function.

What is the first value in the income variable?

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(dplyr)
counties <- read_csv("acs2017_county_data.csv")
## Rows: 3220 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): State, County
## dbl (35): CountyId, TotalPop, Men, Women, Hispanic, White, Black, Native, As...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(counties)
## Rows: 3,220
## Columns: 37
## $ CountyId         <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017,…
## $ State            <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ County           <chr> "Autauga County", "Baldwin County", "Barbour County",…
## $ TotalPop         <dbl> 55036, 203360, 26201, 22580, 57667, 10478, 20126, 115…
## $ Men              <dbl> 26899, 99527, 13976, 12251, 28490, 5616, 9416, 55593,…
## $ Women            <dbl> 28137, 103833, 12225, 10329, 29177, 4862, 10710, 5993…
## $ Hispanic         <dbl> 2.7, 4.4, 4.2, 2.4, 9.0, 0.3, 0.3, 3.6, 2.2, 1.6, 7.7…
## $ White            <dbl> 75.4, 83.1, 45.7, 74.6, 87.4, 21.6, 52.2, 72.7, 56.2,…
## $ Black            <dbl> 18.9, 9.5, 47.8, 22.0, 1.5, 75.6, 44.7, 20.4, 39.3, 5…
## $ Native           <dbl> 0.3, 0.8, 0.2, 0.4, 0.3, 1.0, 0.1, 0.2, 0.3, 0.5, 0.4…
## $ Asian            <dbl> 0.9, 0.7, 0.6, 0.0, 0.1, 0.7, 1.1, 1.0, 1.0, 0.1, 0.4…
## $ Pacific          <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ VotingAgeCitizen <dbl> 41016, 155376, 20269, 17662, 42513, 8212, 15459, 8838…
## $ Income           <dbl> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 4368…
## $ IncomeErr        <dbl> 2838, 1348, 2551, 3431, 2630, 5376, 2701, 1491, 2011,…
## $ IncomePerCap     <dbl> 27824, 29364, 17561, 20911, 22021, 20856, 19004, 2363…
## $ IncomePerCapErr  <dbl> 2024, 735, 798, 1889, 850, 2355, 943, 793, 1205, 1354…
## $ Poverty          <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8,…
## $ ChildPoverty     <dbl> 20.1, 16.1, 44.9, 26.6, 25.4, 50.4, 34.8, 26.6, 29.1,…
## $ Professional     <dbl> 35.3, 35.7, 25.0, 24.4, 28.5, 19.7, 26.9, 29.0, 24.3,…
## $ Service          <dbl> 18.0, 18.2, 16.8, 17.6, 12.9, 17.1, 17.3, 17.5, 13.5,…
## $ Office           <dbl> 23.2, 25.6, 22.6, 19.7, 23.3, 18.6, 18.5, 23.7, 23.0,…
## $ Construction     <dbl> 8.1, 9.7, 11.5, 15.9, 15.8, 14.0, 11.6, 10.4, 11.6, 1…
## $ Production       <dbl> 15.4, 10.8, 24.1, 22.4, 19.5, 30.6, 25.7, 19.4, 27.6,…
## $ Drive            <dbl> 86.0, 84.7, 83.4, 86.4, 86.8, 73.1, 83.6, 85.0, 87.1,…
## $ Carpool          <dbl> 9.6, 7.6, 11.1, 9.5, 10.2, 15.7, 12.6, 9.2, 9.7, 12.1…
## $ Transit          <dbl> 0.1, 0.1, 0.3, 0.7, 0.1, 0.3, 0.0, 0.2, 0.2, 0.4, 0.1…
## $ Walk             <dbl> 0.6, 0.8, 2.2, 0.3, 0.4, 6.2, 0.9, 1.3, 0.6, 0.3, 0.6…
## $ OtherTransp      <dbl> 1.3, 1.1, 1.7, 1.7, 0.4, 1.7, 0.9, 1.1, 0.5, 0.3, 1.8…
## $ WorkAtHome       <dbl> 2.5, 5.6, 1.3, 1.5, 2.1, 3.0, 2.0, 3.2, 2.0, 2.0, 1.7…
## $ MeanCommute      <dbl> 25.8, 27.0, 23.4, 30.0, 35.0, 29.8, 23.2, 24.8, 23.6,…
## $ Employed         <dbl> 24112, 89527, 8878, 8171, 21380, 4290, 7727, 47392, 1…
## $ PrivateWork      <dbl> 74.1, 80.7, 74.1, 76.0, 83.9, 81.4, 79.1, 74.9, 84.5,…
## $ PublicWork       <dbl> 20.2, 12.9, 19.1, 17.4, 11.9, 13.6, 15.3, 19.9, 11.8,…
## $ SelfEmployed     <dbl> 5.6, 6.3, 6.5, 6.3, 4.0, 5.0, 5.3, 5.1, 3.7, 8.1, 4.5…
## $ FamilyWork       <dbl> 0.1, 0.1, 0.3, 0.3, 0.1, 0.0, 0.3, 0.1, 0.0, 0.0, 0.4…
## $ Unemployment     <dbl> 5.2, 5.5, 12.4, 8.2, 4.9, 12.1, 7.6, 10.1, 6.4, 5.3, …

1.1.2 Selecting calumns

Select the following four columns from the counties variable: * state * county * population * poverty You don’t need to save the result to a variable.

counties %>%
select("state", "county", "population", "poverty")

1.2 The filter and arrange verbs

1.2.1 Arranging observations

Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

In these exercises, you’ll sort these observations to find the most interesting cases.

counties_selected <- counties %>%
  select(State, County, TotalPop, PrivateWork, PublicWork, SelfEmployed, Walk)

# Add a verb to sort in descending order of public_work
counties_selected %>%
  arrange(desc(PublicWork))
## # A tibble: 3,220 × 7
##    State        County        TotalPop PrivateWork PublicWork SelfEmployed  Walk
##    <chr>        <chr>            <dbl>       <dbl>      <dbl>        <dbl> <dbl>
##  1 South Dakota Oglala Lakot…    14291        31.1       64.8          4.1  11.9
##  2 Wisconsin    Menominee Co…     4506        34.3       62.3          3.2   4.6
##  3 Hawaii       Kalawao Coun…       86        34.9       61.9          3.2  40.6
##  4 North Dakota Sioux County      4420        34.3       58.3          7.1   5.5
##  5 Alaska       Kusilvak Cen…     8129        41         57.8          1.1  45.5
##  6 South Dakota Buffalo Coun…     2048        38.9       57.7          2.2   7  
##  7 South Dakota Todd County      10016        34.4       57.4          6.8   5.7
##  8 Alaska       Yukon-Koyuku…     5453        38.8       56.5          4.7  39.8
##  9 Alaska       Lake and Pen…     1301        40.4       54.2          5    36.6
## 10 South Dakota Dewey County      5709        33.7       50.2         14.8   2.9
## # … with 3,210 more rows

1.2.2 Filtering for conditions

You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

# Filter for counties with a population above 1000000
counties_selected %>%
  filter(TotalPop > 1000000)
## # A tibble: 44 × 7
##    State      County          TotalPop PrivateWork PublicWork SelfEmployed  Walk
##    <chr>      <chr>              <dbl>       <dbl>      <dbl>        <dbl> <dbl>
##  1 Arizona    Maricopa County  4155501        82.8       11.1          5.9   1.5
##  2 Arizona    Pima County      1007257        75.7       17.6          6.5   2.3
##  3 California Alameda County   1629615        79.3       13.2          7.3   3.7
##  4 California Contra Costa C…  1123678        78         13.4          8.4   1.7
##  5 California Los Angeles Co… 10105722        79.3       11.2          9.3   2.7
##  6 California Orange County    3155816        82         10.1          7.7   1.9
##  7 California Riverside Coun…  2355002        77.7       14.7          7.5   1.6
##  8 California Sacramento Cou…  1495400        71.5       21.3          6.9   1.9
##  9 California San Bernardino…  2121220        77.3       16.2          6.4   1.7
## 10 California San Diego Coun…  3283665        77.8       14.2          7.8   2.9
## # … with 34 more rows
# Filter for counties in the state of California that have a population above 1000000
counties_selected %>%
  filter(State == "California",
  TotalPop > 1000000)
## # A tibble: 9 × 7
##   State      County           TotalPop PrivateWork PublicWork SelfEmployed  Walk
##   <chr>      <chr>               <dbl>       <dbl>      <dbl>        <dbl> <dbl>
## 1 California Alameda County    1629615        79.3       13.2          7.3   3.7
## 2 California Contra Costa Co…  1123678        78         13.4          8.4   1.7
## 3 California Los Angeles Cou… 10105722        79.3       11.2          9.3   2.7
## 4 California Orange County     3155816        82         10.1          7.7   1.9
## 5 California Riverside County  2355002        77.7       14.7          7.5   1.6
## 6 California Sacramento Coun…  1495400        71.5       21.3          6.9   1.9
## 7 California San Bernardino …  2121220        77.3       16.2          6.4   1.7
## 8 California San Diego County  3283665        77.8       14.2          7.8   2.9
## 9 California Santa Clara Cou…  1911226        84.7        9.3          5.9   2.1

Now you know that there are 9 counties in the state of California with a population greater than one million. In the next exercise, you’ll practice filtering and then sorting a dataset to focus on specific observations!

1.2.3 Filtering and arranging

We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.

# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
  filter(State == "Texas", TotalPop > 10000) %>%
  arrange(desc(PrivateWork))
## # A tibble: 168 × 7
##    State County           TotalPop PrivateWork PublicWork SelfEmployed  Walk
##    <chr> <chr>               <dbl>       <dbl>      <dbl>        <dbl> <dbl>
##  1 Texas Andrews County      17577        85.3        8.7          5.7   0  
##  2 Texas Collin County      914075        84.3        9.7          5.9   0.8
##  3 Texas Dallas County     2552213        84.3        9.1          6.4   1.5
##  4 Texas Gregg County       123402        84.3       10.1          5.5   2  
##  5 Texas Calhoun County      21821        83.8       10.5          5.7   0.9
##  6 Texas Titus County        32664        83.5        9.9          6.5   0.6
##  7 Texas Harris County     4525519        83.4        9.8          6.6   1.5
##  8 Texas Jefferson County   254574        83         13            3.9   0.7
##  9 Texas Tarrant County    1983675        83         11.3          5.6   1.2
## 10 Texas Midland County     159883        82.9        9.7          7.3   0.8
## # … with 158 more rows

You’ve learned how to filter and sort a dataset to answer questions about the data. Notice that you only need to slightly modify your code if you are interested in sorting the observations by a different column.

1.3 Mutate

1.3.1 Calculating the number of government employees

Use mutate() to add a column called public_workers to the dataset, with the number of people employed in public (government) work.

# Add a new column public_workers with the number of people employed in public work
counties_selected %>%
  mutate(public_workers = PublicWork * TotalPop / 100) %>%
  arrange(desc(public_workers))
## # A tibble: 3,220 × 8
##    State      County          TotalPop PrivateWork PublicWork SelfEmployed  Walk
##    <chr>      <chr>              <dbl>       <dbl>      <dbl>        <dbl> <dbl>
##  1 California Los Angeles Co… 10105722        79.3       11.2          9.3   2.7
##  2 Illinois   Cook County      5238541        84.2       11.1          4.6   4.4
##  3 California San Diego Coun…  3283665        77.8       14.2          7.8   2.9
##  4 Arizona    Maricopa County  4155501        82.8       11.1          5.9   1.5
##  5 Texas      Harris County    4525519        83.4        9.8          6.6   1.5
##  6 New York   Kings County     2635121        79.4       13.9          6.6   8.6
##  7 California Riverside Coun…  2355002        77.7       14.7          7.5   1.6
##  8 California San Bernardino…  2121220        77.3       16.2          6.4   1.7
##  9 New York   Queens County    2339280        80.1       13.7          6.1   5.8
## 10 California Orange County    3155816        82         10.1          7.7   1.9
## # … with 3,210 more rows, and 1 more variable: public_workers <dbl>

It looks like Los Angeles is the county with the most government employees.

1.3.2 Calculating the percentage of women in a county

The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.

# Select the columns state, county, population, men, and women
counties_selected <- counties %>%
  select(State, County, TotalPop, Men, Women, Walk)

# Calculate proportion_women as the fraction of the population made up of women
counties_selected %>%
  mutate(proportion_women = Women/TotalPop)
## # A tibble: 3,220 × 7
##    State   County          TotalPop   Men  Women  Walk proportion_women
##    <chr>   <chr>              <dbl> <dbl>  <dbl> <dbl>            <dbl>
##  1 Alabama Autauga County     55036 26899  28137   0.6            0.511
##  2 Alabama Baldwin County    203360 99527 103833   0.8            0.511
##  3 Alabama Barbour County     26201 13976  12225   2.2            0.467
##  4 Alabama Bibb County        22580 12251  10329   0.3            0.457
##  5 Alabama Blount County      57667 28490  29177   0.4            0.506
##  6 Alabama Bullock County     10478  5616   4862   6.2            0.464
##  7 Alabama Butler County      20126  9416  10710   0.9            0.532
##  8 Alabama Calhoun County    115527 55593  59934   1.3            0.519
##  9 Alabama Chambers County    33895 16320  17575   0.6            0.519
## 10 Alabama Cherokee County    25855 12862  12993   0.3            0.503
## # … with 3,210 more rows

Notice that the proportion_women variable was added as a column to the counties_selected dataset, and the data now has 6 columns instead of 5.

1.3.3 Select, mutate, filter, and arrange

In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.

counties %>%
  # Select the five columns 
  select(State, County, TotalPop, Men, Women) %>%
  # Add the proportion_men variable
  mutate(proportion_men = Men/TotalPop) %>%
  # Filter for population of at least 10,000
  filter(TotalPop >= 10000) %>%
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))
## # A tibble: 2,509 × 6
##    State      County                TotalPop   Men Women proportion_men
##    <chr>      <chr>                    <dbl> <dbl> <dbl>          <dbl>
##  1 Georgia    Chattahoochee County     11096  7455  3641          0.672
##  2 Louisiana  West Feliciana Parish    15376 10210  5166          0.664
##  3 California Lassen County            31470 20786 10684          0.661
##  4 Virginia   Sussex County            11595  7552  4043          0.651
##  5 Florida    Union County             15300  9922  5378          0.648
##  6 Missouri   DeKalb County            12564  8007  4557          0.637
##  7 Texas      Jones County             19969 12658  7311          0.634
##  8 Virginia   Greensville County       11606  7238  4368          0.624
##  9 Arkansas   Lincoln County           13885  8643  5242          0.622
## 10 Texas      Madison County           13979  8649  5330          0.619
## # … with 2,499 more rows

Notice Sussex County in Virginia is more than two thirds male: this is because of two men’s prisons in the county.

2 Aggregating Data

Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.

2.1 The count verb

2.1.1 Counting by region

The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected table. In this exercise, you’ll focus on the region column.

counties_selected <- counties %>%
select(county, region, state, population, citizens)
# Use count to find the number of counties in each state
counties_selected %>%
  count(State, sort = TRUE)
## # A tibble: 52 × 2
##    State              n
##    <chr>          <int>
##  1 Texas            254
##  2 Georgia          159
##  3 Virginia         133
##  4 Kentucky         120
##  5 Missouri         115
##  6 Kansas           105
##  7 Illinois         102
##  8 North Carolina   100
##  9 Iowa              99
## 10 Tennessee         95
## # … with 42 more rows

2.1.2 Counting citizens by state

You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.

counties_selected <- counties %>%
  select(county, region, state, population, citizens)
# Find number of counties per state, weighted by population
counties_selected %>%
  count(State, wt=TotalPop, sort = TRUE)
## # A tibble: 52 × 2
##    State                 n
##    <chr>             <dbl>
##  1 California     38982847
##  2 Texas          27419612
##  3 Florida        20278447
##  4 New York       19798228
##  5 Illinois       12854526
##  6 Pennsylvania   12790505
##  7 Ohio           11609756
##  8 Georgia        10201635
##  9 North Carolina 10052564
## 10 Michigan        9925568
## # … with 42 more rows

2.1.3 Mutating and counting

You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”

You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

counties_selected <- counties %>%
  select(county, region, state, population, walk)
counties_selected %>%
  # Add population_walk containing the total number of people who walk to work
  mutate(population_walk = Walk * TotalPop/100) %>%
  # Count weighted by the new column
  count(State, wt = population_walk, sort = TRUE)
## # A tibble: 52 × 2
##    State                n
##    <chr>            <dbl>
##  1 New York      1214914.
##  2 California    1008788.
##  3 Pennsylvania   492609.
##  4 Texas          427780.
##  5 Illinois       391265.
##  6 Massachusetts  323859.
##  7 Florida        291701.
##  8 New Jersey     266857.
##  9 Ohio           260880.
## 10 Washington     250556.
## # … with 42 more rows

We can see that while California had the largest total population, New York state has the largest number of people who walk to work.

2.2 The group by, summarize and ungroup verbs

2.2.1 Summarizing

The summarize() verb is very useful for collapsing a large dataset into a single observation.

counties_selected <- counties %>%
  select(county, population, income, unemployment)
# Summarize to find minimum population, maximum unemployment, and average income
counties %>%
  summarize(min_population = min(TotalPop), max_unemployment = max(Unemployment), average_income = mean(Income))
## # A tibble: 1 × 3
##   min_population max_unemployment average_income
##            <dbl>            <dbl>          <dbl>
## 1             74             40.9         48995.

If we wanted to take this a step further, we could use filter() to determine the specific counties that returned the value for min_population and max_unemployment.

2.2.2 Summarizing by state

We can see for example the percentage of men per state.

counties_selected %>%
  group_by(State) %>%
  summarize(total_men = sum(Men),
            total_population = sum(TotalPop)) %>%
  mutate(percentage_men = total_men / total_population) %>%
  arrange(desc(percentage_men))
## # A tibble: 52 × 4
##    State        total_men total_population percentage_men
##    <chr>            <dbl>            <dbl>          <dbl>
##  1 Alaska          386319           738565          0.523
##  2 North Dakota    382121           745475          0.513
##  3 Wyoming         298301           583200          0.511
##  4 South Dakota    430587           855444          0.503
##  5 Utah           1506614          2993941          0.503
##  6 Montana         517860          1029862          0.503
##  7 Colorado       2731315          5436519          0.502
##  8 Hawaii          713981          1421658          0.502
##  9 Nevada         1450091          2887725          0.502
## 10 Idaho           830627          1657375          0.501
## # … with 42 more rows

2.2.3 Summarizing by state and region

You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

counties_selected <- counties %>%
  select(region, state, county, population)

It looks like the South has the highest average_pop of 7370486, while the North Central region has the highest median_pop of 5580644.

2.3 The top_n verb

Selecting a county from each region Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.

Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.

counties_selected <- counties %>%
  select(region, state, county, metro, population, walk)
  
counties_selected %>% 
  group_by(State) %>% 
  # Find the greatest number of citizens who walk to work
  top_n(1, TotalPop)
## # A tibble: 52 × 6
## # Groups:   State [52]
##    State                County                 TotalPop     Men   Women  Walk
##    <chr>                <chr>                     <dbl>   <dbl>   <dbl> <dbl>
##  1 Alabama              Jefferson County         659460  312333  347127   1.4
##  2 Alaska               Anchorage Municipality   298225  152311  145914   3.4
##  3 Arizona              Maricopa County         4155501 2055464 2100037   1.5
##  4 Arkansas             Pulaski County           392848  188310  204538   1.3
##  5 California           Los Angeles County     10105722 4979641 5126081   2.7
##  6 Colorado             Denver County            678467  339400  339067   4.4
##  7 Connecticut          Fairfield County         947328  461380  485948   2.6
##  8 Delaware             New Castle County        555036  268818  286218   2.4
##  9 District of Columbia District of Columbia     672391  319046  353345  13.2
## 10 Florida              Miami-Dade County       2702602 1311997 1390605   2.1
## # … with 42 more rows
counties_selected <- counties %>%
  select(State, County, Unemployment)

counties_selected %>% 
  group_by(State) %>% 
  top_n(3, Unemployment)
## # A tibble: 161 × 3
## # Groups:   State [52]
##    State    County                    Unemployment
##    <chr>    <chr>                            <dbl>
##  1 Alabama  Clarke County                     15.2
##  2 Alabama  Monroe County                     21.6
##  3 Alabama  Wilcox County                     17.2
##  4 Alaska   Kusilvak Census Area              28.8
##  5 Alaska   Northwest Arctic Borough          20.1
##  6 Alaska   Yukon-Koyukuk Census Area         19.7
##  7 Arizona  Apache County                     14.1
##  8 Arizona  Graham County                     12.2
##  9 Arizona  Navajo County                     16.3
## 10 Arkansas Desha County                      12.2
## # … with 151 more rows

2.3.1 Finding the highest-income state in each region

You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and top_n() to find the state in each region with the highest income.

When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X.

counties_selected <- counties %>%
  select(region, state, county, population, income)
  
counties_selected <- counties %>%
  select(State, County, TotalPop, Income)

counties_selected %>%
  group_by(County, State) %>%
   # Calculate average income
  summarize(average_income = mean(Income) ) %>%
  # Find the highest income state in each region
  top_n(1, average_income)
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
## # A tibble: 1,955 × 3
## # Groups:   County [1,955]
##    County              State          average_income
##    <chr>               <chr>                   <dbl>
##  1 Abbeville County    South Carolina          35254
##  2 Acadia Parish       Louisiana               40492
##  3 Accomack County     Virginia                42260
##  4 Ada County          Idaho                   60151
##  5 Adair County        Iowa                    49477
##  6 Adams County        Colorado                64087
##  7 Addison County      Vermont                 61875
##  8 Adjuntas Municipio  Puerto Rico             11680
##  9 Aguada Municipio    Puerto Rico             16199
## 10 Aguadilla Municipio Puerto Rico             16821
## # … with 1,945 more rows

From our results, we can see that the New Jersey in the Northeast is the state with the highest average_income of 73014.

2.3.2 Using summarize, top_n, and count together

Five dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), and top_n(). We’ll use all of them to answer a question: In how many states women population is greater than men?

counties_selected <- counties %>%
  select(State, County, Men, Women)
# Extract the most populated row for each state
counties_selected %>%
  group_by(State, County) %>%
  summarize(women_total = sum(Women),
            men_total = sum(Men),
            women_more_men = women_total > men_total) %>%
  top_n(1, women_total) %>% 
  ungroup() %>% 
  count(women_more_men)
## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 2
##   women_more_men     n
##   <lgl>          <int>
## 1 FALSE              8
## 2 TRUE              44

Notice that 44 states have more women than men.

3 Selecting and Transforming Data

This chapter focuses on advanced methods of selecting and transforming columns. There are a ton of variables in the counties dataset, and often you only want to work with a subset of them.

3.1 Selecting

3.1.1 Selecting columns

Using the select() verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.

counties %>%
  # Select state, county, population, and industry-related columns
  select(State, County, TotalPop, Professional:Production) %>% 
  # Arrange service in descending order 
  arrange(desc(Service))
## # A tibble: 3,220 × 8
##    State     County TotalPop Professional Service Office Construction Production
##    <chr>     <chr>     <dbl>        <dbl>   <dbl>  <dbl>        <dbl>      <dbl>
##  1 Utah      Dagge…      702         19.8    46.4    4.8         13         16  
##  2 Texas     Kinne…     3631         24.2    38.6   10.7         21.6        4.9
##  3 Texas     Hudsp…     3702         17.8    38.4   14.1         17.5       12.2
##  4 Puerto R… Viequ…     8931         20.9    38.4   16.4         16.9        7.3
##  5 Hawaii    Kalaw…       86         22.2    38.1   20.6          0         19  
##  6 Wisconsin Menom…     4506         22      37.4   19.9          7.8       12.9
##  7 Texas     Culbe…     2257         19.3    36     17.9         19.7        7.1
##  8 Alaska    Denal…     2303         23.1    35.2   20.9         10.1       10.8
##  9 Kentucky  Wolfe…     7251         18.7    35     16.5         15.3       14.5
## 10 New York  Bronx…  1455846         25      33.4   23.5          7.1       11.1
## # … with 3,210 more rows

3.1.2 Select helpers

You learned about the select helper starts_with(). Another select helper is ends_with(), which finds the columns that end with a particular string.

Below a list of useful helpers: contains() starts_with() ends_with() last_col()

You can see all typing ?select_helpers.

counties %>%
  # Select the state, county, population, and those ending with "work"
  select(State, County, TotalPop, ends_with("Work")) %>%
  # Filter for counties that have at least 50% of people engaged in public work
  filter(PublicWork >= 50)
## # A tibble: 10 × 6
##    State        County                TotalPop PrivateWork PublicWork FamilyWork
##    <chr>        <chr>                    <dbl>       <dbl>      <dbl>      <dbl>
##  1 Alaska       Kusilvak Census Area      8129        41         57.8        0  
##  2 Alaska       Lake and Peninsula B…     1301        40.4       54.2        0.3
##  3 Alaska       Yukon-Koyukuk Census…     5453        38.8       56.5        0  
##  4 Hawaii       Kalawao County              86        34.9       61.9        0  
##  5 North Dakota Sioux County              4420        34.3       58.3        0.3
##  6 South Dakota Buffalo County            2048        38.9       57.7        1.2
##  7 South Dakota Dewey County              5709        33.7       50.2        1.4
##  8 South Dakota Oglala Lakota County     14291        31.1       64.8        0  
##  9 South Dakota Todd County              10016        34.4       57.4        1.4
## 10 Wisconsin    Menominee County          4506        34.3       62.3        0.3

3.2 The rename verb

3.2.1 Renaming a column after count

The rename() verb is often useful for changing the name of a column that comes out of another verb, such as count(). In this exercise, you’ll rename the n column from count() (which you learned about in Chapter 2) to something more descriptive.

# Rename the n column to num_counties
counties %>%
  count(State) %>%
  rename(num_counties = n)
## # A tibble: 52 × 2
##    State                num_counties
##    <chr>                       <int>
##  1 Alabama                        67
##  2 Alaska                         29
##  3 Arizona                        15
##  4 Arkansas                       75
##  5 California                     58
##  6 Colorado                       64
##  7 Connecticut                     8
##  8 Delaware                        3
##  9 District of Columbia            1
## 10 Florida                        67
## # … with 42 more rows

Notice the difference between column names in the output from the first step to the second step. Don’t forget, using rename() isn’t the only way to choose a new name for a column!

3.2.2 Renaming a column as part of a select

rename() isn’t the only way you can choose a new name for a column; you can also choose a name as part of a select().

# Select state, county, and poverty as poverty_rate
counties %>%
  select(State, County, poverty_rate = Poverty)
## # A tibble: 3,220 × 3
##    State   County          poverty_rate
##    <chr>   <chr>                  <dbl>
##  1 Alabama Autauga County          13.7
##  2 Alabama Baldwin County          11.8
##  3 Alabama Barbour County          27.2
##  4 Alabama Bibb County             15.2
##  5 Alabama Blount County           15.6
##  6 Alabama Bullock County          28.5
##  7 Alabama Butler County           24.4
##  8 Alabama Calhoun County          18.6
##  9 Alabama Chambers County         18.8
## 10 Alabama Cherokee County         16.1
## # … with 3,210 more rows

As you can see, we were able to select the four columns of interest from our dataset, and rename one of those columns, using only the select() verb!

3.3 The transmute verb

Recall, you can think of transmute() as a combination of select() and mutate(), since you are getting back a subset of columns, but you are transforming and changing them at the same time.

3.3.1 Using transmute

As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.

counties %>%
  # Keep the state, county, and populations columns, and add a active column
  transmute(State, County, TotalPop, employment_rate = Employed/TotalPop) %>%
  # Filter for counties with a population greater than one million 
  filter(TotalPop > 1000000) %>%
  # Sort active in ascending order 
  arrange(employment_rate)
## # A tibble: 44 × 4
##    State        County                TotalPop employment_rate
##    <chr>        <chr>                    <dbl>           <dbl>
##  1 New York     Bronx County           1455846           0.408
##  2 Michigan     Wayne County           1763822           0.409
##  3 California   San Bernardino County  2121220           0.410
##  4 California   Riverside County       2355002           0.416
##  5 Pennsylvania Philadelphia County    1569657           0.424
##  6 Arizona      Pima County            1007257           0.430
##  7 California   Sacramento County      1495400           0.446
##  8 Texas        Bexar County           1892004           0.459
##  9 Florida      Palm Beach County      1426772           0.459
## 10 New York     Kings County           2635121           0.462
## # … with 34 more rows

3.3.2 Choosing among the four verbs

In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.

Examples using different verbs:

# Change the name of the unemployment column
counties %>%
  rename(unemployment_rate = Unemployment)
## # A tibble: 3,220 × 37
##    CountyId State County TotalPop   Men  Women Hispanic White Black Native Asian
##       <dbl> <chr> <chr>     <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1     1001 Alab… Autau…    55036 26899  28137      2.7  75.4  18.9    0.3   0.9
##  2     1003 Alab… Baldw…   203360 99527 103833      4.4  83.1   9.5    0.8   0.7
##  3     1005 Alab… Barbo…    26201 13976  12225      4.2  45.7  47.8    0.2   0.6
##  4     1007 Alab… Bibb …    22580 12251  10329      2.4  74.6  22      0.4   0  
##  5     1009 Alab… Bloun…    57667 28490  29177      9    87.4   1.5    0.3   0.1
##  6     1011 Alab… Bullo…    10478  5616   4862      0.3  21.6  75.6    1     0.7
##  7     1013 Alab… Butle…    20126  9416  10710      0.3  52.2  44.7    0.1   1.1
##  8     1015 Alab… Calho…   115527 55593  59934      3.6  72.7  20.4    0.2   1  
##  9     1017 Alab… Chamb…    33895 16320  17575      2.2  56.2  39.3    0.3   1  
## 10     1019 Alab… Chero…    25855 12862  12993      1.6  91.8   5      0.5   0.1
## # … with 3,210 more rows, and 26 more variables: Pacific <dbl>,
## #   VotingAgeCitizen <dbl>, Income <dbl>, IncomeErr <dbl>, IncomePerCap <dbl>,
## #   IncomePerCapErr <dbl>, Poverty <dbl>, ChildPoverty <dbl>,
## #   Professional <dbl>, Service <dbl>, Office <dbl>, Construction <dbl>,
## #   Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>, Walk <dbl>,
## #   OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## #   PrivateWork <dbl>, PublicWork <dbl>, SelfEmployed <dbl>, …
# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(State, County, contains("Poverty"))
## # A tibble: 3,220 × 4
##    State   County          Poverty ChildPoverty
##    <chr>   <chr>             <dbl>        <dbl>
##  1 Alabama Autauga County     13.7         20.1
##  2 Alabama Baldwin County     11.8         16.1
##  3 Alabama Barbour County     27.2         44.9
##  4 Alabama Bibb County        15.2         26.6
##  5 Alabama Blount County      15.6         25.4
##  6 Alabama Bullock County     28.5         50.4
##  7 Alabama Butler County      24.4         34.8
##  8 Alabama Calhoun County     18.6         26.6
##  9 Alabama Chambers County    18.8         29.1
## 10 Alabama Cherokee County    16.1         20  
## # … with 3,210 more rows
# Calculate the fraction_women column without dropping the other columns
counties %>%
  mutate(fraction_women = Women / TotalPop)
## # A tibble: 3,220 × 38
##    CountyId State County TotalPop   Men  Women Hispanic White Black Native Asian
##       <dbl> <chr> <chr>     <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1     1001 Alab… Autau…    55036 26899  28137      2.7  75.4  18.9    0.3   0.9
##  2     1003 Alab… Baldw…   203360 99527 103833      4.4  83.1   9.5    0.8   0.7
##  3     1005 Alab… Barbo…    26201 13976  12225      4.2  45.7  47.8    0.2   0.6
##  4     1007 Alab… Bibb …    22580 12251  10329      2.4  74.6  22      0.4   0  
##  5     1009 Alab… Bloun…    57667 28490  29177      9    87.4   1.5    0.3   0.1
##  6     1011 Alab… Bullo…    10478  5616   4862      0.3  21.6  75.6    1     0.7
##  7     1013 Alab… Butle…    20126  9416  10710      0.3  52.2  44.7    0.1   1.1
##  8     1015 Alab… Calho…   115527 55593  59934      3.6  72.7  20.4    0.2   1  
##  9     1017 Alab… Chamb…    33895 16320  17575      2.2  56.2  39.3    0.3   1  
## 10     1019 Alab… Chero…    25855 12862  12993      1.6  91.8   5      0.5   0.1
## # … with 3,210 more rows, and 27 more variables: Pacific <dbl>,
## #   VotingAgeCitizen <dbl>, Income <dbl>, IncomeErr <dbl>, IncomePerCap <dbl>,
## #   IncomePerCapErr <dbl>, Poverty <dbl>, ChildPoverty <dbl>,
## #   Professional <dbl>, Service <dbl>, Office <dbl>, Construction <dbl>,
## #   Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>, Walk <dbl>,
## #   OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## #   PrivateWork <dbl>, PublicWork <dbl>, SelfEmployed <dbl>, …
# Keep only the state, county, and employment_rate columns
counties %>%
  transmute(State, County, employment_rate = Employed / TotalPop)
## # A tibble: 3,220 × 3
##    State   County          employment_rate
##    <chr>   <chr>                     <dbl>
##  1 Alabama Autauga County            0.438
##  2 Alabama Baldwin County            0.440
##  3 Alabama Barbour County            0.339
##  4 Alabama Bibb County               0.362
##  5 Alabama Blount County             0.371
##  6 Alabama Bullock County            0.409
##  7 Alabama Butler County             0.384
##  8 Alabama Calhoun County            0.410
##  9 Alabama Chambers County           0.429
## 10 Alabama Cherokee County           0.382
## # … with 3,210 more rows

4 Case Study: The babynames Dataset

Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.

4.1 The babynames data

babynames <- readRDS("babynames.rds")
babynames
## # A tibble: 332,595 × 3
##     year name    number
##    <dbl> <chr>    <int>
##  1  1880 Aaron      102
##  2  1880 Ab           5
##  3  1880 Abbie       71
##  4  1880 Abbott       5
##  5  1880 Abby         6
##  6  1880 Abe         50
##  7  1880 Abel         9
##  8  1880 Abigail     12
##  9  1880 Abner       27
## 10  1880 Abraham     81
## # … with 332,585 more rows

4.1.1 Filtering and arranging for one year

The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.

babynames %>%
  # Filter for the year 1990
  filter(year == 1990) %>%
  # Sort the number column in descending order 
  arrange(desc(number))
## # A tibble: 21,223 × 3
##     year name        number
##    <dbl> <chr>        <int>
##  1  1990 Michael      65560
##  2  1990 Christopher  52520
##  3  1990 Jessica      46615
##  4  1990 Ashley       45797
##  5  1990 Matthew      44925
##  6  1990 Joshua       43382
##  7  1990 Brittany     36650
##  8  1990 Amanda       34504
##  9  1990 Daniel       33963
## 10  1990 David        33862
## # … with 21,213 more rows

4.1.2 Using top_n with babynames

You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by() and top_n() to find the most common name in every year.

babynames %>%
  # Find the most common name in each year
  group_by(year) %>%
  top_n(1, number)
## # A tibble: 28 × 3
## # Groups:   year [28]
##     year name  number
##    <dbl> <chr>  <int>
##  1  1880 John    9701
##  2  1885 Mary    9166
##  3  1890 Mary   12113
##  4  1895 Mary   13493
##  5  1900 Mary   16781
##  6  1905 Mary   16135
##  7  1910 Mary   22947
##  8  1915 Mary   58346
##  9  1920 Mary   71175
## 10  1925 Mary   70857
## # … with 18 more rows

4.1.3 Visualizing names with ggplot2

The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2

selected_names <- babynames %>%
  # Filter for the names Steven, Thomas, and Matthew 
  filter(name %in% c("Steven", "Thomas", "Matthew"))
selected_names <- babynames %>%
  # Filter for the names Steven, Thomas, and Matthew 
  filter(name %in% c("Steven", "Thomas", "Matthew"))

# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
  geom_line()

4.2 Grouped mutates

4.2.1 Finding the year each name is most common

In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.

To do this, you’ll be combining the grouped mutate approach with a top_n.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
  # Find the year each name is most common
  group_by(name) %>%
  top_n(1, fraction)
## # A tibble: 48,040 × 5
## # Groups:   name [48,040]
##     year name      number year_total  fraction
##    <dbl> <chr>      <int>      <int>     <dbl>
##  1  1880 Abbott         5     201478 0.0000248
##  2  1880 Abe           50     201478 0.000248 
##  3  1880 Abner         27     201478 0.000134 
##  4  1880 Adelbert      28     201478 0.000139 
##  5  1880 Adella        26     201478 0.000129 
##  6  1880 Adolf          6     201478 0.0000298
##  7  1880 Adolph        93     201478 0.000462 
##  8  1880 Agustus        5     201478 0.0000248
##  9  1880 Albert      1493     201478 0.00741  
## 10  1880 Albertina      7     201478 0.0000347
## # … with 48,030 more rows

Notice that the results are grouped by year, then name, so the first few entries are names that were most popular in the 1880’s that start with the letter A.

4.2.2 Adding the total and maximum for each name

In the video, you learned how you could group by the year and use mutate() to add a total for that year.

In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.

Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutates slower.

babynames %>%
  # Add columns name_total and name_max for each name
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup() %>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number/name_max)
## # A tibble: 332,595 × 6
##     year name    number name_total name_max fraction_max
##    <dbl> <chr>    <int>      <int>    <int>        <dbl>
##  1  1880 Aaron      102     114739    14635     0.00697 
##  2  1880 Ab           5         77       31     0.161   
##  3  1880 Abbie       71       4330      445     0.160   
##  4  1880 Abbott       5        217       51     0.0980  
##  5  1880 Abby         6      11272     1753     0.00342 
##  6  1880 Abe         50       1832      271     0.185   
##  7  1880 Abel         9      10565     3245     0.00277 
##  8  1880 Abigail     12      72600    15762     0.000761
##  9  1880 Abner       27       1552      199     0.136   
## 10  1880 Abraham     81      17882     2449     0.0331  
## # … with 332,585 more rows

This tells you, for example, that the name Abe was at 18.5% of its peak in the year 1880.

4.2.3 Visualizing the normalized change in popularity

You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.

In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized, has been provided for you.

names_normalized <- babynames %>%
                     group_by(name) %>%
                     mutate(name_total = sum(number),
                            name_max = max(number)) %>%
                     ungroup() %>%
                     mutate(fraction_max = number / name_max)

names_filtered <- names_normalized %>%
  # Filter for the names Steven, Thomas, and Matthew
  filter(name %in% c( "Steven", "Thomas", "Matthew")) 

# Visualize these names over time
ggplot(names_filtered, aes(x= year, y= fraction_max, color = name)) + geom_line()

> As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.

4.3 Window functions

4.3.1 Using ratios to describe the frequency of a name

You learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?

babynames_fraction <- babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
  # Find the year each name is most common
  group_by(name) %>%
  top_n(1, fraction)
babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name,year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio of fraction between each year 
  mutate(ratio = fraction/lag(fraction))
## # A tibble: 48,040 × 6
## # Groups:   name [48,040]
##     year name     number year_total   fraction ratio
##    <dbl> <chr>     <int>      <int>      <dbl> <dbl>
##  1  2015 Aaban        15    3648781 0.00000411    NA
##  2  2015 Aadam        22    3648781 0.00000603    NA
##  3  2010 Aadan        11    3672066 0.00000300    NA
##  4  2015 Aadarsh      15    3648781 0.00000411    NA
##  5  2010 Aaden       450    3672066 0.000123      NA
##  6  2015 Aadhav       31    3648781 0.00000850    NA
##  7  2015 Aadhavan      5    3648781 0.00000137    NA
##  8  2015 Aadhya      265    3648781 0.0000726     NA
##  9  2010 Aadi         54    3672066 0.0000147     NA
## 10  2005 Aadil        20    3828460 0.00000522    NA
## # … with 48,030 more rows

Notice that the first observation for each name is missing a ratio, since there is no previous year.

4.3.2 Biggest jumps in a name

Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.

babynames_fraction <- babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
  # Find the year each name is most common
  group_by(name) %>%
  top_n(1, fraction)
babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name,year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio of fraction between each year 
  mutate(ratio = fraction/lag(fraction))
## # A tibble: 48,040 × 6
## # Groups:   name [48,040]
##     year name     number year_total   fraction ratio
##    <dbl> <chr>     <int>      <int>      <dbl> <dbl>
##  1  2015 Aaban        15    3648781 0.00000411    NA
##  2  2015 Aadam        22    3648781 0.00000603    NA
##  3  2010 Aadan        11    3672066 0.00000300    NA
##  4  2015 Aadarsh      15    3648781 0.00000411    NA
##  5  2010 Aaden       450    3672066 0.000123      NA
##  6  2015 Aadhav       31    3648781 0.00000850    NA
##  7  2015 Aadhavan      5    3648781 0.00000137    NA
##  8  2015 Aadhya      265    3648781 0.0000726     NA
##  9  2010 Aadi         54    3672066 0.0000147     NA
## 10  2005 Aadil        20    3828460 0.00000522    NA
## # … with 48,030 more rows
babynames_ratios_filtered <- babynames_fraction %>%
                     arrange(name, year) %>%
                     group_by(name) %>%
                     mutate(ratio = fraction / lag(fraction)) %>%
                     filter(fraction >= 0.00001)

babynames_ratios_filtered %>%
  # Extract the largest ratio from each name 
  top_n(1, ratio) %>%
  # Sort the ratio column in descending order 
  arrange(desc(ratio)) %>%
  # Filter for fractions greater than or equal to 0.001
  filter(fraction >= 0.001)
## # A tibble: 0 × 6
## # Groups:   name [0]
## # … with 6 variables: year <dbl>, name <chr>, number <int>, year_total <int>,
## #   fraction <dbl>, ratio <dbl>

Some of these can be interpreted: for example, Grover Cleveland was a president elected in 1884.

5 Conclusions

5.1 Summary

select()

filter()

mutate()

arrange()

count()

group_by()

summarize()

5.2 Other DataCamp courses

  • Exploratory Data Analysis in R: Case Study
  • Working with Data in the Tidyverse
  • Machine Learning in the Tidyverse
  • Categorical Data in the Tidyverse