We are going to practise with dplyr package manipulating the babynames dataset that is included in the babynames package from CRAN.
Explore the babynames data
If we check the class of the babynames library we can see that is already a dataframe.
babynames <- babynames %>%
select(year, sex, name, number = n)
So, we can explore the first and last rows to see how it looks like:
head(babynames)
tail(babynames)
We can see that the data is from 1880 to 2017.
glimpse(babynames)
Rows: 1,924,665
Columns: 4
$ year [3m[38;5;246m<dbl>[39m[23m 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1…
$ sex [3m[38;5;246m<chr>[39m[23m "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name [3m[38;5;246m<chr>[39m[23m "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margare…
$ number [3m[38;5;246m<int>[39m[23m 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1…
We have 5 columns: - year - sex - name - number
Filtering and arranging for one year
The dplyr verbs are useful for exploring data. For instance, you could find out the most common names in a particular year.
babynames %>%
# Filter for the year 1990
filter(year == 1990) %>%
# Sort the number column in descending order
arrange(desc(number))
It looks like the most common names for babies born in the US in 1990 were Michael, Christopher, and Jessica.
Using top_n with babynames
We could also use group_by and top_n to find the most common name in every year.
# Find the most common name in each year
babynames %>%
group_by(year) %>%
top_n(1, number)
It looks like John was the most common name in 1880, and Mary was the most common name for a while after that.
Visualizing names with ggplot2
The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2
# Filter for the names Steven, Thomas, and Matthew
selected_names <- babynames %>%
filter(name %in% c("Steven", "Thomas", "Matthew"), sex == "M")
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()

It looks like names like Steven and Thomas were common in the 1950s, but Matthew became common more recently.
Grouped mutates
Finding the year each name is most common
We’re going to explore which year each name was the most common.
To do this, you’ll be combining the grouped mutate approach with a top_n.
# Find the year each name is most common
babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)%>%
group_by(name)%>%
top_n(1, fraction)
Notice that the results are grouped by year, then name, so the first few entries are names that were most popular in the 1880’s that start with the letter A.
Adding the total and maximum for each name
We’ll normalize by a different, but also interesting metric: we’ll divide each name by the maximum for that name. This means that every name will peak at 1.
babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number / name_max)
This tells you, for example, that the name Mary was at 9.5% of its peak in the year 1880.
Visualizing the normalized change in popularity
We picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where we’re focused on the relative change within each name rather than the overall popularity of the name.
We’ll visualize the normalized popularity of each name.
names_normalized <- babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
filter(name %in% c("Steven", "Thomas", "Matthew"), sex == "M")
# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()

As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.
Window functions
Takes a vector and returns another vector of the same lenght.
v <- c(1, 3, 6, 14)
v
[1] 1 3 6 14
We can lag the vector as per below:
lag(v)
[1] NA 1 3 6
We can compare consecutive steps and calculate the changes:
v - lag(v)
[1] NA 2 3 8
Using ratios to describe the frequency of a name
What if instead of finding the difference, you wanted to find the ratio?
babynames_fraction <- babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
babynames_fraction %>%
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio between each year
mutate(ratio = fraction / lag(fraction))
Notice that the first observation for each name is missing a ratio, since there is no previous year.
Biggest jumps in a name
Now, we’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.
babynames_ratios_filtered <- babynames_fraction %>%
arrange(name, year) %>%
group_by(name) %>%
mutate(ratio = fraction / lag(fraction)) %>%
filter(fraction >= 0.00001)
babynames_ratios_filtered %>%
# Extract the largest ratio from each name
top_n(1, ratio) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)
LS0tCnRpdGxlOiAiQmFieSBuYW1lcyBQcm9qZWN0IgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCiAgICB0b2NfY29sbGFwc2VkOiB0cnVlCiAgICAKdG9jX2RlcHRoOiAzCi0tLQpXZSBhcmUgZ29pbmcgdG8gcHJhY3Rpc2Ugd2l0aCBkcGx5ciBwYWNrYWdlIG1hbmlwdWxhdGluZyB0aGUgYmFieW5hbWVzIGRhdGFzZXQgdGhhdCBpcyBpbmNsdWRlZCBpbiB0aGUgYmFieW5hbWVzIHBhY2thZ2UgZnJvbSBDUkFOLgpgYGB7cn0KI2xvYWRpbmcgcGFja2FnZXMKbGlicmFyeShkcGx5cikKbGlicmFyeShiYWJ5bmFtZXMpCmxpYnJhcnkoZ2dwbG90MikKYGBgCiMgRXhwbG9yZSB0aGUgYmFieW5hbWVzIGRhdGEKCklmIHdlIGNoZWNrIHRoZSBjbGFzcyBvZiB0aGUgYmFieW5hbWVzIGxpYnJhcnkgd2UgY2FuIHNlZSB0aGF0IGlzIGFscmVhZHkgYSBkYXRhZnJhbWUuCmBgYHtyfQpiYWJ5bmFtZXMgPC0gYmFieW5hbWVzICU+JSAKICBzZWxlY3QoeWVhciwgc2V4LCBuYW1lLCBudW1iZXIgPSBuKQpgYGAKU28sIHdlIGNhbiBleHBsb3JlIHRoZSBmaXJzdCBhbmQgbGFzdCByb3dzIHRvIHNlZSBob3cgaXQgbG9va3MgbGlrZToKYGBge3J9CmhlYWQoYmFieW5hbWVzKQp0YWlsKGJhYnluYW1lcykKYGBgCldlIGNhbiBzZWUgdGhhdCB0aGUgZGF0YSBpcyBmcm9tIDE4ODAgdG8gMjAxNy4KYGBge3J9CmdsaW1wc2UoYmFieW5hbWVzKQpgYGAKV2UgaGF2ZSA1IGNvbHVtbnM6CiAtIHllYXIKIC0gc2V4CiAtIG5hbWUKIC0gbnVtYmVyCiAKIyMgRmlsdGVyaW5nIGFuZCBhcnJhbmdpbmcgZm9yIG9uZSB5ZWFyCgpUaGUgZHBseXIgdmVyYnMgYXJlIHVzZWZ1bCBmb3IgZXhwbG9yaW5nIGRhdGEuIEZvciBpbnN0YW5jZSwgeW91IGNvdWxkIGZpbmQgb3V0IHRoZSBtb3N0IGNvbW1vbiBuYW1lcyBpbiBhIHBhcnRpY3VsYXIgeWVhci4KYGBge3J9CmJhYnluYW1lcyAlPiUKICAjIEZpbHRlciBmb3IgdGhlIHllYXIgMTk5MAogIGZpbHRlcih5ZWFyID09IDE5OTApICU+JQogICMgU29ydCB0aGUgbnVtYmVyIGNvbHVtbiBpbiBkZXNjZW5kaW5nIG9yZGVyIAogIGFycmFuZ2UoZGVzYyhudW1iZXIpKQpgYGAKSXQgbG9va3MgbGlrZSB0aGUgbW9zdCBjb21tb24gbmFtZXMgZm9yIGJhYmllcyBib3JuIGluIHRoZSBVUyBpbiAxOTkwIHdlcmUgTWljaGFlbCwgQ2hyaXN0b3BoZXIsIGFuZCBKZXNzaWNhLgoKIyMgVXNpbmcgdG9wX24gd2l0aCBiYWJ5bmFtZXMKV2UgY291bGQgYWxzbyB1c2UgZ3JvdXBfYnkgYW5kIHRvcF9uIHRvIGZpbmQgdGhlIG1vc3QgY29tbW9uIG5hbWUgaW4gZXZlcnkgeWVhci4KYGBge3J9CiMgRmluZCB0aGUgbW9zdCBjb21tb24gbmFtZSBpbiBlYWNoIHllYXIKYmFieW5hbWVzICU+JQogIGdyb3VwX2J5KHllYXIpICU+JQogIHRvcF9uKDEsIG51bWJlcikKYGBgCkl0IGxvb2tzIGxpa2UgSm9obiB3YXMgdGhlIG1vc3QgY29tbW9uIG5hbWUgaW4gMTg4MCwgYW5kIE1hcnkgd2FzIHRoZSBtb3N0IGNvbW1vbiBuYW1lIGZvciBhIHdoaWxlIGFmdGVyIHRoYXQuCgojIyBWaXN1YWxpemluZyBuYW1lcyB3aXRoIGdncGxvdDIKClRoZSBkcGx5ciBwYWNrYWdlIGlzIHZlcnkgdXNlZnVsIGZvciBleHBsb3JpbmcgZGF0YSwgYnV0IGl0J3MgZXNwZWNpYWxseSB1c2VmdWwgd2hlbiBjb21iaW5lZCB3aXRoIG90aGVyIHRpZHl2ZXJzZSBwYWNrYWdlcyBsaWtlIGdncGxvdDIKYGBge3J9CiMgRmlsdGVyIGZvciB0aGUgbmFtZXMgU3RldmVuLCBUaG9tYXMsIGFuZCBNYXR0aGV3IApzZWxlY3RlZF9uYW1lcyA8LSBiYWJ5bmFtZXMgJT4lCiAgZmlsdGVyKG5hbWUgJWluJSBjKCJTdGV2ZW4iLCAiVGhvbWFzIiwgIk1hdHRoZXciKSwgc2V4ID09ICJNIikKCiMgUGxvdCB0aGUgbmFtZXMgdXNpbmcgYSBkaWZmZXJlbnQgY29sb3IgZm9yIGVhY2ggbmFtZQpnZ3Bsb3Qoc2VsZWN0ZWRfbmFtZXMsIGFlcyh4ID0geWVhciwgeSA9IG51bWJlciwgY29sb3IgPSBuYW1lKSkgKwogIGdlb21fbGluZSgpCmBgYApJdCBsb29rcyBsaWtlIG5hbWVzIGxpa2UgU3RldmVuIGFuZCBUaG9tYXMgd2VyZSBjb21tb24gaW4gdGhlIDE5NTBzLCBidXQgTWF0dGhldyBiZWNhbWUgY29tbW9uIG1vcmUgcmVjZW50bHkuCgojIEdyb3VwZWQgbXV0YXRlcwoKIyMgRmluZGluZyB0aGUgeWVhciBlYWNoIG5hbWUgaXMgbW9zdCBjb21tb24KCldlJ3JlIGdvaW5nIHRvIGV4cGxvcmUgd2hpY2ggeWVhciBlYWNoIG5hbWUgd2FzIHRoZSBtb3N0IGNvbW1vbi4KClRvIGRvIHRoaXMsIHlvdSdsbCBiZSBjb21iaW5pbmcgdGhlIGdyb3VwZWQgbXV0YXRlIGFwcHJvYWNoIHdpdGggYSB0b3Bfbi4KYGBge3J9CiMgRmluZCB0aGUgeWVhciBlYWNoIG5hbWUgaXMgbW9zdCBjb21tb24gCmJhYnluYW1lcyAlPiUKICBncm91cF9ieSh5ZWFyKSAlPiUKICBtdXRhdGUoeWVhcl90b3RhbCA9IHN1bShudW1iZXIpKSAlPiUKICB1bmdyb3VwKCkgJT4lCiAgbXV0YXRlKGZyYWN0aW9uID0gbnVtYmVyIC8geWVhcl90b3RhbCklPiUKICBncm91cF9ieShuYW1lKSU+JQogIHRvcF9uKDEsIGZyYWN0aW9uKQpgYGAKTm90aWNlIHRoYXQgdGhlIHJlc3VsdHMgYXJlIGdyb3VwZWQgYnkgeWVhciwgdGhlbiBuYW1lLCBzbyB0aGUgZmlyc3QgZmV3IGVudHJpZXMgYXJlIG5hbWVzIHRoYXQgd2VyZSBtb3N0IHBvcHVsYXIgaW4gdGhlIDE4ODAncyB0aGF0IHN0YXJ0IHdpdGggdGhlIGxldHRlciBBLgoKIyMgQWRkaW5nIHRoZSB0b3RhbCBhbmQgbWF4aW11bSBmb3IgZWFjaCBuYW1lCgpXZSdsbCBub3JtYWxpemUgYnkgYSBkaWZmZXJlbnQsIGJ1dCBhbHNvIGludGVyZXN0aW5nIG1ldHJpYzogd2UnbGwgZGl2aWRlIGVhY2ggbmFtZSBieSB0aGUgbWF4aW11bSBmb3IgdGhhdCBuYW1lLiBUaGlzIG1lYW5zIHRoYXQgZXZlcnkgbmFtZSB3aWxsIHBlYWsgYXQgMS4KYGBge3J9CmJhYnluYW1lcyAlPiUKICBncm91cF9ieShuYW1lKSAlPiUKICBtdXRhdGUobmFtZV90b3RhbCA9IHN1bShudW1iZXIpLAogICAgICAgICBuYW1lX21heCA9IG1heChudW1iZXIpKSAlPiUKICAjIFVuZ3JvdXAgdGhlIHRhYmxlIAogIHVuZ3JvdXAoKSAlPiUKICAjIEFkZCB0aGUgZnJhY3Rpb25fbWF4IGNvbHVtbiBjb250YWluaW5nIHRoZSBudW1iZXIgYnkgdGhlIG5hbWUgbWF4aW11bSAKICBtdXRhdGUoZnJhY3Rpb25fbWF4ID0gbnVtYmVyIC8gbmFtZV9tYXgpCmBgYApUaGlzIHRlbGxzIHlvdSwgZm9yIGV4YW1wbGUsIHRoYXQgdGhlIG5hbWUgTWFyeSB3YXMgYXQgOS41JSBvZiBpdHMgcGVhayBpbiB0aGUgeWVhciAxODgwLgoKIyMgVmlzdWFsaXppbmcgdGhlIG5vcm1hbGl6ZWQgY2hhbmdlIGluIHBvcHVsYXJpdHkKCldlIHBpY2tlZCBhIGZldyBuYW1lcyBhbmQgY2FsY3VsYXRlZCBlYWNoIG9mIHRoZW0gYXMgYSBmcmFjdGlvbiBvZiB0aGVpciBwZWFrLiBUaGlzIGlzIGEgdHlwZSBvZiAibm9ybWFsaXppbmciIGEgbmFtZSwgd2hlcmUgd2UncmUgZm9jdXNlZCBvbiB0aGUgcmVsYXRpdmUgY2hhbmdlIHdpdGhpbiBlYWNoIG5hbWUgcmF0aGVyIHRoYW4gdGhlIG92ZXJhbGwgcG9wdWxhcml0eSBvZiB0aGUgbmFtZS4KCldlJ2xsIHZpc3VhbGl6ZSB0aGUgbm9ybWFsaXplZCBwb3B1bGFyaXR5IG9mIGVhY2ggbmFtZS4gCgpgYGB7cn0KbmFtZXNfbm9ybWFsaXplZCA8LSBiYWJ5bmFtZXMgJT4lCiAgICAgICAgICAgICAgICAgICAgIGdyb3VwX2J5KG5hbWUpICU+JQogICAgICAgICAgICAgICAgICAgICBtdXRhdGUobmFtZV90b3RhbCA9IHN1bShudW1iZXIpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZV9tYXggPSBtYXgobnVtYmVyKSkgJT4lCiAgICAgICAgICAgICAgICAgICAgIHVuZ3JvdXAoKSAlPiUKICAgICAgICAgICAgICAgICAgICAgbXV0YXRlKGZyYWN0aW9uX21heCA9IG51bWJlciAvIG5hbWVfbWF4KQpgYGAKYGBge3J9CiMgRmlsdGVyIGZvciB0aGUgbmFtZXMgU3RldmVuLCBUaG9tYXMsIGFuZCBNYXR0aGV3Cm5hbWVzX2ZpbHRlcmVkIDwtIG5hbWVzX25vcm1hbGl6ZWQgJT4lCiAgZmlsdGVyKG5hbWUgJWluJSBjKCJTdGV2ZW4iLCAiVGhvbWFzIiwgIk1hdHRoZXciKSwgc2V4ID09ICJNIikKCiMgVmlzdWFsaXplIHRoZXNlIG5hbWVzIG92ZXIgdGltZQpnZ3Bsb3QobmFtZXNfZmlsdGVyZWQsIGFlcyh4ID0geWVhciwgeSA9IGZyYWN0aW9uX21heCwgY29sb3IgPSBuYW1lKSkgKwogIGdlb21fbGluZSgpCmBgYApBcyB5b3UgY2FuIHNlZSwgdGhlIGxpbmUgZm9yIGVhY2ggbmFtZSBoaXRzIGEgcGVhayBhdCAxLCBhbHRob3VnaCB0aGUgcGVhayB5ZWFyIGRpZmZlcnMgZm9yIGVhY2ggbmFtZS4KCiMgV2luZG93IGZ1bmN0aW9ucwoKVGFrZXMgYSB2ZWN0b3IgYW5kIHJldHVybnMgYW5vdGhlciB2ZWN0b3Igb2YgdGhlIHNhbWUgbGVuZ2h0LgpgYGB7cn0KdiA8LSBjKDEsIDMsIDYsIDE0KQp2CmBgYApXZSBjYW4gbGFnIHRoZSB2ZWN0b3IgYXMgcGVyIGJlbG93OgpgYGB7cn0KbGFnKHYpCmBgYApXZSBjYW4gY29tcGFyZSBjb25zZWN1dGl2ZSBzdGVwcyBhbmQgY2FsY3VsYXRlIHRoZSBjaGFuZ2VzOgpgYGB7cn0KdiAtIGxhZyh2KQpgYGAKIyMgVXNpbmcgcmF0aW9zIHRvIGRlc2NyaWJlIHRoZSBmcmVxdWVuY3kgb2YgYSBuYW1lCgpXaGF0IGlmIGluc3RlYWQgb2YgZmluZGluZyB0aGUgZGlmZmVyZW5jZSwgeW91IHdhbnRlZCB0byBmaW5kIHRoZSByYXRpbz8KYGBge3J9CmJhYnluYW1lc19mcmFjdGlvbiA8LSBiYWJ5bmFtZXMgJT4lCiAgZ3JvdXBfYnkoeWVhcikgJT4lCiAgbXV0YXRlKHllYXJfdG90YWwgPSBzdW0obnVtYmVyKSkgJT4lCiAgdW5ncm91cCgpICU+JQogIG11dGF0ZShmcmFjdGlvbiA9IG51bWJlciAvIHllYXJfdG90YWwpCmBgYApgYGB7cn0KYmFieW5hbWVzX2ZyYWN0aW9uICU+JQogICMgQXJyYW5nZSB0aGUgZGF0YSBpbiBvcmRlciBvZiBuYW1lLCB0aGVuIHllYXIgCiAgYXJyYW5nZShuYW1lLCB5ZWFyKSAlPiUKICAjIEdyb3VwIHRoZSBkYXRhIGJ5IG5hbWUKICBncm91cF9ieShuYW1lKSAlPiUKICAjIEFkZCBhIHJhdGlvIGNvbHVtbiB0aGF0IGNvbnRhaW5zIHRoZSByYXRpbyBiZXR3ZWVuIGVhY2ggeWVhciAKICBtdXRhdGUocmF0aW8gPSBmcmFjdGlvbiAvIGxhZyhmcmFjdGlvbikpCmBgYAogTm90aWNlIHRoYXQgdGhlIGZpcnN0IG9ic2VydmF0aW9uIGZvciBlYWNoIG5hbWUgaXMgbWlzc2luZyBhIHJhdGlvLCBzaW5jZSB0aGVyZSBpcyBubyBwcmV2aW91cyB5ZWFyLgoKIyMgQmlnZ2VzdCBqdW1wcyBpbiBhIG5hbWUKCk5vdywgd2UnbGwgbG9vayBhdCBhIHN1YnNldCBvZiB0aGF0IGRhdGEsIGNhbGxlZCBiYWJ5bmFtZXNfcmF0aW9zX2ZpbHRlcmVkLCB0byBsb29rIGZ1cnRoZXIgaW50byB0aGUgbmFtZXMgdGhhdCBleHBlcmllbmNlZCB0aGUgYmlnZ2VzdCBqdW1wcyBpbiBwb3B1bGFyaXR5IGluIGNvbnNlY3V0aXZlIHllYXJzLgoKYGBge3J9CmJhYnluYW1lc19yYXRpb3NfZmlsdGVyZWQgPC0gYmFieW5hbWVzX2ZyYWN0aW9uICU+JQogICAgICAgICAgICAgICAgICAgICBhcnJhbmdlKG5hbWUsIHllYXIpICU+JQogICAgICAgICAgICAgICAgICAgICBncm91cF9ieShuYW1lKSAlPiUKICAgICAgICAgICAgICAgICAgICAgbXV0YXRlKHJhdGlvID0gZnJhY3Rpb24gLyBsYWcoZnJhY3Rpb24pKSAlPiUKICAgICAgICAgICAgICAgICAgICAgZmlsdGVyKGZyYWN0aW9uID49IDAuMDAwMDEpCmBgYApgYGB7cn0KYmFieW5hbWVzX3JhdGlvc19maWx0ZXJlZCAlPiUKICAjIEV4dHJhY3QgdGhlIGxhcmdlc3QgcmF0aW8gZnJvbSBlYWNoIG5hbWUgCiAgdG9wX24oMSwgcmF0aW8pICU+JQogICMgU29ydCB0aGUgcmF0aW8gY29sdW1uIGluIGRlc2NlbmRpbmcgb3JkZXIgCiAgYXJyYW5nZShkZXNjKHJhdGlvKSkgJT4lCiAgIyBGaWx0ZXIgZm9yIGZyYWN0aW9ucyBncmVhdGVyIHRoYW4gb3IgZXF1YWwgdG8gMC4wMDEKICBmaWx0ZXIoZnJhY3Rpb24gPj0gMC4wMDEpCmBgYAoK