We are going to practise with dplyr package manipulating the babynames dataset that is included in the babynames package from CRAN.

#loading packages
library(dplyr)
library(babynames)
library(ggplot2)
Use suppressPackageStartupMessages() to eliminate package
startup messages

Explore the babynames data

If we check the class of the babynames library we can see that is already a dataframe.

babynames <- babynames %>% 
  select(year, sex, name, number = n)

So, we can explore the first and last rows to see how it looks like:

head(babynames)
tail(babynames)

We can see that the data is from 1880 to 2017.

glimpse(babynames)
Rows: 1,924,665
Columns: 4
$ year   <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1…
$ sex    <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name   <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margare…
$ number <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1…

We have 5 columns: - year - sex - name - number

Filtering and arranging for one year

The dplyr verbs are useful for exploring data. For instance, you could find out the most common names in a particular year.

babynames %>%
  # Filter for the year 1990
  filter(year == 1990) %>%
  # Sort the number column in descending order 
  arrange(desc(number))

It looks like the most common names for babies born in the US in 1990 were Michael, Christopher, and Jessica.

Using top_n with babynames

We could also use group_by and top_n to find the most common name in every year.

# Find the most common name in each year
babynames %>%
  group_by(year) %>%
  top_n(1, number)

It looks like John was the most common name in 1880, and Mary was the most common name for a while after that.

Visualizing names with ggplot2

The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2

# Filter for the names Steven, Thomas, and Matthew 
selected_names <- babynames %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"), sex == "M")

# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
  geom_line()

It looks like names like Steven and Thomas were common in the 1950s, but Matthew became common more recently.

Grouped mutates

Finding the year each name is most common

We’re going to explore which year each name was the most common.

To do this, you’ll be combining the grouped mutate approach with a top_n.

# Find the year each name is most common 
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)%>%
  group_by(name)%>%
  top_n(1, fraction)

Notice that the results are grouped by year, then name, so the first few entries are names that were most popular in the 1880’s that start with the letter A.

Adding the total and maximum for each name

We’ll normalize by a different, but also interesting metric: we’ll divide each name by the maximum for that name. This means that every name will peak at 1.

babynames %>%
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup() %>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number / name_max)

This tells you, for example, that the name Mary was at 9.5% of its peak in the year 1880.

Visualizing the normalized change in popularity

We picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where we’re focused on the relative change within each name rather than the overall popularity of the name.

We’ll visualize the normalized popularity of each name.

names_normalized <- babynames %>%
                     group_by(name) %>%
                     mutate(name_total = sum(number),
                            name_max = max(number)) %>%
                     ungroup() %>%
                     mutate(fraction_max = number / name_max)
# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"), sex == "M")

# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
  geom_line()

As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.

Window functions

Takes a vector and returns another vector of the same lenght.

v <- c(1, 3, 6, 14)
v
[1]  1  3  6 14

We can lag the vector as per below:

lag(v)
[1] NA  1  3  6

We can compare consecutive steps and calculate the changes:

v - lag(v)
[1] NA  2  3  8

Using ratios to describe the frequency of a name

What if instead of finding the difference, you wanted to find the ratio?

babynames_fraction <- babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)
babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name, year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio between each year 
  mutate(ratio = fraction / lag(fraction))

Notice that the first observation for each name is missing a ratio, since there is no previous year.

Biggest jumps in a name

Now, we’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.

babynames_ratios_filtered <- babynames_fraction %>%
                     arrange(name, year) %>%
                     group_by(name) %>%
                     mutate(ratio = fraction / lag(fraction)) %>%
                     filter(fraction >= 0.00001)
babynames_ratios_filtered %>%
  # Extract the largest ratio from each name 
  top_n(1, ratio) %>%
  # Sort the ratio column in descending order 
  arrange(desc(ratio)) %>%
  # Filter for fractions greater than or equal to 0.001
  filter(fraction >= 0.001)
LS0tCnRpdGxlOiAiQmFieSBuYW1lcyBQcm9qZWN0IgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCiAgICB0b2NfY29sbGFwc2VkOiB0cnVlCiAgICAKdG9jX2RlcHRoOiAzCi0tLQpXZSBhcmUgZ29pbmcgdG8gcHJhY3Rpc2Ugd2l0aCBkcGx5ciBwYWNrYWdlIG1hbmlwdWxhdGluZyB0aGUgYmFieW5hbWVzIGRhdGFzZXQgdGhhdCBpcyBpbmNsdWRlZCBpbiB0aGUgYmFieW5hbWVzIHBhY2thZ2UgZnJvbSBDUkFOLgpgYGB7cn0KI2xvYWRpbmcgcGFja2FnZXMKbGlicmFyeShkcGx5cikKbGlicmFyeShiYWJ5bmFtZXMpCmxpYnJhcnkoZ2dwbG90MikKYGBgCiMgRXhwbG9yZSB0aGUgYmFieW5hbWVzIGRhdGEKCklmIHdlIGNoZWNrIHRoZSBjbGFzcyBvZiB0aGUgYmFieW5hbWVzIGxpYnJhcnkgd2UgY2FuIHNlZSB0aGF0IGlzIGFscmVhZHkgYSBkYXRhZnJhbWUuCmBgYHtyfQpiYWJ5bmFtZXMgPC0gYmFieW5hbWVzICU+JSAKICBzZWxlY3QoeWVhciwgc2V4LCBuYW1lLCBudW1iZXIgPSBuKQpgYGAKU28sIHdlIGNhbiBleHBsb3JlIHRoZSBmaXJzdCBhbmQgbGFzdCByb3dzIHRvIHNlZSBob3cgaXQgbG9va3MgbGlrZToKYGBge3J9CmhlYWQoYmFieW5hbWVzKQp0YWlsKGJhYnluYW1lcykKYGBgCldlIGNhbiBzZWUgdGhhdCB0aGUgZGF0YSBpcyBmcm9tIDE4ODAgdG8gMjAxNy4KYGBge3J9CmdsaW1wc2UoYmFieW5hbWVzKQpgYGAKV2UgaGF2ZSA1IGNvbHVtbnM6CiAtIHllYXIKIC0gc2V4CiAtIG5hbWUKIC0gbnVtYmVyCiAKIyMgRmlsdGVyaW5nIGFuZCBhcnJhbmdpbmcgZm9yIG9uZSB5ZWFyCgpUaGUgZHBseXIgdmVyYnMgYXJlIHVzZWZ1bCBmb3IgZXhwbG9yaW5nIGRhdGEuIEZvciBpbnN0YW5jZSwgeW91IGNvdWxkIGZpbmQgb3V0IHRoZSBtb3N0IGNvbW1vbiBuYW1lcyBpbiBhIHBhcnRpY3VsYXIgeWVhci4KYGBge3J9CmJhYnluYW1lcyAlPiUKICAjIEZpbHRlciBmb3IgdGhlIHllYXIgMTk5MAogIGZpbHRlcih5ZWFyID09IDE5OTApICU+JQogICMgU29ydCB0aGUgbnVtYmVyIGNvbHVtbiBpbiBkZXNjZW5kaW5nIG9yZGVyIAogIGFycmFuZ2UoZGVzYyhudW1iZXIpKQpgYGAKSXQgbG9va3MgbGlrZSB0aGUgbW9zdCBjb21tb24gbmFtZXMgZm9yIGJhYmllcyBib3JuIGluIHRoZSBVUyBpbiAxOTkwIHdlcmUgTWljaGFlbCwgQ2hyaXN0b3BoZXIsIGFuZCBKZXNzaWNhLgoKIyMgVXNpbmcgdG9wX24gd2l0aCBiYWJ5bmFtZXMKV2UgY291bGQgYWxzbyB1c2UgZ3JvdXBfYnkgYW5kIHRvcF9uIHRvIGZpbmQgdGhlIG1vc3QgY29tbW9uIG5hbWUgaW4gZXZlcnkgeWVhci4KYGBge3J9CiMgRmluZCB0aGUgbW9zdCBjb21tb24gbmFtZSBpbiBlYWNoIHllYXIKYmFieW5hbWVzICU+JQogIGdyb3VwX2J5KHllYXIpICU+JQogIHRvcF9uKDEsIG51bWJlcikKYGBgCkl0IGxvb2tzIGxpa2UgSm9obiB3YXMgdGhlIG1vc3QgY29tbW9uIG5hbWUgaW4gMTg4MCwgYW5kIE1hcnkgd2FzIHRoZSBtb3N0IGNvbW1vbiBuYW1lIGZvciBhIHdoaWxlIGFmdGVyIHRoYXQuCgojIyBWaXN1YWxpemluZyBuYW1lcyB3aXRoIGdncGxvdDIKClRoZSBkcGx5ciBwYWNrYWdlIGlzIHZlcnkgdXNlZnVsIGZvciBleHBsb3JpbmcgZGF0YSwgYnV0IGl0J3MgZXNwZWNpYWxseSB1c2VmdWwgd2hlbiBjb21iaW5lZCB3aXRoIG90aGVyIHRpZHl2ZXJzZSBwYWNrYWdlcyBsaWtlIGdncGxvdDIKYGBge3J9CiMgRmlsdGVyIGZvciB0aGUgbmFtZXMgU3RldmVuLCBUaG9tYXMsIGFuZCBNYXR0aGV3IApzZWxlY3RlZF9uYW1lcyA8LSBiYWJ5bmFtZXMgJT4lCiAgZmlsdGVyKG5hbWUgJWluJSBjKCJTdGV2ZW4iLCAiVGhvbWFzIiwgIk1hdHRoZXciKSwgc2V4ID09ICJNIikKCiMgUGxvdCB0aGUgbmFtZXMgdXNpbmcgYSBkaWZmZXJlbnQgY29sb3IgZm9yIGVhY2ggbmFtZQpnZ3Bsb3Qoc2VsZWN0ZWRfbmFtZXMsIGFlcyh4ID0geWVhciwgeSA9IG51bWJlciwgY29sb3IgPSBuYW1lKSkgKwogIGdlb21fbGluZSgpCmBgYApJdCBsb29rcyBsaWtlIG5hbWVzIGxpa2UgU3RldmVuIGFuZCBUaG9tYXMgd2VyZSBjb21tb24gaW4gdGhlIDE5NTBzLCBidXQgTWF0dGhldyBiZWNhbWUgY29tbW9uIG1vcmUgcmVjZW50bHkuCgojIEdyb3VwZWQgbXV0YXRlcwoKIyMgRmluZGluZyB0aGUgeWVhciBlYWNoIG5hbWUgaXMgbW9zdCBjb21tb24KCldlJ3JlIGdvaW5nIHRvIGV4cGxvcmUgd2hpY2ggeWVhciBlYWNoIG5hbWUgd2FzIHRoZSBtb3N0IGNvbW1vbi4KClRvIGRvIHRoaXMsIHlvdSdsbCBiZSBjb21iaW5pbmcgdGhlIGdyb3VwZWQgbXV0YXRlIGFwcHJvYWNoIHdpdGggYSB0b3Bfbi4KYGBge3J9CiMgRmluZCB0aGUgeWVhciBlYWNoIG5hbWUgaXMgbW9zdCBjb21tb24gCmJhYnluYW1lcyAlPiUKICBncm91cF9ieSh5ZWFyKSAlPiUKICBtdXRhdGUoeWVhcl90b3RhbCA9IHN1bShudW1iZXIpKSAlPiUKICB1bmdyb3VwKCkgJT4lCiAgbXV0YXRlKGZyYWN0aW9uID0gbnVtYmVyIC8geWVhcl90b3RhbCklPiUKICBncm91cF9ieShuYW1lKSU+JQogIHRvcF9uKDEsIGZyYWN0aW9uKQpgYGAKTm90aWNlIHRoYXQgdGhlIHJlc3VsdHMgYXJlIGdyb3VwZWQgYnkgeWVhciwgdGhlbiBuYW1lLCBzbyB0aGUgZmlyc3QgZmV3IGVudHJpZXMgYXJlIG5hbWVzIHRoYXQgd2VyZSBtb3N0IHBvcHVsYXIgaW4gdGhlIDE4ODAncyB0aGF0IHN0YXJ0IHdpdGggdGhlIGxldHRlciBBLgoKIyMgQWRkaW5nIHRoZSB0b3RhbCBhbmQgbWF4aW11bSBmb3IgZWFjaCBuYW1lCgpXZSdsbCBub3JtYWxpemUgYnkgYSBkaWZmZXJlbnQsIGJ1dCBhbHNvIGludGVyZXN0aW5nIG1ldHJpYzogd2UnbGwgZGl2aWRlIGVhY2ggbmFtZSBieSB0aGUgbWF4aW11bSBmb3IgdGhhdCBuYW1lLiBUaGlzIG1lYW5zIHRoYXQgZXZlcnkgbmFtZSB3aWxsIHBlYWsgYXQgMS4KYGBge3J9CmJhYnluYW1lcyAlPiUKICBncm91cF9ieShuYW1lKSAlPiUKICBtdXRhdGUobmFtZV90b3RhbCA9IHN1bShudW1iZXIpLAogICAgICAgICBuYW1lX21heCA9IG1heChudW1iZXIpKSAlPiUKICAjIFVuZ3JvdXAgdGhlIHRhYmxlIAogIHVuZ3JvdXAoKSAlPiUKICAjIEFkZCB0aGUgZnJhY3Rpb25fbWF4IGNvbHVtbiBjb250YWluaW5nIHRoZSBudW1iZXIgYnkgdGhlIG5hbWUgbWF4aW11bSAKICBtdXRhdGUoZnJhY3Rpb25fbWF4ID0gbnVtYmVyIC8gbmFtZV9tYXgpCmBgYApUaGlzIHRlbGxzIHlvdSwgZm9yIGV4YW1wbGUsIHRoYXQgdGhlIG5hbWUgTWFyeSB3YXMgYXQgOS41JSBvZiBpdHMgcGVhayBpbiB0aGUgeWVhciAxODgwLgoKIyMgVmlzdWFsaXppbmcgdGhlIG5vcm1hbGl6ZWQgY2hhbmdlIGluIHBvcHVsYXJpdHkKCldlIHBpY2tlZCBhIGZldyBuYW1lcyBhbmQgY2FsY3VsYXRlZCBlYWNoIG9mIHRoZW0gYXMgYSBmcmFjdGlvbiBvZiB0aGVpciBwZWFrLiBUaGlzIGlzIGEgdHlwZSBvZiAibm9ybWFsaXppbmciIGEgbmFtZSwgd2hlcmUgd2UncmUgZm9jdXNlZCBvbiB0aGUgcmVsYXRpdmUgY2hhbmdlIHdpdGhpbiBlYWNoIG5hbWUgcmF0aGVyIHRoYW4gdGhlIG92ZXJhbGwgcG9wdWxhcml0eSBvZiB0aGUgbmFtZS4KCldlJ2xsIHZpc3VhbGl6ZSB0aGUgbm9ybWFsaXplZCBwb3B1bGFyaXR5IG9mIGVhY2ggbmFtZS4gCgpgYGB7cn0KbmFtZXNfbm9ybWFsaXplZCA8LSBiYWJ5bmFtZXMgJT4lCiAgICAgICAgICAgICAgICAgICAgIGdyb3VwX2J5KG5hbWUpICU+JQogICAgICAgICAgICAgICAgICAgICBtdXRhdGUobmFtZV90b3RhbCA9IHN1bShudW1iZXIpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZV9tYXggPSBtYXgobnVtYmVyKSkgJT4lCiAgICAgICAgICAgICAgICAgICAgIHVuZ3JvdXAoKSAlPiUKICAgICAgICAgICAgICAgICAgICAgbXV0YXRlKGZyYWN0aW9uX21heCA9IG51bWJlciAvIG5hbWVfbWF4KQpgYGAKYGBge3J9CiMgRmlsdGVyIGZvciB0aGUgbmFtZXMgU3RldmVuLCBUaG9tYXMsIGFuZCBNYXR0aGV3Cm5hbWVzX2ZpbHRlcmVkIDwtIG5hbWVzX25vcm1hbGl6ZWQgJT4lCiAgZmlsdGVyKG5hbWUgJWluJSBjKCJTdGV2ZW4iLCAiVGhvbWFzIiwgIk1hdHRoZXciKSwgc2V4ID09ICJNIikKCiMgVmlzdWFsaXplIHRoZXNlIG5hbWVzIG92ZXIgdGltZQpnZ3Bsb3QobmFtZXNfZmlsdGVyZWQsIGFlcyh4ID0geWVhciwgeSA9IGZyYWN0aW9uX21heCwgY29sb3IgPSBuYW1lKSkgKwogIGdlb21fbGluZSgpCmBgYApBcyB5b3UgY2FuIHNlZSwgdGhlIGxpbmUgZm9yIGVhY2ggbmFtZSBoaXRzIGEgcGVhayBhdCAxLCBhbHRob3VnaCB0aGUgcGVhayB5ZWFyIGRpZmZlcnMgZm9yIGVhY2ggbmFtZS4KCiMgV2luZG93IGZ1bmN0aW9ucwoKVGFrZXMgYSB2ZWN0b3IgYW5kIHJldHVybnMgYW5vdGhlciB2ZWN0b3Igb2YgdGhlIHNhbWUgbGVuZ2h0LgpgYGB7cn0KdiA8LSBjKDEsIDMsIDYsIDE0KQp2CmBgYApXZSBjYW4gbGFnIHRoZSB2ZWN0b3IgYXMgcGVyIGJlbG93OgpgYGB7cn0KbGFnKHYpCmBgYApXZSBjYW4gY29tcGFyZSBjb25zZWN1dGl2ZSBzdGVwcyBhbmQgY2FsY3VsYXRlIHRoZSBjaGFuZ2VzOgpgYGB7cn0KdiAtIGxhZyh2KQpgYGAKIyMgVXNpbmcgcmF0aW9zIHRvIGRlc2NyaWJlIHRoZSBmcmVxdWVuY3kgb2YgYSBuYW1lCgpXaGF0IGlmIGluc3RlYWQgb2YgZmluZGluZyB0aGUgZGlmZmVyZW5jZSwgeW91IHdhbnRlZCB0byBmaW5kIHRoZSByYXRpbz8KYGBge3J9CmJhYnluYW1lc19mcmFjdGlvbiA8LSBiYWJ5bmFtZXMgJT4lCiAgZ3JvdXBfYnkoeWVhcikgJT4lCiAgbXV0YXRlKHllYXJfdG90YWwgPSBzdW0obnVtYmVyKSkgJT4lCiAgdW5ncm91cCgpICU+JQogIG11dGF0ZShmcmFjdGlvbiA9IG51bWJlciAvIHllYXJfdG90YWwpCmBgYApgYGB7cn0KYmFieW5hbWVzX2ZyYWN0aW9uICU+JQogICMgQXJyYW5nZSB0aGUgZGF0YSBpbiBvcmRlciBvZiBuYW1lLCB0aGVuIHllYXIgCiAgYXJyYW5nZShuYW1lLCB5ZWFyKSAlPiUKICAjIEdyb3VwIHRoZSBkYXRhIGJ5IG5hbWUKICBncm91cF9ieShuYW1lKSAlPiUKICAjIEFkZCBhIHJhdGlvIGNvbHVtbiB0aGF0IGNvbnRhaW5zIHRoZSByYXRpbyBiZXR3ZWVuIGVhY2ggeWVhciAKICBtdXRhdGUocmF0aW8gPSBmcmFjdGlvbiAvIGxhZyhmcmFjdGlvbikpCmBgYAogTm90aWNlIHRoYXQgdGhlIGZpcnN0IG9ic2VydmF0aW9uIGZvciBlYWNoIG5hbWUgaXMgbWlzc2luZyBhIHJhdGlvLCBzaW5jZSB0aGVyZSBpcyBubyBwcmV2aW91cyB5ZWFyLgoKIyMgQmlnZ2VzdCBqdW1wcyBpbiBhIG5hbWUKCk5vdywgd2UnbGwgbG9vayBhdCBhIHN1YnNldCBvZiB0aGF0IGRhdGEsIGNhbGxlZCBiYWJ5bmFtZXNfcmF0aW9zX2ZpbHRlcmVkLCB0byBsb29rIGZ1cnRoZXIgaW50byB0aGUgbmFtZXMgdGhhdCBleHBlcmllbmNlZCB0aGUgYmlnZ2VzdCBqdW1wcyBpbiBwb3B1bGFyaXR5IGluIGNvbnNlY3V0aXZlIHllYXJzLgoKYGBge3J9CmJhYnluYW1lc19yYXRpb3NfZmlsdGVyZWQgPC0gYmFieW5hbWVzX2ZyYWN0aW9uICU+JQogICAgICAgICAgICAgICAgICAgICBhcnJhbmdlKG5hbWUsIHllYXIpICU+JQogICAgICAgICAgICAgICAgICAgICBncm91cF9ieShuYW1lKSAlPiUKICAgICAgICAgICAgICAgICAgICAgbXV0YXRlKHJhdGlvID0gZnJhY3Rpb24gLyBsYWcoZnJhY3Rpb24pKSAlPiUKICAgICAgICAgICAgICAgICAgICAgZmlsdGVyKGZyYWN0aW9uID49IDAuMDAwMDEpCmBgYApgYGB7cn0KYmFieW5hbWVzX3JhdGlvc19maWx0ZXJlZCAlPiUKICAjIEV4dHJhY3QgdGhlIGxhcmdlc3QgcmF0aW8gZnJvbSBlYWNoIG5hbWUgCiAgdG9wX24oMSwgcmF0aW8pICU+JQogICMgU29ydCB0aGUgcmF0aW8gY29sdW1uIGluIGRlc2NlbmRpbmcgb3JkZXIgCiAgYXJyYW5nZShkZXNjKHJhdGlvKSkgJT4lCiAgIyBGaWx0ZXIgZm9yIGZyYWN0aW9ucyBncmVhdGVyIHRoYW4gb3IgZXF1YWwgdG8gMC4wMDEKICBmaWx0ZXIoZnJhY3Rpb24gPj0gMC4wMDEpCmBgYAoK