Baby names

This is a Quarto document. It allows you to run your R analyses and then generate and publish them. You can type text in this white space. The slightly darker space below is called a chunk and is for your R code. You can switch back and forth between Source and Visual on the upper left. I personally prefer Source, but you may prefer Visual.

Often one of the first chunks you’ll see in a notebook has the packages you’ll be using. Using packages in R is a two-step process. First, you’ll need to download (what R calls ‘install’) the package. Do that in the Packages pane on the right: Click Install and download the packages you want to use. Next, you use the library() function to load the packages.

(Note: when you have library commands in your code that use packages that you have not yet installed, you may get a message at the top of your window asking you if you want to install them. Go ahead and do that if you want.)

In the dark area below the two lines library(tidyverse) and library(wordcloud2), type the following line: library(babynames)

Then press the little green arrow in the top right of the chunk to run everything.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(wordcloud2)
library(babynames)

Viewing the data

The package ‘babynames’ has data on childrens’ names from the Social Security Administration.

First let’s look at the data by typing in the name of the package in the chunk below. Then press the little green arrow at the top right of the chunk.

babynames
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Another way to view it with View(). Create a code chunk below by clicking the Code menu, and then Insert Chunk. Then type View(babynames). Make sure you pay attention to uppercase and lowercase!

There are variables for year, sex, name, n (the total number of people of that sex given that name that year), and prop (the proportion of people of that sex given that name that year).

Notice that there are almost 2 million rows in the data set!

One final way to look at a data set that you should know is with glimpse(). It will show the basic outline of the data: How many observations (rows), how many variables (columns), and the first several rows of each variable.

Create another chunk below, this time using the little +C icon in the upper right (it does exactly the same thing as Code->Insert Chunk), and then use glimpse().

Filtering the data and using the pipe

If you want to see the names for just one year, use filter(). The filter() command will find rows based on some condition, like a year or the sex of a name. Run the chunk below:

filter(babynames, year == 1900)
# A tibble: 3,730 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1900 F     Mary      16706 0.0526
 2  1900 F     Helen      6343 0.0200
 3  1900 F     Anna       6114 0.0192
 4  1900 F     Margaret   5304 0.0167
 5  1900 F     Ruth       4765 0.0150
 6  1900 F     Elizabeth  4096 0.0129
 7  1900 F     Florence   3920 0.0123
 8  1900 F     Ethel      3896 0.0123
 9  1900 F     Marie      3856 0.0121
10  1900 F     Lillian    3414 0.0107
# ℹ 3,720 more rows

A better way to do the above command uses a symbol from the tidyverse called the ‘pipe,’ which looks like this: %>% It’s the percent symbol followed by the greater than symbol followed by another percent. An easy way to type it in R Studio is with control-shift-m (or command-shift-m).

When you see the pipe, think “and then.” So the command in the chunk below says: take the babynames data, AND THEN filter it to the year 1900. It takes the first part, babynames, and sends it to the next part, filter(). Taking babynames out of the parentheses in filter() can make the command a little easier to read, especially when we start using longer sequences of commands.

babynames %>% 
  filter(year == 1900)
# A tibble: 3,730 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1900 F     Mary      16706 0.0526
 2  1900 F     Helen      6343 0.0200
 3  1900 F     Anna       6114 0.0192
 4  1900 F     Margaret   5304 0.0167
 5  1900 F     Ruth       4765 0.0150
 6  1900 F     Elizabeth  4096 0.0129
 7  1900 F     Florence   3920 0.0123
 8  1900 F     Ethel      3896 0.0123
 9  1900 F     Marie      3856 0.0121
10  1900 F     Lillian    3414 0.0107
# ℹ 3,720 more rows

The chunk above does exactly the same thing as the previous.

Look at the names: A few of the names are still popular today, but many are not.

To see the top “slice” of names use slice_max(). slice_max(prop, n = 10) will give you the top 10 with the largest proportion of names.

babynames %>%
  filter(year == 1900) %>% 
  slice_max(prop, n = 10)
# A tibble: 10 × 5
    year sex   name        n   prop
   <dbl> <chr> <chr>   <int>  <dbl>
 1  1900 M     John     9829 0.0606
 2  1900 M     William  8579 0.0529
 3  1900 F     Mary    16706 0.0526
 4  1900 M     James    7245 0.0447
 5  1900 M     George   5403 0.0333
 6  1900 M     Charles  4099 0.0253
 7  1900 M     Robert   3821 0.0236
 8  1900 M     Joseph   3714 0.0229
 9  1900 M     Frank    3477 0.0214
10  1900 F     Helen    6343 0.0200

This table groups the names weirdly because the male and female names are separate. Let’s filter for female names only. Note that we have to put letters like “F” in quotes, but numbers should not have quotes.

babynames %>%
  filter(year == 1900, sex == "F") %>% 
  slice_max(prop, n = 10)
# A tibble: 10 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1900 F     Mary      16706 0.0526
 2  1900 F     Helen      6343 0.0200
 3  1900 F     Anna       6114 0.0192
 4  1900 F     Margaret   5304 0.0167
 5  1900 F     Ruth       4765 0.0150
 6  1900 F     Elizabeth  4096 0.0129
 7  1900 F     Florence   3920 0.0123
 8  1900 F     Ethel      3896 0.0123
 9  1900 F     Marie      3856 0.0121
10  1900 F     Lillian    3414 0.0107

Create a code chunk below and find the top 10 names for your sex and the year you were born.

Creating new variables with mutate()

An important command is mutate(). It creates a new variable. The prop variable is a little hard to read because of the decimal point, so converting it to a percent will make it more readable.

babynames %>%
  filter(year == 1966, sex == "M") %>% 
  mutate(percent = prop * 100) %>% 
  slice_max(prop, n = 10)
# A tibble: 10 × 6
    year sex   name        n   prop percent
   <dbl> <chr> <chr>   <int>  <dbl>   <dbl>
 1  1966 M     Michael 79992 0.0440    4.40
 2  1966 M     David   66419 0.0365    3.65
 3  1966 M     James   65180 0.0359    3.59
 4  1966 M     John    65038 0.0358    3.58
 5  1966 M     Robert  59333 0.0326    3.26
 6  1966 M     William 38260 0.0210    2.10
 7  1966 M     Mark    34805 0.0191    1.91
 8  1966 M     Richard 34467 0.0190    1.90
 9  1966 M     Jeffrey 30198 0.0166    1.66
10  1966 M     Thomas  29016 0.0160    1.60

Two more changes to simplify the way it looks: Using round() will get rid of some of the extra digits. round(x, 1) specifies 1 digit to the right of the decimal place. In addition, since we have percent, we don’t need to keep prop in our table. The ‘select’ function shows only the columns we want to show:

babynames %>%
  filter(year == 1966, sex == "M") %>% 
  mutate(percent = round(prop * 100, 1)) %>% 
  slice_max(prop, n = 10) %>% 
  select(year, sex, name, percent)
# A tibble: 10 × 4
    year sex   name    percent
   <dbl> <chr> <chr>     <dbl>
 1  1966 M     Michael     4.4
 2  1966 M     David       3.7
 3  1966 M     James       3.6
 4  1966 M     John        3.6
 5  1966 M     Robert      3.3
 6  1966 M     William     2.1
 7  1966 M     Mark        1.9
 8  1966 M     Richard     1.9
 9  1966 M     Jeffrey     1.7
10  1966 M     Thomas      1.6

Another function is row_number(), which is useful here because row number corresponds to the popularity rank of the name.

The following uses mutate() to create a new variable called ‘rank’ which is set to be equal to the row number of the name. Notice also that the # sign (pound sign or hashtag symbol) can be used to create comments in the code:

babynames %>%                                   # start with the data set
  filter(year == 1966, sex == "M") %>%          # choose only the year you want
  mutate(percent = round(prop * 100, 1)) %>%    # convert prop to percent
  mutate(rank = row_number()) %>%              # mutate() creates a new variable and calls it rank
  select(year, sex, name, percent, rank)       # show only certain variables
# A tibble: 4,536 × 5
    year sex   name    percent  rank
   <dbl> <chr> <chr>     <dbl> <int>
 1  1966 M     Michael     4.4     1
 2  1966 M     David       3.7     2
 3  1966 M     James       3.6     3
 4  1966 M     John        3.6     4
 5  1966 M     Robert      3.3     5
 6  1966 M     William     2.1     6
 7  1966 M     Mark        1.9     7
 8  1966 M     Richard     1.9     8
 9  1966 M     Jeffrey     1.7     9
10  1966 M     Thomas      1.6    10
# ℹ 4,526 more rows

We can use this rank variable as a measure of the popularity of a particular name. If I want to see how popular Matthew was in 1966, the year I was born, I add another filter() command for my name:

babynames %>%                             
  filter(year == 1966, sex == "M") %>%    
  mutate(rank = row_number()) %>%         
  mutate(percent = round(prop * 100, 1)) %>% 
  filter(name == "Matthew")               
# A tibble: 1 × 7
   year sex   name        n    prop  rank percent
  <dbl> <chr> <chr>   <int>   <dbl> <int>   <dbl>
1  1966 M     Matthew 10807 0.00594    34     0.6

This shows that Matthew was the 34th most popular boy name in 1966, with 10,809 other babies named Matthew for less than 1% of baby boys that year.

Try this with a few other names, years, and sexes. Just copy-paste the above chunk, and then change the name, year, and sex.

Word clouds

A cute graphic for seeing the popularity of words is called a word cloud. It shows words (names in this case) sized by how often they occur.

babynames %>%
  filter(year == 2015) %>%     # use only one year
  filter(sex == "F") %>%       # use only one sex
  slice_max(prop, n=100) %>%   # use the top 100 names
  select(name, n) %>%          # select the two relevant variables: the name and how often it occurs
  wordcloud2()                 # generate the word cloud

There are supposed to be 100 names in a semi-circle, but I see fewer than that and no circle. The problem is that the font is too big to show the whole thing. This makes the font size smaller with ‘size = .5’ in wordcloud2().

babynames %>%
  filter(year == 2015) %>%     # use only one year
  filter(sex == "F") %>%       # use only one sex
  slice_max(prop, n=100) %>%   # use the top 100 names
  select(name, n) %>%          # select the two relevant variables: the name and how often it occurs
  wordcloud2(size = .5)                 # generate the word cloud

That looks better. Hover over a name and it will show the name and how many babies were given that name that year. Click run again and it will generate a slightly different picture each time.

Some other parameters you can change include shape and color. Set shape = “star”, “triangle”, “pentagon”, “diamond”, or “star.” Set color = “pink” or “blue” etc.

babynames %>%
  filter(year == 2015) %>%     # use only one year
  filter(sex == "F") %>%       # use only one sex
  slice_max(prop, n=100) %>%   # use the top 100 names
  select(name, n) %>%          # select the two relevant variables: the name and how often it occurs
  wordcloud2(size = .5, shape = "pentagon", color = "pink")                 # generate the word cloud

Copy and paste the chunk above to make one of your own with a different year and/or sex, and try a different shape and color (that last one was pretty ugly).

wordcloud2 also comes with a few different themes, or styles. This one uses WCtheme(2):

babynames %>%
  filter(year == 2015) %>%     # use only one year
  filter(sex == "F") %>%       # use only one sex
  slice_max(prop, n=100) %>%   # use the top 100 names
  select(name, n) %>%          # select the two relevant variables: the name and how often it occurs
  wordcloud2(size = .5) + WCtheme(2)                 # generate the word cloud

Create a word cloud with a year and sex of your choice, size = .5, red colored words, a square shape, and theme 1.

Graphing changes in popularity over time

We can also look at a specific name’s popularity over time. ggplot is the most common graphing package in R, and it is also part of the tidyverse. Here’s one way to use it. Start with the data, then filter for just one name, then create the plot with aes (“aesthetics”) set so the x-axis is year and the y is the proportion of names. Then use the + to add the line “geometry” with geom_line.

babynames %>%                             # start with the babynames data
  filter(name == "Matthew") %>%           # look only at the name Matthew
  ggplot(aes(x = year, y = prop)) +       # create a graph with x and y aesthetics 
  geom_line()                             # make it a line graph

It’s weird that the line zig-zags back and forth most years from a number and then back down to 0. That’s because girls aren’t named Matthew, so we’re getting a number for the boys but a 0 for the girls each year. If you add sex == “M” to the filter it looks normal:

babynames %>%                                    # start with the data
  filter(name == "Matthew", sex == "M") %>%      # choose the name and sex
  ggplot(aes(x = year, y = prop)) +              # put year on the x-axis and prop (proportion) on y
  geom_line()                                    # make it a line graph 

This will make two final changes: 1. Mutate the prop variable into percent as above, and then use that as the y-axis, and 2. color the line blue.

babynames %>%                                    # start with the data
  filter(name == "Matthew", sex == "M") %>%      # choose the name and sex
  mutate(percent = round(prop * 100, 1)) %>%     # create a new variable called percent
  ggplot(aes(x = year, y = percent)) +           # put year on the x-axis and prop (proportion) on y
  geom_line(color = "blue")                      # make it a line graph and give the line a color

I was born in 1966, and my name had increased in popularity about a decade before I was born, and then began to drop off in popularity about 15 years later. At its peak year, over 2% of baby boys were named Matthew.

Just out of curiousity I wanted to see if there were any girls named Matthew. I changed sex == “F” and I also changed y = n (so the y-axis was n, or the number of babies with that name), to see the absolute number of girls named Matthew rather than the proportion or percentage.

babynames %>%
  filter(name == "Matthew", sex == "F") %>% 
  ggplot(aes(x = year, y = n)) +
  geom_line()

During the 1970s and 1980s, a couple hundred girls each year were named Matthew! I wonder how many were really girls and how many of those were just errors in data coding. Did that many parents really name their girls Matthew?

My wife’s name is Michele (with one ‘l’), and her named peaked right around the time she was born:

babynames %>%
  filter(name == "Michele", sex == "F") %>% 
  mutate(percent = round(prop * 100, 1)) %>%
  ggplot(aes(x = year, y = percent)) +
  geom_line()

Let’s look at both of my daughter’s names in one graph. In filter(), I put both of their names separated by the vertical line |, which is a symbol for OR. Then I set color = name, so the two names will have different color lines.

babynames %>%
  filter(name == "Emma" | name == "Julia", sex == "F") %>%  
  mutate(percent = round(prop * 100, 1)) %>%  
  ggplot(aes(x = year, y = percent, color = name)) +
  geom_line()

Their names both had a peak around the time they were born (1999 and 2003), but they were both also popular over 100 years ago. In fact, my daughter Emma was named after my wife’s great-grandmother.

Is it possible to guess the year someone was born by their name? One way to do this is to get the peak year for that name:

babynames %>%                                     # Start with the dataset
  filter(name == "Michele", sex == "F") %>%       # only look at the name you want
  slice_max(prop, n = 1)                                  # get the year with the top number for that name
# A tibble: 1 × 5
   year sex   name        n    prop
  <dbl> <chr> <chr>   <int>   <dbl>
1  1968 F     Michele 11217 0.00656

My wife was born in 1968, so that’s a pretty good guess!

Let’s get the top 10 years for Matthew, and sort them in descending order with arrange().

babynames %>%                                  # Start with the dataset
  filter(name == "Matthew", sex == "M") %>%    # only look at the name and sex you want
  slice_max(prop, n = 10)                         # get the top 10 names
# A tibble: 10 × 5
    year sex   name        n   prop
   <dbl> <chr> <chr>   <int>  <dbl>
 1  1983 M     Matthew 50214 0.0269
 2  1984 M     Matthew 49775 0.0265
 3  1985 M     Matthew 47073 0.0245
 4  1986 M     Matthew 46923 0.0244
 5  1982 M     Matthew 46060 0.0244
 6  1987 M     Matthew 46481 0.0238
 7  1981 M     Matthew 43330 0.0233
 8  1988 M     Matthew 45868 0.0229
 9  1989 M     Matthew 45371 0.0217
10  1990 M     Matthew 44800 0.0208

The peak “Matthew” year was 1983, followed by years mostly right around that same time.

Copy and paste the chunk above and try a few different names and sexes. Could you have guessed when people you know were born?

Famous names

Some names become popular when there is a famous person with that name. See how Barack was non-existent prior to 2007, peaked the year he became president, and then dropped off again.

babynames %>%
  filter(name == "Siri") %>% 
  ggplot(aes(x = year, y = n)) +
  geom_line()

The Disney movie The Little Mermaid came out in 1989 with Ariel its star character, Aladdin with Princess Jasmine came out in 1992, and Frozen with Elsa came out in 2013.

babynames %>%
  filter(name == "Ariel" | name == "Elsa" | name == "Jasmine") %>% 
  filter(sex == "F") %>% 
  ggplot(aes(x = year, y = n, color = name)) +
  geom_line()

To get a better look at this, filter for years only after 1980, with filter(year > 1980).

Copy and paste the chunk above, but add filter(year > 1980) %>% on a new line right after filter(sex == “F”) %>% :

It looks like Ariel peaked in popularity a few years after Little Mermaid came out in 1989, and Jasmine was very popular around the time Aladdin came out in 1992. Elsa never really became popular, but the movie came out in 2013 and the dataset ends in 2015. Maybe we’re about to have a bumper crop of Elsas.

Assignment

Create a new R Notebook and create a report about your name. You may use someone else’s name if you prefer. To do this, click on the File -> New File -> R Notebook menu. You should keep the tab with this notebook open so you can see it too. Feel free to copy and paste between this notebook and your new project.

In the white space around the chunks of code, write brief descriptions of the analysis and what the results show. Write it so that a friend or family member could read and understand it.

  1. Determine its rank the year you were born.
  2. Create a word cloud of the names of your sex and the year you were born.
  3. Graph its popularity over time.
  4. Create a table showing which years it was most popular.
  5. Graph its popularity in comparison to another name or two (e.g., a friend, family member, etc.). To keep it simple, use other names of the same sex.
  6. Publish it to RPubs.