Babynames: Looking at the Last Letters

Author

Katelyn Litvan

Introduction and Hypothesis:

For my project, I wanted to look at the most and least popular last letters in names. I think this will be a unique look at the pattern of names over time. Based on my own life experience, I am going to make the hypothesis that the most popular last letters will be “y,” “a,” and “n”, but let’s find out!

To begin, I made sure to install all of the packages I would be using:

library(babynames)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(directlabels)

Abstracting the Last Letter:

First, I have to abstract the last letter from each name. I did this by using a few functions. In order to make sure I was using them properly, I used the help function to learn the proper forms of the functions, and to make sure I was inputting the right information.

I knew I had to use mutate because I was creating a new variable:

?mutate()

I also knew I had to use substr, as I needed to abstract the last string from a character vector (name):

?substr()

Finally, after doing some internet research, I figured out I need to use the nchar() function to pull a specific character from an observation:

?nchar()

So, putting it all together, I figured out (with some trial and error) I needed nchar(name) in both the first and last positions in substr(), as I was pulling the last character from each name. I then created a new dataframe, babynames2, with my new variable, last_letter:

babynames2 <- babynames|> 
  mutate(last_letter = substr(name, nchar(name), nchar(name)))

It worked! Now to answer some of my questions, and begin to explore my initial hypothesis.

Visualization #1:

First, I wanted to create a bar chart with all 26 letters to figure out where each last letter stood in terms of popularity of all time (or at least since 1880). To begin, I want to make a new data frame that would tally the usage of each last letter:

last_letter_counts <- babynames2 |> 
  group_by(last_letter) |> 
  summarize(count = n()) |> 
  arrange(desc(count))

last_letter_counts contains 26 rows, and tallies the usage of every letter as the last letter in all of the name occurences in the package. Looking at my new data frame off the bat, my hypothesis was almost right! a, e,n, and y are the most popular last letters. Opening this data frame is a neat and easy way to get familiar with the data, but now we will make a visualization that will also convey this information:

options(scipen= 10000000)

ggplot(last_letter_counts, aes(x = reorder(last_letter, count), y = count)) +
  geom_col() +
  ggtitle("The Popularity of Last Letters in US Baby Names") +
  xlab("Last Letter") +
  ylab("Count")

Notice that I reordered by last_letter in order to view the last letters from least to most popular. I also used the options() function to prevent scientific notation.

I think this visualization is a great snapshot of the data. You are able to look at all 26 letters, in order from least to most popular, and is very easy to read. Viewers can easily find where their own name’s last letter ranks overall, and they can make their own observations about where each of the letters fall. Each letter is clearly included, and I think this is a great first visualization to get familiar with the data.

Visualization #2:

Now, we’re going to get a little more complex, and look at each of these letters’ popularity over time. I think this will be interesting because it will allow us to better understand the trends that are associated with each last letter. The sorted_year will allow us to get a total of how many of each letter fell into the last place in the name each year. In other words, 26 sums of letters for each year since 1880.

sorted_year <- babynames2 |> 
  group_by(last_letter, year) |> 
  summarize(total_prop = sum(prop))
`summarise()` has grouped output by 'last_letter'. You can override using the
`.groups` argument.

I did get an error/warning in this argument about my summarize function, but I got the output I wanted using group_by. Now let’s visualize it using a line graph, and I added direct labels so that each letters’ path can be followed over time.

library(directlabels)
ggplot(sorted_year, aes(year, total_prop, color = last_letter, group = last_letter)) +
  geom_line() +
  ggtitle("Total Popularity of Last Letters Over Time")+
  geom_dl(aes(label= last_letter),method = list("last.points"))

This visualization definitely gives a lot of information, but I think it shows a lot of interesting trends. My eyes first went to the rise of “n” and the fall of “e,” but this line graph provides a viewer with a better understanding of how these last letters have shifted over time. The sorted_year variable is the star of the show in this visualization, as it captured what last letter was the most popular in each year.

Visualization #3:

Now, let’s take a closer look at my original hypothesized top letters, and pull the top 5 last letters from the data and look at their individual patterns.

I used slice_head to grab the top 5 letters from my last_letter_counts data frame (this is the one with 26 observations if you recall). I then researched the pull() function to grab the last_letter variable, and save just that in my new data frame, top_last. This basically just affirms the top 5 last letters (a,e,n,y,l) and saves them under top_last.

top_last <- last_letter_counts |> 
  slice_head(n = 5) |> 
  pull(last_letter)

Then, this part got a little tricky. I needed to filter for the top letters from my original babynames2 (just grabbing the data for a,e,n,y,l).

top_data <- babynames2 |> 
  filter(last_letter %in% top_last)

Now let’s create 5 unique visualizations to look at each last letter’s popularity over time:

ggplot(top_data, aes(x = year, group = last_letter, color = last_letter)) +
  geom_line(stat = "sum", aes(y = prop)) +
  ggtitle("Popularity of Top Five Last Letters in US Baby Names Over Time") +
  xlab("Year") +
  ylab("Proportion") +
  facet_wrap(~last_letter, scales = "free_y")

For that last facet_wrap line, I wanted all the y axes to be independent of each other so I could really examine the trends, I came across scales = “free_y”

Looking at this graph, my first observation is that no last letters have grown in popularity. This basically means that these last letters are no longer “dominating” the names anymore, and less popular last letters are getting used more. Back in the first collection of decades in the data, names were less unique, so this aligns well with my findings of these five letters appearing more frequently. Now, names are more unique, which means other, less commons last letters, are getting their time to shine! Interesting. My only problem with my graph is that n keeps appearing, even though I did not include it in my gg plot.

Visualization #4:

Now we’re just reversing the prcess and grabbing the least popular (tail instead of head). I wanted to see if my new hypothesis was correct, and the less popular last letters were starting to gain traction.

least_popular_last <- last_letter_counts |> 
  tail(5) |> 
  pull(last_letter)

Again, repeating the process from above and just filtering for those least popular last letters (w,p,v,j,q).

least_popular_data <- babynames2 |> 
  filter(last_letter %in% least_popular_last)

Repeating the same graphs as above:

ggplot(least_popular_data, aes(x = year, y = prop, color = last_letter)) +
  geom_line(stat = "sum") +
  ggtitle("Popularity of Least Popular Five Last Letters in US Baby Names Over Time") +
  xlab("Year") +
  ylab("Proportion") +
  facet_wrap(~last_letter, scales = "free_y")

My earlier hypothesis was (sort of) right. These least popular last letters are definitely seeing more action than they did in the past. This could mean that people are creating and using more unique names, and ending names in letters like “j” or “q” are not as unheard of as they may have once been.

Overall, my project was a great way to demonstrate how unique names are becoming. Looking at the last letters of names over time may not come to mind at first as being interesting, but I think that my four visualizations demonstrated that looking at how names have ended over time reveals a lot. My original hypothesis of the most popular letters was for the most part correct, but over time, names have become more and more diverse, and as seen in the last visualization, more unpopular last letters (like j, v, or w) are now getting their time to shine.