By: Alex Pirsos, 2/21/2020

Question

With short baby names such as Liam and Emma increasing in popularity over the past few years, I am curious to know if the average length of baby names has changed over time. This project aims to analyze if baby names are increasing or decreasing in length, and pose some possible explanations as to why it is happening.

Hypothesis

I predict that over time we will see a decrease in the length of baby names. Hundreds of years ago, long biblical inspired names seemed to be the most prevalent. However, over time they appear to decrease based on naming trends and efficiency of the name. I think we will see this as a decrease to the average length of baby names from 1880 as compared to 2017.

Required R Packages

For this project I used R and R Studio as well as the following two installed packages: babynames and ggthemes.

r= getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
library(babynames)
install.packages("babynames")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
install.packages("ggthemes")
library(ggthemes)
View(babynames)

Method

First, I wanted to plot the changes in the average length of names from 1880 to 2017. To do so I created a new data frame for the baby names data. I then added a new column to the data set (called new) to count the number of characters in each name. Next, I created a new data frame (dfnew) which aggregated the average length of names for each year.

df <- babynames
df$new <- nchar(df$name)
dfnew <- aggregate(new~year,data=df,mean)

This data was then plotted proving that my original hypothesis was incorrect. The length of baby names has increased from 1880 by approximately 11.23% at its height (5.7 characters in 1880 to 6.34 characters in 1992).

ggplot(data = dfnew, mapping = aes(year,new)) + 
  geom_line() + theme_economist() +
  xlab("Year") +
  ylab("Average Number of Letters in Names") +
  ggtitle("Changes in Average Length of Names, 1880 to 2017")

Given that my original hypothesis was proven incorrect, I wanted to dive deeper into the top names of the years 1880, 1992, and 2017: the first year, the maximum average length year, and the latest year. For each year I created a new data frame filtering it to only include the names from that given year. I then ordered the names from most to least popular, and plotted the top 10.

df1880 <- babynames
df1880 <- df1880[df1880$year==1880,]
df1880 <- df1880[order(-df1880$n),]
df1880 <- head(df1880, 10)
df %>% 
  filter(year == 1880) %>% 
  group_by(name, sex) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -total), total)) + geom_bar(stat="identity") -> plot1880
plot1880 + theme_economist() + 
  xlab("Name") + 
  ylab("Number of People") +
  ggtitle("Top 10 Names, 1880")

In the 1880 graph we notice that half of the top 10 names from that year are under 5 letters.

df1992 <- babynames
df1992 <- df1992[df1992$year==1992,]
df1992 <- df1992[order(-df1992$n),]
df1992 <- head(df1992, 10)
df %>% 
  filter(year == 1992) %>% 
  group_by(name, sex) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -total), total)) + geom_bar(stat="identity") -> plot1880
plot1880 + theme_economist() + 
  xlab("Name") + 
  ylab("Number of People") +
  ggtitle("Top 10 Names, 1992")

As compared to 1992, there is only one name that is 5 letters, and the rest are above 5 letters.

df2017 <- babynames
df2017 <- df2017[df2017$year==2017,]
df2017 <- df2017[order(-df2017$n),]
df2017 <- head(df2017, 10)
df %>% 
  filter(year == 2017) %>% 
  group_by(name, sex) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -total), total)) + geom_bar(stat="identity") -> plot2017
plot2017 + theme_economist() + 
  xlab("Name") + 
  ylab("Number of People") +
  ggtitle("Top 10 Names, 2017")

Finally, when looking at 2017, we see the reason for the downward slope of the Changes in Average Length of Names graph with 6 of the top 10 names under 5 letters.

Next, I wanted to test this new observation with a case study. Hypothetically, I could pick any long name and see that its popularity would increase overtime. Additionally, I could compare that with shorter versions of the same name and see those decrease in popularity over time.

My first test was with the name William, given that it appeared on both the 1880 and 2017 top 10 names lists. I compared that with the common nicknames Bill and Will.

babynames %>%
  filter(name %in% c("William", "Will", "Bill") & sex== "M") %>%
  ggplot(aes(year, n, colour=name)) + geom_line() + 
  theme_economist() +
  xlab("Year") + 
  ylab("Number of People") +
  ggtitle("Changes in Popularity of the Name William")

As you can see, this theory did not work for the name William. While it did increase in popularity up to about the 1950’s, it has since continuously decreased in popularity. Conversely, the shorter versions of the name William had very small numbers of occurrences that a significant decrease is not apparent on the visualization. With that said, perhaps William, as one of the most popular names at the start and end of the data set would be an outlier because it is so common that it did not have the room to increase in popularity much more. To test this theory again, I chose my own name, Alexandra (and the nickname Alex) given that neither appeared on any of the top 10 lists and may be more representative of an average name.

babynames %>%
  filter(name %in% c("Alexandra", "Alex") & sex== "F") %>%
  ggplot(aes(year, n, colour=name)) + geom_line() + 
  theme_economist() +
  xlab("Year") + 
  ylab("Number of People") +
  ggtitle("Changes in Popularity of the Name Alexandra")

As seen on the graph above, the name Alexandra perfectly follows the naming trends expressed in the first visualization. Alexandra increased in popularity, peaking right before 2000, and is now declining.

Asumptions

This project relies on a few assumptions. First, the data set only reports names that have been used at least 5 times. It is possible that some really short or really long names are used more frequently, but that specific name is not used enough to register in the data. The data also only looks up to the year 2017. There have been many naming trends in the past 3 years that also could impact this data set. Additionally, the data only looks at US names. It is likely that other countries, especially ones with different alphabets have different results with the average length of baby names over time.

Explanation

To further understand why the length of names has increased over time, I also wanted to plot the number of unique names over time.

babynames %>%
  group_by(year) %>%
  summarise(name_count = n_distinct(name)) %>%
  ggplot(aes(year, name_count)) + geom_point() +
  theme_economist() +
  xlab("Year") +
  ylab("Number of Unique Names") +
  ggtitle("Changes in Number of Unique Names")

As seen here, there is a continuous increase in what people are naming their children, increasing the variety of names all together. This likely impacts the length of names because as more longer names are added to the total name bank, it increases the average. It is also interesting that we see the same decrease in the number of unique names in roughly the same time frame as the dip in length of names. This leads me to assume that the two are likely correlated.

Conclusion

While my original hypothesis was proven incorrect, this analysis shows that despite recent trends with short baby names, the length of baby names on average in increasing and more unique names are being used.