Utilizing data from kaggle containing names given to babies in the United States between 1880 and 2014, this report attempts to better understand the changing trends in female baby naming during this period. To better understand demographic changes during this period, the number of male and female babies born in each year represented in the data set is investigated. Then, I attempt to demonstrate my thought process in understanding how the frequency of most common female baby names have changed over time.
library(tidyverse)
library(dplyr)
library(readr)
library(knitr)
natbabynames <- read_csv("/Users/lorigerstenfeld/Downloads/national_baby_names.csv")
sum(is.na(natbabynames))
## [1] 0
The result is 0, so there are no NAs in this dataset that need to be removed.
In this section, I will attempt to answer the question: how has the trend in the number of male and female babies born each year changed between 1880 and 2014?
natbabynames2 <- natbabynames %>%
dplyr::select("Year", "Count", "Gender") %>%
dplyr::group_by(Gender, Year) %>%
dplyr::summarize("total_babies"=sum(Count)/1000) %>%
dplyr::ungroup()
ggplot(natbabynames2, aes(Year, total_babies))+
geom_line(aes(color=Gender))+
theme_classic()+
labs(
title="Number of Male and Female Babies Born 1880-2014",
x="Year",
y="Total Babies Born (Thousands)") +
scale_color_discrete(breaks=c("F","M"),
labels=c("Female", "Male"))
Figure 1: Number of male and female babies born each year 1880-2014
We can conclude from this figure 1 that the number of babies born increased rapidly between the early 1900s and the early 1960s, aside from a slight dip during the 1920s and 1930s (possibly associated with the Great Depression and World War II). Birth rates have fluctuated, but remained relatively regular since the 1960s.
In this and the next section, I attempt to answer the question: how has the frequency of common female baby names changed between 1880 and 2014?
natbabynames3 <- natbabynames %>%
dplyr::filter(Gender=="F") %>%
dplyr::group_by(Year) %>%
slice(1)
ggplot(natbabynames3, aes(Year, Count))+
geom_point(aes(color=Name))+
theme_bw()+
labs(
title="Number of Female Babies Born 1880-2014 with \n Most Frequently Used Baby Names of the Year",
x="Year",
y="Number of Babies with Most Frequently Used Female Name",
color="Most Frequently\nUsed Baby Name")
Figure 2: Number of US-born female babies with most common name of year
While figure 2 shows the changes in the number of female babies born with the most common names over time, it does not take into account the number of female babies born each year. Therefore, another graph will need to be made to graph year vs. proportion of babies with the most frequently used name in each year to accurately represent this phenomenon.
In this section, the analysis from the previous section will be altered to include the total number of female babies in each year. This will allow for the analysis of the frequency that the most common baby name is used over time.
natbabynames2_onlytotal <- natbabynames2 %>%
dplyr::filter(Gender=="F") %>%
dplyr::select(-"Gender")
natbabynames_joined <- left_join(natbabynames3, natbabynames2_onlytotal, by="Year") %>%
dplyr::mutate(proportion=Count/(total_babies*1000))
ggplot(natbabynames_joined, aes(Year, proportion))+
geom_point(aes(color=Name))+
geom_smooth(method="lm", se=F) +
theme_bw()+
labs(
title="Proportion of Female Babies Born 1880-2014 \nwith Most Frequently Used Baby Names of the Year",
x="Year",
y="Proportion of Babies with Most Frequently Used Female Name",
color="Most Frequently\nUsed Baby Name")
Figure 3: Proportion of US-born female babies each year with most common female name
Figure 3 demonstrates that the proportion of female babies with the most common baby name has steadily decreased over time. This trend is much more clearly visible with the standardization of data to the number of babies being born each year.
model <- lm(proportion ~ Year, data=natbabynames_joined)
statsummary <- coef(summary(model))
row.names (statsummary) <- c("Intercept", "Count")
knitr::kable(statsummary, align="ccccc",
caption="Table 1. Linear Model for Figure 3 (year vs. proportion of female babies with most common name",
col.names=c("Estimate", "Std. Error", "t value", "p value"))
| Estimate | Std. Error | t value | p value | |
|---|---|---|---|---|
| Intercept | 0.8958616 | 0.0261097 | 34.31149 | 0 |
| Count | -0.0004383 | 0.0000134 | -32.68914 | 0 |
The very high t value and the low p value of this linear regression demonstrates that the linear model is very accurate in representing the decrease in the frequency that the most common baby name is used over time.
As shown by Figure 3, the relative popularity of common female names has decrease linearly between 1880 and 2014. It was important to understand the trends in the birthrates of babies (Figure 1) and the number of times the most frequent baby name was used (Figure 2) in order to reach this third graph. By dividing the number of female babies with the most common name by the total number of babies born each year, a conclusion could be reach about the relative popularity of the most common female baby names over time.
This result and the figures produced throughout the report demonstrates that the trend in female baby naming has changed significantly over time. The most popular name has changed several times, often after remaining the most common name for several years, and the commonality of the most common female baby name has decreased significantly between 1880 and 2014.