First Hypothesis

The original hypothesis being tested was that as time has gone on, names from the top 20 female and top 20 male list from 1900 were becoming more popular. The requirements that were determined to form this list were as follows:

1900: prop >= 0.01

1960: prop <= 0.0025

2017: prop >= 0.002

After processing the names through these requirements, it was found that four female names - Alice, Anna, Lillian, and Grace - and one male name - Henry - matched the criteria. Below are the charts of their names and prop values throughout the years. For each name, you can see an uptick in the prop value at some point following the 1960s, which is when it was at a very unpopular prop value.

library(tidyverse)
library(babynames)
library(scales)
babynames %>%
  filter(name %in% "Alice" & sex=="F" & year > 1880)%>%
  ggplot(aes(year, prop)) + geom_line() -> alice
  alice + ggtitle("Alice") + theme(plot.title = element_text(hjust = 0.5)) +labs(y= "Prop value", x = "Year")

babynames %>%
  filter(name %in% "Anna" & sex=="F" & year > 1880)%>%
  ggplot(aes(year, prop)) + geom_line() -> Anna
  Anna + ggtitle("Anna") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Year")

babynames %>%
  filter(name %in% "Lillian" & sex=="F" & year > 1880)%>%
  ggplot(aes(year, prop)) + geom_line() -> lillian
  lillian + ggtitle("Lillian") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Year")

babynames %>%
  filter(name %in% "Grace" & sex=="F" & year > 1880)%>%
  ggplot(aes(year, prop)) + geom_line() -> grace
  grace + ggtitle("Grace") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Year")

babynames %>%
  filter(name %in% "Henry" & sex=="M" & year > 1880)%>%
  ggplot(aes(year, prop)) + geom_line() -> henry
  henry + ggtitle("Henry") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Year")

First Conclusion and Second Hypothesis

The fact that there were very few names that met this criteria set forth allowed me to come to the conclusion that the hypothesis was not supported by the data. It was realized that only five names of the forty initially looked at showed even a possibility of this trend, so more research was done to determine possible causes or reasoning. In the new data search it was discovered that as the years had gone by, the overall prop values of the top twenty names of males and top twenty names of females had greatly decreased. Below are data-driven graphs providing evidence of this. This drove a new hypothesis that the reason prop values had decreased was due to a great increase in the United States population and therefore, name diversity.

*For readability’s sake, I only included the top 10 female names and top 10 male names in 1900, 1960, and 2017 as that was sufficent to show the highest prop values.

The top 10 female and male names in 1900, noting that all the female prop values are below 0.055, and all the male prop values are below 0.061.

babynames %>% 
  filter(year == 1900 & sex == "F") %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill=name)) + geom_col() -> nintyfn
  nintyfn + ggtitle("Top 10 Female Babynames in 1900") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Name")

babynames %>% 
  filter(year == 1900 & sex == "M") %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill=name)) + geom_col() -> nintymn
  nintymn + ggtitle("Top 10 Male Babynames in 1900") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Name")

The top 10 female and male names in 1960, noting that all the female prop values are below 0.025, and all the male prop values are below 0.04.

babynames %>% 
  filter(year == 1960 & sex == "F") %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill=name)) + geom_col() -> sixtyfn
  sixtyfn + ggtitle("Top 10 Female Babynames in 1960") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Name")

babynames %>% 
  filter(year == 1960 & sex == "M") %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill=name)) + geom_col() -> sixtymn
  sixtymn + ggtitle("Top 10 Male Babynames in 1960") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Name")

The top 10 female and male names in 2017, noting that all the female prop values are below 0.015, and all the male prop values are below 0.01.

babynames %>% 
  filter(year == 2017 & sex == "F") %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill=name)) + geom_col() -> sevenfn
  sevenfn + ggtitle("Top 10 Female Babynames in 2017") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Name")

babynames %>% 
  filter(year == 2017 & sex == "M") %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill=name)) + geom_col() -> sevenmn
  sevenmn + ggtitle("Top 10 Male Babynames in 2017") + theme(plot.title = element_text(hjust = 0.5))+labs(y= "Prop value", x = "Name")

The data sets above show interesting results but in order to make sense of them in the new hyptothesis, more data was needed in order to make a conclusion. Below are data sets showing the male and female name populations in 1900, 1960, and 2017. More specifically, it shows the number of unique female and male names given in those years.

Unique male and female baby names in 1900.

babynames %>%  
  filter(year == 1900) %>% 
  group_by(sex) %>% 
  count()
## # A tibble: 2 x 2
## # Groups:   sex [2]
##   sex       n
##   <chr> <int>
## 1 F      2224
## 2 M      1506

Unique male and female baby names in 1960.

babynames %>%  
  filter(year == 1960) %>% 
  group_by(sex) %>% 
  count()
## # A tibble: 2 x 2
## # Groups:   sex [2]
##   sex       n
##   <chr> <int>
## 1 F      7331
## 2 M      4590

Unique male and female baby names in 2017.

babynames %>%  
  filter(year == 2017) %>% 
  group_by(sex) %>% 
  count()
## # A tibble: 2 x 2
## # Groups:   sex [2]
##   sex       n
##   <chr> <int>
## 1 F     18309
## 2 M     14160

Conclusion

In conclusion, there is no definitive evidence that names that were popular with a prop value of 0.01 or higher in 1900 are coming back into that same kind of popularity in 2017. Of the forty names that were looked at, only five of them showed the reverse bell graph that our criteria was looking for, and even then, it was not significant enough to surpass our criteria.

There is, however, evidence that proves that as the population of girls and boys in the United States increases, the diversity of names in the nation does as well. This is shown by the last three coded data sets where we see that the number of names recorded between 1900 and 1960 grew by over 5,000, and between 1960 and 2017, grew by nearly 11,000. This data, in congruence with the prop values from the top ten boy names and girl names from the three different years, allow for a logical conclusion that as more names are recorded, the most popular names are over a smaller number of the population. We can see this clearly when comparing the highest prop-value of boys and girls from 1900, 0.061 and 0.053, respectively, with the highest prop-value of boys and girls from 2017, 0.0093 and 0.0106. There is an obvious and large decrease in those values, suggesting that as more people were born, more names were given and, therefore, the most popular names of specific years had smaller pools overall.

.