The dataset babynames provides all names for US babies from 1880-2017. Using this resource, I want to determine the frequency of seasonal related babynames.
Thesis Statement- I hypothesize that that each generation closest to modern day will see more popularity as far as seasonal like names. Let’s start by loading our packages in R.
library(tidyverse)
library(babynames)
library(ggthemes)
The first step is to determine what the names I will be including in my analysis of ‘seasonal names.’
I will create a filter for the names Winter, Spring, Summer, & Autumn, and since I will be focusing on each gender first I will filter females with these names by labeling it season_f then I will filter men by labeling it seasons_m
babynames %>%
filter(name %in% c("Winter", "Spring", "Summer", "Autumn"), sex=="F")-> seasons_f
babynames %>%
filter(name %in% c("Winter", "Spring", "Summer", "Autumn"), sex=="M")-> seasons_m
Now that I have created a variable, I will use these to represent the different names I will be analyzing: Winter, Spring, Summer, Autumn
Next, lets see arrange these names in descending order starting with the highest prop; The proportion (prop) of children born with that name in each year.
seasons_f %>%
arrange(desc(prop))
## # A tibble: 289 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1998 F Autumn 4208 0.00217
## 2 1999 F Autumn 4127 0.00212
## 3 2001 F Autumn 4191 0.00212
## 4 2015 F Autumn 4112 0.00211
## 5 2016 F Autumn 4022 0.00209
## 6 2014 F Autumn 4062 0.00208
## 7 2002 F Autumn 4103 0.00208
## 8 2013 F Autumn 3950 0.00205
## 9 2003 F Autumn 4055 0.00202
## 10 2000 F Autumn 4027 0.00202
## # … with 279 more rows
From this visualization we can see that Females named Autumn have taken over our top 10 for the highest prop in the data set. Now lets do this with males.
seasons_m %>%
arrange(desc(prop))
## # A tibble: 99 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2016 M Winter 46 0.0000228
## 2 2017 M Winter 42 0.0000214
## 3 2014 M Winter 35 0.0000171
## 4 2012 M Winter 34 0.0000168
## 5 2015 M Winter 33 0.0000162
## 6 2004 M Autumn 28 0.0000133
## 7 2004 M Summer 28 0.0000133
## 8 1977 M Summer 20 0.0000117
## 9 2002 M Winter 24 0.0000116
## 10 2005 M Winter 23 0.0000108
## # … with 89 more rows
From this table we can see that Winter was the most popular name for Males over time, specifically in 2016, 2017, 2014, 2012, and 2015 which takes over the top 5 most popular when using prop.
Now I will plot the names using a different color for each name. I am creating for both N and prop to see if there is a major difference.
ggplot(seasons_f, aes(x=year, y=n, color=name)) +
geom_line()
ggplot(seasons_f, aes(x=year, y=prop, color=name)) +
geom_line() -> prop_time_female
We can see that n and prop graphs are very comparable.
Through this graph we see how Autumn is the most popular name for females, and has consistently been since after approx 1975.
Now I am going to do the same thing, except for males.
ggplot(seasons_m, aes(x=year, y=n, color=name))+
geom_line()
ggplot(seasons_m, aes(x=year, y=prop, color=name))+
geom_line()
We see that the graphs again are very similar in nature.
It is evident that the male names have more variation then the females did, however it is obvious that the name Winter has been consistently the most popular since after the 2000s.
Next I will plot All the names together over time for females in a bar graph.
ggplot(seasons_f, aes(x=name, y=n)) +
geom_col()
ggplot(seasons_f, aes(x=name, y=prop)) +
geom_col()
Again, we see that there is not much variation between n and prop for the graphs.
This bar graph does show up overall through all of time Autumn is the most popular female name followed by Summer, Winter, then Spring.
Now I will do the same for males. This will represent all the names together over time for males.
ggplot(seasons_m, aes(x=name, y=n)) +
geom_col()
ggplot(seasons_m, aes(x=name, y=prop)) +
geom_col()
We see here that Spring has ever been used enough as a male name to ever show up for our data set and bar graph. We see Winter is the most popular male name followed by summer and autumn which are not too far a part.
Now I will be creating variables to divide the four seasonal names, by gender by season. This will help me with my analysis.
seasons_f %>%
filter( year >= 1901 & year <=1927) -> greatest_gen_f
seasons_m %>%
filter( year >= 1901 & year <=1927) -> greatest_gen_m
seasons_f %>%
filter( year >= 1928 & year <=1945) -> silent_gen_f
seasons_m %>%
filter( year >= 1928 & year <=1945) -> silent_gen_m
seasons_f %>%
filter( year >= 1946 & year <=1964) -> boomer_gen_f
seasons_m %>%
filter( year >= 1946 & year <=1964) -> boomer_gen_m
seasons_f %>%
filter( year >= 1965 & year <=1980) -> x_gen_f
seasons_m %>%
filter( year >= 1965 & year <=1980) -> x_gen_m
seasons_f %>%
filter( year >= 1981 & year <=1995) -> millenials_gen_f
seasons_m %>%
filter( year >= 1981 & year <=1995) -> millenials_gen_m
seasons_f %>%
filter( year >= 1996 & year <=2010) -> z_gen_f
seasons_m %>%
filter( year >= 1996 & year <=2010) -> z_gen_m
seasons_f %>%
filter( year >= 2011 & year <=2025) -> alpha_gen_f
seasons_m %>%
filter( year >= 2011 & year <=2025) -> alpha_gen_m
Now I am taking my line graph of female seasonal names and adding annotations so that it is clearly divided by season. I also used the theme ‘fivethirtyeight’ here. These annotations add shaded areas for every other generation so that people can see the lines during each generation. I also labeled the relevant seasons, the ones who show data aka the boomer gen and beyond.
prop_time_female +
annotate("rect", xmin=2011, xmax=2018, ymin= 0, ymax = 0.003, alpha=.2) +
annotate("rect", xmin=1981, xmax=1995, ymin= 0, ymax = 0.003, alpha=.2)+
annotate("rect", xmin=1946, xmax=1964, ymin= 0, ymax = 0.003, alpha=.2) +
annotate("text", x=1956, y=.0031, label= "Boomer") +
annotate("text", x=1974, y=.0031, label= "Gen X") +
annotate("text", x=1988, y=.0031, label= "Millenials") +
annotate("text", x=2004, y=.0031, label= "Gen Z") +
annotate("text", x=2015, y=.0031, label= "Alpha") +
theme_fivethirtyeight()
Now I will create a variable to show the overall popularity of these names in total by gender. This will help us see if M or F used seasonal names more.
babynames %>%
filter(name %in% c("Winter", "Spring", "Summer", "Autumn"),) %>%
group_by(year, sex) %>%
summarise(n=sum(n)) %>%
arrange(year) -> seasons_mf
Now I will graph it using a line plot!
seasons_mf %>%
ggplot(aes(x=year, y=n, color=sex )) +
geom_line() +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() -> compare_gender
Now I will add the same annotations, including shaded areas to differentiate the generation, and names to clarify them.
compare_gender +
annotate("rect", xmin=2011, xmax=2018, ymin= 0, ymax =7000, alpha=.2) +
annotate("rect", xmin=1981, xmax=1995, ymin= 0, ymax = 7000, alpha=.2)+
annotate("rect", xmin=1946, xmax=1964, ymin= 0, ymax = 7000, alpha=.2) +
annotate("text", x=1956, y=7100, label= "Boomer") +
annotate("text", x=1974, y=7100, label= "Gen X") +
annotate("text", x=1988, y=7100, label= "Millenials") +
annotate("text", x=2004, y=7100, label= "Gen Z") +
annotate("text", x=2015, y=7100, label= "Alpha") +
theme_fivethirtyeight()
Here we see that there is an exponential more amount of males than females, and when comparing the two it seems that there are no males at all, but we know this is false since we graphed it earlier.
To me, the spike around the 1970s is extremely interesting as well as again around the 90s. It seems that seasonal names were at its highest right before the 2000s and has been decreasing ever since.