Tidy Tuesday 03-22-2022: Baby Names

The Data

This week’s data comes from US babynames and NZ babynames. The data describes the name, year, popularity in that year, count, proportion, as well as the sex associated with the name. The data begins in 1880 and goes until 2017.

babynames <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-22/babynames.csv')
nzbabynames <-readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-22/nz_names.csv')

Data Exploration

Like usual, I started with skimming the data frames to see if there was any interesting information. I didn’t end up using much of the info from this skim chart but it was still informative to see the variable distributions etc.

babynames %>%
  skim()

nzbabynames %>%
  skim()

Sara versus Sarah

My name is Sara, but most people I meet spell their name Sarah rather than Sara. Just out of curiosity, I wanted to see how different the popularity of Sarah and Sara really are. As the plot below shows, the name Sarah is more popular than the name Sara – though it is interesting that they have an uptick at about the same time (roughly between 1980-1990s).

Sara <- babynames %>%
  filter(name == "Sara" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

Sarah <- babynames %>%
  filter(name == "Sarah" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

colors <- c("Sara" = "red", "Sarah" = "blue")

Sara %>%
  ggplot(aes(x = year, y = prop)) +
  geom_line(aes(color = "Sara")) +
  geom_area(alpha = .7, aes(fill = "Sara")) +
  geom_line(data = Sarah, aes(color = "Sarah")) +
  geom_area(data = Sarah, alpha = .1, aes(fill = "Sarah"))+
  theme_bw()+
  labs(color = "Legend")+
  guides(fill=FALSE)+
  ylab("Proportion")+
  xlab("Year")+
  ggtitle("Popularity of Sarah versus Sara in the United States (1880-2017)")+
  xlim(c(1880,2017))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Sara in NZ versus US

As aforementioned, my name is Sara, and because of this, I looked at the popularity of my name in NZ versus US. Interestingly, the name Sara is way more popular in NZ than the US.

nz_prop<- nzbabynames %>%
  group_by(Year, Sex) %>%
  mutate(n = n()) %>% 
  ungroup() %>%
  mutate(prop = Count/n) %>%
  rename(year = Year)

Sara_1 <- nz_prop %>%
  filter(Name == "Sara" & Sex == "Female")

colors <- c("New Zealand" = "blue", "United States" = "red")

Sara %>%
ggplot(aes(x = year, y = prop)) +
  geom_area(alpha = .7, aes(fill = "United States"))+
         geom_line(aes(color = "United States")) +
  geom_line(data = Sara_1, aes(color = "New Zealand")) +
  geom_area(data = Sara_1, aes(fill="New Zealand"), alpha = .1)+
    labs(color = "Legend")+
    guides(fill=FALSE)+
  theme_bw()+
   ylab("Proportion")+
  xlab("Year")+
  ggtitle("Popularity of Sara in the United States versus New Zealand (1880-2017)")+
  xlim(c(1880,2017))

Plotting Interesting 2000s Names

I next wanted to try graphing interesting female baby names from the 2000s like Bella (from Twilight) and Alexa (the Amazon device) to see if them gaining pop culture significance was associated with a change in popularity. Obviously, this list of names is not exhaustive, meaning there are many other interesting babynames that can and should be tried out! The code for this plot is very very messy and I think that if I were to do this again, I would want to create a new data frame with just these names so that I would not have to call a new data set in every line of code. Particularly, the colors were a huge hassle because I was calling a new data frame each time. I opted to make the final plot look good, but because of that – I sacrificed code elegance etc.

Siri <- babynames %>%
  filter(name == "Siri" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

Alexa <- babynames %>%
  filter(name == "Alexa" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

Katrina <- babynames %>%
  filter(name == "Katrina" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

Elsa <- babynames %>%
  filter(name == "Elsa" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

Isis <- babynames %>%
  filter(name == "Isis" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

Bella <- babynames %>%
  filter(name == "Bella" & sex == "F") %>%
  mutate(prop_factor = as.factor(prop))

colors <- c("Siri" = "deepskyblue2", "Alexa" = "red", "Katrina" = "aquamarine3", "Elsa" = "darkslategray", "Isis" = "darkorchid1", "Bella" = "blue")


Siri %>%
  ggplot(aes(x = year, y = prop)) +
         geom_line(aes(color = "Siri"), size=0.7)+
  geom_line(data = Alexa, aes(color = "Alexa"), size=0.7)+
  geom_line(data = Katrina, aes(color = "Katrina"), size=0.7)+
  geom_line(data = Elsa, aes(color = "Elsa"), size=0.7)+
  geom_line(data = Isis, aes(color = "Isis"), size=0.7)+
  geom_line(data = Bella, aes(color = "Bella"), size=0.7)+
  
  
  geom_vline(xintercept = 2014+.75, color = "red1", alpha = .5)+
  geom_vline(xintercept = 2005, color = "royalblue2", alpha = .5)+
  geom_vline(xintercept = 2013+.75, color = "seagreen3", alpha = .6)+
  geom_vline(xintercept = 2011 +.75, color = "deeppink1", alpha =.5)+
  geom_vline(xintercept = 2013.9 +.75, color =  "turquoise3", alpha = .5)+
   geom_vline(xintercept = 2005.75, color =  "lightgoldenrod4", alpha = .5)+
  
  
  annotate(geom="text", x=2002.75, y=0.0015, label="Hurricane Katrina",
              color="royalblue2")+
  annotate(geom="text", x=2010, y=0.001, label="Siri comes out",
              color="deeppink1")+
   annotate(geom="text", x=2011.5, y=0.003, label="Frozen Premieres",
              color="seagreen3")+
   annotate(geom="text", x=2012.5, y=0.0019, label="Alexa comes out",
              color="red")+
  annotate(geom="text", x=2011.8, y=0.0015, label="Isis Gains Prominence",
              color="turquoise3")+
  annotate(geom="text", x=2003.5, y=0.001, label="Twilight Released",
              color="lightgoldenrod4")+
  
     labs(color = "Legend")+
  

  scale_x_continuous(limits = c(2000, 2017))+
   scale_y_continuous(limits = c(0,.0032))+
  ylab("Proportion")+
  xlab("Year")+
  ggtitle("Popularity of Select Baby Names (2000-2017)")+

                                 
                                
  
  theme_bw()