Babynames Popularity in the US

library(tidyverse)
library(lubridate)
library(babynames)
life_df <- babynames::lifetables
birth_df <- babynames::births
babynames_df <- babynames::babynames
?babynames
babynames

Introduction

For the second part, I will be exploring the babynames data set that contains data on names attributed to newborns since 1880, including the number of newborns and proportion of newborns who received a particular name. With the current data set, I will be creating an animated plot of the top 10 given names per year in the United States between 1880 and 1929.

Babynames is a dataset collected by the Social Security Administration initially in 1998 under the Actuarial Note #139, Name Distributions in the Social Security Area, August 1997. The dataset provides the distribution of given names o Social Security number holders who were born in the United States of America (50 States and District of Columbia) after 1879.

One of the limitations of the data is that the data does not include the names of many people who were born before 1937, because those people did not apply for a Social Security card. In addition, it is important to mention that the data on births in U.S territories are not included in our national data.

The dataset contains a total of 5 variables and 1000+ names displayed, from the most popular ones to the least popular ones; For this study, I will only be using all the variables:

  • year, the year in which a baby name was attributed to a new born

  • sex, the sex of the baby to which a name was attributed last day of data collection for each poll

  • name, the name given to a new born

  • n, number of newborn given

  • prop, the proportion of newborn babies to which a name was given

Initial Exploration

Initial Explorations (10 points). Provide a few static graphics of what your animation is for a few different points in time. You likely made these before adding the code for the animation. Comment on any trends that you see.

Graph 1: Name popularity - 1880

Graph 2: Name popularity - 1890

Graph 3: Name popularity - 1900

Graph 4: Name popularity - 1920

Observable Trends: As time passed, Mary started being the most used name, reaching more than 60.000 times in which families gave this name to babies.

  1. John loses popularity between 1890-1990, and gains momentum again in 1920s.

  2. Williams the second popular male name throughout the whole time period.

  3. The top 10 names are all very English sound names, meaning that the immigrants that lived in the US at the time were mainly from the UK or surroundings as opposed to Spaniards, German and French names.

  4. Helen was overall the second female name most used during the whole period.

Results I

  1. Animation I: Name Popularity

Discussion I

The animation shows the top 10 popular names from 1880- 1929 in the US. It showcases the e main used names given to baby girls and baby boys, it gives us an idea of the most given names that the beginning of the time period that stop being used frequently and vice-versa. There were three things that I was able to observe in the trend. Firstly, Mary and John for the most part seems to be among the top 10 names throughout the whole time period between 1880 and 1929, which makes sense considering that biblical names would be very much used in general because of the massive prevalence of religious beliefs in all aspects of life.

Secondly, Robert became an increasingly used name from around 1914 until 1929. Lastly, Anna and Williams was a name that was popular by the end of the 1800, but its popularity started decreasing steadily;

Results II

  1. Animation II: Birth by gender
babynames_plot2 <- babynames_df %>%
  mutate(female = if_else(sex == "F", 1,0), male = if_else(sex == "M",1,0)) %>%
  group_by(year) %>%
  summarise(Male = sum(male), Female = sum(female)) %>%
  arrange(year) %>%
  collect
babynames_plot2 <- babynames_plot2 %>%
  pivot_longer(cols = c(Male, Female), names_to = "Sex", values_to = "Count")

staticplot2 = ggplot(babynames_plot2, aes(year, Count, group = Sex, color = Sex )) + 
  geom_line() +
  scale_colour_brewer(palette = "Set2") +
  labs(x= "Year", y= "Count of Names") +
  theme(legend.position = "top")

staticplot2 + transition_reveal(year)

Discussion II

The animation above displayed showcases the count of births by gender over time. Through the plot, we can see that overall, according to the count of names, we can see that more female names were given compared to male, which means that for the most part, there more baby girls being birth compared to boys.

Wrap-Up

If I was to work on this further, I would look deeper into the into the times in which unisex names started being more common and to study the instances in which giving male names to female and vice versa started happening, just out of curiosity.

Appendix:

R code (10 points) Include an Appendix with all of your R code (you can include code from Part 1 or leave it out). The 10 points here are allocated to how “readable” your code is and if the code is correct.

##R code: PART 2 ONLY 
library(tidyverse)
library(lubridate)
library(babynames)
life_df <- babynames::lifetables
birth_df <- babynames::births
babynames_df <- babynames::babynames
?babynames
babynames

library(gifski)
library(gganimate)

## 1880 
babynames_1880 <- babynames_df %>% 
  filter(year== "1880") %>%
  group_by(year, name) %>%
  summarise(count = max(n)) %>%
  arrange(desc(count)) %>%
  slice(1:10)

babynames_1880 <- babynames_1880 %>% 
  mutate(Name_Ordered = fct_reorder(name,count))

ggplot(data= babynames_1880, aes(x=count, y=Name_Ordered)) +
  geom_col(color= "green", fill= "turquoise3") +
  labs(title = "Most Popular Names- 1880",
    subtitle = "Top 10 Names") +
  theme_classic()

## 1890

babynames_1890 <- babynames_df %>% 
  filter(year== "1890") %>%
  group_by(year, name) %>%
  summarise(count = max(n)) %>% 
  arrange(desc(count)) %>%
  slice(1:10)

babynames_1890 <- babynames_1890 %>% 
  mutate(Name_Ordered = fct_reorder(name,count))

ggplot(data= babynames_1890, aes(x=count, y=Name_Ordered)) + geom_col(color= "green", fill= "turquoise3") +
  labs(title =  "Most Popular Names -1890",
    subtitle = "Top 10 Names") +
  theme_classic()

## 1900

babynames_1900 <- babynames_df %>% 
  filter(year== "1900") %>%
  group_by(year, name) %>%
   summarise(count = max(n)) %>% 
  arrange(desc(count)) %>%
  slice(1:10)

babynames_1900 <- babynames_1900 %>% 
  mutate(Name_Ordered = fct_reorder(name,count))

ggplot(data= babynames_1900, aes(x=count, y=Name_Ordered)) + geom_col(color= "green", fill= "turquoise3") +
  labs(title = "Most Popular Names - 1900",
    subtitle = "Top 10 Names") +
  theme_classic()

## 1920

babynames_1920 <- babynames_df %>% 
  filter(year== "1920") %>%
  group_by(year, name) %>%
  summarise(count = max(n)) %>% 
  arrange(desc(count)) %>%
  slice(1:10)

babynames_1920 <- babynames_1920 %>% 
  mutate(Name_Ordered = fct_reorder(name,count))

ggplot(data= babynames_1920, aes(x=count, y=Name_Ordered)) + geom_col(color= "green", fill= "turquoise3") +
  labs(title = "Most Popular Names - 1920",
    subtitle = "Top 10 Names") +
  theme_classic()

## Static Plot
babynames_plot <- babynames_df %>% 
  group_by(year) %>%
    mutate(rank = rank(-n),
         n_rel = n/n[rank==1],
         n_lbl = paste(" ",round(n/1e9))) %>%
  group_by(name) %>% 
  filter(rank <=10) %>%
  ungroup()

staticplot = ggplot(babynames_plot, aes(rank, group = name, 
                fill = as.factor(name), color = as.factor(name)))  +
  geom_tile(aes(y = n/2,
                height = n,
                width = 0.9), alpha = 0.8, color = NA) +
  geom_text(aes(y = 0, label = paste(name, " ")), vjust = 0.5, hjust = 1) +
  geom_text(aes(y=n ,label = n_lbl, hjust=0)) +
  coord_flip(clip = "off", expand = FALSE) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_reverse() +
  guides(color = FALSE, fill = FALSE) +
  theme(axis.line=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.title.x=element_blank(),
         axis.title.y=element_blank(),
        legend.position="none",
        panel.background=element_blank(),
        panel.border=element_blank(),
        panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        panel.grid.major.x = element_line( size=.1, color="grey" ),
        panel.grid.minor.x = element_line( size=.1, color="grey" ),
        plot.title=element_text(size=25, hjust=0.5, face="bold", colour="grey", vjust=-1),
        plot.subtitle=element_text(size=18, hjust=0.5, face="italic", color="grey"),
        plot.caption =element_text(size=8, hjust=0.5, face="italic", color="grey"),
        plot.background=element_blank(),
       plot.margin = margin(2,2, 2, 4, "cm"))

anim = staticplot + transition_states(year, transition_length = 4, state_length = 1) +
  view_follow(fixed_x = TRUE)  +
  labs(title = 'Name popularity per year : {closest_state}',  
       subtitle  =  "Top 10 Names",
       caption  = "count in discrete number | Data Source: Social Security Administration")

animate(anim, 200, fps = 20,  width = 1200, height = 1000, duration = 5, 
        renderer = gifski_renderer("gganim.gif"))

##Animation II

babynames_plot2 <- babynames_df %>%
  mutate(female = if_else(sex == "F", 1,0), male = if_else(sex == "M",1,0)) %>%
  group_by(year) %>%
  summarise(Male = sum(male), Female = sum(female)) %>%
  arrange(year) %>%
  collect

babynames_plot2 <- babynames_plot2 %>%
  pivot_longer(cols = c(Male, Female), names_to = "Sex", values_to = "Count")

staticplot2 = ggplot(babynames_plot2, aes(year, Count, group = Sex, color = Sex )) + 
  geom_line() +
  scale_colour_brewer(palette = "GnBu") +
  labs(x = "Year", y = "Count of Names") +
  theme(legend.position = "top")

staticplot2 + transition_reveal(year)

Appendix:

Data Wrangling Explanation :

To prepare the data, I had to: Filter the data by year, create a variable for rank, considering that I was trying to illustrate the top 10 most given names in a particular time and arrange everything from bigger to smaller values. Working with baby names data set was fairly simple, considering that it does not have many variables and it is clean enough from the beginning to work with.