Synopsis

I took the data set Hadley Wickham has on his github to do some exploratory data analysis for the singular purpose of practice. After plotting several basic plots, I focused on exploring the topic of gender neutral names coming to the conclusion that although the overall number of names given to babies rises, the number of gender-neutral names is on th decline just as is the number of babies who are given gender neutral names.

Data Acquisition

The dataset is downloaded from github account of Hadley Wickham who took care of some of the initial pre-processing and tidying. The problem is missing documentation.

url <- "https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv"
babynames <- read.csv(url, header = T)

Analysis

  1. What are the most popular girl/boy names in 2008? Starting light, I wanted to visualize the most popular boy and girl names given in the last year of the datset: 2008:
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
top08 <- babynames %>% filter(year == 2008)  %>% select(name, percent, sex) %>% arrange(desc(percent)) %>% group_by(sex) %>% top_n(n =10, wt = percent)
library(ggplot2)
ggplot(top08, aes(top08, x = reorder(name, percent),y = percent, fill = sex)) + geom_bar(stat = "identity") + coord_flip() + xlab("") + ylab("") + ggtitle(label="Most popular baby names in 2008") + scale_y_continuous(labels = scales::percent) + facet_wrap(c("sex"), scales = "free_y")  + theme_minimal() + theme(legend.position="none")  + scale_fill_hue(l=40, c=35) + scale_fill_manual(values=c("blue", "red"))
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

  1. List all names that have been assigned to both girls and boys in year This helper function returns all unique names assigned to both genders in a given year. I will refer to those names as gender-neutral of bigender names. The difinition is as loose as possible. As soon as, in a given year, a name if recorded to be given to both girls and boys, I classify it as gender-neutral.
binamesinyear <- function(myyear){
      require(dplyr)
      
      boynames <- babynames %>% filter(sex=="boy" & year == myyear) %>% distinct(name)
      boynames <- sapply(boynames[,"name"], as.character)

      girlnames <- babynames %>% filter(sex=="girl" & year == myyear) %>% distinct(name)
      girlnames <- sapply(girlnames[,"name"], as.character)

      bigender <- intersect(girlnames, boynames)

      return(bigender)
      
}

binames08 <- binamesinyear(2008)

Let’s get a more specific view at the data. These are few names given to both genders in 2008.

sample(binamesinyear(2008))
##  [1] "Taylor"    "Alexis"    "Emery"     "Justice"   "Skyler"   
##  [6] "Sage"      "London"    "Armani"    "Parker"    "Harley"   
## [11] "Rylee"     "Jaden"     "Ariel"     "Cameron"   "Teagan"   
## [16] "Ali"       "Hayden"    "Kayden"    "Camryn"    "Jaidyn"   
## [21] "Kendall"   "Rory"      "Payton"    "Casey"     "Zion"     
## [26] "Peyton"    "Bailey"    "Skylar"    "Jessie"    "Logan"    
## [31] "Dylan"     "Jayden"    "Avery"     "Jaiden"    "Eden"     
## [36] "Emerson"   "Reese"     "Ryan"      "Charlie"   "Angel"    
## [41] "Jaylin"    "Kasey"     "Quinn"     "Lyric"     "Jaylen"   
## [46] "Kamari"    "Harper"    "Reagan"    "Finley"    "Riley"    
## [51] "Jadyn"     "Jordan"    "Amari"     "Morgan"    "Micah"    
## [56] "Marley"    "Dominique" "Jordyn"    "Addison"   "Rowan"    
## [61] "Phoenix"   "Sidney"    "Jamie"     "Dakota"

One interesting name bi-gender name: Cortney

cortney <- babynames %>% filter(name =="Cortney") %>% select(year, sex, percent)

colorPallete <- c("blue", "red")

ggplot(cortney, aes(x = year, y= percent, colour = sex, group = sex)) + geom_line() + scale_y_continuous(labels = scales::percent) + theme(legend.position = "right") + ggtitle("Babes names Cortney in the US over time") + theme_minimal() +   scale_colour_manual(values=colorPallete)

Another interesting bigender name: Angel

cortney <- babynames %>% filter(name =="Angel") %>% select(year, sex, percent)

colorPallete <- c("blue", "red")

ggplot(cortney, aes(x = year, y= percent, colour = sex, group = sex)) + geom_line() + scale_y_continuous(labels = scales::percent) + theme(legend.position = "right") + ggtitle("Babies names Angel in the US over time") + theme_minimal() +   scale_colour_manual(values=colorPallete)

3.Examine the trend in number of gender-neutral names over time The number of bigender names given to babiesappears to be on the decline. Perhaps we are just using fewer names in general?

library(reshape2)
bicount <- c()
for (year in 1880:2008) {bicount <- c(bicount,length(binamesinyear(year)))}
names(bicount) <- 1880:2008
bicount <- as.data.frame(bicount)
bicount$year <- rownames(bicount)
bicount <- melt(bicount)
## Using year as id variables
ggplot(bicount, aes(x = year,y = value, group="variable")) + geom_line(alpha = 0.4) + theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(label=" Number of gender neutral names in a given year") + geom_smooth(method="loess")

This little function calculate the overall number of unique names given to babies in a given year. We need it to examine the overall development in number of names given each year.

namesinyear <- function(myyear){
      require(dplyr)
      yearnames <- babynames %>% filter(year == myyear) %>% distinct(name)
      yearnames <- sapply(yearnames[,"name"], as.character)
      return(length(yearnames))
}

It looks like the number of names in a year has actually been growing in last few decades.

namescount <- c()
for (year in 1880:2008){namescount <- c(namescount,namesinyear(year))}

namescount <- as.data.frame(namescount)
namescount$year <- rownames(namescount)
namescount <- melt(namescount)
## Using year as id variables
ggplot(namescount, aes(x = year,y = value, group="variable")) + geom_line(alpha = 0.4) + theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(label=" Number of  names in a given year") + geom_smooth(method="loess")

Se perhaps we need to look at not the number of names, but percentage of kids with the names.

biperct <- function(myyear){
      require(dplyr)
      
      boynames <- babynames %>% filter(sex=="boy" & year == myyear) %>% distinct(name)
      boynames <- sapply(boynames[,"name"], as.character)

      girlnames <- babynames %>% filter(sex=="girl" & year == myyear) %>% distinct(name)
      girlnames <- sapply(girlnames[,"name"], as.character)

      bigender <- intersect(girlnames, boynames)
      
      bikids <- babynames %>% filter(year == myyear & name %in% bigender)

      return(sum(bikids$percent))
      
}

And no… even when we work with the percentages, the trend is still clearly decreasing.

namesbiperct <- c()
for (year in 1880:2008){namesbiperct <- c(namesbiperct,biperct(year))}
names(namesbiperct) <- 1880:2008

namesbiperct <- as.data.frame(namesbiperct)
namesbiperct$year <- rownames(namesbiperct)
namesbiperct <- melt(namesbiperct)
## Using year as id variables
ggplot(namesbiperct, aes(x = year,y = value, group="variable")) + geom_line(alpha = 0.4) + theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(label=" Number of babies with gender-neutral names") + geom_smooth(method="loess")  + scale_y_continuous(labels = scales::percent)

Conclusion and further steps

This analysis has been surprising to me. Perhaps the number of names is rising but the new names are so plentiful that they are not being captured by the statistics. One possible reason for the rise would be genderization of names (Angel -> Angela). A possible avenue for further analysis would be to devise and calculate overall “variance measure” in the names awarded in each year and see its development over time. Another possibility would be to download and work with the full government datset (available on Kaggle) but since this analyis was not motivated by anything else than my motivation to practice, I am happy with this result.

Furthermore, a problem to point out is that I don’t know what does the percentage in the used dataset measure. It does not sum to a 100% neither across a year, nor across a combination of year + gender.