US Baby Names Analysis

Let’s first load all the files from http://www.ssa.gov/oact/babynames/state/namesbystate.zip and unpacked into a local folder called “babies”

temp <- tempfile()
download.file("http://www.ssa.gov/oact/babynames/state/namesbystate.zip",temp)
fname = unzip(temp, list=TRUE)
unzip(temp, files=fname$Name, exdir="babies", overwrite=TRUE)

Let’s filter out only files with *.TXT extension as those are relevant data per US State and load them all into one data frame as CSV:

flist = list.files(path="babies", pattern ="*.TXT", full.names=TRUE)

out <- do.call(rbind, lapply(flist, function(x) {
  df <- read.table(x, header=FALSE, sep=",")
}))

names(out)= c("State","Sex","Birth","Name","Fr")

str(out)

## 'data.frame':    5552452 obs. of  5 variables:
##  $ State: Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex  : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Birth: int  1910 1910 1910 1910 1910 1910 1910 1910 1911 1911 ...
##  $ Name : Factor w/ 29828 levels "Aaliyah","Aaron",..: 1062 95 90 1037 623 495 996 453 1062 1037 ...
##  $ Fr   : int  14 12 10 8 7 6 6 5 12 7 ...

As per description, the limitation of the data is that includes names with at least of frequency 5 per state to support privacy and Names are limited in 2-15 character symbols so it does not represent all names in fully, espeically very unique names per state. This might strongly bias the dataset, as it does not include 100% of names and the size of unique name count is unknown.
Attaching “dplyr” package and aggregating to find most popular name

library(dplyr);

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df= out[,4:5]
df$Name = as.character(df$Name)
dd =data.frame(summarize(group_by(df,Name), sum(Fr)))
dd = arrange(dd, -dd$sum.Fr)

So, the most popular name of all time (of either gender) is:

dd$Name[1]

## [1] "James"

Since we have to find most gender ambigious name for 2 years (2013 and 1945), I will write a small function which takes year as an argument and provides output. To understand how gender ambiguous the name is, I will create a gender ambiguity ratio. This ratio will calculate the absolute difference between % Male and % Female [ Ratio = ABS (%M -%F) ]. Since %F can be expressed as %F = 1-%M, the formula can be updated [ Ratio = ABS (%M -1 +%M) OR Ratio = ABS (2*%M -1) ]. In this last Ratio formula, it does not really matter if we include %M or %F, the result will be identical. We’ll assume sending full data frame filtered on a number of years or a year we would like to see results on and, of course when Ratio = 0 (that’s when % Female = % Male = 50% for a given name )

GendAmb = function (arg_df) {
  
df=arg_df  
df$Name = as.character(df$Name)
df = df[,c(2,4:5)]
dd= data.frame(summarize(group_by(df,Name, Sex), sum(Fr)))
uu= data.frame(summarize(group_by(df,Name), sum(Fr)))

dd =dd[!duplicated(dd["Name"]),]

uu$Ratio = abs((2*dd$sum.Fr./uu$sum.Fr.)-1)
uu = arrange(uu,Ratio)
return(subset(uu, Ratio==0))
}

The most gender ambiguous name/s for 2013:

GendAmb(subset(out, Birth==2013))

##     Name sum.Fr. Ratio
## 1   Aris      30     0
## 2  Arlin      10     0
## 3   Cree      22     0
## 4 Devine      20     0
## 5  Eliah      10     0
## 6 Nikita      94     0
## 7  Sonam      10     0
## 8    Tru      22     0

The most gender ambiguous name/s for 1945:

GendAmb(subset(out, Birth==1945))

##    Name sum.Fr. Ratio
## 1 Maxie      38     0

Now let’s calculate largest percentage increase in name usage since 1980 until 2013.

# Subset data from 1980, summarize frequencies and calculate percentage name poularity for 1980 and then sort in "uu" data frame
df = subset(out, Birth==1980)
df$Name = as.character(df$Name)
uu= data.frame(summarize(group_by(df,Name), sum(Fr)))
total = sum(uu$sum.Fr.)
uu$Perc = uu$sum.Fr./total*100
uu = arrange(uu,-Perc)

# Subset data from 2013, summarize frequencies and calculate percentage name poularity for 1980 and then sort in "uu2" data frame
df = subset(out, Birth==2013)
df$Name = as.character(df$Name)
uu2= data.frame(summarize(group_by(df,Name), sum(Fr)))
total = sum(uu2$sum.Fr.)
uu2$Perc = uu2$sum.Fr./total*100
uu2 = arrange(uu2,-Perc)

# Making sure to compare names which existed in 1980 and still exist in 2013 and sort
kk = subset(uu2, Name %in% uu$Name)
kk = arrange(kk, Name)

# Reverse correlate those back to 1980 so that we have same sets of names aka Inner Join and sort again
cc = subset(uu, Name %in% kk$Name)
cc = arrange(cc, Name)

# Calculate percentage difference as 2013 minus 1980
kk$Diff = kk$Perc -cc$Perc

The highest percentage increase in popularity since 1980 to 2013:

kk = arrange(kk, -Diff)
kk[1,]

##     Name sum.Fr.      Perc      Diff
## 1 Sophia   21075 0.6910051 0.6723459

The highest percentage decrease in popularity since 1980 to 2013:

kk = arrange(kk, Diff)
kk[1,]

##       Name sum.Fr.       Perc      Diff
## 1 Jennifer    1666 0.05462465 -1.815549

The above does not take into consideration names which are new in 2013 and did not exist in 1980 or were below 5 frequency threshold as incrase from 0% and, in reverse, names which existed in 1980 and do not exist in 2013 list for maximum decrease in popularity to 0%
Insight. Name diversity

While working with the data, I have noticed that there seems to be many more female names compared to males, so I decided to explore this and plot diversity of female vs. male names on the timeline (where diversity is measured by the total number of unique names per gender per year)

out$Name = as.character(out$Name)
years = unique(out$Birth)
female = lapply(years, function (x) { length(unique(subset(out, Birth==x | Sex =="F")$Name))  })
male = lapply(years, function (x) { length(unique(subset(out, Birth==x | Sex =="M")$Name))  })

This cannot be explained by the gender ratio at birth in the US as it fluctuates 5% only and usually in favor for men based on US Census:

logo

Also these lines cannot be easily correlated with population growth in the US, which from 1915 to 2013 could be graphed as approximation of 45 degree line.

It seems that male names are much more homogenous compared to female names. May be culturally girls are perceived to be more unique than newborn boys in the US. It feels normal to have 3 John’s in the office but not 3 Jennifer’s.However, lack in male name diversity reduces with time. Provided this trend continues, we’ll be able to witness a true equality in “name diversity” between 2 genders.

Also, it is clear to notice that there was a peaks in name diversity in 1920s, around 1990 and 2010. Let’s compare it to the US immigration history with the same dates.

logo

It seems that peaks in immigration coincide with peaks in new unique name introduction both for male and female babies born, which later subsides. Sharpness of peaks in immigration is not the same as in the unique names, as number of uniqe names is limited.