Let’s first load all the files from http://www.ssa.gov/oact/babynames/state/namesbystate.zip and unpacked into a local folder called “babies”
temp <- tempfile()
download.file("http://www.ssa.gov/oact/babynames/state/namesbystate.zip",temp)
fname = unzip(temp, list=TRUE)
unzip(temp, files=fname$Name, exdir="babies", overwrite=TRUE)
Let’s filter out only files with *.TXT extension as those are relevant data per US State and load them all into one data frame as CSV:
flist = list.files(path="babies", pattern ="*.TXT", full.names=TRUE)
out <- do.call(rbind, lapply(flist, function(x) {
df <- read.table(x, header=FALSE, sep=",")
}))
names(out)= c("State","Sex","Birth","Name","Fr")
str(out)
## 'data.frame': 5552452 obs. of 5 variables:
## $ State: Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Birth: int 1910 1910 1910 1910 1910 1910 1910 1910 1911 1911 ...
## $ Name : Factor w/ 29828 levels "Aaliyah","Aaron",..: 1062 95 90 1037 623 495 996 453 1062 1037 ...
## $ Fr : int 14 12 10 8 7 6 6 5 12 7 ...
As per description, the limitation of the data is that includes names with at least of frequency 5 per state to support privacy and Names are limited in 2-15 character symbols so it does not represent all names in fully, espeically very unique names per state. This might strongly bias the dataset, as it does not include 100% of names and the size of unique name count is unknown.
Attaching “dplyr” package and aggregating to find most popular name
library(dplyr);
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df= out[,4:5]
df$Name = as.character(df$Name)
dd =data.frame(summarize(group_by(df,Name), sum(Fr)))
dd = arrange(dd, -dd$sum.Fr)
So, the most popular name of all time (of either gender) is:
dd$Name[1]
## [1] "James"
GendAmb = function (arg_df) {
df=arg_df
df$Name = as.character(df$Name)
df = df[,c(2,4:5)]
dd= data.frame(summarize(group_by(df,Name, Sex), sum(Fr)))
uu= data.frame(summarize(group_by(df,Name), sum(Fr)))
dd =dd[!duplicated(dd["Name"]),]
uu$Ratio = abs((2*dd$sum.Fr./uu$sum.Fr.)-1)
uu = arrange(uu,Ratio)
return(subset(uu, Ratio==0))
}
The most gender ambiguous name/s for 2013:
GendAmb(subset(out, Birth==2013))
## Name sum.Fr. Ratio
## 1 Aris 30 0
## 2 Arlin 10 0
## 3 Cree 22 0
## 4 Devine 20 0
## 5 Eliah 10 0
## 6 Nikita 94 0
## 7 Sonam 10 0
## 8 Tru 22 0
The most gender ambiguous name/s for 1945:
GendAmb(subset(out, Birth==1945))
## Name sum.Fr. Ratio
## 1 Maxie 38 0
# Subset data from 1980, summarize frequencies and calculate percentage name poularity for 1980 and then sort in "uu" data frame
df = subset(out, Birth==1980)
df$Name = as.character(df$Name)
uu= data.frame(summarize(group_by(df,Name), sum(Fr)))
total = sum(uu$sum.Fr.)
uu$Perc = uu$sum.Fr./total*100
uu = arrange(uu,-Perc)
# Subset data from 2013, summarize frequencies and calculate percentage name poularity for 1980 and then sort in "uu2" data frame
df = subset(out, Birth==2013)
df$Name = as.character(df$Name)
uu2= data.frame(summarize(group_by(df,Name), sum(Fr)))
total = sum(uu2$sum.Fr.)
uu2$Perc = uu2$sum.Fr./total*100
uu2 = arrange(uu2,-Perc)
# Making sure to compare names which existed in 1980 and still exist in 2013 and sort
kk = subset(uu2, Name %in% uu$Name)
kk = arrange(kk, Name)
# Reverse correlate those back to 1980 so that we have same sets of names aka Inner Join and sort again
cc = subset(uu, Name %in% kk$Name)
cc = arrange(cc, Name)
# Calculate percentage difference as 2013 minus 1980
kk$Diff = kk$Perc -cc$Perc
The highest percentage increase in popularity since 1980 to 2013:
kk = arrange(kk, -Diff)
kk[1,]
## Name sum.Fr. Perc Diff
## 1 Sophia 21075 0.6910051 0.6723459
The highest percentage decrease in popularity since 1980 to 2013:
kk = arrange(kk, Diff)
kk[1,]
## Name sum.Fr. Perc Diff
## 1 Jennifer 1666 0.05462465 -1.815549
The above does not take into consideration names which are new in 2013 and did not exist in 1980 or were below 5 frequency threshold as incrase from 0% and, in reverse, names which existed in 1980 and do not exist in 2013 list for maximum decrease in popularity to 0%
Insight. Name diversity
While working with the data, I have noticed that there seems to be many more female names compared to males, so I decided to explore this and plot diversity of female vs. male names on the timeline (where diversity is measured by the total number of unique names per gender per year)
out$Name = as.character(out$Name)
years = unique(out$Birth)
female = lapply(years, function (x) { length(unique(subset(out, Birth==x | Sex =="F")$Name)) })
male = lapply(years, function (x) { length(unique(subset(out, Birth==x | Sex =="M")$Name)) })
This cannot be explained by the gender ratio at birth in the US as it fluctuates 5% only and usually in favor for men based on US Census:
Also these lines cannot be easily correlated with population growth in the US, which from 1915 to 2013 could be graphed as approximation of 45 degree line.
It seems that male names are much more homogenous compared to female names. May be culturally girls are perceived to be more unique than newborn boys in the US. It feels normal to have 3 John’s in the office but not 3 Jennifer’s.However, lack in male name diversity reduces with time. Provided this trend continues, we’ll be able to witness a true equality in “name diversity” between 2 genders.
Also, it is clear to notice that there was a peaks in name diversity in 1920s, around 1990 and 2010. Let’s compare it to the US immigration history with the same dates.
It seems that peaks in immigration coincide with peaks in new unique name introduction both for male and female babies born, which later subsides. Sharpness of peaks in immigration is not the same as in the unique names, as number of uniqe names is limited.