What's the difference between the names in the US census (which are the names of adults or children) and the names in the Social Security administration (which are overwhelmingly legal names.)

The version of the gender package being used here is from my github fork.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
set1 = gender::ssa_national %>% group_by(year) %>% mutate(ssa = (female + male)/sum(female + 
    male), fName = female > male) %>% select(name, year, ssa, fName)
set2 = gender::ipums_usa %>% group_by(year) %>% mutate(ipums = (female + male)/sum(female + 
    male), fName = female > male) %>% select(name, year, ipums, fName)

First I join them together, and substitute zeroes with some minimum values to make division possible.

s1 = set1 %>% filter(year < 1931)
s2 = set2 %>% filter(year >= 1880)

joint = merge(s1, s2, all = T) %>% as.tbl

joint = joint %>% mutate(isNA = is.na(ssa + ipums))
joint$ssa[is.na(joint$ssa)] = min(joint$ssa, na.rm = T)
joint$ipums[is.na(joint$ipums)] = min(joint$ipums, na.rm = T)

Now let's compare Social Security to IPUMS names. We'll do this by dividing the ipums count by the ssa count: the log scale makes it uniform with positive being SSA-friendly names, and negative being IPUMS-friendly names.

I'll take the mean of that difference across all years as the metric: simply adding may be odd because of different year-to-year sizes in the sets. To keep that sane, I'll limit to names that appear at a greater than .5% rate at least 10 times.

set.seed(1)
joint %>% sample_n(10)

## Source: local data frame [10 x 6]
## 
##           name year fName       ssa     ipums  isNA
## 192838 durmond 1898  TRUE 2.099e-06 5.266e-06  TRUE
## 270272 fritzie 1925 FALSE 2.099e-06 7.912e-06  TRUE
## 416060   libra 1913  TRUE 2.099e-06 4.582e-06  TRUE
## 659625  thonie 1922  TRUE 2.099e-06 7.668e-06  TRUE
## 146480  collin 1916 FALSE 9.305e-06 7.336e-07  TRUE
## 652492  tennie 1887  TRUE 1.132e-04 7.327e-05 FALSE
## 686108 vernita 1908  TRUE 2.099e-06 9.303e-06  TRUE
## 479930   mauer 1888 FALSE 2.099e-06 2.771e-05  TRUE
## 456918   mancy 1908  TRUE 2.099e-06 3.290e-05  TRUE
## 44875  aontion 1930 FALSE 2.099e-06 9.151e-06  TRUE

topNames = joint %>% group_by(name, fName) %>% filter(!isNA, ssa > 0.005, ipums > 
    0.005) %>% mutate(count = n()) %>% filter(count > 10) %>% mutate(difference = log(ssa/ipums)) %>% 
    summarize(difference = mean(difference), aname = name[1]) %>% arrange(-difference) %>% 
    mutate(gender = ifelse(fName, "Female", "Male"))

topNames %>% ggplot(aes(x = reorder(name, difference), y = difference)) + geom_point(aes(color = gender)) + 
    coord_flip()

plot of chunk unnamed-chunk-3

The first thing that jumps out here is that high differences are all female names, and low ones are all male names. That means that there are relatively more women in the SSA sample than in the census.

Why?

Well, keep in mind that these aren't childbirth records like the earlier censuses: these are registration records of people who signed up for social security after the program began in the late 1930s.

So some possible reasons that women are over-represented in the Social Security set 1. Men die younger than women, so more women are still around at the point that they're registering for social security benefits. 2. Men just didn't sign up for Social Security at all. 3. Men immigrate at higher rates than women, and something like this is messing it up.

Some of these we can test quite easily: what percentage are male and female, by year, in each set.

genderCounts = joint %>% group_by(year, fName) %>% do(data.frame(group = c("ssa", 
    "ipums"), percent = c(sum(.$ssa), sum(.$ipums))))

ggplot(genderCounts %>% filter(fName == T)) + geom_line(aes(x = year, y = percent, 
    color = group))

plot of chunk unnamed-chunk-4


gender::ssa_national %>% group_by(year) %>% summarize(male = sum(male), female = sum(female)) %>% 
    mutate(ratio = female/(female + male))

## Source: local data frame [133 x 4]
## 
##    year   male female  ratio
## 1  1880 110491  90993 0.4516
## 2  1881 100746  91955 0.4772
## 3  1882 113687 107850 0.4868
## 4  1883 104630 112322 0.5177
## 5  1884 114445 129022 0.5299
## 6  1885 107801 133055 0.5524
## 7  1886 110786 144534 0.5661
## 8  1887 101414 145982 0.5901
## 9  1888 120854 178628 0.5965
## 10 1889 110587 178365 0.6173
## ..  ...    ...    ...    ...

That plot reveals that it's almost certainly number 2: people born between 1890 and about 1912 are dramatically more likely to be female in the Social Security administration set, but those born before 1885 (retirement age as the legislation phased in, and therefore eligible for benefits) or after 1912 (who would have started their working careers after the legislation was passed). I don't know exactly why the gap is so strong in the middle, because I don't remember my implementation history that well, but it's about twice as strong.

So, it turns out that strong assumptions about gender are baked into this set. That's a cautionary tale: historical notions about sex are imprinted on data.

Race

So, now we know that somethings funny about this data, particularly between 1890 and 1910.

And if you know anything about America between 1890 and 1910, it's that race was even more screwed up than gender. So is the racial distribution of names affecting how they show up?

First I'll pull the racial data for every name from the census samples.


source("/raid/census/functions.R")
codebook = read.table("/raid/census/usa_00012.codebook", sep = "\t", stringsAsFactors = F, 
    comment.char = "", quote = "")
names(codebook) = c("field", "fieldname", "code", "value")

tables = src_mysql("IPUMS", user = "bschmidt", password = "Try my raisin Brahms")

## Loading required package: RMySQL
## Loading required package: DBI

persons = tbl(src = tables, "usa_00012")

allnames = persons %>% filter(BIRTHYR != 0, NAMEFRST != "", NAMEFRST != "!") %>% 
    select(NAMEFRST, BIRTHYR, YEAR, PERWT, SEX, RACE) %>% collect()
allnames = allnames %>% correctAbbreviations()
allnames = allnames %>% encode("RACE")

racePercents = allnames %>% filter(BIRTHYR > 1890, BIRTHYR < 1912) %>% group_by(NAMEFRST) %>% 
    summarize(percent_white = sum(PERWT[RACE == "White"])/sum(PERWT), total = sum(PERWT))

racePercents = racePercents %>% mutate(name = tolower(NAMEFRST))

Then plot that against the distance metric calculated earlier…

withRace = topNames %>% inner_join(racePercents)

## Joining by: "name"




ggplot(withRace, aes(y = percent_white, x = difference, color = gender, label = name)) + 
    geom_text() + geom_smooth(method = "lm") + scale_x_continuous(label = exp)

plot of chunk unnamed-chunk-6