What's the difference between the names in the US census (which are the names of adults or children) and the names in the Social Security administration (which are overwhelmingly legal names.)
The version of the gender package being used here is from my github fork.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
set1 = gender::ssa_national %>% group_by(year) %>% mutate(ssa = (female + male)/sum(female +
male), fName = female > male) %>% select(name, year, ssa, fName)
set2 = gender::ipums_usa %>% group_by(year) %>% mutate(ipums = (female + male)/sum(female +
male), fName = female > male) %>% select(name, year, ipums, fName)
First I join them together, and substitute zeroes with some minimum values to make division possible.
s1 = set1 %>% filter(year < 1931)
s2 = set2 %>% filter(year >= 1880)
joint = merge(s1, s2, all = T) %>% as.tbl
joint = joint %>% mutate(isNA = is.na(ssa + ipums))
joint$ssa[is.na(joint$ssa)] = min(joint$ssa, na.rm = T)
joint$ipums[is.na(joint$ipums)] = min(joint$ipums, na.rm = T)
Now let's compare Social Security to IPUMS names. We'll do this by dividing the ipums count by the ssa count: the log scale makes it uniform with positive being SSA-friendly names, and negative being IPUMS-friendly names.
I'll take the mean of that difference across all years as the metric: simply adding may be odd because of different year-to-year sizes in the sets. To keep that sane, I'll limit to names that appear at a greater than .5% rate at least 10 times.
set.seed(1)
joint %>% sample_n(10)
## Source: local data frame [10 x 6]
##
## name year fName ssa ipums isNA
## 192838 durmond 1898 TRUE 2.099e-06 5.266e-06 TRUE
## 270272 fritzie 1925 FALSE 2.099e-06 7.912e-06 TRUE
## 416060 libra 1913 TRUE 2.099e-06 4.582e-06 TRUE
## 659625 thonie 1922 TRUE 2.099e-06 7.668e-06 TRUE
## 146480 collin 1916 FALSE 9.305e-06 7.336e-07 TRUE
## 652492 tennie 1887 TRUE 1.132e-04 7.327e-05 FALSE
## 686108 vernita 1908 TRUE 2.099e-06 9.303e-06 TRUE
## 479930 mauer 1888 FALSE 2.099e-06 2.771e-05 TRUE
## 456918 mancy 1908 TRUE 2.099e-06 3.290e-05 TRUE
## 44875 aontion 1930 FALSE 2.099e-06 9.151e-06 TRUE
topNames = joint %>% group_by(name, fName) %>% filter(!isNA, ssa > 0.005, ipums >
0.005) %>% mutate(count = n()) %>% filter(count > 10) %>% mutate(difference = log(ssa/ipums)) %>%
summarize(difference = mean(difference), aname = name[1]) %>% arrange(-difference) %>%
mutate(gender = ifelse(fName, "Female", "Male"))
topNames %>% ggplot(aes(x = reorder(name, difference), y = difference)) + geom_point(aes(color = gender)) +
coord_flip()
The first thing that jumps out here is that high differences are all female names, and low ones are all male names. That means that there are relatively more women in the SSA sample than in the census.
Why?
Well, keep in mind that these aren't childbirth records like the earlier censuses: these are registration records of people who signed up for social security after the program began in the late 1930s.
So some possible reasons that women are over-represented in the Social Security set 1. Men die younger than women, so more women are still around at the point that they're registering for social security benefits. 2. Men just didn't sign up for Social Security at all. 3. Men immigrate at higher rates than women, and something like this is messing it up.
Some of these we can test quite easily: what percentage are male and female, by year, in each set.
genderCounts = joint %>% group_by(year, fName) %>% do(data.frame(group = c("ssa",
"ipums"), percent = c(sum(.$ssa), sum(.$ipums))))
ggplot(genderCounts %>% filter(fName == T)) + geom_line(aes(x = year, y = percent,
color = group))
gender::ssa_national %>% group_by(year) %>% summarize(male = sum(male), female = sum(female)) %>%
mutate(ratio = female/(female + male))
## Source: local data frame [133 x 4]
##
## year male female ratio
## 1 1880 110491 90993 0.4516
## 2 1881 100746 91955 0.4772
## 3 1882 113687 107850 0.4868
## 4 1883 104630 112322 0.5177
## 5 1884 114445 129022 0.5299
## 6 1885 107801 133055 0.5524
## 7 1886 110786 144534 0.5661
## 8 1887 101414 145982 0.5901
## 9 1888 120854 178628 0.5965
## 10 1889 110587 178365 0.6173
## .. ... ... ... ...
That plot reveals that it's almost certainly number 2: people born between 1890 and about 1912 are dramatically more likely to be female in the Social Security administration set, but those born before 1885 (retirement age as the legislation phased in, and therefore eligible for benefits) or after 1912 (who would have started their working careers after the legislation was passed). I don't know exactly why the gap is so strong in the middle, because I don't remember my implementation history that well, but it's about twice as strong.
So, it turns out that strong assumptions about gender are baked into this set. That's a cautionary tale: historical notions about sex are imprinted on data.
So, now we know that somethings funny about this data, particularly between 1890 and 1910.
And if you know anything about America between 1890 and 1910, it's that race was even more screwed up than gender. So is the racial distribution of names affecting how they show up?
First I'll pull the racial data for every name from the census samples.
source("/raid/census/functions.R")
codebook = read.table("/raid/census/usa_00012.codebook", sep = "\t", stringsAsFactors = F,
comment.char = "", quote = "")
names(codebook) = c("field", "fieldname", "code", "value")
tables = src_mysql("IPUMS", user = "bschmidt", password = "Try my raisin Brahms")
## Loading required package: RMySQL
## Loading required package: DBI
persons = tbl(src = tables, "usa_00012")
allnames = persons %>% filter(BIRTHYR != 0, NAMEFRST != "", NAMEFRST != "!") %>%
select(NAMEFRST, BIRTHYR, YEAR, PERWT, SEX, RACE) %>% collect()
allnames = allnames %>% correctAbbreviations()
allnames = allnames %>% encode("RACE")
racePercents = allnames %>% filter(BIRTHYR > 1890, BIRTHYR < 1912) %>% group_by(NAMEFRST) %>%
summarize(percent_white = sum(PERWT[RACE == "White"])/sum(PERWT), total = sum(PERWT))
racePercents = racePercents %>% mutate(name = tolower(NAMEFRST))
Then plot that against the distance metric calculated earlier…
withRace = topNames %>% inner_join(racePercents)
## Joining by: "name"
ggplot(withRace, aes(y = percent_white, x = difference, color = gender, label = name)) +
geom_text() + geom_smooth(method = "lm") + scale_x_continuous(label = exp)