The code to generate the additional names for the R gender package:
I'm using a number of different census from IPUMs to cobble together the data. It's stored in a database, and so is slightly opaque here:
source("/raid/census/functions.R")
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# nrow=10000
tables = src_mysql("IPUMS", user = "bschmidt", password = "Try my raisin Brahms")
## Loading required package: RMySQL
## Loading required package: DBI
persons = tbl(src = tables, "usa_00012")
allnames = persons %>% filter(BIRTHYR != 0, NAMEFRST != "", NAMEFRST != "!") %>%
select(NAMEFRST, BIRTHYR, YEAR, PERWT, SEX) %>% collect()
Unlike the social security data, this stuff is a sample. So I'm going to create the new numbers by reweighting from the person weights in each sample.
This means some persons will be double-counted. (When they show up in multiple censuses samples.) So it's not as good as the SSA data.
This will have some substantial problems compared to the social security data. On the other hand, it will have more names used by people as opposed to just aged.
I'm only taking people under 62 y.o. (to avoid too strong a prejudice towards names associated with high life-expectancy), with the number select so that we start in 1789, as good a year as any.
usefulnames = allnames %>% filter(YEAR - BIRTHYR < 62)
Also, I'm going to strip all “Norma Jean” type names down to just “Norma,” and replace some of the most common unambiguous abbreviations. (“WM” to “WILLIAM,” and so forth.)
normed = correctAbbreviations(usefulnames)
Now to get the actual counts. The integers I'll be using are the weighted numbers, divided by the population weights. This is sort of a funky way to do it, but if anyone is using the numbers as probabilities, they may want to know that there are really 2 male Libbys and 1 female, rather than thinking it's 212 and 96, which is what the weights would suggests.
There may be differential death rates by name during the civil war; there certainly are by gender; but I think those differences are small enough that they don't really matter.
counts = normed %>% group_by(BIRTHYR, SEX) %>% mutate(meanWeight = mean(PERWT)) %>%
group_by(NAMEFRST, add = T) %>% summarize(n = n(), count = sum(PERWT), ncensuses = length(unique(YEAR)),
est = count/meanWeight[1])
Sanity check for birth year 1803. We're looking here at the number of records actually in the censuses (n), the number of estimated people for each name by gender (count), and the number of records we would have seen if all people were equally weighted (est), which is the number we'll be using. Note that to get a sense of the national population with that name, you'd instead want to do something like calculate count/ncensuses, which will give a very different sort of number.
counts %>% filter(BIRTHYR == 1803) %>% arrange(-count)
## Source: local data frame [614 x 7]
## Groups: BIRTHYR, SEX
##
## BIRTHYR SEX NAMEFRST n count ncensuses est
## 1 1803 2 MARY 121 12283 2 122.48
## 2 1803 1 JOHN 116 11644 2 115.87
## 3 1803 1 WILLIAM 85 8585 2 85.43
## 4 1803 2 SARAH 71 7098 2 70.78
## 5 1803 1 JAMES 63 6266 2 62.35
## 6 1803 2 ELIZABETH 63 6199 2 61.81
## 7 1803 2 NANCY 41 4021 2 40.10
## 8 1803 1 SAMUEL 35 3472 2 34.55
## 9 1803 2 ELIZA 30 3088 2 30.79
## 10 1803 1 THOMAS 31 2979 2 29.64
## .. ... ... ... ... ... ... ...
Now reformat this into the right shape for the gender library. I'm ignoring the state data because I believe it to be too sparse.
byYear = counts %>% group_by(BIRTHYR, NAMEFRST) %>% summarize(name = tolower(NAMEFRST[1]),
year = BIRTHYR[1], female = sum(est[SEX == 2]), male = sum(est[SEX == 1])) %>%
ungroup() %>% select(name, year, female, male)
byYear %>% sample_n(10)
## Source: local data frame [10 x 4]
##
## name year female male
## 132521 harit 1837 0.3184 0.0000
## 795334 sa?? 1890 0.6000 0.0000
## 193706 yip 1844 0.0000 0.1771
## 653474 melchor 1878 0.6144 0.3004
## 422422 berry 1865 0.0000 38.3634
## 735542 stephie 1884 0.0000 0.5638
## 1075180 wilkin 1912 0.0000 1.1665
## 356452 vindy 1859 0.2132 0.0000
## 123058 ulrich 1835 0.0000 0.3999
## 1151230 alph 1918 0.0000 0.6386
ipums_usa = byYear
ipums_usa %>% save(file = "~/gender/data/IPUMS_USA.rda")