Olden Names

The code to generate the additional names for the R gender package:

I'm using a number of different census from IPUMs to cobble together the data. It's stored in a database, and so is slightly opaque here:


source("/raid/census/functions.R")

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# nrow=10000
tables = src_mysql("IPUMS", user = "bschmidt", password = "Try my raisin Brahms")

## Loading required package: RMySQL
## Loading required package: DBI

persons = tbl(src = tables, "usa_00012")

allnames = persons %>% filter(BIRTHYR != 0, NAMEFRST != "", NAMEFRST != "!") %>% 
    select(NAMEFRST, BIRTHYR, YEAR, PERWT, SEX) %>% collect()

Unlike the social security data, this stuff is a sample. So I'm going to create the new numbers by reweighting from the person weights in each sample.

This means some persons will be double-counted. (When they show up in multiple censuses samples.) So it's not as good as the SSA data.

This will have some substantial problems compared to the social security data. On the other hand, it will have more names used by people as opposed to just aged.

I'm only taking people under 62 y.o. (to avoid too strong a prejudice towards names associated with high life-expectancy), with the number select so that we start in 1789, as good a year as any.


usefulnames = allnames %>% filter(YEAR - BIRTHYR < 62)

Also, I'm going to strip all “Norma Jean” type names down to just “Norma,” and replace some of the most common unambiguous abbreviations. (“WM” to “WILLIAM,” and so forth.)

normed = correctAbbreviations(usefulnames)

Now to get the actual counts. The integers I'll be using are the weighted numbers, divided by the population weights. This is sort of a funky way to do it, but if anyone is using the numbers as probabilities, they may want to know that there are really 2 male Libbys and 1 female, rather than thinking it's 212 and 96, which is what the weights would suggests.

There may be differential death rates by name during the civil war; there certainly are by gender; but I think those differences are small enough that they don't really matter.



counts = normed %>% group_by(BIRTHYR, SEX) %>% mutate(meanWeight = mean(PERWT)) %>% 
    group_by(NAMEFRST, add = T) %>% summarize(n = n(), count = sum(PERWT), ncensuses = length(unique(YEAR)), 
    est = count/meanWeight[1])

Sanity check for birth year 1803. We're looking here at the number of records actually in the censuses (n), the number of estimated people for each name by gender (count), and the number of records we would have seen if all people were equally weighted (est), which is the number we'll be using. Note that to get a sense of the national population with that name, you'd instead want to do something like calculate count/ncensuses, which will give a very different sort of number.

counts %>% filter(BIRTHYR == 1803) %>% arrange(-count)

## Source: local data frame [614 x 7]
## Groups: BIRTHYR, SEX
## 
##    BIRTHYR SEX  NAMEFRST   n count ncensuses    est
## 1     1803   2      MARY 121 12283         2 122.48
## 2     1803   1      JOHN 116 11644         2 115.87
## 3     1803   1   WILLIAM  85  8585         2  85.43
## 4     1803   2     SARAH  71  7098         2  70.78
## 5     1803   1     JAMES  63  6266         2  62.35
## 6     1803   2 ELIZABETH  63  6199         2  61.81
## 7     1803   2     NANCY  41  4021         2  40.10
## 8     1803   1    SAMUEL  35  3472         2  34.55
## 9     1803   2     ELIZA  30  3088         2  30.79
## 10    1803   1    THOMAS  31  2979         2  29.64
## ..     ... ...       ... ...   ...       ...    ...

Now reformat this into the right shape for the gender library. I'm ignoring the state data because I believe it to be too sparse.

byYear = counts %>% group_by(BIRTHYR, NAMEFRST) %>% summarize(name = tolower(NAMEFRST[1]), 
    year = BIRTHYR[1], female = sum(est[SEX == 2]), male = sum(est[SEX == 1])) %>% 
    ungroup() %>% select(name, year, female, male)

byYear %>% sample_n(10)

## Source: local data frame [10 x 4]
## 
##            name year female    male
## 132521    harit 1837 0.3184  0.0000
## 795334     sa?? 1890 0.6000  0.0000
## 193706      yip 1844 0.0000  0.1771
## 653474  melchor 1878 0.6144  0.3004
## 422422    berry 1865 0.0000 38.3634
## 735542  stephie 1884 0.0000  0.5638
## 1075180  wilkin 1912 0.0000  1.1665
## 356452    vindy 1859 0.2132  0.0000
## 123058   ulrich 1835 0.0000  0.3999
## 1151230    alph 1918 0.0000  0.6386


ipums_usa = byYear

ipums_usa %>% save(file = "~/gender/data/IPUMS_USA.rda")