Who died in the Civil War?

I was curious about differential death rates by name during the Civil War, so I thought I'd see how

I'm limiting to white men because slaves were not counted by name before the Civil War, so black men can't be taken as a continuous set. I'll leave in white women as a control.


codebook = read.table("usa_00012.codebook", sep = "\t", stringsAsFactors = F, 
    comment.char = "", quote = "")
names(codebook) = c("field", "fieldname", "code", "value")

source("functions.R")
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# nrow=10000
tables = src_mysql("IPUMS", user = "bschmidt", password = "Try my raisin Brahms")
## Loading required package: RMySQL
## Loading required package: DBI
persons = tbl(src = tables, "usa_00012")

allnames = persons %>% filter((YEAR == 1860 || YEAR == 1870), (BIRTHYR > 1820 && 
    BIRTHYR < 1850), BIRTHYR != 0, NAMEFRST != "", NAMEFRST != "!", RACE == 
    1) %>% select(NAMEFRST, BIRTHYR, YEAR, PERWT, SEX, BPL) %>% collect()

allnames = allnames %>% mutate(nativeborn = BPL < 100, YEAR = factor(YEAR))

Finally some data cleaning to change “WM” to “William” and the like

allnames = correctAbbreviations(allnames)

As a sanity test, we'll plot the most common names in each year for nativeborn Americans (don't wqnt those Irish immigrants mucking things up.)


topnames = allnames %>% filter(SEX == 1, nativeborn) %>% group_by(YEAR, NAMEFRST) %>% 
    summarize(count = sum(PERWT)) %>% group_by(YEAR) %>% mutate(rank = rank(-count)) %>% 
    group_by(NAMEFRST, add = F) %>% filter(min(rank) <= 10) %>% mutate(displayCount = sum(count))

ggplot(topnames) + geom_bar(aes(x = reorder(NAMEFRST, displayCount), y = count, 
    fill = YEAR), stat = "identity", position = "dodge") + coord_flip()

plot of chunk unnamed-chunk-3

And this looks good: everyname name decreases in uses from 1860 to 1870 for our age cohort: but there also seem to be differential rates of decrease, such as “Charles” going down much less than “Henry,” say.

Let's check the same chart with women included:


topnames = allnames %>% filter(nativeborn) %>% group_by(YEAR, NAMEFRST) %>% 
    summarize(count = sum(PERWT)) %>% group_by(YEAR) %>% mutate(rank = rank(-count)) %>% 
    group_by(NAMEFRST, add = F) %>% filter(min(rank) <= 10) %>% mutate(displayCount = sum(count))

ggplot(topnames) + geom_bar(aes(x = reorder(NAMEFRST, displayCount), y = count, 
    fill = YEAR), stat = "identity", position = "dodge") + coord_flip()

plot of chunk unnamed-chunk-4

Hrrm. Not looking so good for the project: women are declining at about the same rate as men.

Let's formalize that for some of the most common names: this plot shows for every individual name, the percentage decline across a whole bunch of years. So the boxplots show not the overall change, but a distribution: how much the 1845-born population with that name changes, how the much the 1846-born population with that name changed, and so forth.


topchanges = allnames %>% filter(nativeborn) %>% group_by(BIRTHYR, NAMEFRST, 
    SEX) %>% filter(n() > 20) %>% summarize(y1 = sum(PERWT[YEAR == 1870]), y2 = sum(PERWT[YEAR == 
    1860]), change = y1/y2)

ggplot(topchanges %>% group_by(NAMEFRST) %>% filter(n() > 10) %>% mutate(pos = median(change))) + 
    geom_boxplot(aes(x = reorder(NAMEFRST, change), y = change, fill = factor(SEX))) + 
    scale_y_log10(breaks = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 1, 1.2, 1.5, 
        2, 3, 5, 7, 10, 20)) + coord_flip()

plot of chunk unnamed-chunk-5


What this seems to be showing is that there is some reason to think that women are doing better than men (the top 7 names are all female) but they are interspersed enough that its hard to be sure.

I'm not even clear that Civil War casualties are visual here: looking at the overall counts, it does appear that more men are missing after 10 years than women, at least:


totals = allnames %>% filter(nativeborn) %>% group_by(YEAR, SEX) %>% summarize(count = sum(PERWT))

ggplot(totals) + geom_bar(aes(x = factor(SEX, labels = c("Male", "Female")), 
    y = count, fill = factor(YEAR)), stat = "identity", position = "dodge")

plot of chunk unnamed-chunk-6

Still, that's not enough to really hang one's hat on.

On the other hand, there's enough to be vaguely suggestive: the top-surviving men's names and “Benjamin,” “Robert,” “Samuel,” “Albert,” and “Charles.” When I looked at names disproportionately represented in library catalogues, which is something that should correlate well with social class, those are mostly over-represented names:

The most-missing names are “Joseph,” “Thomas”, “Peter”, “James,” and “John:” those are mostly in the names that don't write library books.

So there's something going on, surely: but I haven't yet started to try to disentangle regional origin, which matters enormously, and this doesn't seem likely to be very fruitful if the changes aren't really marked. So I'm going to write this off as a failure and move on.