Gender imbalance in SSA data

The problem with the raw SSA data is that the data is gender imbalanced. (This was pointed out to me by Ben Schmidt.) Here is the problem:

ratios <- gender::ssa_national %>%
  group_by(year) %>%
  summarize(ratio_female = sum(female) / sum(female + male))

ratios %>%
  ggplot(aes(x = year, y = ratio_female)) +
  geom_line() + 
  geom_hline(y = 0.5, color = "darkgray") + 
  geom_vline(x = 1935, color = "darkgray") +
  geom_text(aes(label = "SSA of 1935", x = 1950, y = 0.51), size = 6) +
  ggtitle("Proportion of SSA data that is female")

plot of chunk unnamed-chunk-3

I haven’t figured out why the data should be skewed. The most important observation is that the names before the Social Security Act of 1935—that is, the names of people who were born before Social Security existed and thus who were registered for it after the fact—were skewed heavily female, while the names since 1935 have been mostly male. Perhaps the reason that the names are skewed is that more women than men born around the start of the twentieth century were still alive. (Though note that the biggest bubble in names peaks in 1900-1910, and people born in those years were too young to collect Social Security immediately.) Another possibility is that since Social Security was written to exclude farm workers, perhaps more men were excluded. A more minor puzzle is why there are more men than women registered after 1935. I need to do some research into the implementation of the Social Security Act during the New Deal. (Shane Landrum’s dissertation on birth registration has some leads here.)

Whatever the reason for the skewed data, the problem remains of how best to fix it. The way to fix it is to calculate a correction factor for each year. Assuming for each year that the gender ratio in the SSA data is \(r\) and that the actual ratio of births in that year is \(b\), then the correction factor \(c\) will be \(rc = b\), or \(c = \frac{b}{r}\).

The question then is what should we assume that \(b\), the actual ratio of births is? (This is referred to as the secondary sex ratio, to distinguish it from the sex ratio at conception.) If we had good data, we might be able to calculate the actual secondary sex ratio for each year. The trouble is that birth registration, as Shane Landrum demonstrates, was spotty and not necessarily reliable. Trying to calculate the secondary sex ratio from U.S. Census data in IPUMS is probably not accurate enough. For what it’s worth, Eldridge and Siegel observed that in 1940 “male births were more completely registered than female births” but also that the secondary sex ratio did not seem fluctuate meaningfully. The simplest thing to do is to assume that the secondary sex ratio is 0.5 for every year: if that is off, it is not off by much.

Making that assumption, we can calculate a table of correction factors by year.

correction_factors <- 
  ratios %>%
  mutate(correction_factor = 0.5 / ratio_female) %>%
  select(-ratio_female)
print(correction_factors)

## Source: local data frame [133 x 2]
## 
##    year correction_factor
## 1  1880            1.1071
## 2  1881            1.0478
## 3  1882            1.0271
## 4  1883            0.9658
## 5  1884            0.9435
## 6  1885            0.9051
## 7  1886            0.8833
## 8  1887            0.8474
## 9  1888            0.8383
## 10 1889            0.8100
## ..  ...               ...

But since the package allows users to select an arbitrary number of years, what we really need is a function that will select a range of years, and calculate the correction factor for that year. That function will determine the SSA sex ratio for the range of years, then figure out the correction factor for that range using the same formula as above. We will provide tests and fuller documentation for this function in the package itself.

calculate_correction_factor <- function(years) {
  require(dplyr)
  selection <- gender::ssa_national %>%
    filter(year >= years[1], year <= years[2])
  
  ratio_female <- sum(selection$female) / sum(selection$female + selection$male)
  
  return(ratio_female / 0.5)
}

Let’s test this.

calculate_correction_factor(c(1890,1900))

## [1] 1.323

calculate_correction_factor(c(1920,1950))

## [1] 1.007

calculate_correction_factor(c(1935,2000))

## [1] 0.9719

How long does it take to calculate a correction factor?

system.time(calculate_correction_factor(c(1880, 2000)))

##    user  system elapsed 
##   0.077   0.009   0.085

system.time(calculate_correction_factor(c(1900, 1900)))

##    user  system elapsed 
##   0.048   0.002   0.050

That amount of elapsed time is not insignificant. For calculations involving a single year, we’re better off using a cached version of the calculated correction factor.

This is good enough to be implemented in the gender package now, leaving room for improvement as needs be.

References

Eldridge, Hope Tisdale, and Jacob S. Siegel. “The Changing Sex Ratio in the United States.” American Journal of Sociology 52, no. 3 (November 1, 1946): 224–34. http://www.jstor.org/stable/2771067.

Landrum, Shane. “The State’s Big Family Bible: Birth Certificates, Personal Identity, and Citizenship in the United States, 1840–1950.” PhD dissertation, Brandeis University, 2014.

Gender imbalance in SSA data

Lincoln Mullen

07/16/2014

References