So: a little more about the the spikes in the SSA dataset.
When do spikes happen?
I found another dataset that includes birth estimates in the US since 1910.
So we can plot what percentage of births have an SSN.
There are two differe
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
library(scales)
load(file = "births.rda")
set3 = gender::ssa_national %>% group_by(year) %>% summarize(ssa = sum(female +
male))
ggplot(set3 %>% inner_join(births)) + geom_line(aes(x = year, y = ssa/count)) +
scale_y_continuous("Percentage of births with SSN", label = percent) + labs(title = "Percentage of births with an associated SSN, 1910-2010ish")
## Joining by: "year"
There are two different spikes: one rapid rise between 1905 and 1920, and one less rapid between 1925 and 1950.
Both those group show a rapid raise, but women mayt be higher before 1920 and men are higher after 1920. That suggests that the post-1925 rise is related to workforce issues, and the pre-1925 ones to something else.
Those born before 1920 will have already entered the workforce when social security began; they will have already been retiring by 1980, which seems less interesting.
Shane Landrum suggests it might be great society policies: I like this idea, particularly it squares my failure to find black under-representation in the data, but I can't see anything that would dramatically affect people 45 in 1965 but not people 55 years old. Which isn't to say there isn't anything.
gendered = gender::ssa_national %>% melt(id.vars = c("name", "year")) %>% group_by(year,
variable) %>% summarize(ssa = sum(value))
ggplot(gendered %>% inner_join(births)) + geom_line(aes(x = year, y = ssa/count,
color = variable)) + scale_y_continuous("Percentage of births with SSN",
label = percent)
## Joining by: "year"
Here are the raw numbers of SSNs assigned.
One interesting feature is the spike in 1900, which suggests that some sort of fudging is going on for people for whom no birthdate is known.
One interesting question that might be answerable from these statistics is: were there particular cutoffs that led to overstating of birth numbers at particular times? Did a lot of people claim to be 65 who were only 62 in 1936, for instance? One would have to really know the legislative history to answer this.
set3 = gender::ssa_national %>% group_by(year) %>% summarize(ssa = sum(female +
male))
ggplot(set3) + geom_line(aes(x = year, y = ssa))