Option 3
Conduct an age-period-cohort analysis. Make plots of the different dimensions. Develop a parsimonious model of what you think is going on. Explain your results.
Exploratory Data Analysis
library(knitr)
library(tidyverse)
library(skimr) #https://ropensci.org/blog/2017/07/11/skimr/
library(ggplot2)
library(ggthemes)
library(plyr)
library(jcolors) #https://github.com/jaredhuling/jcolors/
library(Epi)
library(rms)
For this lab I looked at variable “ABANY”, which codes responses to the question “Please tell me whether or not you think it should be possible for a pregnant woman [sic] to obtain a legal abortion if [t]he woman [sic] wants it for any reason?” where 1 = “yes” and 2 = “no”.
Here we can see respondents’ ages range from 18 to 89, their birth years range from 1888 to 1998, and the years surveyed range from 1977 to 2016.
ggplot(data = data, aes(COHORT)) + geom_histogram(bins = 20)
As expected, the cohorts fall along a normal distribution, with a mean birth year in the 1950s.
ggplot(data = data, aes(AGE)) + geom_histogram(bins = 20)
The age of respondents falls along a right-skewed distribution, with the highest number of respondents in their 20s to 40s.
data_cut <- data %>%
mutate(year_cut = cut(YEAR, breaks = 10, labels = F, right = F),
age_cut = cut(AGE, breaks = 10, labels = F, right = F),
cohort_cut = cut(COHORT, breaks = 10, labels = F, right = F))
tabulate_age_cohort <- stat.table(index = list("AGE" = age_cut, "COHORT" = cohort_cut), contents = mean(ABANY), margins = TRUE, data = data_cut)
print(tabulate_age_cohort, digits = 3)
Breaking the age and cohort variables into ten categories and tabulating ABANY by age and cohort, we can that support for abortion slowly increases with each successive cohort, and cohorts 6 to 10 increased their support as they aged, with later cohorts increasing their support more quickly year-to-year.
ddply(data_cut, "year_cut", summarise, min = min(YEAR), max = max(YEAR))
ddply(data_cut, "cohort_cut", summarise, min = min(COHORT), max = max(COHORT))
ddply(data_cut, "age_cut", summarise, min = min(AGE), max = max(AGE))
The above tables tell us which cohorts, ages, and years correspond to each category.
data_cut_age_cohort <- ddply(data_cut, c("age_cut", "cohort_cut"), summarise, ABANY = mean(ABANY))
ggplot(data_cut_age_cohort, aes(x = age_cut, y = ABANY, group = cohort_cut, color = factor(cohort_cut))) +
geom_point(size = 3) +
geom_line() +
scale_x_continuous(breaks = 1:10, labels = c("18-25", "26-32", "33-39", "40-46", "47-53", "54-60", "61-67", "68-74", "75-81", "82-89")) +
theme_fivethirtyeight() +
theme(panel.grid.major.x = element_blank(),
legend.position = "top") +
scale_color_jacksonlab() +
labs(title = "Do you support abortion for any reason?",
subtitle = "By cohort and age, responses 1 = Yes, 2 = No",
color = "Cohort #1 born 1888-1898\nCohort #10 born 1987-1998",
caption = "\nSource: ICPSR cumulative General Social Survey, 1977-2016")
Graphing each cohort by age shows a generational split between cohorts 5 and 6. The first five cohorts (born 1988 to 1942) generally decreased their support over time. Cohorts 4 and 5 started in the same range as cohorts 6-8 but ended up in the same range as cohorts 1-3. Cohorts 6-8 show a similar age effect in the same age cuts, but their support never drops below their initial response average in age cut 1. Cohorts 6-10 uniformly increased support as they moved from age cut 1 to age cut 4, which is likely to be an age effect, but each subsequent cohort begins at a higher level of support.
Models
summary(lm(ABANY ~ COHORT + AGE + YEAR, data = data))
Running a naïve model we see extreme collinearity, where cohort is very significant, age is not significant, and the year effects cannot even be calculated.
APC_factor_all = lm(ABANY ~ factor(age_cut) + factor(year_cut) + factor(cohort_cut), data = data_cut)
summary(APC_factor_all)
Running the model with each cut as a dummy variable, we see that year cuts 4-5 (1989-1996) and 9-10 (2010-2016) are statistically significant, and age cuts 2-3 (26-39) and 9-10 (76-89) are almost significant.
data_cut <- data_cut %>%
mutate(elderly = case_when(AGE >= 70 ~ 1,
TRUE ~ 0)) %>%
mutate(young = case_when(AGE <= 40 ~ 1,
TRUE ~ 0))
To isolate the end-range age effects, I created indicator variables for being elderly (70 and up) and young (40 and under).
summary(lm(ABANY ~ young + elderly + factor(year_cut) + factor(cohort_cut), data = data_cut))
Adding in these age indicators, only the elderly category is statistically significant. The year cuts effects remain, although with less significance in the most recent year cuts. However cohorts 6-10 (born 1943 and onward) are all very statistically significant.
data_cut <- data_cut %>%
mutate(prewar_cohort = case_when(COHORT <= 1940 ~ 1,
TRUE ~ 0))
To isolate the new cohort effect, I created an indicator variable for one’s cohort being pre-war, that is, born before 1941.
summary(lm(ABANY ~ young + elderly + factor(year_cut) + prewar_cohort, data = data_cut))
Adding in this cohort indicator, we see it is quite statistically significant, and the elderly indicator remains significant. The year cut effects remain the same, which was puzzling to me.
data_cut <- data_cut %>%
mutate(Reagan_era = case_when(YEAR == 1981 ~ 1,
YEAR == 1982 ~ 1,
YEAR == 1983 ~ 1,
YEAR == 1984 ~ 1,
YEAR == 1985 ~ 1,
YEAR == 1986 ~ 1,
YEAR == 1987 ~ 1,
YEAR == 1988 ~ 1,
YEAR == 1989 ~ 1,
YEAR == 1990 ~ 1,
YEAR == 1991 ~ 1,
TRUE ~ 0)) %>%
mutate(Obama_era = case_when(YEAR >= 2009 ~ 1,
TRUE ~ 0))
Considering the years these cuts represent – 1989-1996 and 2010-2016 – I realized that these align roughly with the Reagan Era and Obama Era respectively. I then created two indicator variables to isolate these year effects.
APC_factor_specific = lm(ABANY ~ young + elderly + Reagan_era + Obama_era + prewar_cohort, data = data_cut)
summary(APC_factor_specific)
In the final model, almost every indicator variable is statistically significant. Being over 69 decreases support 0.044, being born before 1941 decreases support by 0.099, the Reagan Era decreased support by 0.015, and the Obama Era increased support by 0.016.
# Compare R^2
c(Model1 = summary(APC_factor_all)$adj, Model2 = summary(APC_factor_specific)$adj)
Comparing the second model with only dummy variables to the final model with only indicator variables, the R-squared decreases slightly, but the final model obviates age and cohort effects the second model obscures.
Conclusion
The age-cohort visualization of support for abortion shows evidence of both age and cohort effects similar to the significant indicators in the final model. However it was not possible to see the two distinct year effects of the Reagan-era and Obama-era political shifts. Concerning the model itself, it is possible that shifting from dummy variables to indicator variables in a different order – isolating the year effects first, for instance – might yield different significance for the final indicators, or even suggest different break points for each of the age, cohort, and year effects. If I were to continue investigating this topic, I would try modeling the cohort effects of coming of age under the period effects of the Reagan Era, and possibly even the Obama Era, though properly analysis the latter would likely need to wait a further decade.
