The New York Times published an article “Do Millenial Men want Stay-at-Home Wives”, with disturbing graph showing that about 50% of millenial men agreed or strongly agreed with “It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.”

It’s not surprising that more men than women agree with the statement: self-interest would explain that. It’s surprising that there’s such a clear worsening trend. So, I wanted to look at the confidence intervals around the curves, because the General Social Survey isn’t that big, especially after it’s broken down into small age groups (the usual definition of `millennials’ would stretch maybe a decade higher than the 18-25 used in the article)

The General Social Survey is a wonderful public resource. You can do simple analyses online, for free, or if you’re a card-carrying survey nerd you can download all the data. That’s what I did.

The data are in Stata format, so we’ll load them into R with the haven package

library(haven)
GSS<-read_dta("~/Downloads/GSS_Stata/GSS7216_R1a.dta")

The New York Times didn’t give the variable name, but that’s why we have Google, and once I found it was called fefam it was easy to get official documentation

First, how many observations of this variable do we have for people 25 and under:

## How much data?
with(subset(GSS,age>=18 & age <=25 & !is.na(fefam)), table(sex,year))
##    year
## sex 1977 1985 1986 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004
##   1  105   92   74   65   64   52   60   46   92  129   85   92   62   52
##   2  117  103  101   73   69   58   58   69  105  170   96  110   61   57
##    year
## sex 2006 2008 2010 2012 2014 2016
##   1   87   65   76   67   66   85
##   2  120   77   91   72   75  113

As I suspected, not a whole lot. We should really use the weights that come with the dataset: the GSS takes an equal-probability sample of households, but non-response and variation in household size both have an effect. Also, the variable wasn’t measured every year; it will be convenient to just drop the years when we don’t have data.

## survey analysis
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
options(survey.lonely.psu="adjust")
GSS<-subset(GSS, year>1974)
gss<-svydesign(id=~vpsu, strata=~interaction(year,vstrat), weights=~wtssall, data=GSS, nest=TRUE)
gssyng<-subset(gss,age>=18 & age <=25 & !is.na(fefam))
gssyng<-subset(gssyng, year %in% c(1977, 1985:1986, 1988:2016))

So, we’ve got a survey object with all the years of this question in people under 25. The variable is on a 4-point scale, so we need to recode it to binary:

gssyng<-update(gssyng, losers=(fefam<=2))

Now the analysis: the proportion with losers==TRUE for each year and sex

means<-svyby(~losers,~year+sex,FUN=svymean,na.rm=TRUE,design=gssyng)
round(means[,1:3],2)
##        year sex losersFALSE
## 1977.1 1977   1        0.45
## 1985.1 1985   1        0.71
## 1986.1 1986   1        0.70
## 1988.1 1988   1        0.82
## 1989.1 1989   1        0.64
## 1990.1 1990   1        0.62
## 1991.1 1991   1        0.67
## 1993.1 1993   1        0.79
## 1994.1 1994   1        0.83
## 1996.1 1996   1        0.72
## 1998.1 1998   1        0.75
## 2000.1 2000   1        0.68
## 2002.1 2002   1        0.74
## 2004.1 2004   1        0.71
## 2006.1 2006   1        0.70
## 2008.1 2008   1        0.62
## 2010.1 2010   1        0.62
## 2012.1 2012   1        0.60
## 2014.1 2014   1        0.52
## 2016.1 2016   1        0.89
## 1977.2 1977   2        0.58
## 1985.2 1985   2        0.73
## 1986.2 1986   2        0.70
## 1988.2 1988   2        0.81
## 1989.2 1989   2        0.75
## 1990.2 1990   2        0.85
## 1991.2 1991   2        0.75
## 1993.2 1993   2        0.87
## 1994.2 1994   2        0.87
## 1996.2 1996   2        0.75
## 1998.2 1998   2        0.83
## 2000.2 2000   2        0.79
## 2002.2 2002   2        0.69
## 2004.2 2004   2        0.71
## 2006.2 2006   2        0.89
## 2008.2 2008   2        0.76
## 2010.2 2010   2        0.75
## 2012.2 2012   2        0.86
## 2014.2 2014   2        0.74
## 2016.2 2016   2        0.74
with(means, 
plot(year[means$sex==1],losersFALSE[means$sex==1],col="darkred",pch=19,type="b",xlab="year",ylab="% disagree",ylim=c(0,1))
)
with(means, 
points(year[means$sex==2],losersFALSE[means$sex==2],col="grey",pch=19,type="b")
)

The graph agrees with the NY Times graph, except it has more data. The last data point looks really strange. Is that what the raw data really say?

with(subset(GSS, year==2016 & age<=25),table(fefam, sex))
##      sex
## fefam  1  2
##     1  1  6
##     2  9 23
##     3 43 50
##     4 32 34

Yes, it is. The percentages for 18–34 and all ages also agree with the online explorer, so it’s not just me.

Now, a graph with uncertainty intervals

props<-svyby(~losers,~year+sex,FUN=svyciprop, method="xlogit",na.rm=TRUE,design=gssyng,vartype="ci")

with(props, 
plot(year[props$sex==1],1-losers[props$sex==1],col="darkred",pch=19,type="b",xlab="year",ylab="% disagree",ylim=c(0,1))
)

with(props, 
points(year[props$sex==2],1-losers[means$sex==2],col="darkgrey",pch=19,type="b")
)

with(props, 
polygon(c(year[props$sex==2],rev(year[props$sex==2])),c( 1-ci_l[props$sex==2], rev(1-ci_u[props$sex==2])),col="#aaaaaa50",border=NA)
)


with(props, 
polygon(c(year[props$sex==1],rev(year[props$sex==1])),c( 1-ci_l[props$sex==1], rev(1-ci_u[props$sex==1])),col="#F0000030",border=NA)
)

So, what should we believe about this trend? Well, the GSS isn’t the only data set showing it, but in the GSS it’s less robust than it looks. Perhaps things are a little more complicated.

Also, it’s impressive what can be done quickly and easily with a commodity laptop and internet connection, free software, and open data.