This assignment analyzes having children and briefly discusses parental age at birth.
Run a multiple variable survival analysis. You can perform the survival analysis either using discrete-time methods (i.e., event history analysis) or you can use Cox proportional hazards methods, either one is fine.
library(knitr)
library(tidyverse)
library(skimr) #https://ropensci.org/blog/2017/07/11/skimr/
library(panelr) #https://panelr.jacob-long.com/index.html
# Codebook: https://www.thearda.com/Archive/Files/Codebooks/GSSPANEL_CB.asp
data <- read_csv("panel-for-R.csv")
data_2 <- panel_data(data, id = idnum, wave = panelwave)
State what your “failure” variable is and how you expect your independent variables to affect it.
data_3 <- data_2 %>%
select(childs, age, sex, race, educ, marital, wrkstat, coninc, polviews, attend, babies) %>%
mutate(child_cat = case_when(childs == 0 ~ 0,
TRUE ~ 1)) %>%
mutate(marital_cat = case_when(marital == 5 ~ 0,
TRUE ~ 1)) %>%
mutate(work_cat = case_when(wrkstat == 1 ~ 1,
TRUE ~ 0))
For this analysis my “failure” variable is whether a respondent has any children, using the binary variable “child_cat” created from “childs” (number of children). For independent variables I selected age, sex (1 = male, 2 = female), race (1 = white, 2 = black, 3 = other), “educ” (years of education), “martial_cat” created from “marital” (0 = never married, 1 = married at least once), “work_cat” created from “wrkstat” (1 = employed full-time, 0 = not employed full-time) “coninc” (family income), “polviews” (political leaning, 1 = extremely liberal, 7 = extremely conservative), “attend” (religious attendance, 0 = never, 7 = once a week), and “babies” (number of household members under 6).
I expect being female, having more years of education, having been married at least once, being employed full-time, having a higher income, being politically conservative, reporting high religious attendance, and having young children in the household to all increase the likelihood of a person to go from not having children to having children.
Explain how you determined the “risk window” (due to right truncation and left-censoring) and who is eligible for failure over the time you are studying.
#https://community.rstudio.com/t/filtering-multiple-condition-within-a-column/8549
data_4 <- data_3 %>%
filter(any(child_cat == 0),
any(panelwave == 1)) %>%
filter(any(age <= 49),
any(panelwave == 1)) %>%
drop_na()
# Making list of remaining respondents over 49 in wave 1
data_4_check_1 <- data_4 %>%
filter(age >= 50,
panelwave == 1) %>%
select(idnum) %>%
flatten() %>%
unlist()
# Making list of remaining respondents with children in wave 1
data_4_check_2 <- data_4 %>%
filter(child_cat == 1,
panelwave == 1) %>%
select(idnum) %>%
flatten() %>%
unlist()
# Removing selected respondents
data_5 <- data_4 %>%
filter(!as.integer(idnum) %in% data_4_check_1) %>%
filter(!as.integer(idnum) %in% data_4_check_2)
# Making list of respondents who had children in wave 2
data_5_sub <- data_5 %>%
filter(child_cat == 1,
panelwave == 2) %>%
select(idnum) %>%
flatten() %>%
unlist()
# Splitting panel into two groups
# Removing respondents who had children in wave 2
data_6_sub_1 <- data_5 %>%
filter(!as.integer(idnum) %in% data_5_sub)
# Subsetting to respondents who had children in wave 2
data_6_sub_2 <- data_5 %>%
filter(as.integer(idnum) %in% data_5_sub)
# Splitting subset 2 by panel wave
# Removing panel wave 3
data_6_sub_2_waves_1_2 <- data_6_sub_2 %>%
filter(panelwave != 3)
# Subsetting to panel wave 3 and changing "child_cat" values to NA
data_6_sub_2_wave_3 <- data_6_sub_2 %>%
filter(panelwave == 3) %>%
mutate(child_cat = case_when(TRUE ~ NA))
# Merging all panel waves of subset 2
data_6_sub_2_merged <- data_6_sub_2_waves_1_2 %>%
full_join(data_6_sub_2_wave_3) %>%
arrange(idnum, panelwave)
# Merging subsets 1 and 2
data_7 <- data_6_sub_1 %>%
full_join(data_6_sub_2_merged) %>%
arrange(idnum, panelwave)
To avoid left-censoring, I will start with only those respondents who report no children in wave 1. Any respondents reporting children in wave 2 will be removed in wave 3. To the extent that right truncation might be an issue, I minimized non-random attrition by pre-emptively dropping all missing values from the data.
The issue of an appropriate risk window is complicated due to the different distributions of paternal vs. maternal age at birth. However, since most people have children with someone of a similar age, I decided to limit the risk window to those under 50 at the start of the panel.
Explain whether the results were consistent with your expectations, and do that by interpreting the coefficients from the models, model fit, and so on.
summary(glm(child_cat ~ as.factor(panelwave), data_7, family = "binomial", subset = data_7$panelwave>1))
Call:
glm(formula = child_cat ~ as.factor(panelwave), family = "binomial",
data = data_7, subset = data_7$panelwave > 1)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5270 -0.5270 -0.5215 -0.5215 2.0310
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.92642 0.17597 -10.948 <2e-16 ***
as.factor(panelwave)3 0.02218 0.26832 0.083 0.934
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 388.32 on 506 degrees of freedom
Residual deviance: 388.32 on 505 degrees of freedom
(29 observations deleted due to missingness)
AIC: 392.32
Number of Fisher Scoring iterations: 4
Using only time as an explanatory variable, we see that wave 3 increases the logit of having a child by 0.02, relative to wave 2, but this is not statistically significant.
summary(glm(child_cat ~ as.factor(panelwave) + age + sex + as.factor(race) + educ + marital_cat + work_cat + coninc + polviews + attend + babies,
data_7, family = "binomial", subset = data_7$panelwave>1))
Call:
glm(formula = child_cat ~ as.factor(panelwave) + age + sex +
as.factor(race) + educ + marital_cat + work_cat + coninc +
polviews + attend + babies, family = "binomial", data = data_7,
subset = data_7$panelwave > 1)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.2374 -0.3606 -0.1960 -0.1137 3.0188
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.632e-01 1.503e+00 -0.508 0.6116
as.factor(panelwave)3 4.445e-01 3.699e-01 1.202 0.2295
age 6.649e-03 2.249e-02 0.296 0.7675
sex -5.270e-01 4.057e-01 -1.299 0.1939
as.factor(race)2 1.138e+00 5.490e-01 2.073 0.0382 *
as.factor(race)3 -5.276e-01 6.930e-01 -0.761 0.4465
educ -1.623e-01 8.509e-02 -1.908 0.0564 .
marital_cat 2.282e+00 4.987e-01 4.576 4.74e-06 ***
work_cat -4.914e-01 4.137e-01 -1.188 0.2350
coninc -5.539e-07 5.264e-06 -0.105 0.9162
polviews -2.506e-01 1.504e-01 -1.666 0.0957 .
attend 7.343e-02 7.676e-02 0.957 0.3388
babies 3.566e+00 4.288e-01 8.315 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 388.32 on 506 degrees of freedom
Residual deviance: 216.43 on 494 degrees of freedom
(29 observations deleted due to missingness)
AIC: 242.43
Number of Fisher Scoring iterations: 6
Adding in the independent variables, we see that, unsuprisingly, being married increases the logit of having a child by 2.28 and already having young children in the household increases the logit by 3.57, net of other factors, with extreme significance. Being Black, compared to being white, also increases the logit of having a child by 1,14, net of other factors, but with lower significance at p<0.05. Lastly, each additional year of education decreases the logit of having a child by 1.62, and each additional point towards conservatism decreases the logit by 2.51, net of other factors, with only marginal significance at p<0.1. Being of some other race decreased the logit by 5.28, but this was not statistically significant. In terms of fit, the AIC of the full model is 242 compared to the time-only model at 392, so this model is more parsimonious despite the additional parameters.
The effects of marital status and young children already in the household were as I expected, but the effects of education and conservatism were the opposite of what I hypothesized. The model indicates that Black repondents were more likely to have children than white respondents, even controlling for age, education, and income. Importantly, this model does not control for which region of the country a respondent lives, which is a significant factor given the geographical distribution of the African-American population across the US. It is also likely that being Black interacts with these variables differently than being white or another race.
As I expected, higher religious attendance increases the likelihood of having a child, though this coefficient is not statistically significant. On the other hand, being female, having more years of education, being employed full-time, and having a higher income actually all decrease the likelihood of having a child, though again, none of these coefficients are statistically significant. Some of the low significance might stem from sex, education, and income having a more complicated relationship to age than a linear model. For example, we might expect people to be most likely to have children in the middle of education distribution, between 10 and 20 years, for example, but since the model covers all levels of education, this changing relationship is not captured.
summary(glm(child_cat ~ as.factor(panelwave) + age + sex + as.factor(race) + educ + marital_cat + work_cat + coninc + polviews + attend + babies,
data_7, family = "binomial", subset = data_7$panelwave>1 & data_7$age<35))
Call:
glm(formula = child_cat ~ as.factor(panelwave) + age + sex +
as.factor(race) + educ + marital_cat + work_cat + coninc +
polviews + attend + babies, family = "binomial", data = data_7,
subset = data_7$panelwave > 1 & data_7$age < 35)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8563 -0.3120 -0.1666 -0.1040 2.9681
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.120e+00 2.438e+00 -0.459 0.646118
as.factor(panelwave)3 6.677e-01 5.345e-01 1.249 0.211568
age -2.289e-02 8.181e-02 -0.280 0.779667
sex -3.282e-01 5.463e-01 -0.601 0.548043
as.factor(race)2 1.225e+00 8.067e-01 1.519 0.128739
as.factor(race)3 -2.820e-01 8.558e-01 -0.330 0.741768
educ -1.191e-01 1.211e-01 -0.984 0.325212
marital_cat 2.460e+00 6.622e-01 3.715 0.000203 ***
work_cat -6.248e-01 5.746e-01 -1.087 0.276926
coninc -5.789e-06 7.843e-06 -0.738 0.460462
polviews -2.055e-01 2.188e-01 -0.939 0.347556
attend 1.524e-01 1.139e-01 1.338 0.180743
babies 3.162e+00 5.026e-01 6.291 3.15e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 223.82 on 284 degrees of freedom
Residual deviance: 111.90 on 272 degrees of freedom
(15 observations deleted due to missingness)
AIC: 137.9
Number of Fisher Scoring iterations: 7
Running another model with the same parameters but only for respondents under 35, marital status and young children already in the household become the only statistically significant variables. The signs for the other coefficients remain the same except for age, which now decreases the logit of having children.
summary(glm(child_cat ~ as.factor(panelwave) + age + sex + as.factor(race) + educ + marital_cat + work_cat + coninc + polviews + attend + babies,
data_7, family = "binomial", subset = data_7$panelwave>1 & data_7$age<36))
Call:
glm(formula = child_cat ~ as.factor(panelwave) + age + sex +
as.factor(race) + educ + marital_cat + work_cat + coninc +
polviews + attend + babies, family = "binomial", data = data_7,
subset = data_7$panelwave > 1 & data_7$age < 36)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8963 -0.3039 -0.1666 -0.0948 3.0331
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.921e-01 2.230e+00 -0.445 0.656
as.factor(panelwave)3 7.733e-01 5.161e-01 1.498 0.134
age 9.280e-03 6.730e-02 0.138 0.890
sex -2.539e-01 5.265e-01 -0.482 0.630
as.factor(race)2 1.138e+00 7.798e-01 1.459 0.145
as.factor(race)3 -3.972e-01 8.577e-01 -0.463 0.643
educ -1.727e-01 1.133e-01 -1.524 0.127
marital_cat 2.548e+00 6.491e-01 3.926 8.65e-05 ***
work_cat -6.648e-01 5.585e-01 -1.190 0.234
coninc -5.451e-06 7.726e-06 -0.706 0.480
polviews -3.031e-01 2.064e-01 -1.469 0.142
attend 1.333e-01 1.109e-01 1.203 0.229
babies 3.242e+00 4.940e-01 6.563 5.27e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 242.98 on 299 degrees of freedom
Residual deviance: 120.19 on 287 degrees of freedom
(17 observations deleted due to missingness)
AIC: 146.19
Number of Fisher Scoring iterations: 7
Strangely, modifying the risk window even slightly, here including respondents aged 35 and under, the coefficient for age flips to being positive and actually goes up to 9.28. To better observe the age-group patterns, I ran the model separately on respondents under 30, those 30-39, and those 40 and up.
summary(glm(child_cat ~ as.factor(panelwave) + age + sex + as.factor(race) + educ + marital_cat + work_cat + coninc + polviews + attend + babies,
data_7, family = "binomial", subset = data_7$panelwave>1 & data_7$age<30))
Call:
glm(formula = child_cat ~ as.factor(panelwave) + age + sex +
as.factor(race) + educ + marital_cat + work_cat + coninc +
polviews + attend + babies, family = "binomial", data = data_7,
subset = data_7$panelwave > 1 & data_7$age < 30)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3773 -0.3640 -0.1917 -0.1134 2.8611
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.845e+00 3.557e+00 -0.519 0.60394
as.factor(panelwave)3 3.685e-01 6.014e-01 0.613 0.54011
age 9.426e-02 1.432e-01 0.658 0.51040
sex -4.582e-01 6.391e-01 -0.717 0.47345
as.factor(race)2 1.727e+00 8.469e-01 2.039 0.04143 *
as.factor(race)3 1.217e-01 1.094e+00 0.111 0.91139
educ -1.763e-01 1.285e-01 -1.372 0.17017
marital_cat 2.211e+00 7.112e-01 3.109 0.00188 **
work_cat -9.307e-01 6.573e-01 -1.416 0.15674
coninc -8.230e-06 9.366e-06 -0.879 0.37955
polviews -2.666e-01 2.332e-01 -1.143 0.25304
attend 6.943e-02 1.263e-01 0.550 0.58252
babies 2.659e+00 5.472e-01 4.860 1.18e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 161.683 on 198 degrees of freedom
Residual deviance: 88.198 on 186 degrees of freedom
(11 observations deleted due to missingness)
AIC: 114.2
Number of Fisher Scoring iterations: 6
Looking only at respondents under 30, age retains a large coefficient pf 9.43, though it is not statistically significant. Like the all-ages model, being Black, being married, and living with young children are all statistically significant with a positive effect on the logit.
summary(glm(child_cat ~ as.factor(panelwave) + age + sex + as.factor(race) + educ + marital_cat + work_cat + coninc + polviews + attend + babies,
data_7, family = "binomial", subset = data_7$panelwave>1 & data_7$age>=30 & data_7$age<40))
Call:
glm(formula = child_cat ~ as.factor(panelwave) + age + sex +
as.factor(race) + educ + marital_cat + work_cat + coninc +
polviews + attend + babies, family = "binomial", data = data_7,
subset = data_7$panelwave > 1 & data_7$age >= 30 & data_7$age <
40)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.94369 -0.15188 -0.03509 -0.00835 2.93512
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.079e+00 6.140e+00 -0.501 0.6161
as.factor(panelwave)3 8.411e-01 8.352e-01 1.007 0.3139
age 8.781e-02 1.535e-01 0.572 0.5672
sex -1.096e-02 1.067e+00 -0.010 0.9918
as.factor(race)2 1.524e+00 1.736e+00 0.878 0.3801
as.factor(race)3 -2.338e+00 1.609e+00 -1.453 0.1462
educ -4.935e-01 2.458e-01 -2.008 0.0447 *
marital_cat 5.348e+00 2.124e+00 2.518 0.0118 *
work_cat 8.328e-01 1.107e+00 0.752 0.4518
coninc 1.483e-06 1.587e-05 0.093 0.9255
polviews -5.267e-01 4.232e-01 -1.245 0.2133
attend -3.610e-02 2.205e-01 -0.164 0.8700
babies 6.583e+00 1.668e+00 3.947 7.93e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 136.250 on 152 degrees of freedom
Residual deviance: 41.619 on 140 degrees of freedom
(10 observations deleted due to missingness)
AIC: 67.619
Number of Fisher Scoring iterations: 8
Looking at respondents from ages 30-39, the age coefficient drops only slightly to 8.78, again having no statistical significance. For this age group, race is no longer a statistically significant factor, but education is. Each additional year of education decreases the logit of having a child by 4.94, holding all other factors equal, with p<0.01. Marital status and living with young children remain statistically significant, although the coefficients are 2-3 times larger than with the under-30 group.
summary(glm(child_cat ~ as.factor(panelwave) + age + sex + as.factor(race) + educ + marital_cat + work_cat + coninc + polviews + attend + babies,
data_7, family = "binomial", subset = data_7$panelwave>1 & data_7$age>=40))
glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = child_cat ~ as.factor(panelwave) + age + sex +
as.factor(race) + educ + marital_cat + work_cat + coninc +
polviews + attend + babies, family = "binomial", data = data_7,
subset = data_7$panelwave > 1 & data_7$age >= 40)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7658 -0.4407 -0.2842 -0.1682 2.6446
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.941e+00 6.069e+00 -0.320 0.7491
as.factor(panelwave)3 -3.011e-02 6.968e-01 -0.043 0.9655
age 2.428e-02 9.828e-02 0.247 0.8049
sex -1.899e+00 1.015e+00 -1.872 0.0613 .
as.factor(race)2 1.034e+00 9.944e-01 1.040 0.2982
as.factor(race)3 -1.588e+01 2.207e+03 -0.007 0.9943
educ -1.073e-02 1.827e-01 -0.059 0.9532
marital_cat 1.578e+00 9.434e-01 1.672 0.0945 .
work_cat -6.210e-01 8.253e-01 -0.752 0.4518
coninc 2.830e-06 9.945e-06 0.285 0.7760
polviews -1.246e-01 2.705e-01 -0.461 0.6450
attend 1.110e-01 1.382e-01 0.803 0.4217
babies 1.953e+01 2.954e+03 0.007 0.9947
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.450 on 154 degrees of freedom
Residual deviance: 65.139 on 142 degrees of freedom
(8 observations deleted due to missingness)
AIC: 91.139
Number of Fisher Scoring iterations: 17
Finally, for respondents 40 and older, age remains positive though much smaller at 2.43, still with no statistical significance. For this age group only sex and marital status were even close to statistical significance, with marriage increasing the logit and being a woman decreasing the logit of having children.
In conclusion, the different model results show the importance of choosing appropriate risk windows, questioning the salience of results from aggregated data, and knowing the population being studied. Why age sould have decreased the logit of having children only for the data of those under 36, and so different than those under 35, remains a mystery.