For HW 5 I continued to use the National Longitudinal Survey of Youth (1997 cohort). For this analysis I used just one time period rather than the longitudinal methods previously explored. The year that will be observed will be the year of 2004; the first initial wave of the data. The wave for 2019 was initially analyzed but continued running into errors for this portion of the code
” Plot the hazard function on the probability scale} haz<-1/(1+exp(-coef(fit.0))) time<-seq(1,30,1) “
with the time portion of the code I was running into errors. When I ran the data with the year 2004 it was able to run. Other errors occured thorughout the process of fitting the data; usually missing observations as the errors for different variables. I was able to correct these different errors. The only code I was not able to run was
It continued to give me the error of not being able to find my variable BAYR in the dat$sq line. I’ve checked the chunk and everything looked correct and was able to run.
The variables in the models are earning a Bachelors in 2004 (Bachelors04), if the person is Hispanic (his1), respondents sex (sex1), if Bachelors is attained by the father (DADBA), if a Bachelors is attained by the mother (MOMBA), and the age of the respondent when they earned their Bachelors (BAYR).
When observing fit.0 (General Model) when observing all of the variables in the model, the respondents sex has a marginal effect being that women are still at a slightly higher risk of earning a Bachelors when compared to men when observed with all other variables in the model.
This effect is lost with no other variables being statistically significant (outside of BAYR; time constant variables), for fit.l (Linear Model) and fit.s (Square Model). Fit.c (Cubic Model) brought about a much higher statistical significance for sex showing the same results as fit.0 but also DADBA is statistically significant showing that the respondents risk of earning a Bachelors lessens if their father does not have a Bachelors degree.
When observing fit.q (Quartic Model) the statistical significance with sex was lost now showing only a marginal effect. It shows a reverse trend with men tending to be at slightly higher risk than women when earning a Bachelors. With the Splines method it did not show any statistical significance with the variable BAYR.
When observing the AIC’s of all of the models, the Quartic Model exhibited the lowest AIC (797.02). The key variable of being Hispanic was not found to be statistically significant in any model. The only time it was found to be significant was during a Cox Regression but then was controlled away with sex. For future analysis possibly stratifying the Hispanic variable itself with the current data can be able to observe more specified differences as opposed to comparing Hispanics to Non Hispanic whites; which as been found many times through many models to not be significant. Sex has been consistently found to be significant so that could possibly be a key variable to continue to have especially when stratifying the data looking within the Hispanic variable. Also fathers educational attainment should continued to be observed being that it was also found to be statistically significant in one of the models that was conducted in this assignment.
Loading required package: grid
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Attaching package: 'survey'
The following object is masked from 'package:graphics':
dotchart
library(ggsurvfit)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Rows: 8984 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (15): ID, SEX, BDATEM, BDATEY, SAMPLETYPE, ETHNICITY, HDEGREE04, HDEGREE...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat97$Bachelors04 <-Recode(dat97$HDEGREE04, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=T)## Bachelors degree or higher = 1 & all lesser educations are labled 0
#tabyl(Bachelors04)
dat97$Bachelors10 <-Recode(dat97$HDEGREE2010, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=T)## Bachelors degree or higher = 1 & all lesser educations are labled 0dat97 %>%tabyl(Bachelors10) ## Bachelors degree or higher in 2010
Bachelors10 n percent
0 213 0.1418109
1 1289 0.8581891
dat97$Bachelors19 <-Recode(dat97$HDEGREE2019, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=T)## Bachelors degree or higher = 1 & all lesser educations are labled 0dat97 %>%tabyl(Bachelors19) ## Bachelors degree or higher in 2019
Bachelors19 n percent
1 1502 1
dat97$Hispanic<-Recode(dat97$ETHNICITY, recodes ="2 = 0; 4 = 1; else=NA", as.factor=T)## Hispanics are coded as 0 & Non Hipanic whites are coded as 1, all other ethnicities are excludeddat97$his1<-as.factor(ifelse(dat97$Hispanic==1, "Hispanic", "Non Hispanic"))dat97 %>%tabyl(his1) ## Hispanics and Non Hispanic whites coded
his1 n percent valid_percent
Hispanic 1031 0.6864181 0.8321227
Non Hispanic 208 0.1384820 0.1678773
<NA> 263 0.1750999 NA
summary(dat97)
ID HDEGREE04 HDEGREE2010 HDEGREE2019 ETHNICITY
Min. : 9 Min. :0.000 Min. :0.000 Min. :4.000 Min. :1.000
1st Qu.:1894 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:2.000
Median :3644 Median :2.000 Median :4.000 Median :4.000 Median :4.000
Mean :3962 Mean :2.559 Mean :3.939 Mean :4.438 Mean :3.218
3rd Qu.:5873 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
Max. :9016 Max. :5.000 Max. :7.000 Max. :7.000 Max. :4.000
SEX BIOFTHIGD BIOMTHIGD BDATEY VSTRAT
Min. :1.00 Min. : 2.00 Min. : 1.00 Min. :1980 Min. : 1.00
1st Qu.:1.00 1st Qu.:12.00 1st Qu.:12.00 1st Qu.:1981 1st Qu.: 19.00
Median :2.00 Median :14.00 Median :14.00 Median :1982 Median : 39.00
Mean :1.57 Mean :14.24 Mean :14.06 Mean :1982 Mean : 41.83
3rd Qu.:2.00 3rd Qu.:16.00 3rd Qu.:16.00 3rd Qu.:1983 3rd Qu.: 59.00
Max. :2.00 Max. :20.00 Max. :20.00 Max. :1984 Max. :116.00
VPSU samplingweight DATEBA Bachelors04 Bachelors10
Min. :1.000 Min. : 0 Min. :257.0 0:1113 0: 213
1st Qu.:1.000 1st Qu.:180962 1st Qu.:296.0 1: 389 1:1289
Median :1.000 Median :530019 Median :317.0
Mean :1.476 Mean :411945 Mean :326.9
3rd Qu.:2.000 3rd Qu.:595821 3rd Qu.:341.0
Max. :2.000 Max. :936477 Max. :481.0
Bachelors19 Hispanic his1
1:1502 0 : 208 Hispanic :1031
1 :1031 Non Hispanic: 208
NA's: 263 NA's : 263
summary(dat97$DATEBA)
Min. 1st Qu. Median Mean 3rd Qu. Max.
257.0 296.0 317.0 326.9 341.0 481.0
dat97<- dat97 %>%filter(DATEBA>0)dat97$BAYR<-ifelse(dat97$HDEGREE04==2, (2004-dat97$BDATEY),ifelse(dat97$HDEGREE04==4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yetsummary(dat97$BAYR)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
20.00 21.00 22.00 22.04 23.42 36.08 95
## For the wave of 2004
dat97$BAYR1<-ifelse(dat97$HDEGREE2010==2, (2010-dat97$BDATEY),ifelse(dat97$HDEGREE2010==4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yetsummary(dat97$BAYR1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
21.42 25.00 26.42 26.39 27.50 36.08 289
## For the wave of 2010
dat97$BAYR2<-ifelse(dat97$HDEGREE2019==2, (2019-dat97$BDATEY),ifelse(dat97$HDEGREE2019==4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yetsummary(dat97$BAYR2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
21.42 25.33 26.58 27.80 29.00 40.08 489
## For the wave of 2019
dat97$sex1<-Recode(dat97$SEX, recodes ="1 = 0; 2 = 1; else=NA", as.factor=T)## Women are coded as 0 & Men are coded as 1dat97$sex11<-as.factor(ifelse(dat97$sex1==1, "Women", "Men"))
dat97$DADBA <-Recode(dat97$BIOFTHIGD, recodes ="0:15 = 0; 16:20 = 1; else=NA", as.factor=T)## Dads Bachelors degree or higher = 1 & all lesser educations are labled 0## Dads Bachelors degree or higher
dat97$MOMBA <-Recode(dat97$BIOMTHIGD, recodes ="0:15 = 0; 16:20 = 1; else=NA", as.factor=T)## Moms Bachelors degree or higher = 1 & all lesser educations are labled 0## Moms Bachelors degree or higher
#Plot the hazard function on the probability scalehaz<-1/(1+exp(-coef(fit.0)))time<-seq(1,30,1)plot(haz~time, type="l", ylab="h(t)")title(main="Discrete Time Hazard Function for Bachelors Attainment 2004")
St<-NAtime<-1:length(haz)St[1]<-1-haz[1]for(i in2:length(haz)){St[i]<-St[i-1]* (1-haz[i])}St<-c(1, St)time<-c(0, time)plot(y=St,x=time, type="l",main="Survival function for earning a Bachelors 2004")
#Linear term for timefit.0<-svyglm(Bachelors04~his1++sex1+DADBA+MOMBA+1,design=des ,family=binomial(link="cloglog"))
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.0)
Call:
svyglm(formula = Bachelors04 ~ his1 + +sex1 + DADBA + MOMBA +
1, design = des, family = binomial(link = "cloglog"))
Survey design:
survey::svydesign(ids = ~VPSU, strata = ~VSTRAT, data = pp, weight = ~samplingweight,
nest = TRUE)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.79273 0.12608 -14.219 <2e-16 ***
his1Non Hispanic -0.27747 0.20718 -1.339 0.1848
sex11 0.21216 0.11250 1.886 0.0635 .
DADBA1 0.09959 0.11663 0.854 0.3961
MOMBA1 -0.03069 0.11259 -0.273 0.7859
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1.092358)
Number of Fisher Scoring iterations: 5
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
Warning: glm.fit: algorithm did not converge
summary(fit.sp)
Call:
svyglm(formula = Bachelors04 ~ ns(BAYR, df = 3), design = des,
family = binomial(link = "cloglog"))
Survey design:
survey::svydesign(ids = ~VPSU, strata = ~VSTRAT, data = pp, weight = ~samplingweight,
nest = TRUE)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.8762 28.2200 -0.102 0.919
ns(BAYR, df = 3)1 11.3137 13.9656 0.810 0.420
ns(BAYR, df = 3)2 0.5812 55.4097 0.010 0.992
(Dispersion parameter for binomial family taken to be 0.4329859)
Number of Fisher Scoring iterations: 25
#AIC tableaic<-round(c(fit.l$deviance+2*length(fit.l$coefficients),fit.s$deviance+2*length(fit.s$coefficients),fit.c$deviance+2*length(fit.c$coefficients),fit.q$deviance+2*length(fit.q$coefficients),fit.sp$deviance+2*length(fit.sp$coefficients),fit.0$deviance+2*length(fit.0$coefficients)),2)#compare all aics to the one from the general modeldif.aic<-round(aic-aic[6],2)data.frame(model =c( "linear","square", "cubic", "quartic","spline", "general"),aic=aic,aic_dif=dif.aic)
model aic aic_dif
1 linear 970.14 -1119.10
2 square 1099.51 -989.73
3 cubic 17960.80 15871.56
4 quartic 797.02 -1292.22
5 spline 913.84 -1175.40
6 general 2089.24 0.00