For HW4, I continue to observe potential relationships between Hispanics and Non Hispanic whites in terms of following a cohort over time and the risk of obtaining a Bachelors degree.
The key control variable added to this model will be sex but also the effects of parental education of earning a bachelors has as well. The reason for observing sex as a control would be due to differing outcomes that may be explained by sex differences which is to be found in arguably both populations of observance. Then a second level of observations will be added of potential parental effects if a statistical significance is to still be found after the inclusion of the sex variable.
I am observing the possible differences in educational attainment following a span from 2004 to 2019 using the National Longitudinal Survey of Youth observing the 1997 cohort.
I hypothesize that:
H1) Hispanics over time will have lower rates of earning a Bachelors degree more so than Non Hispanic Whites.
H2) Hispanic men will tend to have lower rates of earning a Bachelors degree more so than Hispanic women.
H3) Hispanics overall will tend to still have lower rates of earning a Bachelors degree over time compared to Non Hispanic Whites regardless of whether or not parents earn a Bachelors degree.
When observing the initial Cox regression model it shows that there is a statistically significant relationship between Non Hispanic whites and Hispanics at the .001 level. This shows that Hispanics are 27% more at risk of not receiving a Bachelors degree when compared to Non Hispanic whites. When sex is introduced into the model it negates this statistical relationship; showing no statistical relationship when observing the Hispanic and the sex variable collectively.
The tests that were able to be ran with the current code/data would be the Grambsch and Therneau (1994). From the results of this test it shows that the variables Hispanic and sex are not statistically significant. Observing that they are both implying proportionalityity of effect. Also, the actual Schoenfeld residuals were successfully plotted although the actual summary output model had an error as its output which is included below. Each plotted residual as well as the Chisquare test offered by Gramsch and Therneau; it is to be shown that these variables show proportionality.
When conducting the Schoenfeld test, the fitl2 observing the stratification factors, and the fitl3 observing ANOVA like tests for the models errors occurred in each test. Images of the errors are provided below with reference to the code used. From the errors being presented it seems to possibly be a data issues which may also have effected the initial Cox regression model.
From whats observed from the Cox model, H1 is supported but the overall observation including those for the other hypotheses are nulled due to the loss of statistical significance once the sex variable is added between the interaction of those two variables regarding risk of obtaining a Bachelors degree over time.
Continued evaluation of the data, variables, and transformation of the data into longitudinal form will be evaluated for any potential type one error issues that may exist.
Loading required package: grid
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Attaching package: 'survey'
The following object is masked from 'package:graphics':
dotchart
library(ggsurvfit)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Rows: 8984 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (15): ID, SEX, BDATEM, BDATEY, SAMPLETYPE, ETHNICITY, HDEGREE04, HDEGREE...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat97$Bachelors_1 <-Recode(dat97$HDEGREE04, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=F)## Bachelors degree or higher = 1 & all lesser educations are labled 0
dat97$Bachelors_2 <-Recode(dat97$HDEGREE2010, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=F)## Bachelors degree or higher = 1 & all lesser educations are labled 0## Bachelors degree or higher in 2010
dat97$Bachelors_3 <-Recode(dat97$HDEGREE2019, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=F)## Bachelors degree or higher = 1 & all lesser educations are labled 0## Bachelors degree or higher in 2019
dat97$Hispanic<-Recode(dat97$ETHNICITY, recodes ="2 = 0; 4 = 1; else=NA", as.factor=T)## Hispanics are coded as 0 & Non Hipanic whites are coded as 1, all other ethnicities are excludeddat97$his1<-as.factor(ifelse(dat97$Hispanic==1, "Hispanic", "Non Hispanic"))
dat97$sex1<-Recode(dat97$SEX, recodes ="1 = 0; 2 = 1; else=NA", as.factor=T)## Women are coded as 0 & Men are coded as 1dat97$sex11<-as.factor(ifelse(dat97$sex1==1, "Women", "Men"))
dat97$DADBA <-Recode(dat97$BIOFTHIGD, recodes ="0:15 = 0; 16:20 = 1; else=NA", as.factor=T)## Dads Bachelors degree or higher = 1 & all lesser educations are labled 0## Dads Bachelors degree or higher
dat97$MOMBA <-Recode(dat97$BIOMTHIGD, recodes ="0:15 = 0; 16:20 = 1; else=NA", as.factor=T)## Moms Bachelors degree or higher = 1 & all lesser educations are labled 0## Moms Bachelors degree or higher
Censoring
#dat97<- dat97%>%filter(DATEBA>0)dat97$BAYR_1<-ifelse(dat97$HDEGREE04==4, (2004-dat97$BDATEY),ifelse(dat97$HDEGREE04>=4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet## For the wave of 2004
Time varying variables
dat97$BAYR_2<-ifelse(dat97$HDEGREE2010==2, (2010-dat97$BDATEY),ifelse(dat97$HDEGREE2010>=4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet## For the wave of 2010
dat97$BAYR_3<-ifelse(dat97$HDEGREE2019==2, (2019-dat97$BDATEY),ifelse(dat97$HDEGREE2019>=4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet## For the wave of 2019
e.long1 <- dat97 %>%#renamerename(wt = samplingweight,strata= VSTRAT, psu = VPSU)%>%select(ID, psu,wt,strata,his1,sex11,DADBA,MOMBA,DATEBA, #time constant BAYR_1, BAYR_2, BAYR_3, #t-varying variables Bachelors_1, Bachelors_2, Bachelors_3)%>%pivot_longer(cols =c(-ID,-psu, -wt, -strata, -his1, -sex11, -MOMBA, -DADBA, -DATEBA), #error herenames_to =c(".value", "wave"), #make wave variable and put t-v vars into columnsnames_sep ="_")%>%#all t-v variables have _ between name and time, like age_1, age_2group_by(ID)%>%mutate(age_enter = DATEBA, age_exit =lead(DATEBA, 1, order_by=ID))%>%mutate(nexBA = dplyr::lead(Bachelors,n=1, order_by = ID))%>%mutate(BAtran =ifelse(nexBA ==1& Bachelors ==0, 1, 0))%>%filter(is.na(age_exit)==F)%>%ungroup()%>%filter(complete.cases(age_enter, age_exit, BAtran, his1, MOMBA, DADBA, sex11, psu))
summary(e.long1)
ID psu wt strata
Min. : 9 Min. :1.000 Min. : 0 Min. : 1.00
1st Qu.:1811 1st Qu.:1.000 1st Qu.:242343 1st Qu.: 19.00
Median :3232 Median :1.000 Median :544051 Median : 36.00
Mean :3653 Mean :1.467 Mean :449250 Mean : 38.58
3rd Qu.:5271 3rd Qu.:2.000 3rd Qu.:607052 3rd Qu.: 54.00
Max. :8991 Max. :2.000 Max. :936477 Max. :116.00
his1 sex11 DADBA MOMBA DATEBA
Hispanic :2062 Men :1102 0:1390 0:1470 Min. :257.0
Non Hispanic: 416 Women:1376 1:1088 1:1008 1st Qu.:294.0
Median :317.0
Mean :325.1
3rd Qu.:336.0
Max. :481.0
wave BAYR Bachelors age_enter
Length:2478 Min. :21.00 Min. :0.0000 Min. :257.0
Class :character 1st Qu.:24.00 1st Qu.:0.0000 1st Qu.:294.0
Mode :character Median :25.42 Median :1.0000 Median :317.0
Mean :25.49 Mean :0.5682 Mean :325.1
3rd Qu.:27.00 3rd Qu.:1.0000 3rd Qu.:336.0
Max. :36.08 Max. :1.0000 Max. :481.0
NA's :972
age_exit nexBA BAtran
Min. :257.0 Min. :0.0000 Min. :0.000
1st Qu.:294.0 1st Qu.:1.0000 1st Qu.:0.000
Median :317.0 Median :1.0000 Median :0.000
Mean :325.1 Mean :0.9362 Mean :0.368
3rd Qu.:336.0 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :481.0 Max. :1.0000 Max. :1.000