Homework 4 EHA

Author

Brandon Flores

For HW4, I continue to observe potential relationships between Hispanics and Non Hispanic whites in terms of following a cohort over time and the risk of obtaining a Bachelors degree.

The key control variable added to this model will be sex but also the effects of parental education of earning a bachelors has as well. The reason for observing sex as a control would be due to differing outcomes that may be explained by sex differences which is to be found in arguably both populations of observance. Then a second level of observations will be added of potential parental effects if a statistical significance is to still be found after the inclusion of the sex variable.

I am observing the possible differences in educational attainment following a span from 2004 to 2019 using the National Longitudinal Survey of Youth observing the 1997 cohort.

I hypothesize that:

H1) Hispanics over time will have lower rates of earning a Bachelors degree more so than Non Hispanic Whites.

H2) Hispanic men will tend to have lower rates of earning a Bachelors degree more so than Hispanic women.

H3) Hispanics overall will tend to still have lower rates of earning a Bachelors degree over time compared to Non Hispanic Whites regardless of whether or not parents earn a Bachelors degree.

When observing the initial Cox regression model it shows that there is a statistically significant relationship between Non Hispanic whites and Hispanics at the .001 level. This shows that Hispanics are 27% more at risk of not receiving a Bachelors degree when compared to Non Hispanic whites. When sex is introduced into the model it negates this statistical relationship; showing no statistical relationship when observing the Hispanic and the sex variable collectively.

The tests that were able to be ran with the current code/data would be the Grambsch and Therneau (1994). From the results of this test it shows that the variables Hispanic and sex are not statistically significant. Observing that they are both implying proportionalityity of effect. Also, the actual Schoenfeld residuals were successfully plotted although the actual summary output model had an error as its output which is included below. Each plotted residual as well as the Chisquare test offered by Gramsch and Therneau; it is to be shown that these variables show proportionality.

When conducting the Schoenfeld test, the fitl2 observing the stratification factors, and the fitl3 observing ANOVA like tests for the models errors occurred in each test. Images of the errors are provided below with reference to the code used. From the errors being presented it seems to possibly be a data issues which may also have effected the initial Cox regression model.

From whats observed from the Cox model, H1 is supported but the overall observation including those for the other hypotheses are nulled due to the loss of statistical significance once the sex variable is added between the interaction of those two variables regarding risk of obtaining a Bachelors degree over time.

Continued evaluation of the data, variables, and transformation of the data into longitudinal form will be evaluated for any potential type one error issues that may exist.

library(foreign)
library(car)

Loading required package: carData

library(haven)
library(survival)
library(ggplot2)
library(tidyverse)

── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──

✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
✔ purrr   0.3.4      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ purrr::some()   masks car::some()

library(survey)

Loading required package: grid
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack


Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

library(ggsurvfit)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(muhaz)
library(eha)

options(survey.lonely.psu = "adjust")

dat97<-read_csv("C:\\Users\\BTP\\Desktop\\97cohortNLSYRnum.csv")

Rows: 8984 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (15): ID, SEX, BDATEM, BDATEY, SAMPLETYPE, ETHNICITY, HDEGREE04, HDEGREE...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

myvars<-c( "ID","HDEGREE04", "HDEGREE2010", "HDEGREE2019","ETHNICITY",
           "SEX", "BIOFTHIGD", "BIOMTHIGD", "BDATEY","VSTRAT", "VPSU",
           "samplingweight","DATEBA")

dat97<-dat97[,myvars]

dat97<- dat97 %>%
  filter(HDEGREE04 >=0, HDEGREE2010>=0, HDEGREE2019>=0, DATEBA >=0, ETHNICITY>=0, SEX>=0, VSTRAT>=0) #filter missing data codes

Time Constant Variables

dat97$Bachelors_1 <-Recode(dat97$HDEGREE04, recodes = "0:3 = 0; 4:7 = 1; else=NA", as.factor=F)

## Bachelors degree or higher = 1 & all lesser educations are labled 0

dat97$Bachelors_2 <-Recode(dat97$HDEGREE2010, recodes = "0:3 = 0; 4:7 = 1; else=NA", as.factor=F)

## Bachelors degree or higher = 1 & all lesser educations are labled 0

## Bachelors degree or higher in 2010

dat97$Bachelors_3 <-Recode(dat97$HDEGREE2019, recodes = "0:3 = 0; 4:7 = 1; else=NA", as.factor=F)

## Bachelors degree or higher = 1 & all lesser educations are labled 0

## Bachelors degree or higher in 2019

dat97$Hispanic<-Recode(dat97$ETHNICITY, recodes = "2 = 0; 4 = 1; else=NA", as.factor=T)

## Hispanics are coded as 0 & Non Hipanic whites are coded as 1, all other ethnicities are excluded

dat97$his1<-as.factor(ifelse(dat97$Hispanic==1, "Hispanic", "Non Hispanic"))

dat97$sex1<-Recode(dat97$SEX, recodes = "1 = 0; 2 = 1; else=NA", as.factor=T)

## Women are coded as 0 & Men  are coded as 1

dat97$sex11<-as.factor(ifelse(dat97$sex1==1, "Women", "Men"))

dat97$DADBA <-Recode(dat97$BIOFTHIGD, recodes = "0:15 = 0; 16:20 = 1; else=NA", as.factor=T)

## Dads Bachelors degree or higher = 1 & all lesser educations are labled 0

## Dads Bachelors degree or higher

dat97$MOMBA <-Recode(dat97$BIOMTHIGD, recodes = "0:15 = 0; 16:20 = 1; else=NA", as.factor=T)

## Moms Bachelors degree or higher = 1 & all lesser educations are labled 0

## Moms Bachelors degree or higher

Censoring

#dat97<- dat97%>%filter(DATEBA>0)

dat97$BAYR_1<-ifelse(dat97$HDEGREE04==4,
                   (2004-dat97$BDATEY),
                   ifelse(dat97$HDEGREE04>=4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet

                  ## For the wave of 2004

Time varying variables

dat97$BAYR_2<-ifelse(dat97$HDEGREE2010==2,
                   (2010-dat97$BDATEY),
                   ifelse(dat97$HDEGREE2010>=4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet

                  ## For the wave of 2010

dat97$BAYR_3<-ifelse(dat97$HDEGREE2019==2,
                   (2019-dat97$BDATEY),
                   ifelse(dat97$HDEGREE2019>=4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet

                  ## For the wave of 2019

Risk Set

dat97<-dat97%>% filter(is.na(Bachelors_1)==F &
                  is.na(Bachelors_2)==F &
                  is.na(Bachelors_3)==F & 
                  is.na(BAYR_1)==F &
                  is.na(BAYR_2)==F &
                  is.na(BAYR_3)==F &
                    Bachelors_1!=1)

Pivot

e.long1 <- dat97 %>%
  #rename
  rename(wt = samplingweight,strata= VSTRAT, psu = VPSU)%>%
  select(ID, psu,wt,strata,his1,sex11,DADBA,MOMBA,DATEBA,   #time constant
        BAYR_1, BAYR_2, BAYR_3, #t-varying variables
         Bachelors_1, Bachelors_2, Bachelors_3)%>%
   pivot_longer(cols = c(-ID,-psu, -wt, -strata, -his1, -sex11, -MOMBA, -DADBA, -DATEBA), #error here
               names_to  = c(".value", "wave"), #make wave variable and put t-v vars into columns
               names_sep = "_")%>% #all t-v variables have _ between name and time, like age_1, age_2
  group_by(ID)%>%
   mutate(age_enter = DATEBA, 
         age_exit = lead(DATEBA, 1, order_by=ID))%>%
  mutate(nexBA = dplyr::lead(Bachelors,n=1, order_by = ID))%>%
  mutate(BAtran = ifelse(nexBA == 1 & Bachelors == 0, 1, 0))%>%
  filter(is.na(age_exit)==F)%>%
  ungroup()%>%
  
    filter(complete.cases(age_enter, age_exit, BAtran,
                        his1, MOMBA, DADBA, sex11, psu))

summary(e.long1)

       ID            psu              wt             strata      
 Min.   :   9   Min.   :1.000   Min.   :     0   Min.   :  1.00  
 1st Qu.:1811   1st Qu.:1.000   1st Qu.:242343   1st Qu.: 19.00  
 Median :3232   Median :1.000   Median :544051   Median : 36.00  
 Mean   :3653   Mean   :1.467   Mean   :449250   Mean   : 38.58  
 3rd Qu.:5271   3rd Qu.:2.000   3rd Qu.:607052   3rd Qu.: 54.00  
 Max.   :8991   Max.   :2.000   Max.   :936477   Max.   :116.00  
                                                                 
           his1        sex11      DADBA    MOMBA        DATEBA     
 Hispanic    :2062   Men  :1102   0:1390   0:1470   Min.   :257.0  
 Non Hispanic: 416   Women:1376   1:1088   1:1008   1st Qu.:294.0  
                                                    Median :317.0  
                                                    Mean   :325.1  
                                                    3rd Qu.:336.0  
                                                    Max.   :481.0  
                                                                   
     wave                BAYR         Bachelors        age_enter    
 Length:2478        Min.   :21.00   Min.   :0.0000   Min.   :257.0  
 Class :character   1st Qu.:24.00   1st Qu.:0.0000   1st Qu.:294.0  
 Mode  :character   Median :25.42   Median :1.0000   Median :317.0  
                    Mean   :25.49   Mean   :0.5682   Mean   :325.1  
                    3rd Qu.:27.00   3rd Qu.:1.0000   3rd Qu.:336.0  
                    Max.   :36.08   Max.   :1.0000   Max.   :481.0  
                    NA's   :972                                     
    age_exit         nexBA            BAtran     
 Min.   :257.0   Min.   :0.0000   Min.   :0.000  
 1st Qu.:294.0   1st Qu.:1.0000   1st Qu.:0.000  
 Median :317.0   Median :1.0000   Median :0.000  
 Mean   :325.1   Mean   :0.9362   Mean   :0.368  
 3rd Qu.:336.0   3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :481.0   Max.   :1.0000   Max.   :1.000

options(survey.lonely.psu = "adjust")

des2<-svydesign(ids = ~psu, #error here
                strata = ~strata,
                weights =~wt, 
                data=e.long1,
                nest=T)

Cox Regression

f1 <- survfit(Surv(time = age_enter, event = BAtran)~his1+sex11, data=e.long1)

f1%>%
  ggsurvfit()

#Fit the model
fitl1<-svycoxph(Surv(time = age_enter, event = BAtran)~his1*sex11, design=des2)
summary(fitl1)

Stratified 1 - level Cluster Sampling design (with replacement)
With (172) clusters.
svydesign(ids = ~psu, strata = ~strata, weights = ~wt, data = e.long1, 
    nest = T)

Call:
svycoxph(formula = Surv(time = age_enter, event = BAtran) ~ his1 * 
    sex11, design = des2)

  n= 2016, number of events= 735 

                                coef exp(coef) se(coef) robust se      z
his1Non Hispanic            -0.30570   0.73661  0.16404   0.10759 -2.841
sex11Women                  -0.02651   0.97384  0.07178   0.09074 -0.292
his1Non Hispanic:sex11Women  0.01593   1.01605  0.23589   0.27344  0.058
                            Pr(>|z|)   
his1Non Hispanic             0.00449 **
sex11Women                   0.77019   
his1Non Hispanic:sex11Women  0.95355   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                            exp(coef) exp(-coef) lower .95 upper .95
his1Non Hispanic               0.7366     1.3576    0.5966    0.9095
sex11Women                     0.9738     1.0269    0.8152    1.1634
his1Non Hispanic:sex11Women    1.0161     0.9842    0.5945    1.7365

Concordance= 0.51  (se = 0.011 )
Likelihood ratio test= NA  on 3 df,   p=NA
Wald test            = 9.84  on 3 df,   p=0.02
Score (logrank) test = NA  on 3 df,   p=NA

  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).

schoenresid<-resid(fitl1, type=“schoenfeld”)

fit.sr<-lm(schoenresid~des2\(variables\)age_enter[des2\(variables\)BAtran==1])

fit.sr%>% broom::tidy()%>% filter(term != “(Intercept)”)%>% select(response, estimate, statistic, p.value)

Grambsch and Therneau (1994) Test

fit.test<-cox.zph(fitl1)
fit.test

              chisq df    p
his1       1.85e-04  1 0.99
sex11      1.42e-05  1 1.00
his1:sex11 3.76e-03  1 0.95
GLOBAL     3.89e-03  3 1.00

plot(fit.test, df=2)

#extract Martingale residuals res.mar<-resid(fitl1, type=“martingale”)

#plot vs maternal education scatter.smooth(des2\(variables\)his1, res.mar,degree = 2, span = 1, ylab=“Martingale Residual”, col=1, cex=.5, lpars=list(col = “red”, lwd = 3)) title(main=“Martingale residuals for Mother’ < High School’s Education”)

fit3<-svycoxph(Surv(time = age_enter, time2=age_exit, event = BAtran)~his1+sex11, design=des2) summary(fit3)

fitl2<-svycoxph(Surv(time = age_enter, time2 = age_exit, event = BAtran)~his1+strata(sex11), design=des2) summary(fitl2)

Error codes for stratification and ANOVA type of test codes