I am using a Proportional Hazards model because for the data that I am evaluating it conceptually makes sense to observe the amount of hazard between Hispanics and Non Hispanic Whites regarding earning a Bachelors Degree. Observing the amount of time, the respondents age of, earning a Bachelors between the two groups.
The Parametric Distribution I am using is the Piecewise Exponential Model because you are able to have set intervals of observations to obtain a more specified outcomes. As opposed to the other models that assume more of a continuation throughout without much variation.
Analysis
For this analysis added predictors will be added to the model along with the variable Hispanics. The parental education for both the mother and father are added to the models. For parental education it is observed whether or not they earned a Bachelors degree. The sex of the respondent is also added to the model.
When observing the AIC, the model that was the best fit for the data was the Piecewise model with the lowest shown AIC. When observing the regression summary of the Cox model it shows that when controlling for parental education of earning a Bachelors and the respondents sex there is no statistically significant relationship between Hispanics and Non Hispanic Whites. A Cox model was also conducted without those other control variables and a statistical significant result was still not found. This is different with typical results showing a statistically significant result between Hispanics and Non Hispanic Whites concerning educational attainment.
Loading required package: grid
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Attaching package: 'survey'
The following object is masked from 'package:graphics':
dotchart
library(ggsurvfit)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Rows: 8984 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (15): ID, SEX, BDATEM, BDATEY, SAMPLETYPE, ETHNICITY, HDEGREE04, HDEGREE...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat97$Bachelors_1 <-Recode(dat97$HDEGREE04, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=T)## Bachelors degree or higher = 1 & all lesser educations are labled 0
dat97$Bachelors_2 <-Recode(dat97$HDEGREE2010, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=T)## Bachelors degree or higher = 1 & all lesser educations are labled 0## Bachelors degree or higher in 2010
dat97$Bachelors_3 <-Recode(dat97$HDEGREE2019, recodes ="0:3 = 0; 4:7 = 1; else=NA", as.factor=T)## Bachelors degree or higher = 1 & all lesser educations are labled 0## Bachelors degree or higher in 2019
dat97$Hispanic<-Recode(dat97$ETHNICITY, recodes ="2 = 0; 4 = 1; else=NA", as.factor=T)## Hispanics are coded as 0 & Non Hipanic whites are coded as 1, all other ethnicities are excludeddat97$his1<-as.factor(ifelse(dat97$Hispanic==1, "Hispanic", "Non Hispanic"))
dat97$sex1<-Recode(dat97$SEX, recodes ="1 = 0; 2 = 1; else=NA", as.factor=T)## Women are coded as 0 & Men are coded as 1dat97$sex11<-as.factor(ifelse(dat97$sex1==1, "Women", "Men"))
dat97$DADBA <-Recode(dat97$BIOFTHIGD, recodes ="0:15 = 0; 16:20 = 1; else=NA", as.factor=T)## Dads Bachelors degree or higher = 1 & all lesser educations are labled 0## Dads Bachelors degree or higher
dat97$MOMBA <-Recode(dat97$BIOMTHIGD, recodes ="0:15 = 0; 16:20 = 1; else=NA", as.factor=T)## Moms Bachelors degree or higher = 1 & all lesser educations are labled 0## Moms Bachelors degree or higher
Censoring
dat97<- dat97%>%filter(DATEBA>0)dat97$BAYR_1<-ifelse(dat97$HDEGREE04==2, (2004-dat97$BDATEY),ifelse(dat97$HDEGREE04==4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet## For the wave of 2004
Time varying variables
dat97$BAYR_2<-ifelse(dat97$HDEGREE2010==2, (2010-dat97$BDATEY),ifelse(dat97$HDEGREE2010==4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet## For the wave of 2010
dat97$BAYR_3<-ifelse(dat97$HDEGREE2019==2, (2019-dat97$BDATEY),ifelse(dat97$HDEGREE2019==4,dat97$DATEBA/12,NA)) ## For Censored because they dont have a bachelors degree yet## For the wave of 2019
e.long1 <- dat97 %>%#renamerename(wt = samplingweight,strata= VSTRAT, psu = VPSU)%>%select(ID,his1,sex11,DADBA,MOMBA,DATEBA, #time constant BAYR_1, BAYR_2, BAYR_3, #t-varying variables Bachelors_1, Bachelors_2, Bachelors_3)%>%pivot_longer(cols =c(-ID, -his1, -sex11, -MOMBA, -DADBA, -DATEBA), #time constant variables go herenames_to =c(".value", "wave"), #make wave variable and put t-v vars into columnsnames_sep ="_") #all t-v variables have _ between name and time, like age_1, age_2
Transition variable
e.long1 <- e.long1%>%group_by(ID)%>%mutate(nexBA = dplyr::lead(Bachelors,n=1, order_by = ID))%>%mutate(BAtran =ifelse(nexBA ==1& Bachelors ==0, 1, 0))%>%ungroup()#find which people failed in the first time period and remove them from the second risk period risk setfailed1<-which(is.na(e.long1$BAtran)==T)#e.long1<-e.long1[-failed1,] # this makes you have 0 casesprint(e.long1, n =100)
# A tibble: 2,406 × 11
ID his1 sex11 DADBA MOMBA DATEBA wave BAYR Bache…¹ nexBA BAtran
<dbl> <fct> <fct> <fct> <fct> <dbl> <chr> <dbl> <fct> <fct> <dbl>
1 11 Non Hispanic Women 0 0 432 1 22 0 0 0
2 11 Non Hispanic Women 0 0 432 2 28 0 1 1
3 11 Non Hispanic Women 0 0 432 3 36 1 <NA> 0
4 28 <NA> Women 0 0 317 1 21 0 1 1
5 28 <NA> Women 0 0 317 2 26.4 1 1 0
6 28 <NA> Women 0 0 317 3 26.4 1 <NA> 0
7 34 Hispanic Women 0 1 329 1 20 0 1 1
8 34 Hispanic Women 0 1 329 2 27.4 1 1 0
9 34 Hispanic Women 0 1 329 3 27.4 1 <NA> 0
10 35 Hispanic Men <NA> <NA> 321 1 21 0 1 1
11 35 Hispanic Men <NA> <NA> 321 2 26.8 1 1 0
12 35 Hispanic Men <NA> <NA> 321 3 26.8 1 <NA> 0
13 41 Hispanic Women 0 0 317 1 21 0 1 1
14 41 Hispanic Women 0 0 317 2 26.4 1 1 0
15 41 Hispanic Women 0 0 317 3 26.4 1 <NA> 0
16 48 <NA> Women <NA> 0 325 1 21 0 1 1
17 48 <NA> Women <NA> 0 325 2 27.1 1 1 0
18 48 <NA> Women <NA> 0 325 3 27.1 1 <NA> 0
19 55 Non Hispanic Women 0 0 336 1 23 0 1 1
20 55 Non Hispanic Women 0 0 336 2 28 1 1 0
21 55 Non Hispanic Women 0 0 336 3 28 1 <NA> 0
22 63 <NA> Women 0 0 380 1 20 0 0 0
23 63 <NA> Women 0 0 380 2 26 0 1 1
24 63 <NA> Women 0 0 380 3 31.7 1 <NA> 0
25 70 <NA> Men 0 1 324 1 22 0 1 1
26 70 <NA> Men 0 1 324 2 27 1 1 0
27 70 <NA> Men 0 1 324 3 27 1 <NA> 0
28 91 Non Hispanic Women 0 1 349 1 22 0 1 1
29 91 Non Hispanic Women 0 1 349 2 29.1 1 1 0
30 91 Non Hispanic Women 0 1 349 3 29.1 1 <NA> 0
31 108 Hispanic Men 1 1 317 1 20 0 1 1
32 108 Hispanic Men 1 1 317 2 26.4 1 1 0
33 108 Hispanic Men 1 1 317 3 26.4 1 <NA> 0
34 114 Hispanic Women 0 0 368 1 22 0 1 1
35 114 Hispanic Women 0 0 368 2 30.7 1 1 0
36 114 Hispanic Women 0 0 368 3 30.7 1 <NA> 0
37 128 Hispanic Women 1 1 300 1 22 0 1 1
38 128 Hispanic Women 1 1 300 2 25 1 1 0
39 128 Hispanic Women 1 1 300 3 25 1 <NA> 0
40 157 Non Hispanic Women 0 0 317 1 22 0 1 1
41 157 Non Hispanic Women 0 0 317 2 26.4 1 1 0
42 157 Non Hispanic Women 0 0 317 3 26.4 1 <NA> 0
43 170 Hispanic Women 1 0 401 1 20 0 0 0
44 170 Hispanic Women 1 0 401 2 26 0 1 1
45 170 Hispanic Women 1 0 401 3 33.4 1 <NA> 0
46 182 Hispanic Women 1 1 316 1 20 0 1 1
47 182 Hispanic Women 1 1 316 2 26.3 1 1 0
48 182 Hispanic Women 1 1 316 3 26.3 1 <NA> 0
49 183 Hispanic Women 1 1 324 1 24 0 1 1
50 183 Hispanic Women 1 1 324 2 27 1 1 0
51 183 Hispanic Women 1 1 324 3 27 1 <NA> 0
52 187 Non Hispanic Men 0 0 317 1 21 0 1 1
53 187 Non Hispanic Men 0 0 317 2 26.4 1 1 0
54 187 Non Hispanic Men 0 0 317 3 26.4 1 <NA> 0
55 199 Hispanic Men 1 1 346 1 22 0 1 1
56 199 Hispanic Men 1 1 346 2 28.8 1 1 0
57 199 Hispanic Men 1 1 346 3 28.8 1 <NA> 0
58 203 Hispanic Women 0 0 324 1 20 0 1 1
59 203 Hispanic Women 0 0 324 2 27 1 1 0
60 203 Hispanic Women 0 0 324 3 27 1 <NA> 0
61 215 Non Hispanic Men 0 0 456 1 23 0 0 0
62 215 Non Hispanic Men 0 0 456 2 29 0 1 1
63 215 Non Hispanic Men 0 0 456 3 38 1 <NA> 0
64 229 Hispanic Women <NA> 1 317 1 20 0 1 1
65 229 Hispanic Women <NA> 1 317 2 26.4 1 1 0
66 229 Hispanic Women <NA> 1 317 3 26.4 1 <NA> 0
67 230 Non Hispanic Men 0 0 329 1 20 0 1 1
68 230 Non Hispanic Men 0 0 329 2 27.4 1 1 0
69 230 Non Hispanic Men 0 0 329 3 27.4 1 <NA> 0
70 237 Hispanic Men 0 0 329 1 21 0 1 1
71 237 Hispanic Men 0 0 329 2 27.4 1 1 0
72 237 Hispanic Men 0 0 329 3 27.4 1 <NA> 0
73 238 Hispanic Women <NA> <NA> 313 1 23 0 1 1
74 238 Hispanic Women <NA> <NA> 313 2 26.1 1 1 0
75 238 Hispanic Women <NA> <NA> 313 3 26.1 1 <NA> 0
76 240 Hispanic Women 1 1 305 1 21 0 1 1
77 240 Hispanic Women 1 1 305 2 25.4 1 1 0
78 240 Hispanic Women 1 1 305 3 25.4 1 <NA> 0
79 248 Hispanic Women 0 0 323 1 21 0 1 1
80 248 Hispanic Women 0 0 323 2 26.9 1 1 0
81 248 Hispanic Women 0 0 323 3 26.9 1 <NA> 0
82 251 Non Hispanic Men 0 0 305 1 22 0 1 1
83 251 Non Hispanic Men 0 0 305 2 25.4 1 1 0
84 251 Non Hispanic Men 0 0 305 3 25.4 1 <NA> 0
85 286 Hispanic Women 0 0 325 1 20 0 1 1
86 286 Hispanic Women 0 0 325 2 27.1 1 1 0
87 286 Hispanic Women 0 0 325 3 27.1 1 <NA> 0
88 293 Non Hispanic Women 0 0 341 1 21 0 1 1
89 293 Non Hispanic Women 0 0 341 2 28.4 1 1 0
90 293 Non Hispanic Women 0 0 341 3 28.4 1 <NA> 0
91 296 Non Hispanic Women 0 0 317 1 23 0 1 1
92 296 Non Hispanic Women 0 0 317 2 26.4 1 1 0
93 296 Non Hispanic Women 0 0 317 3 26.4 1 <NA> 0
94 300 Hispanic Women 0 0 301 1 20 0 1 1
95 300 Hispanic Women 0 0 301 2 25.1 1 1 0
96 300 Hispanic Women 0 0 301 3 25.1 1 <NA> 0
97 311 Hispanic Men <NA> <NA> 324 1 22 0 1 1
98 311 Hispanic Men <NA> <NA> 324 2 27 1 1 0
99 311 Hispanic Men <NA> <NA> 324 3 27 1 <NA> 0
100 329 Hispanic Men 0 0 317 1 20 0 1 1
# … with 2,306 more rows, and abbreviated variable name ¹Bachelors
Covariate Mean Coef Rel.Risk S.E. LR p
his1 0.418
Hispanic 0.783 0 1 (reference)
Non Hispanic 0.217 -0.086 0.918 0.107
DADBA 0.759
0 0.609 0 1 (reference)
1 0.391 0.030 1.030 0.097
MOMBA 0.948
0 0.617 0 1 (reference)
1 0.383 0.006 1.006 0.097
sex11 0.775
Men 0.477 0 1 (reference)
Women 0.523 0.025 1.025 0.086
Events 557
Total time at risk 42851
Max. log. likelihood -3930.1
LR test statistic 0.99
Degrees of freedom 4
Overall p-value 0.911008
summary(fitle2)
Covariate Mean Coef Rel.Risk S.E. LR p
his1 0.268
Hispanic 0.765 0 1 (reference)
Non Hispanic 0.235 -0.103 0.902 0.094
Events 646
Total time at risk 49868
Max. log. likelihood -4643.8
LR test statistic 1.22
Degrees of freedom 1
Overall p-value 0.2684
AIC(fitle)
[1] 7868.295
check.dist(fitle, fitl2, main ="Weibull")
check.dist(fitle, fitl3, main ="Piecewise Exponential")