Dem 7223 Homework #2

Due November 12th, 2019 30 points

A) Choose a data set and an event, with an associated duration; define as many predictors as you need to do your analysis.

The data set I’ve chosen is the PSID from the University of Michigan with 5 waves (waves are two years apart) from 2005 through 2013. The event I’ve focused on is the transition from good to poor health, with the main discriminant being the process of moving and whether or not the move was voluntary or if it was involuntary (i.e. being displaced).

The main predictor is:
move.y: a categorical variable that can take one of three outcomes: 1) not move 2) not displaced 3) displaced
Demographic characteristics:
race: categorical: nhw, nhb, nha, nhother and hispanic
age: age squared for curvilinear effect off age
sex: dichotomous 1 is female and 0 is male
educ: dichotomous either less than high school (lths) or high school + (hs+)

B) Define the key variables: episodes, events, duration, censoring, predictors

The predictors were defined above in A).
The episode is whether or not someone moved in the two year period between the waves, and it has one of three possible outcomes: 1) not move 2) move but not displaced 3) move and displaced. The event is transitiong from good to poor health. The duration is two years per wave for five waves from 2005 through 2013. the censored events are those who never make the transition from good to poor health.
## C) Carry out the following analysis
### a. Fit the Cox model the data.
### i. Include all main effects in the model

#I did not include the 300 lines of code for the cleaning and conditioning of the data.  Datcomb is the main data that contains all of five waves
datcomb <- datcomb %>% 
    mutate(
      race = fct_relevel(race, 'nhw','nhb','hisp','asn','nh-other'),
    move.y = as.character(ifelse(move == 'not move', 'not move',
                    ifelse(move.why == 'not displaced', 'not displaced', 'displaced'))), 
    educ = as.character(ifelse(edu <=11, 'lths',
                           ifelse(edu>=12, 'hs+',NA))),
    time_start = ifelse(year==2005,0,
                        ifelse(year==2007,1,
                               ifelse(year==2009, 2,
                                      ifelse(year==2011, 3,4)))),
    time = time_start +1,
    healthd = ifelse(health == 'good', 0,1),
  )%>% 
    arrange(id)
#str(datcomb)
datcomb <- datcomb %>% 
  mutate(
    educ  = fct_relevel(educ, 'hs+','lths'),
    move.y = fct_relevel(move.y, 'not move','not displaced','displaced'))
datcomb <- datcomb[complete.cases(datcomb),]
 #table(is.na(datcomb) )
 #str(datcomb)
 options(survey.lonely.psu = "adjust")
des = svydesign(ids = ~1,
                strata = datcomb$strat,
                weights = datcomb$wt,
                data = datcomb)
#length(datcomb$strat)
#length(datcomb$wt)
#must get rid of zero weights
datcomb$wt=datcomb$wt+0.000001
datcomb$age2 <- datcomb$age^2

#I first used sex and age, but had to remove them for the discrete tie=me models as I could not get them to run with thos variables in there
fit1 <- coxreg(Surv(time = time_start, time2 = time, event = healtran)~ scale(age) + scale(age2) + move.y + educ + race + sex, data = datcomb, weights = wt)
summary(fit1)
## Call:
## coxreg(formula = Surv(time = time_start, time2 = time, event = healtran) ~ 
##     scale(age) + scale(age2) + move.y + educ + race + sex, data = datcomb, 
##     weights = wt)
## 
## Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
## scale(age)         -0.000    -0.675     0.509     0.015     0.000 
## scale(age2)         0.000     0.804     2.235     0.014     0.000 
## move.y 
##         not move    0.677     0         1 (reference)
##    not displaced    0.276     0.013     1.013     0.010     0.181 
##        displaced    0.047     0.565     1.759     0.014     0.000 
## educ 
##              hs+    0.854     0         1 (reference)
##             lths    0.146     0.726     2.067     0.010     0.000 
## race 
##              nhw    0.497     0         1 (reference)
##              nhb    0.273     0.514     1.672     0.011     0.000 
##             hisp    0.084     0.564     1.758     0.012     0.000 
##              asn    0.106     0.044     1.045     0.020     0.029 
##         nh-other    0.039     0.145     1.156     0.023     0.000 
## sex                 0.222     0.299     1.349     0.009     0.000 
## 
## Events                    3993 
## Total time at risk         60255 
## Max. log. likelihood      -868877 
## LR test statistic         24327.32 
## Degrees of freedom        10 
## Overall p-value           0
#table(datcomb$move.y)

For the cox model we can see that all predictors are significant with the exception of moving and not being displaced. From this model it appears that all predictors are harmful, to one degree or another, to our reference group: nhw, male, high school grad. Also of not, the category of moving but not being displaced was not significant. As we shall see, this model is not necessarily the best fit for this data. ### ii. Test for an interaction between at least two of the predictors

fit1.int <- coxreg(Surv(time = time_start, time2 = time, event = healtran)~ scale(age) + scale(age2) + move.y + educ + race + sex + move.y*race + move.y*educ, data = datcomb, weights = wt)
summary(fit1.int)
## Call:
## coxreg(formula = Surv(time = time_start, time2 = time, event = healtran) ~ 
##     scale(age) + scale(age2) + move.y + educ + race + sex + move.y * 
##         race + move.y * educ, data = datcomb, weights = wt)
## 
## Covariate           Mean       Coef     Rel.Risk   S.E.    Wald p
## scale(age)         -0.000    -0.677     0.508     0.015     0.000 
## scale(age2)         0.000     0.806     2.240     0.014     0.000 
## move.y 
##         not move    0.677     0         1 (reference)
##    not displaced    0.276    -0.108     0.897     0.015     0.000 
##        displaced    0.047     0.753     2.124     0.020     0.000 
## educ 
##              hs+    0.854     0         1 (reference)
##             lths    0.146     0.661     1.936     0.012     0.000 
## race 
##              nhw    0.497     0         1 (reference)
##              nhb    0.273     0.600     1.822     0.014     0.000 
##             hisp    0.084     0.657     1.928     0.013     0.000 
##              asn    0.106    -0.090     0.914     0.022     0.000 
##         nh-other    0.039     0.041     1.042     0.026     0.114 
## sex                 0.222     0.293     1.340     0.009     0.000 
## move.y:race      
##    not displaced:nhb         -0.013     0.987     0.026     0.604 
##    displaced:nhb             -0.822     0.439     0.041     0.000 
##    not displaced:hisp        -0.181     0.835     0.029     0.000 
##    displaced:hisp            -0.667     0.513     0.045     0.000 
##    not displaced:asn          0.426     1.532     0.030     0.000 
##    displaced:asn              0.520     1.683     0.042     0.000 
##    not displaced:nh-other     0.534     1.706     0.049     0.000 
##    displaced:nh-other         0.314     1.369     0.064     0.000 
## move.y:educ      
##    not displaced:lths         0.320     1.377     0.023     0.000 
##    displaced:lths            -0.015     0.985     0.032     0.648 
## 
## Events                    3993 
## Total time at risk         60255 
## Max. log. likelihood      -868151 
## LR test statistic         25779.72 
## Degrees of freedom        20 
## Overall p-value           0

Having added the interactions between moving and both race/ethnicity and education we now see that the process of moving but not being displaced becomes protective, while being displaced is still harmful, where one is more than 2.2 times more likely to trasnition from good to poor health.
In terms of significance, one’s race is only significant, and harmful, for non-Hispanic black and Hispanic individuals while sex shows that females are more likely to trastion from good to poor health.
The interactions showed several that were statistically significant. I will get into a detailed discusssion below.

Likelihood ratio test

lrt1 <- lrtest(fit1,fit1.int)
lrt1
## Likelihood ratio test
## 
## Model 1: Surv(time = time_start, time2 = time, event = healtran) ~ scale(age) + 
##     scale(age2) + move.y + educ + race + sex
## Model 2: Surv(time = time_start, time2 = time, event = healtran) ~ scale(age) + 
##     scale(age2) + move.y + educ + race + sex + move.y * race + 
##     move.y * educ
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  10 -868877                         
## 2  20 -868151 10 1452.4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we can see the second model, the one with the intercations is a significantly better fitting model

D) Assess proportionality of hazards for each covariate

#schoenfeld residuals
des2 <- svydesign(ids = ~psu, strata = ~strat, weights = datcomb$wt, data = datcomb, nest = T)
#fit2.base <- svycoxph(Surv(time = time, event = healtran)~ move.y, data = datcomb, design = des)
#fit2.race <- svycoxph(Surv(time = time, event = healtran)~ move.y + race, data = datcomb, design = des)
fit2 <- svycoxph(Surv(time = time, event = healtran)~ move.y + race + sex +educ, data = datcomb, design = des2)

#summary(fit2.base)
#summary(fit2.race)
summary(fit2)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (8833) clusters.
## svydesign(ids = ~psu, strata = ~strat, weights = datcomb$wt, 
##     data = datcomb, nest = T)
## Call:
## svycoxph(formula = Surv(time = time, event = healtran) ~ move.y + 
##     race + sex + educ, design = des2, data = datcomb)
## 
##   n= 60255, number of events= 3993 
## 
##                         coef exp(coef) se(coef)       z Pr(>|z|)    
## move.ynot displaced -0.02405   0.97624  0.06919  -0.348    0.728    
## move.ydisplaced      0.58855   1.80137  0.09825   5.990 2.09e-09 ***
## racenhb              0.13523   1.14480  0.09400   1.439    0.150    
## racehisp            -0.16369   0.84901  0.11084  -1.477    0.140    
## raceasn             -1.56089   0.20995  0.08812 -17.713  < 2e-16 ***
## racenh-other        -1.31145   0.26943  0.13371  -9.808  < 2e-16 ***
## sex                  0.42115   1.52371  0.06063   6.947 3.74e-12 ***
## educlths             0.89654   2.45111  0.06098  14.701  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## move.ynot displaced    0.9762     1.0243    0.8524    1.1180
## move.ydisplaced        1.8014     0.5551    1.4858    2.1839
## racenhb                1.1448     0.8735    0.9522    1.3764
## racehisp               0.8490     1.1778    0.6832    1.0550
## raceasn                0.2099     4.7630    0.1766    0.2495
## racenh-other           0.2694     3.7116    0.2073    0.3502
## sex                    1.5237     0.6563    1.3530    1.7160
## educlths               2.4511     0.4080    2.1750    2.7623
## 
## Concordance= 0.706  (se = 0.006 )
## Likelihood ratio test= NA  on 8 df,   p=NA
## Wald test            = 952.7  on 8 df,   p=<2e-16
## Score (logrank) test = NA  on 8 df,   p=NA
schoenresid<-resid(fit2, type="schoenfeld")
summary(schoenresid)
##  move.ynot displaced move.ydisplaced       racenhb       
##  Min.   :-0.20707    Min.   :-0.09296   Min.   :-0.1657  
##  1st Qu.:-0.19827    1st Qu.:-0.09296   1st Qu.:-0.1607  
##  Median :-0.19485    Median :-0.08951   Median :-0.1527  
##  Mean   : 0.06708    Mean   : 0.01343   Mean   : 0.2048  
##  3rd Qu.: 0.79293    3rd Qu.:-0.06341   3rd Qu.: 0.8343  
##  Max.   : 0.82439    Max.   : 0.93659   Max.   : 0.8473  
##     racehisp           raceasn          racenh-other      
##  Min.   :-0.27433   Min.   :-0.32835   Min.   :-0.129875  
##  1st Qu.:-0.14850   1st Qu.:-0.08490   1st Qu.:-0.035144  
##  Median :-0.12697   Median :-0.04778   Median :-0.020601  
##  Mean   :-0.03155   Mean   :-0.00289   Mean   :-0.001424  
##  3rd Qu.:-0.11746   3rd Qu.:-0.03304   3rd Qu.:-0.014826  
##  Max.   : 0.88254   Max.   : 0.96696   Max.   : 0.985174  
##       sex               educlths       
##  Min.   :-0.316763   Min.   :-0.27112  
##  1st Qu.:-0.316084   1st Qu.:-0.25238  
##  Median :-0.315303   Median :-0.23172  
##  Mean   : 0.006677   Mean   : 0.04114  
##  3rd Qu.: 0.683237   3rd Qu.: 0.72888  
##  Max.   : 0.687223   Max.   : 0.77429
#des2$variables
#length(des2$variables$time[des2$variables$healtran==1])
#length(des2$variables$healtran==1)
#length(schoenresid)
#table(datcomb$healtran)
fit.sr<-lm(schoenresid~des2$variables$time[des2$variables$healtran==1])
summary(fit.sr)
## Response move.ynot displaced :
## 
## Call:
## lm(formula = `move.ynot displaced` ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2865 -0.2702 -0.2551  0.7135  0.7676 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                       0.041685   0.022799
## des2$variables$time[des2$variables$healtran == 1] 0.007551   0.006455
##                                                   t value Pr(>|t|)  
## (Intercept)                                         1.828   0.0676 .
## des2$variables$time[des2$variables$healtran == 1]   1.170   0.2421  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4404 on 3991 degrees of freedom
## Multiple R-squared:  0.0003428,  Adjusted R-squared:  9.23e-05 
## F-statistic: 1.368 on 1 and 3991 DF,  p-value: 0.2421
## 
## 
## Response move.ydisplaced :
## 
## Call:
## lm(formula = move.ydisplaced ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10801 -0.10503 -0.10158 -0.07846  0.92452 
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                       0.0100907  0.0153133
## des2$variables$time[des2$variables$healtran == 1] 0.0009919  0.0043356
##                                                   t value Pr(>|t|)
## (Intercept)                                         0.659    0.510
## des2$variables$time[des2$variables$healtran == 1]   0.229    0.819
## 
## Residual standard error: 0.2958 on 3991 degrees of freedom
## Multiple R-squared:  1.311e-05,  Adjusted R-squared:  -0.0002374 
## F-statistic: 0.05234 on 1 and 3991 DF,  p-value: 0.8191
## 
## 
## Response racenhb :
## 
## Call:
## lm(formula = racenhb ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.391 -0.371 -0.348  0.624  0.667 
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                        0.255350   0.024856
## des2$variables$time[des2$variables$healtran == 1] -0.015021   0.007037
##                                                   t value Pr(>|t|)    
## (Intercept)                                        10.273   <2e-16 ***
## des2$variables$time[des2$variables$healtran == 1]  -2.134   0.0329 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4802 on 3991 degrees of freedom
## Multiple R-squared:  0.00114,    Adjusted R-squared:  0.0008899 
## F-statistic: 4.556 on 1 and 3991 DF,  p-value: 0.03287
## 
## 
## Response racehisp :
## 
## Call:
## lm(formula = racehisp ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25004 -0.11888 -0.09735 -0.08671  0.92279 
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                       -0.013660   0.017416
## des2$variables$time[des2$variables$healtran == 1] -0.005319   0.004931
##                                                   t value Pr(>|t|)
## (Intercept)                                        -0.784    0.433
## des2$variables$time[des2$variables$healtran == 1]  -1.079    0.281
## 
## Residual standard error: 0.3364 on 3991 degrees of freedom
## Multiple R-squared:  0.0002914,  Adjusted R-squared:  4.092e-05 
## F-statistic: 1.163 on 1 and 3991 DF,  p-value: 0.2808
## 
## 
## Response raceasn :
## 
## Call:
## lm(formula = raceasn ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32702 -0.08128 -0.04415 -0.03057  0.97058 
## 
## Coefficients:
##                                                     Estimate Std. Error
## (Intercept)                                        0.0009784  0.0130322
## des2$variables$time[des2$variables$healtran == 1] -0.0011502  0.0036898
##                                                   t value Pr(>|t|)
## (Intercept)                                         0.075    0.940
## des2$variables$time[des2$variables$healtran == 1]  -0.312    0.755
## 
## Residual standard error: 0.2517 on 3991 degrees of freedom
## Multiple R-squared:  2.435e-05,  Adjusted R-squared:  -0.0002262 
## F-statistic: 0.09718 on 1 and 3991 DF,  p-value: 0.7553
## 
## 
## Response racenh-other :
## 
## Call:
## lm(formula = `racenh-other` ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13639 -0.03681 -0.02227 -0.01257  0.99321 
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                       -0.017734   0.009904
## des2$variables$time[des2$variables$healtran == 1]  0.004850   0.002804
##                                                   t value Pr(>|t|)  
## (Intercept)                                        -1.791   0.0734 .
## des2$variables$time[des2$variables$healtran == 1]   1.730   0.0838 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1913 on 3991 degrees of freedom
## Multiple R-squared:  0.0007489,  Adjusted R-squared:  0.0004985 
## F-statistic: 2.991 on 1 and 3991 DF,  p-value: 0.0838
## 
## 
## Response sex :
## 
## Call:
## lm(formula = sex ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3285 -0.3239 -0.3193  0.6753  0.6847 
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                       -0.003614   0.024194
## des2$variables$time[des2$variables$healtran == 1]  0.003060   0.006850
##                                                   t value Pr(>|t|)
## (Intercept)                                        -0.149    0.881
## des2$variables$time[des2$variables$healtran == 1]   0.447    0.655
## 
## Residual standard error: 0.4674 on 3991 degrees of freedom
## Multiple R-squared:  5e-05,  Adjusted R-squared:  -0.0002006 
## F-statistic: 0.1996 on 1 and 3991 DF,  p-value: 0.6551
## 
## 
## Response educlths :
## 
## Call:
## lm(formula = educlths ~ des2$variables$time[des2$variables$healtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3149 -0.2942 -0.2697  0.6889  0.7363 
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                        0.047543   0.023396
## des2$variables$time[des2$variables$healtran == 1] -0.001903   0.006624
##                                                   t value Pr(>|t|)  
## (Intercept)                                         2.032   0.0422 *
## des2$variables$time[des2$variables$healtran == 1]  -0.287   0.7739  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.452 on 3991 degrees of freedom
## Multiple R-squared:  2.068e-05,  Adjusted R-squared:  -0.0002299 
## F-statistic: 0.08252 on 1 and 3991 DF,  p-value: 0.7739
fit.test<-cox.zph(fit2)
fit.test
##                         rho   chisq         p
## move.ynot displaced  0.0375   21.33  3.87e-06
## move.ydisplaced      0.1079  167.05  3.26e-38
## racenhb             -0.1919 1023.88 1.16e-224
## racehisp            -0.1154  237.93  1.11e-53
## raceasn              0.0696   37.25  1.04e-09
## racenh-other         0.0205    4.31  3.80e-02
## sex                  0.0384   18.39  1.80e-05
## educlths            -0.0295   10.56  1.16e-03
## GLOBAL                   NA 1730.88  0.00e+00
par(mfrow=c(3,3))
plot(fit.test, df=2)
par(mfrow=c(1,1))

As discuused earlier, The schoenfeld residuals are all significant, and as discussed earlier, this is probably not a good model for this data set.

#extract Martingale residuals
fit2 <- svycoxph(Surv(time = time, event = healtran)~ move.y +race + educ + sex + age , data = datcomb, design = des2)
#fit2.move2 <- svycoxph(Surv(time = time, event = healtran)~ strata(move.y), data = datcomb, design = des2)


#fit2.race <- svycoxph(Surv(time = time, event = healtran)~ race, data = datcomb, design = des2)
#fit2.educ <- svycoxph(Surv(time = time, event = healtran)~ race, data = datcomb, design = des2)

fit2.full <- svycoxph(Surv(time = time, event = healtran)~ race + educ + sex + age 
                      + strata(move.y), data = datcomb, design = des2)
#fit2.quad <- svycoxph(Surv(time = time, event = healtran)~ race + educ + sex + age 
 #                     + strata(move.y) + move.y^2, data = datcomb, design = des2)


#res.mar.move2<-resid(fit2.move2, type="martingale")
#res.mar.race<-resid(fit2.race, type="martingale")
res.mar.full <- resid(fit2.full, type="martingale")
res.mar <- resid(fit2, type="martingale")
#res.mar.quad <- resid(fit2.quad, type="martingale")
#length(res.mar.race2)
#length(des$variables$race)


#scatter.smooth(des$variables$move.y, res.mar.move2,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.5, lpars=list(col = "red", lwd = 3))
#scatter.smooth(des$variables$race, res.mar.race,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.5, lpars=list(col = "red", lwd = 3))
#scatter.smooth(des$variables$move.y, res.mar.move,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.5, lpars=list(col = "red", lwd = 3))
#full model
scatter.smooth(des$variables$move.y, res.mar,degree = 2,span = 5, ylab="Martingale Residual",col=1,  cex=.5, lpars=list(col = "red", lwd = 5))

#strata move.y
#scatter.smooth(des$variables$move.y, res.mar.full,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.5, lpars=list(col = "red", lwd = 3))
#summary(fit2.full)

#res.mar<-resid(fit2, type="martingale")

#plot vs maternal education
#scatter.smooth(des$variables$move.y, res.mar.quad,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.5, lpars=list(col = "red", lwd = 3))
#title(main="Martingale residuals for Mother' < High School's Education")




regTermTest(fit2, ~move.y, method="LRT")
## Working (Rao-Scott+F) LRT for move.y
##  in svycoxph(formula = Surv(time = time, event = healtran) ~ move.y + 
##     race + educ + sex + age, design = des2, data = datcomb)
## Working 2logLR =  40.57694 p= 1.725e-09 
## (scale factors:  1 0.98 );  denominator df= 8762
#regTermTest(fit2.full, ~move.y, method="LRT")

a. What method did you use?

I tried both the schoenfeld residuals and the martingale residuals. The schoenfeld residuals were significant for each of the predictors and give evidence that this model is probably not appropriate for this data set.

b. If you find non-proportional hazards, use an appropriate method to fit an

extended Cox model that addresses the non-proportionality

attempted and failed. This data is really not appropriate for this method

place for some descriptive statistics

#descriptive statistics
#datcomb <- datcomb %>% 
#  mutate(
#    age_enter = age,
#    age_exit = age+2
#  )
#datcomb %>% 
#  group_by(time) %>% 
#  summarise(prop_event = mean(healtran))

#datcomb %>% 
#  group_by(time, move.y) %>% 
#  summarise(prop_event = mean(healtran))
desc.full <- aggregate(healtran~race*educ*move.y, datcomb, mean, na.rm=T)
desc.race.move <- aggregate(healtran~race*move.y, datcomb, mean, na.rm=T)
desc.edu.move <- aggregate(healtran~educ*move.y, datcomb, mean, na.rm=T)
aggregate(healtran~move.y*time, datcomb, mean, na.rm=T)
      move.y time   healtran

1 not move 1 0.00000000 2 not displaced 1 0.00000000 3 displaced 1 0.00000000 4 not move 2 0.08351384 5 not displaced 2 0.08423142 6 displaced 2 0.14444444 7 not move 3 0.08928977 8 not displaced 3 0.07559332 9 displaced 3 0.13310962 10 not move 4 0.07196743 11 not displaced 4 0.07872596 12 displaced 4 0.14285714 13 not move 5 0.06504841 14 not displaced 5 0.08327325 15 displaced 5 0.17233560

stargazer(desc.full, type = "html",style = "demography", title = "Proportion by race, education and move status", summary = F)
Proportion by race, education and move status
race educ move.y healtran
1 nhw hs+ not move 0.042
2 nhb hs+ not move 0.071
3 hisp hs+ not move 0.086
4 asn hs+ not move 0.045
5 nh-other hs+ not move 0.054
6 nhw lths not move 0.095
7 nhb lths not move 0.141
8 hisp lths not move 0.138
9 asn lths not move 0.113
10 nh-other lths not move 0.145
11 nhw hs+ not displaced 0.035
12 nhb hs+ not displaced 0.068
13 hisp hs+ not displaced 0.067
14 asn hs+ not displaced 0.071
15 nh-other hs+ not displaced 0.084
16 nhw lths not displaced 0.109
17 nhb lths not displaced 0.124
18 hisp lths not displaced 0.132
19 asn lths not displaced 0.196
20 nh-other lths not displaced 0.093
21 nhw hs+ displaced 0.091
22 nhb hs+ displaced 0.142
23 hisp hs+ displaced 0.126
24 asn hs+ displaced 0.142
25 nh-other hs+ displaced 0.098
26 nhw lths displaced 0.240
27 nhb lths displaced 0.194
28 hisp lths displaced 0.231
29 asn lths displaced 0.333
30 nh-other lths displaced 0.312
stargazer(desc.race.move, type = "html",style = "demography", title = "Proportion by race, education and move status", summary = F)
Proportion by race, education and move status
race move.y healtran
1 nhw not move 0.047
2 nhb not move 0.086
3 hisp not move 0.103
4 asn not move 0.051
5 nh-other not move 0.066
6 nhw not displaced 0.042
7 nhb not displaced 0.080
8 hisp not displaced 0.082
9 asn not displaced 0.081
10 nh-other not displaced 0.085
11 nhw displaced 0.108
12 nhb displaced 0.156
13 hisp displaced 0.157
14 asn displaced 0.165
15 nh-other displaced 0.133
stargazer(desc.edu.move, type = "html",style = "demography", title = "Proportion by race, education and move status", summary = F)
Proportion by race, education and move status
educ move.y healtran
1 hs+ not move 0.053
2 lths not move 0.123
3 hs+ not displaced 0.053
4 lths not displaced 0.125
5 hs+ displaced 0.116
6 lths displaced 0.223

E) Fit the discrete time hazard model to your outcome

#discrete time model

fit3.dt<-svyglm(healtran~as.factor(time_start)+move.y+race-1+ educ-1, design=des2, family=binomial(link="cloglog"))
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
#summary(fit3.dt)
#stargazer(fit3.dt, type = "html",style = "demography", title = "Discrete time model with interactions for move status and education & move status and race", summary = F)
fit3.dt.int <-svyglm(healtran~as.factor(time_start)+move.y+race-1+ educ-1 
                               +move.y*educ  + move.y*race, 
                               design=des2, family=binomial(link="cloglog"))
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
#summary(fit3.dt.int)
stargazer(fit3.dt, fit3.dt.int, type = "html",style = "demography", title = "Discrete time model with interactions for move status and education & move status and race", summary = F, align = T)
Discrete time model with interactions for move status and education & move status and race
healtran
Model 1 Model 2
as.factor(time_start)0 -20.060*** -20.047***
(0.122) (0.124)
as.factor(time_start)1 -2.797*** -2.781***
(0.077) (0.078)
as.factor(time_start)2 -2.755*** -2.743***
(0.067) (0.070)
as.factor(time_start)3 -2.914*** -2.900***
(0.060) (0.062)
as.factor(time_start)4 -3.008*** -2.996***
(0.119) (0.121)
move.ynot displaced -0.088 -0.244*
(0.068) (0.098)
move.ydisplaced 0.592*** 0.781***
(0.101) (0.134)
racenhb 0.491*** 0.571***
(0.076) (0.093)
racehisp 0.448*** 0.527***
(0.088) (0.110)
raceasn 0.036 -0.093
(0.142) (0.161)
racenh-other 0.137 0.026
(0.161) (0.186)
educlths 0.837*** 0.756***
(0.060) (0.068)
move.ynot displaced:educlths 0.343*
(0.134)
move.ydisplaced:educlths 0.166
(0.207)
move.ynot displaced:racenhb 0.047
(0.164)
move.ydisplaced:racenhb -0.952**
(0.324)
move.ynot displaced:racehisp -0.086
(0.198)
move.ydisplaced:racehisp -0.721*
(0.290)
move.ynot displaced:raceasn 0.424*
(0.200)
move.ydisplaced:raceasn 0.382
(0.284)
move.ynot displaced:racenh-other 0.530
(0.354)
move.ydisplaced:racenh-other 0.324
(0.405)
N 60,255 60,255
Log Likelihood -11,655.700 -11,619.250
AIC 23,335.390 23,282.500
p < .05; p < .01; p < .001
sums<-data.frame(round(summary(fit3.dt)$coef, 3)); sums$HR<-round(exp(sums[,1]), 3)
sums
                   Estimate Std..Error  t.value Pr...t..    HR

as.factor(time_start)0 -20.060 0.122 -164.109 0.000 0.000 as.factor(time_start)1 -2.797 0.077 -36.213 0.000 0.061 as.factor(time_start)2 -2.755 0.067 -41.098 0.000 0.064 as.factor(time_start)3 -2.914 0.060 -48.713 0.000 0.054 as.factor(time_start)4 -3.008 0.119 -25.229 0.000 0.049 move.ynot displaced -0.088 0.068 -1.303 0.193 0.916 move.ydisplaced 0.592 0.101 5.870 0.000 1.808 racenhb 0.491 0.076 6.427 0.000 1.634 racehisp 0.448 0.088 5.063 0.000 1.565 raceasn 0.036 0.142 0.253 0.801 1.037 racenh-other 0.137 0.161 0.851 0.395 1.147 educlths 0.837 0.060 13.920 0.000 2.309

dat3.all<-expand.grid(time_start=as.factor(c(0,1,2,3,4)),healtran=c(0,1), race=levels(des2$variables$race), move.y=levels(des2$variables$move.y), educ = levels(des2$variables$educ))
head(dat3.all)

time_start healtran race move.y educ 1 0 0 nhw not move hs+ 2 1 0 nhw not move hs+ 3 2 0 nhw not move hs+ 4 3 0 nhw not move hs+ 5 4 0 nhw not move hs+ 6 0 1 nhw not move hs+

#Sys.setenv(R_MAX_VSIZE = 16e9)
dat3.all$fitted<-as.numeric(predict(fit3.dt.int, dat3.all, type="response"))
head(dat3.all)

time_start healtran race move.y educ fitted 1 0 0 nhw not move hs+ 1.967409e-09 2 1 0 nhw not move hs+ 6.008519e-02 3 2 0 nhw not move hs+ 6.231881e-02 4 3 0 nhw not move hs+ 5.352630e-02 5 4 0 nhw not move hs+ 4.877333e-02 6 0 1 nhw not move hs+ 1.967409e-09

#aggregate(healtran~race*educ*move.y, datcomb, mean, na.rm=T)
#aggregate(healtran~race*move.y, datcomb, mean, na.rm=T)
#aggregate(healtran~educ*move.y, datcomb, mean, na.rm=T)
#aggregate(healtran~move.y*time, datcomb, mean, na.rm=T)
#generating hazard plots
dat3.all%>%
  mutate(
    period = ifelse(time_start==1, "Period 1", 
                    ifelse(time_start==2,"Period 2",
                    ifelse(time_start==3, "Period 3","Period 4")))
    )%>%
  ggplot()+geom_bar(aes(y=fitted,x=race,fill=race, group=race),stat="identity", position="dodge")+facet_grid(~move.y+educ)+ggtitle(label="Hazard of Transitioning into Poor Health by Race and Displaced status by wave")+
  theme(axis.text.x = element_text(angle = 90))

dat3.all%>%
  mutate(
    period = ifelse(time_start==1, "Period 1", 
                    ifelse(time_start==2,"Period 2",
                    ifelse(time_start==3, "Period 3","Period 4")))
    )%>%
  ggplot()+geom_bar(aes(y=fitted,x=educ,fill=educ, group=educ),stat="identity", position="dodge")+facet_wrap(~move.y+period, nrow=2)+ggtitle(label="Hazard of Transitioning into Poor Health by Education and Displaced status by wave")+
  theme(axis.text.x = element_text(angle = 90))

#dat3.all%>%
  #mutate(
  #  period = ifelse(time_start==1, "Period 1", 
  #                  ifelse(time_start==2,"Period 2",
   #                 ifelse(time_start==3, "Period 3","Period 4")))
  #  )%>%
  #ggplot()+geom_bar(aes(y=fitted,x=sex,fill=sex, group=sex),stat="identity", position="dodge")+facet_grid(~move.y+period)+ggtitle(label="Hazard of Transitioning into Poor Health by Sex and Displaced status by wave")+
  #theme(axis.text.x = element_text(angle = 90))

i. You must form a person-period data set

ii. Consider both the general model and other time specifications

iii. Include all main effects in the model

iv. Test for an interaction between at least two of the predictors

v. Generate hazard plots for interesting cases highlighting the significant predictors in your analysis

F) Provide the following as a single word document:

a. Code and output

see above
### b. Technical write up
I’m making comments below the output shown above to provide my technical write up so as not to duplicate output more than necessary.

c. Descriptive write up consisting of

###i. Tables of descriptive statistics
###ii. Results of statistical tests
###iii. Interpretation of the results

The purpose of this analysis is to determine differrnces of transitioning from good to poor health amongst three groups: 1)those who have not moved 2) those who have moved voluntarirly and 3) those who moved involuntarily (displaced). The data was obtained from the five PSID waves from 2005 through 2013. The PSID examines nationally representative individuals and families throughout the United States.

The outcome being examined is the transition form good health to poor health. The predictors examined in the analysis are race/ethnicity, education, sex and move status. The analysis examines the transition from good to poor health from 5 waves.

First, examining the proportions of the population by education and move status it can be seen that individuals with less than a high school education are more at risk moving voluntarily and for being displaced:

Proportion by education and move status
educ move.y healtran
1 hs+ not move 0.053
2 lths not move 0.123
3 hs+ not displaced 0.053
4 lths not displaced 0.125
5 hs+ displaced 0.116
6 lths displaced 0.223

And next by race and move status, it can be seen that the differences by race are fairly minor, with non-Hispanic black, hispanic and asian being the most at risk:

Proportion by race, education and move status
race move.y healtran
1 nhw not move 0.047
2 nhb not move 0.086
3 hisp not move 0.103
4 asn not move 0.051
5 nh-other not move 0.066
6 nhw not displaced 0.042
7 nhb not displaced 0.080
8 hisp not displaced 0.082
9 asn not displaced 0.081
10 nh-other not displaced 0.085
11 nhw displaced 0.108
12 nhb displaced 0.156
13 hisp displaced 0.157
14 asn displaced 0.165
15 nh-other displaced 0.133

And finally by race, education and move status, ti can be seen that the combination of less than a high school education and a distinct race (such as asian or non-Hispanic other) are most at risk of being displaced.:

Proportion by race, education and move status
race educ move.y healtran
1 nhw hs+ not move 0.042
2 nhb hs+ not move 0.071
3 hisp hs+ not move 0.086
4 asn hs+ not move 0.045
5 nh-other hs+ not move 0.054
6 nhw lths not move 0.095
7 nhb lths not move 0.141
8 hisp lths not move 0.138
9 asn lths not move 0.113
10 nh-other lths not move 0.145
11 nhw hs+ not displaced 0.035
12 nhb hs+ not displaced 0.068
13 hisp hs+ not displaced 0.067
14 asn hs+ not displaced 0.071
15 nh-other hs+ not displaced 0.084
16 nhw lths not displaced 0.109
17 nhb lths not displaced 0.124
18 hisp lths not displaced 0.132
19 asn lths not displaced 0.196
20 nh-other lths not displaced 0.093
21 nhw hs+ displaced 0.091
22 nhb hs+ displaced 0.142
23 hisp hs+ displaced 0.126
24 asn hs+ displaced 0.142
25 nh-other hs+ displaced 0.098
26 nhw lths displaced 0.240
27 nhb lths displaced 0.194
28 hisp lths displaced 0.231
29 asn lths displaced 0.333
30 nh-other lths displaced 0.312

A Discrete Time model was first with the variables previously discussed (Model 1) then interatcions between both race and move status and education and ove status were included (Model 2):

Discrete time model with interactions for move status and education & move status and race
healtran
Model 1 Model 2
as.factor(time_start)0 -20.060*** -20.047***
(0.122) (0.124)
as.factor(time_start)1 -2.797*** -2.781***
(0.077) (0.078)
as.factor(time_start)2 -2.755*** -2.743***
(0.067) (0.070)
as.factor(time_start)3 -2.914*** -2.900***
(0.060) (0.062)
as.factor(time_start)4 -3.008*** -2.996***
(0.119) (0.121)
move.ynot displaced -0.088 -0.244*
(0.068) (0.098)
move.ydisplaced 0.592*** 0.781***
(0.101) (0.134)
racenhb 0.491*** 0.571***
(0.076) (0.093)
racehisp 0.448*** 0.527***
(0.088) (0.110)
raceasn 0.036 -0.093
(0.142) (0.161)
racenh-other 0.137 0.026
(0.161) (0.186)
educlths 0.837*** 0.756***
(0.060) (0.068)
move.ynot displaced:educlths 0.343*
(0.134)
move.ydisplaced:educlths 0.166
(0.207)
move.ynot displaced:racenhb 0.047
(0.164)
move.ydisplaced:racenhb -0.952**
(0.324)
move.ynot displaced:racehisp -0.086
(0.198)
move.ydisplaced:racehisp -0.721*
(0.290)
move.ynot displaced:raceasn 0.424*
(0.200)
move.ydisplaced:raceasn 0.382
(0.284)
move.ynot displaced:racenh-other 0.530
(0.354)
move.ydisplaced:racenh-other 0.324
(0.405)
N 60,255 60,255
Log Likelihood -11,655.700 -11,619.250
AIC 23,335.390 23,282.500
p < .05; p < .01; p < .001

From these two models, model two (the model with the interactions included) appears to fit the data better as it has a lower AIC. The only difference in terms of statistical significance with the covariates they share are the protective factor of moving but not being displaced over those who do not move. Being displaced means one has approximately %78 greater chance of transition from good health to poor health.
Examining are other covariates and it can be seen that being non-Hispanic black, Hispanic and havng less than a high school education all place an individual at greater risk of transitioning from good to poor health.
The most interesting interaction is that of non-Hispanic black with being displaced in that it is protective. One explanation for this may be that an already disadvantaged group may be more resililient to a negative life circumstance such as being displaced better than an advantaged group like non-Hispanic whites. likewise, being hispanic and being displaced also is protective versus non-hispanic white. The significant effect of the interactions is moving but not being displaced and being asian. These individuals are 36% more likely to transition from good to poor health.This is particularly strange, as this effect on the population at large is protective, so maybe non-Hispanic asians are much more bothered by moving that non-Hispanic whites. Next we shall examine a plot to see if these phenomenon can be observed:

Both of these phenomenon are readily apparent on the above plot. Both the Hispanic and non-Hispanic black fair far better than the non-Hispanic white population, while the asian population fare far worse.

Ultimately, there is sufficient evidence to support the claim that being displaced from one’s home increases the risk of transitioning from good to poor health, while there is some evidence to support the claim that moving voluntarily is protective of transitioning from good to poor health.