hw2.utf8

DEM 7223 Homework 2 DEM 7223 - Event History Analysis - Parametric Hazard Models Ricardo Lowe

library(car)

## Loading required package: carData

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(muhaz)
library(eha)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ stringr 1.4.0
## ✓ tidyr   1.1.1     ✓ forcats 0.5.0
## ✓ readr   1.3.1

## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()
## x tidyr::unpack() masks Matrix::unpack()

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")

## Loading required package: ipumsr

usa_00117 <- read_ipums_ddi("usa_00117.xml")
usa_00117_ <- read_ipums_micro(usa_00117)

## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

usa_00118 <- read_ipums_ddi("usa_00118.xml")
usa_00118_ <- read_ipums_micro(usa_00118)

## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

usa_00119 <- read_ipums_ddi("usa_00119.xml")
usa_00119_ <- read_ipums_micro(usa_00119)

## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

usa_00120 <- read_ipums_ddi("usa_00120.xml")
usa_00120_ <- read_ipums_micro(usa_00120)

## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

Carry out the following analysis: Define your outcome as in HW 1. Also consider what covariates are hypothesized to affect the outcome variable. Define these and construct a parametric model for your outcome.

The outcome for this analysis is the duration between the respondent’s year of immigration and the year they were actually naturalized. I censor records whose naturalization occurred during the same year of their immigration. Thus, if a respondent did not become naturalized until after their first year of immigration, I compute the duration between the survey date (2018) and their respective year of naturalization. The alternative is the difference between the year of naturalization and their year of immigration.

library(dplyr)
ipums_view(usa_00117_) 

n2 <- usa_00117_%>%
  filter(YRNATUR < 9999 & YRIMMIG < 9999)


n2$YRIMMIG2 <- as.numeric(ifelse(n2$YRNATUR-n2$YRIMMIG <= 0,"",n2$YRIMMIG))


n2$secbi<-ifelse(is.na(n2$YRIMMIG2)==T,
                   ((n2$YEAR))-((n2$YRNATUR)), 
                   (n2$YRNATUR-n2$YRIMMIG2))

n2$b2event<-ifelse(is.na(n2$YRIMMIG2)==T,0,1) 
fit<-survfit(Surv(secbi, b2event)~1, n2)
fit

## Call: survfit(formula = Surv(secbi, b2event) ~ 1, data = n2)
## 
##       n  events  median 0.95LCL 0.95UCL 
##   88485   79491      11      11      11

79,491 (unweighted) cases experienced the event. That is, there are nearly 10,000 instances where naturalization did not occur within the time window I alotted.

plot(fit, conf.int=T, ylab="S(t)", xlab="Years")
title(main="Survival Function for Naturalization Interval, ACS 2018 Data")

fit.haz.km<-kphaz.fit(n2$secbi[n2$secbi>0],
                      n2$b2event[n2$secbi>0] , 
                      method = "product-limit")
fit.haz.sm<-muhaz(n2$secbi[n2$secbi>0], n2$b2event[n2$secbi>0] )

#Empirical hazard function (product-limit estimate) plot
kphaz.plot(fit.haz.km, main="plot of the hazard of obtaining naturalization")
#overlay the smoothed version
lines(fit.haz.sm, col=2, lwd=3)
legend("topright", legend = c("KM hazard", "Smoothed Hazard"),
       col=c(1,2), lty=c(1,1))

n2$BPLD2 <- ifelse(n2$BPLD>= 25000 & n2$BPLD<= 26070, 0,
            ifelse(n2$BPLD>= 60000 & n2$BPLD<= 60099, 1,NA))

                                    

n2$SEX <- ifelse(n2$SEX>= 2, 1,
            ifelse(n2$SEX<= 1,0,NA))

Fit the parametric model of your choosing to the data.

Did you choose an AFT or PH model and why?

I initially opted for an AFT because I am particularly interested in interpreting the actual duration of the time it takes to be naturalized as oppose to the “hazard” of being naturalized in itself. However, I conduct both models to examine their relative outputs.

Justify what parametric distribution you choose

I have not seen much research pertaining to time-varying naturalization processes. I would presume that naturalization will be concentrated in the early years following immigration, and will wane over time. I don’t anticipate much fluctuation.This presumption, which is sort of plotted above, might empathize with the Weibull distribution. I will run and conduct different diagnostics to examine this further.

Carry out model fit diagnostics for the model:

Exponential Model

#exponential distribution for hazard, here we hard code it to be
#a weibull dist with shape ==1 
fit.1<-phreg(Surv(secbi, b2event)~SEX+AGE^2+BPLD2,
             data=n2[n2$secbi>0,], dist="weibull", shape = 1)
summary(fit.1)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE^2 + BPLD2, data = n2[n2$secbi > 
##     0, ], dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## SEX                 0.551    -0.003     0.997     0.008     0.667 
## AGE                53.255    -0.012     0.988     0.000     0.000 
## BPLD2               0.311     0.226     1.254     0.008     0.000 
## 
## log(scale)                    2.177               0.014     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    68558 
## Total time at risk        1014894 
## Max. log. likelihood      -251191 
## LR test statistic         4241.91 
## Degrees of freedom        3 
## Overall p-value           0

plot(fit.1)

AFT model specification

fit.1.aft<-survreg(Surv(secbi, b2event)~SEX+AGE+BIRTHYR+BPLD2,
                   data=n2[n2$secbi>0,],dist = "exponential")

summary(fit.1.aft)

## 
## Call:
## survreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BIRTHYR + 
##     BPLD2, data = n2[n2$secbi > 0, ], dist = "exponential")
##                Value Std. Error      z      p
## (Intercept) 10.84664    5.46237   1.99 0.0471
## SEX          0.00341    0.00770   0.44 0.6576
## AGE          0.00728    0.00271   2.68 0.0073
## BIRTHYR     -0.00430    0.00271  -1.59 0.1125
## BPLD2       -0.22548    0.00818 -27.55 <2e-16
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -251190   Loglik(intercept only)= -253312.2
##  Chisq= 4244.43 on 4 degrees of freedom, p= 0 
## Number of Newton-Raphson Iterations: 4 
## n=76135 (12340 observations deleted due to missingness)

Weibull Model

#weibull distribution for hazard
  fit.2<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
             data=n2[n2$secbi>0,], dist="weibull")

summary(fit.2)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi > 
##     0, ], dist = "weibull")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## SEX                 0.551    -0.002     0.998     0.008     0.793 
## AGE                53.255    -0.020     0.980     0.000     0.000 
## BPLD2               0.311     0.336     1.399     0.008     0.000 
## 
## log(scale)                    2.163               0.009     0.000 
## log(shape)                    0.429               0.003     0.000 
## 
## Events                    68558 
## Total time at risk        1014894 
## Max. log. likelihood      -242711 
## LR test statistic         10178.32 
## Degrees of freedom        3 
## Overall p-value           0

plot(fit.2)

plot(fit.2, fn="haz")
lines(fit.haz.sm, col=2)

Log-Normal Model

fit.3<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
             data=n2[n2$secbi>0,],dist="lognormal", center=T)
summary(fit.3)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi > 
##     0, ], dist = "lognormal", center = T)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## (Intercept)                   2.692               0.064     0.000 
## SEX                 0.551    -0.004     0.996     0.008     0.647 
## AGE                53.255    -0.018     0.983     0.000     0.000 
## BPLD2               0.311     0.321     1.379     0.008     0.000 
## 
## log(scale)                    3.964               0.058     0.000 
## log(shape)                   -0.165               0.012     0.000 
## 
## Events                    68558 
## Total time at risk        1014894 
## Max. log. likelihood      -240300 
## LR test statistic         8139.68 
## Degrees of freedom        3 
## Overall p-value           0

plot(fit.3)

#plot the hazard from the log normal vs the empirical hazard
plot(fit.3, fn="haz")
lines(fit.haz.sm, col=2)

#log-normal distribution for hazard
fit.4<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
             data=n2[n2$secbi>0,], dist="loglogistic", center=T)
summary(fit.4)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi > 
##     0, ], dist = "loglogistic", center = T)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## (Intercept)                   1.056               0.026     0.000 
## SEX                 0.551    -0.004     0.996     0.008     0.623 
## AGE                53.255    -0.017     0.983     0.000     0.000 
## BPLD2               0.311     0.322     1.380     0.008     0.000 
## 
## log(scale)                    2.612               0.014     0.000 
## log(shape)                    0.772               0.006     0.000 
## 
## Events                    68558 
## Total time at risk        1014894 
## Max. log. likelihood      -240139 
## LR test statistic         7635.14 
## Degrees of freedom        3 
## Overall p-value           0

plot(fit.4)

#plot the hazard from the log normal vs the empirical hazard
plot(fit.4, fn="haz")
lines(fit.haz.sm, col=2)

fit.5<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
             data=n2[n2$secbi>0,], dist="pch",
             cuts=c(15, 35, 50,70))
summary(fit.5)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi > 
##     0, ], dist = "pch", cuts = c(15, 35, 50, 70))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## SEX                 0.551    -0.003     0.997     0.008     0.733 
## AGE                53.255    -0.013     0.987     0.000     0.000 
## BPLD2               0.311     0.254     1.289     0.008     0.000 
## 
## 
## Events                    68558 
## Total time at risk        1014894 
## Max. log. likelihood      -250372 
## LR test statistic         5130.27 
## Degrees of freedom        3 
## Overall p-value           0

plot(fit.5)

plot(fit.5, fn="haz", ylim=c(0, .15))
lines(fit.haz.sm, col=2)

None of these look any good. But the best fit is between the Log-logistic Model and the Piecewise constant exponential model.

Graphical checks on the model fit

emp<-coxreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
            data=n2[n2$secbi>0,])

check.dist(sp=emp,pp=fit.1, main = "Empirical vs. Exponential")

check.dist(sp=emp,pp=fit.2, main = "Empirical vs. Weibull")

check.dist(sp=emp,pp=fit.3, main = "Empirical vs. Log-Normal")

check.dist(sp=emp,pp=fit.4, main = "Empirical vs. Log-Logistic")

check.dist(sp=emp,pp=fit.5, main = "Empirical vs. PCH")

PCH is definitley the best. Now I can use this model to interpret results.

Include all main effects in the model
Test for an interaction between at least two of the predictors

Coming back to Fit 5 (Piecewise), I interact BPLD2 (Region of Orgin, where 1 = African Immigrants and 0 = Caribbean Immigrants) and SEX.

fit.6<-phreg(Surv(secbi, b2event)~SEX*AGE+BPLD2,
             data=n2[n2$secbi>0,], dist="pch",
             cuts=c(15, 35, 50,70))
summary(fit.6)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX * AGE + BPLD2, data = n2[n2$secbi > 
##     0, ], dist = "pch", cuts = c(15, 35, 50, 70))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## SEX                 0.551    -0.001     0.999     0.024     0.970 
## AGE                53.255    -0.013     0.987     0.000     0.000 
## BPLD2               0.311     0.254     1.289     0.008     0.000 
## SEX:AGE          
##        -0.000     1.000     0.000     0.938 
## 
## 
## Events                    68558 
## Total time at risk        1014894 
## Max. log. likelihood      -250372 
## LR test statistic         5130.28 
## Degrees of freedom        4 
## Overall p-value           0

plot(fit.6)

f. Interpret your results and write them up

After running diagnostics, I decided to keep with Proportional Hazards. The Piecewise constant exponential model in particular best fits my data and can be broken down by 15, 35, 50, 70 hazard-year points.

In the original Piecewise model, AGE and BIRTHPLACE had positive effects on the outcome. In terms of BIRTHPLACE, we see that African immigrants (1) tend to have higher “risks” of naturalization compared to their Caribbean counterparts (0). We also note that the outcome is dependent on AGE. A cursory examination of naturalization “risk” shows that an immigrant aged 50 will have 2 times more of a chance to be naturalized compared to a 25 year old exp(.987)50/exp(.987)25. SEX however has a negative effect. That is, SEX has no significant effect in determining risk of naturalization. When I interacted SEX with BIRTHPLACE, the term remained non-significant. This seems odd to me, given that most research indicates that women tend to immigrate by themselves. Perhaps the effect became null after censoring? This will require further interrogation.

Note: I chose this dataset as a way try something new, but it may be worth interrogating this subject using additional covariates like Marital status, income and family size, which all have a important impact on naturalization.