DEM 7223 Homework 2 DEM 7223 - Event History Analysis - Parametric Hazard Models Ricardo Lowe
library(car)
## Loading required package: carData
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(muhaz)
library(eha)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ stringr 1.4.0
## ✓ tidyr 1.1.1 ✓ forcats 0.5.0
## ✓ readr 1.3.1
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x dplyr::recode() masks car::recode()
## x purrr::some() masks car::some()
## x tidyr::unpack() masks Matrix::unpack()
if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")
## Loading required package: ipumsr
usa_00117 <- read_ipums_ddi("usa_00117.xml")
usa_00117_ <- read_ipums_micro(usa_00117)
## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
usa_00118 <- read_ipums_ddi("usa_00118.xml")
usa_00118_ <- read_ipums_micro(usa_00118)
## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
usa_00119 <- read_ipums_ddi("usa_00119.xml")
usa_00119_ <- read_ipums_micro(usa_00119)
## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
usa_00120 <- read_ipums_ddi("usa_00120.xml")
usa_00120_ <- read_ipums_micro(usa_00120)
## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
The outcome for this analysis is the duration between the respondent’s year of immigration and the year they were actually naturalized. I censor records whose naturalization occurred during the same year of their immigration. Thus, if a respondent did not become naturalized until after their first year of immigration, I compute the duration between the survey date (2018) and their respective year of naturalization. The alternative is the difference between the year of naturalization and their year of immigration.
library(dplyr)
ipums_view(usa_00117_)
n2 <- usa_00117_%>%
filter(YRNATUR < 9999 & YRIMMIG < 9999)
n2$YRIMMIG2 <- as.numeric(ifelse(n2$YRNATUR-n2$YRIMMIG <= 0,"",n2$YRIMMIG))
n2$secbi<-ifelse(is.na(n2$YRIMMIG2)==T,
((n2$YEAR))-((n2$YRNATUR)),
(n2$YRNATUR-n2$YRIMMIG2))
n2$b2event<-ifelse(is.na(n2$YRIMMIG2)==T,0,1)
fit<-survfit(Surv(secbi, b2event)~1, n2)
fit
## Call: survfit(formula = Surv(secbi, b2event) ~ 1, data = n2)
##
## n events median 0.95LCL 0.95UCL
## 88485 79491 11 11 11
79,491 (unweighted) cases experienced the event. That is, there are nearly 10,000 instances where naturalization did not occur within the time window I alotted.
plot(fit, conf.int=T, ylab="S(t)", xlab="Years")
title(main="Survival Function for Naturalization Interval, ACS 2018 Data")
fit.haz.km<-kphaz.fit(n2$secbi[n2$secbi>0],
n2$b2event[n2$secbi>0] ,
method = "product-limit")
fit.haz.sm<-muhaz(n2$secbi[n2$secbi>0], n2$b2event[n2$secbi>0] )
#Empirical hazard function (product-limit estimate) plot
kphaz.plot(fit.haz.km, main="plot of the hazard of obtaining naturalization")
#overlay the smoothed version
lines(fit.haz.sm, col=2, lwd=3)
legend("topright", legend = c("KM hazard", "Smoothed Hazard"),
col=c(1,2), lty=c(1,1))
n2$BPLD2 <- ifelse(n2$BPLD>= 25000 & n2$BPLD<= 26070, 0,
ifelse(n2$BPLD>= 60000 & n2$BPLD<= 60099, 1,NA))
n2$SEX <- ifelse(n2$SEX>= 2, 1,
ifelse(n2$SEX<= 1,0,NA))
I initially opted for an AFT because I am particularly interested in interpreting the actual duration of the time it takes to be naturalized as oppose to the “hazard” of being naturalized in itself. However, I conduct both models to examine their relative outputs.
I have not seen much research pertaining to time-varying naturalization processes. I would presume that naturalization will be concentrated in the early years following immigration, and will wane over time. I don’t anticipate much fluctuation.This presumption, which is sort of plotted above, might empathize with the Weibull distribution. I will run and conduct different diagnostics to examine this further.
Exponential Model
#exponential distribution for hazard, here we hard code it to be
#a weibull dist with shape ==1
fit.1<-phreg(Surv(secbi, b2event)~SEX+AGE^2+BPLD2,
data=n2[n2$secbi>0,], dist="weibull", shape = 1)
summary(fit.1)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE^2 + BPLD2, data = n2[n2$secbi >
## 0, ], dist = "weibull", shape = 1)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## SEX 0.551 -0.003 0.997 0.008 0.667
## AGE 53.255 -0.012 0.988 0.000 0.000
## BPLD2 0.311 0.226 1.254 0.008 0.000
##
## log(scale) 2.177 0.014 0.000
##
## Shape is fixed at 1
##
## Events 68558
## Total time at risk 1014894
## Max. log. likelihood -251191
## LR test statistic 4241.91
## Degrees of freedom 3
## Overall p-value 0
plot(fit.1)
AFT model specification
fit.1.aft<-survreg(Surv(secbi, b2event)~SEX+AGE+BIRTHYR+BPLD2,
data=n2[n2$secbi>0,],dist = "exponential")
summary(fit.1.aft)
##
## Call:
## survreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BIRTHYR +
## BPLD2, data = n2[n2$secbi > 0, ], dist = "exponential")
## Value Std. Error z p
## (Intercept) 10.84664 5.46237 1.99 0.0471
## SEX 0.00341 0.00770 0.44 0.6576
## AGE 0.00728 0.00271 2.68 0.0073
## BIRTHYR -0.00430 0.00271 -1.59 0.1125
## BPLD2 -0.22548 0.00818 -27.55 <2e-16
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -251190 Loglik(intercept only)= -253312.2
## Chisq= 4244.43 on 4 degrees of freedom, p= 0
## Number of Newton-Raphson Iterations: 4
## n=76135 (12340 observations deleted due to missingness)
Weibull Model
#weibull distribution for hazard
fit.2<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
data=n2[n2$secbi>0,], dist="weibull")
summary(fit.2)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi >
## 0, ], dist = "weibull")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## SEX 0.551 -0.002 0.998 0.008 0.793
## AGE 53.255 -0.020 0.980 0.000 0.000
## BPLD2 0.311 0.336 1.399 0.008 0.000
##
## log(scale) 2.163 0.009 0.000
## log(shape) 0.429 0.003 0.000
##
## Events 68558
## Total time at risk 1014894
## Max. log. likelihood -242711
## LR test statistic 10178.32
## Degrees of freedom 3
## Overall p-value 0
plot(fit.2)
plot(fit.2, fn="haz")
lines(fit.haz.sm, col=2)
Log-Normal Model
fit.3<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
data=n2[n2$secbi>0,],dist="lognormal", center=T)
summary(fit.3)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi >
## 0, ], dist = "lognormal", center = T)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## (Intercept) 2.692 0.064 0.000
## SEX 0.551 -0.004 0.996 0.008 0.647
## AGE 53.255 -0.018 0.983 0.000 0.000
## BPLD2 0.311 0.321 1.379 0.008 0.000
##
## log(scale) 3.964 0.058 0.000
## log(shape) -0.165 0.012 0.000
##
## Events 68558
## Total time at risk 1014894
## Max. log. likelihood -240300
## LR test statistic 8139.68
## Degrees of freedom 3
## Overall p-value 0
plot(fit.3)
#plot the hazard from the log normal vs the empirical hazard
plot(fit.3, fn="haz")
lines(fit.haz.sm, col=2)
#log-normal distribution for hazard
fit.4<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
data=n2[n2$secbi>0,], dist="loglogistic", center=T)
summary(fit.4)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi >
## 0, ], dist = "loglogistic", center = T)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## (Intercept) 1.056 0.026 0.000
## SEX 0.551 -0.004 0.996 0.008 0.623
## AGE 53.255 -0.017 0.983 0.000 0.000
## BPLD2 0.311 0.322 1.380 0.008 0.000
##
## log(scale) 2.612 0.014 0.000
## log(shape) 0.772 0.006 0.000
##
## Events 68558
## Total time at risk 1014894
## Max. log. likelihood -240139
## LR test statistic 7635.14
## Degrees of freedom 3
## Overall p-value 0
plot(fit.4)
#plot the hazard from the log normal vs the empirical hazard
plot(fit.4, fn="haz")
lines(fit.haz.sm, col=2)
fit.5<-phreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
data=n2[n2$secbi>0,], dist="pch",
cuts=c(15, 35, 50,70))
summary(fit.5)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX + AGE + BPLD2, data = n2[n2$secbi >
## 0, ], dist = "pch", cuts = c(15, 35, 50, 70))
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## SEX 0.551 -0.003 0.997 0.008 0.733
## AGE 53.255 -0.013 0.987 0.000 0.000
## BPLD2 0.311 0.254 1.289 0.008 0.000
##
##
## Events 68558
## Total time at risk 1014894
## Max. log. likelihood -250372
## LR test statistic 5130.27
## Degrees of freedom 3
## Overall p-value 0
plot(fit.5)
plot(fit.5, fn="haz", ylim=c(0, .15))
lines(fit.haz.sm, col=2)
None of these look any good. But the best fit is between the Log-logistic Model and the Piecewise constant exponential model.
Graphical checks on the model fit
emp<-coxreg(Surv(secbi, b2event)~SEX+AGE+BPLD2,
data=n2[n2$secbi>0,])
check.dist(sp=emp,pp=fit.1, main = "Empirical vs. Exponential")
check.dist(sp=emp,pp=fit.2, main = "Empirical vs. Weibull")
check.dist(sp=emp,pp=fit.3, main = "Empirical vs. Log-Normal")
check.dist(sp=emp,pp=fit.4, main = "Empirical vs. Log-Logistic")
check.dist(sp=emp,pp=fit.5, main = "Empirical vs. PCH")
PCH is definitley the best. Now I can use this model to interpret results.
Include all main effects in the model
Test for an interaction between at least two of the predictors
Coming back to Fit 5 (Piecewise), I interact BPLD2 (Region of Orgin, where 1 = African Immigrants and 0 = Caribbean Immigrants) and SEX.
fit.6<-phreg(Surv(secbi, b2event)~SEX*AGE+BPLD2,
data=n2[n2$secbi>0,], dist="pch",
cuts=c(15, 35, 50,70))
summary(fit.6)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ SEX * AGE + BPLD2, data = n2[n2$secbi >
## 0, ], dist = "pch", cuts = c(15, 35, 50, 70))
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## SEX 0.551 -0.001 0.999 0.024 0.970
## AGE 53.255 -0.013 0.987 0.000 0.000
## BPLD2 0.311 0.254 1.289 0.008 0.000
## SEX:AGE
## -0.000 1.000 0.000 0.938
##
##
## Events 68558
## Total time at risk 1014894
## Max. log. likelihood -250372
## LR test statistic 5130.28
## Degrees of freedom 4
## Overall p-value 0
plot(fit.6)
f. Interpret your results and write them up
After running diagnostics, I decided to keep with Proportional Hazards. The Piecewise constant exponential model in particular best fits my data and can be broken down by 15, 35, 50, 70 hazard-year points.
In the original Piecewise model, AGE and BIRTHPLACE had positive effects on the outcome. In terms of BIRTHPLACE, we see that African immigrants (1) tend to have higher “risks” of naturalization compared to their Caribbean counterparts (0). We also note that the outcome is dependent on AGE. A cursory examination of naturalization “risk” shows that an immigrant aged 50 will have 2 times more of a chance to be naturalized compared to a 25 year old exp(.987)50/exp(.987)25. SEX however has a negative effect. That is, SEX has no significant effect in determining risk of naturalization. When I interacted SEX with BIRTHPLACE, the term remained non-significant. This seems odd to me, given that most research indicates that women tend to immigrate by themselves. Perhaps the effect became null after censoring? This will require further interrogation.
Note: I chose this dataset as a way try something new, but it may be worth interrogating this subject using additional covariates like Marital status, income and family size, which all have a important impact on naturalization.