Ricardo Lowe DEM 7223 – Event History Analysis Homework 5 (October 26th)
This analysis fits multiple discrete time hazard models to person-period data. In this week’s homework, I will use the event of naturalization among Panamanian immigrants coming into the U.S. This dataset is exclusively Panamanian immigrants, irrespective of race and ethnicity, surveyed in 2017 and 2018.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(survival)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(ggplot2)
if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")
## Loading required package: ipumsr
usa_00129 <- read_ipums_ddi("usa_00129.xml")
usa_00129 <- read_ipums_micro(usa_00129)
## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
ipums_view(usa_00129)
In the ACS, information pertaining to year of naturalization among immigrants is recorded in the survey. The variable can be interpreted as retrospective because it catalogs year of citizenship dating back to the early 20th century.
Since our outcome is time between immigration and naturalization, we must select as our risk set, only immigrants. I did this through the “select cases” option on ACS.
To get the interval for immigrants who actually naturalized at time of survey, that is the difference between the year of naturalization and year of immigration. For those who did not acquire citizenship by the time of the interview, the censored time is between year of immigration and the date of the interview.
We have 6624 observations who are at experienced naturalization.
usa_00122_ <- usa_00129
usa_00122_$WESTINDI <- ifelse(usa_00122_$ANCESTR1 >= 300 & usa_00122_$ANCESTR1 <= 335, 1, 0)
usa_00122_$Female <- ifelse(usa_00122_$SEX == 1, 0, 1)
usa_00122_$MAR <- ifelse(usa_00122_$MARST <= 2, 0, 1)
usa_00122_$HighEd <- ifelse(usa_00122_$EDUC <= 6, 0, 1)
usa_00122_$ImmAge <- usa_00122_$YRIMMIG-usa_00122_$BIRTHYR
usa_00122_$AGEC<-cut(usa_00122_$AGE, breaks=c(0,24,39,59,79,99))
usa_00122_$AGEW<-ifelse(usa_00122_$AGE >=15 & usa_00122_$AGE <=24, "15_24",
ifelse(usa_00122_$AGE >=25 & usa_00122_$AGE <=54, "25_54",
ifelse(usa_00122_$AGE >=55 & usa_00122_$AGE <=64, "55_64","Other")))
n2 <- usa_00122_%>%
filter(YRIMMIG < 9000 &
BIRTHYR < 9000 &
YRSUSA1 > 5)
#n2$YRIMMIG3 <- as.numeric(ifelse(n2$YRIMMIG >= 1990,"",n2$YRIMMIG))
n2$YRNATUR3 <- as.numeric(ifelse(n2$YRNATUR >= 9997,NA,n2$YRNATUR))
n2$YRIMMIG3 <- as.numeric((n2$YEAR - n2$YRIMMIG))
n2$YEAR <- as.factor(n2$YEAR)
n2$secbi<-ifelse(is.na(n2$YRNATUR3)==T,
n2$YRIMMIG3,
(n2$YRNATUR3 - n2$YRIMMIG))
n2$b2event<-ifelse(is.na(n2$YRNATUR3)==T,0,1)
n2$b2event<-as.numeric(n2$b2event)
n2$secbi<-as.numeric(n2$secbi)
fit<-survfit(Surv(secbi, b2event)~1, n2)
fit
## Call: survfit(formula = Surv(secbi, b2event) ~ 1, data = n2)
##
## n events median 0.95LCL 0.95UCL
## 13259 6624 32 31 34
Here, I treat time discretely by transforming the data into a person-period format. I split the continuous duration per event into discrete periods. The ACS is a household survey, meaning serial are the same for those in living in the same house. Still, each respondent is givin their own periods dependent on their specified charateristics.
#make person period file
pp<-survSplit(Surv(secbi, b2event)~. , data = n2[n2$secbi>0,],
cut=c(0,12, 24, 36, 48, 60, 72), episode="year_nat")
pp$year <- pp$year_nat-1
pp<-pp[order(pp$SERIAL, pp$year_nat),]
head(pp[, c("SERIAL", "secbi", "STRATA", "b2event", "year", "AGE")], n=50)
## SERIAL secbi STRATA b2event year AGE
## 1 45 9 250001 1 1 33
## 16846 1002 12 250001 0 1 39
## 16847 1002 21 250001 0 2 39
## 16848 1690 12 10001 0 1 43
## 16849 1690 24 10001 0 2 43
## 2 2613 12 30101 0 1 28
## 3 2613 24 30101 0 2 28
## 4 2613 29 30101 0 3 28
## 5 2681 12 90001 0 1 56
## 6 2681 24 90001 0 2 56
## 7 2681 36 90001 0 3 56
## 8 2681 48 90001 0 4 56
## 9 2681 60 90001 0 5 56
## 10 2681 61 90001 0 6 56
## 11 3234 12 130401 0 1 33
## 12 3234 24 130401 0 2 33
## 13 3234 36 130401 0 3 33
## 14 3234 37 130401 0 4 33
## 16850 3978 12 30201 0 1 59
## 16851 3978 24 30201 0 2 59
## 16852 3978 35 30201 0 3 59
## 15 4455 12 130301 0 1 27
## 16 4455 24 130301 0 2 27
## 17 4455 26 130301 0 3 27
## 18 4895 12 250001 0 1 53
## 19 4895 24 250001 0 2 53
## 20 4895 36 250001 0 3 53
## 21 4895 38 250001 0 4 53
## 22 6073 6 210001 1 1 45
## 23 6494 12 260001 0 1 56
## 24 6494 24 260001 0 2 56
## 25 6494 36 260001 0 3 56
## 26 6494 48 260001 0 4 56
## 27 6494 58 260001 0 5 56
## 28 8486 12 270201 0 1 19
## 29 8486 21 270201 0 2 19
## 30 9434 12 240001 0 1 44
## 32 9434 12 240001 0 1 54
## 31 9434 21 240001 0 2 44
## 33 9434 15 240001 0 2 54
## 34 10088 12 250001 0 1 93
## 35 10088 24 250001 0 2 93
## 36 10088 32 250001 1 3 93
## 37 11416 12 250001 0 1 49
## 41 11416 12 250001 0 1 48
## 38 11416 24 250001 0 2 49
## 42 11416 24 250001 0 2 48
## 39 11416 36 250001 0 3 49
## 43 11416 36 250001 0 3 48
## 40 11416 38 250001 0 4 49
It is hard to tell here given the small number of actionable events, but each Panamanian immigrant appears in the data until they either experience the event or are censored.
I now practice using different logit models. First I fit this to the binomial distribution.
#generate survey design
options(survey.lonely.psu="adjust")
des<-svydesign(ids=~SERIAL, strata = ~STRATA, weights=~PERWT, nest=TRUE, data=pp)
#https://r-survey.r-forge.r-project.org/survey/exmample-lonely.html
fit.0<-svyglm(b2event~as.factor(year)-1, design=des, family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.0)
##
## Call:
## svyglm(formula = b2event ~ as.factor(year) - 1, design = des,
## family = "binomial")
##
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT,
## nest = TRUE, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## as.factor(year)1 -1.09956 0.02450 -44.87 <2e-16 ***
## as.factor(year)2 -1.29340 0.03115 -41.52 <2e-16 ***
## as.factor(year)3 -2.02556 0.04889 -41.43 <2e-16 ***
## as.factor(year)4 -2.82766 0.08705 -32.48 <2e-16 ***
## as.factor(year)5 -3.52256 0.15659 -22.50 <2e-16 ***
## as.factor(year)6 -6.31136 0.55275 -11.42 <2e-16 ***
## as.factor(year)7 -14.55473 0.07742 -187.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.995109)
##
## Number of Fisher Scoring iterations: 13
#Plot the hazard function on the probability scale
haz<-1/(1+exp(-coef(fit.0)))
time<-seq(1,7,1)
plot(haz~time, type="l", ylab="h(t)")
title(main="Discrete Time Hazard Function for Naturalization")
It appears that the hazare peaks three years of immigration.
St<-NA
time<-1:length(haz)
St[1]<-1-haz[1]
for(i in 2:length(haz)){
St[i]<-St[i-1]* (1-haz[i]) }
plot(y=St,x=time, type="l", main="Survival Function for Naturalization Interval")
I opt to employ in another mode.I want to see how this looks like if were to assume that the baseline hazard fits a linear or curvilinear function of time.
#Linear term for time
fit.l<-svyglm(b2event~year,design=des , family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.l)
##
## Call:
## svyglm(formula = b2event ~ year, design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT,
## nest = TRUE, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.46842 0.03213 -14.58 <2e-16 ***
## year -0.52869 0.01555 -34.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9652099)
##
## Number of Fisher Scoring iterations: 5
This says that the hazard decreases over time. Or, in other words, that the act of naturalization declines overtime.
This seems sensible to me. It makes sense to acquire citizenship almost immediately after eligibility. I need to examine the characteristics of these immigrants and compare to those who are censored.
But first, I want to test this model on a quadratic function and incorporate some splines as well.
fit.s<-svyglm(b2event~year+I(year^2),design=des , family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.s)
##
## Call:
## svyglm(formula = b2event ~ year + I(year^2), design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT,
## nest = TRUE, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.12255 0.07837 -14.324 <2e-16 ***
## year 0.18754 0.07931 2.365 0.0181 *
## I(year^2) -0.15098 0.01684 -8.964 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9989647)
##
## Number of Fisher Scoring iterations: 6
fit.c<-svyglm(b2event~year+I(year^2)+I(year^3 ),design=des , family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.c)
##
## Call:
## svyglm(formula = b2event ~ year + I(year^2) + I(year^3), design = des,
## family = "binomial")
##
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT,
## nest = TRUE, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.426660 0.150173 -9.500 < 2e-16 ***
## year 0.659529 0.207499 3.178 0.00149 **
## I(year^2) -0.351988 0.078688 -4.473 7.8e-06 ***
## I(year^3) 0.024491 0.008542 2.867 0.00415 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9932461)
##
## Number of Fisher Scoring iterations: 7
fit.q<-svyglm(b2event~year+I(year^2)+I(year^3 )+I(year^4),design=des , family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.q)
##
## Call:
## svyglm(formula = b2event ~ year + I(year^2) + I(year^3) + I(year^4),
## design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT,
## nest = TRUE, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.918288 0.473710 -6.160 7.56e-10 ***
## year 3.508919 0.874603 4.012 6.07e-05 ***
## I(year^2) -2.103154 0.520637 -4.040 5.40e-05 ***
## I(year^3) 0.446959 0.122240 3.656 0.000257 ***
## I(year^4) -0.034505 0.009779 -3.528 0.000420 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9873573)
##
## Number of Fisher Scoring iterations: 8
#Spline
library(splines)
fit.sp<-svyglm(b2event~ns(year, df = 2),design=des , family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.sp)
##
## Call:
## svyglm(formula = b2event ~ ns(year, df = 2), design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT,
## nest = TRUE, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.08776 0.02384 -45.62 <2e-16 ***
## ns(year, df = 2)1 -3.47898 0.12633 -27.54 <2e-16 ***
## ns(year, df = 2)2 -5.36197 0.31233 -17.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9972881)
##
## Number of Fisher Scoring iterations: 6
dat<-expand.grid(year=seq(1,7,1))
dat$genmod<-predict(fit.0, newdata=data.frame(year=as.factor(1:7 )), type="response")
dat$lin<-predict(fit.l, newdata=dat, type="response")
dat$sq<-predict(fit.s, newdata=dat, type="response")
dat$cub<-predict(fit.c, newdata=dat, type="response")
dat$quart<-predict(fit.q, newdata=dat, type="response")
dat$spline<-predict(fit.sp, newdata=expand.grid(year=seq(1,7,1)), type="response")
dat
## year genmod lin sq cub quart
## 1 1 2.498215e-01 0.26951043 0.2523739724 0.250747682 2.497272e-01
## 2 2 2.152773e-01 0.17860997 0.2056429136 0.210880302 2.159429e-01
## 3 3 1.165449e-01 0.11360055 0.1280006160 0.124040636 1.143655e-01
## 4 4 5.584778e-02 0.07022978 0.0579719248 0.054529947 5.947338e-02
## 5 5 2.867708e-02 0.04262106 0.0187182681 0.020485362 2.537882e-02
## 6 6 1.812270e-03 0.02556742 0.0043526690 0.007759807 3.162156e-03
## 7 7 4.774864e-07 0.01522878 0.0007402301 0.003480926 1.756846e-05
## spline
## 1 0.252040104
## 2 0.250262389
## 3 0.212579251
## 4 0.127055620
## 5 0.043696863
## 6 0.009471842
## 7 0.001631189
plot(genmod~year, dat, type="l", ylab="h(t)", xlab="Time")
title(main="Hazard function from different time parameterizations")
lines(lin~year, dat, col=2, lwd=2)
lines(sq~year, dat, col=3, lwd=2)
lines(cub~year, dat, col=4, lwd=2)
lines(quart~year, dat, col=5, lwd=2)
lines(spline~year, dat, col=6, lwd=2)
legend("topleft",
legend=c("General Model", "Linear","Square", "Cubic", "Quartic", "Natural spline"), col=1:6, lwd=1.5)
#AIC table
aic<-round(c(
fit.l$deviance+2*length(fit.l$coefficients),
fit.s$deviance+2*length(fit.s$coefficients),
fit.c$deviance+2*length(fit.c$coefficients),
fit.q$deviance+2*length(fit.q$coefficients),
fit.sp$deviance+2*length(fit.sp$coefficients),
fit.0$deviance+2*length(fit.0$coefficients)),2)
#compare all aics to the one from the general model
dif.aic<-round(aic-aic[6],2)
data.frame(model =c( "linear","square", "cubic", "quartic","spline", "general"), aic=aic, aic_dif=dif.aic)
## model aic aic_dif
## 1 linear 30245.26 157.09
## 2 square 30103.32 15.15
## 3 cubic 30098.95 10.78
## 4 quartic 30086.66 -1.51
## 5 spline 30102.45 14.28
## 6 general 30088.17 0.00
The AIC points look generally good and equally fit across all models except for the quartic. The difference is intense.