HW4.utf8

Ricardo Lowe DEM 7223 – Event History Analysis Homework 5 (October 26th)

This analysis fits multiple discrete time hazard models to person-period data. In this week’s homework, I will use the event of naturalization among Panamanian immigrants coming into the U.S. This dataset is exclusively Panamanian immigrants, irrespective of race and ethnicity, surveyed in 2017 and 2018.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(survival)
library(survey)

## Loading required package: grid

## Loading required package: Matrix

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(ggplot2)




if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")

## Loading required package: ipumsr

usa_00129 <- read_ipums_ddi("usa_00129.xml")
usa_00129 <- read_ipums_micro(usa_00129)

## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

ipums_view(usa_00129)

In the ACS, information pertaining to year of naturalization among immigrants is recorded in the survey. The variable can be interpreted as retrospective because it catalogs year of citizenship dating back to the early 20th century.

Since our outcome is time between immigration and naturalization, we must select as our risk set, only immigrants. I did this through the “select cases” option on ACS.

To get the interval for immigrants who actually naturalized at time of survey, that is the difference between the year of naturalization and year of immigration. For those who did not acquire citizenship by the time of the interview, the censored time is between year of immigration and the date of the interview.

We have 6624 observations who are at experienced naturalization.

usa_00122_ <- usa_00129

usa_00122_$WESTINDI <- ifelse(usa_00122_$ANCESTR1 >= 300 & usa_00122_$ANCESTR1 <= 335, 1, 0)


usa_00122_$Female <-   ifelse(usa_00122_$SEX == 1, 0, 1)
usa_00122_$MAR <-   ifelse(usa_00122_$MARST <= 2, 0, 1)

usa_00122_$HighEd <-   ifelse(usa_00122_$EDUC <= 6, 0, 1)



usa_00122_$ImmAge <-  usa_00122_$YRIMMIG-usa_00122_$BIRTHYR

usa_00122_$AGEC<-cut(usa_00122_$AGE, breaks=c(0,24,39,59,79,99))

usa_00122_$AGEW<-ifelse(usa_00122_$AGE >=15 & usa_00122_$AGE <=24, "15_24",
                 ifelse(usa_00122_$AGE >=25 & usa_00122_$AGE <=54, "25_54",
                 ifelse(usa_00122_$AGE >=55 & usa_00122_$AGE <=64, "55_64","Other")))
                        


n2 <- usa_00122_%>%
  filter(YRIMMIG < 9000 & 
         BIRTHYR < 9000 &
         YRSUSA1 > 5)
         


#n2$YRIMMIG3 <- as.numeric(ifelse(n2$YRIMMIG >= 1990,"",n2$YRIMMIG))
n2$YRNATUR3 <- as.numeric(ifelse(n2$YRNATUR >= 9997,NA,n2$YRNATUR))
n2$YRIMMIG3 <- as.numeric((n2$YEAR - n2$YRIMMIG))


n2$YEAR <- as.factor(n2$YEAR)

n2$secbi<-ifelse(is.na(n2$YRNATUR3)==T, 
                 n2$YRIMMIG3, 
                (n2$YRNATUR3 - n2$YRIMMIG))
                 


n2$b2event<-ifelse(is.na(n2$YRNATUR3)==T,0,1) 

n2$b2event<-as.numeric(n2$b2event)
n2$secbi<-as.numeric(n2$secbi)


fit<-survfit(Surv(secbi, b2event)~1, n2)
fit

## Call: survfit(formula = Surv(secbi, b2event) ~ 1, data = n2)
## 
##       n  events  median 0.95LCL 0.95UCL 
##   13259    6624      32      31      34

Here, I treat time discretely by transforming the data into a person-period format. I split the continuous duration per event into discrete periods. The ACS is a household survey, meaning serial are the same for those in living in the same house. Still, each respondent is givin their own periods dependent on their specified charateristics.

#make person period file
pp<-survSplit(Surv(secbi, b2event)~. , data = n2[n2$secbi>0,], 
              cut=c(0,12, 24, 36, 48, 60, 72), episode="year_nat")

pp$year <- pp$year_nat-1
pp<-pp[order(pp$SERIAL, pp$year_nat),]
head(pp[, c("SERIAL", "secbi", "STRATA", "b2event", "year", "AGE")], n=50)

##       SERIAL secbi STRATA b2event year AGE
## 1         45     9 250001       1    1  33
## 16846   1002    12 250001       0    1  39
## 16847   1002    21 250001       0    2  39
## 16848   1690    12  10001       0    1  43
## 16849   1690    24  10001       0    2  43
## 2       2613    12  30101       0    1  28
## 3       2613    24  30101       0    2  28
## 4       2613    29  30101       0    3  28
## 5       2681    12  90001       0    1  56
## 6       2681    24  90001       0    2  56
## 7       2681    36  90001       0    3  56
## 8       2681    48  90001       0    4  56
## 9       2681    60  90001       0    5  56
## 10      2681    61  90001       0    6  56
## 11      3234    12 130401       0    1  33
## 12      3234    24 130401       0    2  33
## 13      3234    36 130401       0    3  33
## 14      3234    37 130401       0    4  33
## 16850   3978    12  30201       0    1  59
## 16851   3978    24  30201       0    2  59
## 16852   3978    35  30201       0    3  59
## 15      4455    12 130301       0    1  27
## 16      4455    24 130301       0    2  27
## 17      4455    26 130301       0    3  27
## 18      4895    12 250001       0    1  53
## 19      4895    24 250001       0    2  53
## 20      4895    36 250001       0    3  53
## 21      4895    38 250001       0    4  53
## 22      6073     6 210001       1    1  45
## 23      6494    12 260001       0    1  56
## 24      6494    24 260001       0    2  56
## 25      6494    36 260001       0    3  56
## 26      6494    48 260001       0    4  56
## 27      6494    58 260001       0    5  56
## 28      8486    12 270201       0    1  19
## 29      8486    21 270201       0    2  19
## 30      9434    12 240001       0    1  44
## 32      9434    12 240001       0    1  54
## 31      9434    21 240001       0    2  44
## 33      9434    15 240001       0    2  54
## 34     10088    12 250001       0    1  93
## 35     10088    24 250001       0    2  93
## 36     10088    32 250001       1    3  93
## 37     11416    12 250001       0    1  49
## 41     11416    12 250001       0    1  48
## 38     11416    24 250001       0    2  49
## 42     11416    24 250001       0    2  48
## 39     11416    36 250001       0    3  49
## 43     11416    36 250001       0    3  48
## 40     11416    38 250001       0    4  49

It is hard to tell here given the small number of actionable events, but each Panamanian immigrant appears in the data until they either experience the event or are censored.

I now practice using different logit models. First I fit this to the binomial distribution.

#generate survey design
options(survey.lonely.psu="adjust")
des<-svydesign(ids=~SERIAL, strata = ~STRATA, weights=~PERWT, nest=TRUE, data=pp)

#https://r-survey.r-forge.r-project.org/survey/exmample-lonely.html
fit.0<-svyglm(b2event~as.factor(year)-1, design=des, family="binomial")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit.0)

## 
## Call:
## svyglm(formula = b2event ~ as.factor(year) - 1, design = des, 
##     family = "binomial")
## 
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT, 
##     nest = TRUE, data = pp)
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## as.factor(year)1  -1.09956    0.02450  -44.87   <2e-16 ***
## as.factor(year)2  -1.29340    0.03115  -41.52   <2e-16 ***
## as.factor(year)3  -2.02556    0.04889  -41.43   <2e-16 ***
## as.factor(year)4  -2.82766    0.08705  -32.48   <2e-16 ***
## as.factor(year)5  -3.52256    0.15659  -22.50   <2e-16 ***
## as.factor(year)6  -6.31136    0.55275  -11.42   <2e-16 ***
## as.factor(year)7 -14.55473    0.07742 -187.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.995109)
## 
## Number of Fisher Scoring iterations: 13

#Plot the hazard function on the probability scale
haz<-1/(1+exp(-coef(fit.0)))
time<-seq(1,7,1)
plot(haz~time, type="l", ylab="h(t)")
title(main="Discrete Time Hazard Function for Naturalization")

It appears that the hazare peaks three years of immigration.

St<-NA 
time<-1:length(haz) 
St[1]<-1-haz[1]
for(i in 2:length(haz)){
St[i]<-St[i-1]* (1-haz[i]) }
plot(y=St,x=time, type="l", main="Survival Function for Naturalization Interval")

I opt to employ in another mode.I want to see how this looks like if were to assume that the baseline hazard fits a linear or curvilinear function of time.

 #Linear term for time
fit.l<-svyglm(b2event~year,design=des , family="binomial")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit.l)

## 
## Call:
## svyglm(formula = b2event ~ year, design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT, 
##     nest = TRUE, data = pp)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.46842    0.03213  -14.58   <2e-16 ***
## year        -0.52869    0.01555  -34.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9652099)
## 
## Number of Fisher Scoring iterations: 5

This says that the hazard decreases over time. Or, in other words, that the act of naturalization declines overtime.

This seems sensible to me. It makes sense to acquire citizenship almost immediately after eligibility. I need to examine the characteristics of these immigrants and compare to those who are censored.

But first, I want to test this model on a quadratic function and incorporate some splines as well.

fit.s<-svyglm(b2event~year+I(year^2),design=des , family="binomial")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit.s)

## 
## Call:
## svyglm(formula = b2event ~ year + I(year^2), design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT, 
##     nest = TRUE, data = pp)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.12255    0.07837 -14.324   <2e-16 ***
## year         0.18754    0.07931   2.365   0.0181 *  
## I(year^2)   -0.15098    0.01684  -8.964   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9989647)
## 
## Number of Fisher Scoring iterations: 6

fit.c<-svyglm(b2event~year+I(year^2)+I(year^3 ),design=des , family="binomial")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit.c)

## 
## Call:
## svyglm(formula = b2event ~ year + I(year^2) + I(year^3), design = des, 
##     family = "binomial")
## 
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT, 
##     nest = TRUE, data = pp)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.426660   0.150173  -9.500  < 2e-16 ***
## year         0.659529   0.207499   3.178  0.00149 ** 
## I(year^2)   -0.351988   0.078688  -4.473  7.8e-06 ***
## I(year^3)    0.024491   0.008542   2.867  0.00415 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9932461)
## 
## Number of Fisher Scoring iterations: 7

fit.q<-svyglm(b2event~year+I(year^2)+I(year^3 )+I(year^4),design=des , family="binomial")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit.q)

## 
## Call:
## svyglm(formula = b2event ~ year + I(year^2) + I(year^3) + I(year^4), 
##     design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT, 
##     nest = TRUE, data = pp)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.918288   0.473710  -6.160 7.56e-10 ***
## year         3.508919   0.874603   4.012 6.07e-05 ***
## I(year^2)   -2.103154   0.520637  -4.040 5.40e-05 ***
## I(year^3)    0.446959   0.122240   3.656 0.000257 ***
## I(year^4)   -0.034505   0.009779  -3.528 0.000420 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9873573)
## 
## Number of Fisher Scoring iterations: 8

 #Spline
library(splines)

fit.sp<-svyglm(b2event~ns(year, df = 2),design=des , family="binomial")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit.sp)

## 
## Call:
## svyglm(formula = b2event ~ ns(year, df = 2), design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~SERIAL, strata = ~STRATA, weights = ~PERWT, 
##     nest = TRUE, data = pp)
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.08776    0.02384  -45.62   <2e-16 ***
## ns(year, df = 2)1 -3.47898    0.12633  -27.54   <2e-16 ***
## ns(year, df = 2)2 -5.36197    0.31233  -17.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9972881)
## 
## Number of Fisher Scoring iterations: 6

dat<-expand.grid(year=seq(1,7,1))
dat$genmod<-predict(fit.0, newdata=data.frame(year=as.factor(1:7 )), type="response") 
dat$lin<-predict(fit.l, newdata=dat, type="response")
dat$sq<-predict(fit.s, newdata=dat, type="response")
dat$cub<-predict(fit.c, newdata=dat, type="response")
dat$quart<-predict(fit.q, newdata=dat, type="response")
dat$spline<-predict(fit.sp, newdata=expand.grid(year=seq(1,7,1)), type="response")
dat

##   year       genmod        lin           sq         cub        quart
## 1    1 2.498215e-01 0.26951043 0.2523739724 0.250747682 2.497272e-01
## 2    2 2.152773e-01 0.17860997 0.2056429136 0.210880302 2.159429e-01
## 3    3 1.165449e-01 0.11360055 0.1280006160 0.124040636 1.143655e-01
## 4    4 5.584778e-02 0.07022978 0.0579719248 0.054529947 5.947338e-02
## 5    5 2.867708e-02 0.04262106 0.0187182681 0.020485362 2.537882e-02
## 6    6 1.812270e-03 0.02556742 0.0043526690 0.007759807 3.162156e-03
## 7    7 4.774864e-07 0.01522878 0.0007402301 0.003480926 1.756846e-05
##        spline
## 1 0.252040104
## 2 0.250262389
## 3 0.212579251
## 4 0.127055620
## 5 0.043696863
## 6 0.009471842
## 7 0.001631189

plot(genmod~year, dat, type="l", ylab="h(t)", xlab="Time") 
title(main="Hazard function from different time parameterizations") 
lines(lin~year, dat, col=2, lwd=2)
lines(sq~year, dat, col=3, lwd=2)
lines(cub~year, dat, col=4, lwd=2) 
lines(quart~year, dat, col=5, lwd=2) 
lines(spline~year, dat, col=6, lwd=2)
legend("topleft",
legend=c("General Model", "Linear","Square", "Cubic", "Quartic", "Natural spline"), col=1:6, lwd=1.5)

#AIC table
aic<-round(c( 
  fit.l$deviance+2*length(fit.l$coefficients), 
  fit.s$deviance+2*length(fit.s$coefficients), 
  fit.c$deviance+2*length(fit.c$coefficients), 
  fit.q$deviance+2*length(fit.q$coefficients), 
  fit.sp$deviance+2*length(fit.sp$coefficients),
  fit.0$deviance+2*length(fit.0$coefficients)),2)
#compare all aics to the one from the general model
dif.aic<-round(aic-aic[6],2)
data.frame(model =c( "linear","square", "cubic", "quartic","spline", "general"), aic=aic, aic_dif=dif.aic)

##     model      aic aic_dif
## 1  linear 30245.26  157.09
## 2  square 30103.32   15.15
## 3   cubic 30098.95   10.78
## 4 quartic 30086.66   -1.51
## 5  spline 30102.45   14.28
## 6 general 30088.17    0.00

The AIC points look generally good and equally fit across all models except for the quartic. The difference is intense.