Jacob M. Souch

Code
library(readr)
getwd()
[1] "C:/Users/jacob/University of Texas at San Antonio/TEAM - Cossman Crew - General"
Code
setwd("C:/Users/jacob/Downloads")
mhas0121dummies<-read.csv("MHAS0121 NO LABELS NO NEGATIVE TIMES (2).csv")

mhas0121noNA<-mhas0121dummies

#update.packages()
#Split data by person-month
library(survival)
library(dplyr)

mhas0121noNA<-mhas0121noNA %>%
  mutate(
        startmo = round(ageint*12),
        endmo = round(agecensor*12),
        dead = died0121
       )
head(mhas0121noNA, n= 10)
   id locsize01 perwght01 died18 died21 mobirth yrbirth female moint yrint
1   1         1       300     NA     NA      10    1941      0     8  2001
2   2         1       253      0      0       6    1931      0     7  2001
3   3         1       454      0      0       6    1950      0     7  2001
4   4         1       337     NA     NA      10    1936      0     7  2001
5   5         1       295      0      0       3    1944      1     7  2001
6   6         1       253      0      0       5    1948      1     7  2001
7   7         1       490     NA     NA       8    1927      0     8  2001
8   8         1       537      0      0       2    1945      0     7  2001
9   9         1       370      1     NA       9    1946      1     7  2001
10 10         1       237      0     NA       1    1948      0     7  2001
   schooling died0103 refused0103 lost0103 followup0103 died0312 refused0312
1          0        0           0        0            1        0           0
2          0        0           0        0            1        0           0
3          3        0           0        0            1        0           0
4          0        0           0        0            1        1           0
5          0        0           0        0            1        0           0
6          0        0           0        0            1        0           0
7          6        0           0        0            1        1           0
8          3        0           0        0            1        0           0
9          1        0           0        0            1        0           0
10         3        0           0        0            1        0           0
   lost0312 followup0312 died1215 refused1215 lost1215 followup1215 died1518
1         0            1        1           0        0            0       NA
2         0            1        0           0        0            1        0
3         0            1        0           0        0            1        0
4         0            0       NA           0       NA           NA       NA
5         0            1        0           0        0            1        0
6         0            1        0           0        0            1        0
7         0            0       NA           0       NA           NA       NA
8         0            1        0           0        0            1        0
9         0            1        0           0        0            1        1
10        0            1        0           0       NA           NA        0
   refused1518 lost1518 followup1518 died1821 refused1821 lost1821 followup1821
1            0       NA           NA       NA           0       NA           NA
2            0        0            1        0           0        0            1
3            0        0            1        0           0        0            1
4            0       NA           NA       NA           0       NA           NA
5            0        0            1        0           0        0            1
6            0        0            1        0           0        0            1
7            0       NA           NA       NA           0       NA           NA
8            0        0            1        0           0        0            1
9            0        0            0       NA           0       NA           NA
10           0       NA           NA       NA           0       NA           NA
   died0121 mocensor yrcensor   ageint agecensor ageintexact exposure_months
1         1       11     2015 59.83333  74.08334    59.83333             171
2         0       12     2021 70.08334  90.50000    70.08334             245
3         0       12     2021 51.08333  71.50000    51.08333             245
4         1        6     2011 64.75000  74.66666    64.75000             119
5         0       12     2021 57.33333  77.75000    57.33333             245
6         0       12     2021 53.16667  73.58334    53.16667             245
7         1        5     2012 74.00000  84.75000    74.00000             129
8         0        1     2022 56.41667  76.91666    56.41667             246
9         1        2     2015 54.83333  68.41666    54.83333             163
10       NA       11     2012 53.50000  64.83334    53.50000             136
   educlevel ageintmo agecensormo educ1 educ2 educ3 educ4 locsize011 locsize012
1          0      718         889     1     0     0     0          1          0
2          0      841        1086     1     0     0     0          1          0
3         15      613         858     0     1     0     0          1          0
4          0      777         896     1     0     0     0          1          0
5          0      688         933     1     0     0     0          1          0
6          0      638         883     1     0     0     0          1          0
7         68      888        1017     0     0     1     0          1          0
8         15      677         923     0     1     0     0          1          0
9         15      658         821     0     1     0     0          1          0
10        15      642         778     0     1     0     0          1          0
   locsize013 locsize014 startmo endmo dead
1           0          0     718   889    1
2           0          0     841  1086    0
3           0          0     613   858    0
4           0          0     777   896    1
5           0          0     688   933    0
6           0          0     638   883    0
7           0          0     888  1017    1
8           0          0     677   923    0
9           0          0     658   821    1
10          0          0     642   778   NA
Code
max(mhas0121noNA$exposure_months)
[1] NA
Code
#Possible max time is 296 months, but using an even higher threshold in case things change

#rm(mhas_personmonth)
# Code adapted from Mills (2011): section 3.5
# Expands data into person-months, up to maximum number age in months observed in data
mhas_personmonth<-survSplit(mhas0121noNA, cut=c(1:2000), start="startmo", end="endmo", event="dead")

# Sorting data by ID
mhas_personmonth<-mhas_personmonth[order (mhas_personmonth$id, mhas_personmonth$ageintmo),]

#Creating time-varying age variables for different model specifications0
mhas_personmonth<-mhas_personmonth %>%
  mutate(
  age=startmo/12, #could truncate if one wanted to, but I don't 
  agesq=age^2,
  agectr=age-50,
  ageln=logb(age),
  .after = "died0121"
  )

library(dplyr)
library(ggplot2)
#install.packages("AMR")
library(AMR)
mhas_personmonth <- mhas_personmonth %>% 
  mutate(
  # Create age groups
  age5 = dplyr::case_when(
    age >= 50 & age < 55 ~ "50-54",
    age >= 55 & age < 60 ~ "55-59",
    age >= 60 & age < 65 ~ "60-64",
    age >= 65 & age < 70 ~ "65-69",
    age >= 70 & age < 75 ~ "70-74",
    age >= 75 & age < 80 ~ "75-79",
    age >= 80 & age < 85 ~ "80-84",
    age >= 85             ~ "85+"
  ),
  # Convert to factor
  age5 = factor(
    age5,
    level = c("50-54", "55-59","60-64","65-69","70-74","75-79","80-84","85+")
)
)

mhas_personmonth <- mhas_personmonth %>%
  mutate(age5b = case_when(age >= 50 & age < 55 ~ "50-54",
                          age >= 55 & age < 60 ~ "55-59",
                          age >= 60 & age < 65 ~ "60-64",
                          age >= 65 & age < 70 ~ "65-69",
                          age >= 70 & age < 75 ~ "70-74",
                          age >= 75 & age < 80 ~ "75-79",
                          age >= 80 & age < 85 ~ "80-84",
                          age >= 85 ~ "85+"),
         age5b = factor(age5, level = c("50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85+")))

#write.csv(mhas_personmonth,"C:\\Users\\eay621\\OneDrive - University of Texas at San Antonio\\3. TEACHING\\EVENT HISTORY ANALYSIS\\EHA Fall 2024\\4. Problem sets\\MHAS0121_PERSONMONTHS.csv", row.names = TRUE)

# Export to CSV to explore more easily (at least for me)
#testPMdata <- mhas_personmonth[which(mhas_personmonth$id < 20, ]
#testPMdata <- subset(mhas_personmonth, id < 50, select=c(id,ageint,agecensor,startmo,endmo,age,age5,died0121,dead))

#write.csv(testPMdata,"C:\\Users\\eay621\\OneDrive - University of Texas at San Antonio\\3. TEACHING\\EVENT HISTORY ANALYSIS\\EHA Fall 2024\\4. Problem sets\\MHAS0121_PERSONMONTHS_TEST.csv", row.names = TRUE)


mhas_personmonth$locsize01 <- as.factor(mhas_personmonth$locsize01)
mhas_personmonth$female <- as.factor(mhas_personmonth$female)

A. Models

i. Not controlling for age.

Code
library(gtsummary)
model_i <- glm(dead ~ female + schooling + locsize01, family = binomial, data = mhas_personmonth)
summary(model_i)

Call:
glm(formula = dead ~ female + schooling + locsize01, family = binomial, 
    data = mhas_personmonth)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -5.390744   0.032289 -166.955  < 2e-16 ***
female1     -0.169665   0.029420   -5.767 8.06e-09 ***
schooling   -0.063834   0.003982  -16.031  < 2e-16 ***
locsize012  -0.026721   0.041502   -0.644 0.519674    
locsize013  -0.065008   0.050876   -1.278 0.201329    
locsize014  -0.142182   0.041300   -3.443 0.000576 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64671  on 1548755  degrees of freedom
Residual deviance: 64369  on 1548750  degrees of freedom
  (2761 observations deleted due to missingness)
AIC: 64381

Number of Fisher Scoring iterations: 9

ii. Controlling for age linearly.

Code
model_ii <- glm(dead ~ female + schooling + locsize01 + age, family = binomial, data = mhas_personmonth)
summary(model_ii)

Call:
glm(formula = dead ~ female + schooling + locsize01 + age, family = binomial, 
    data = mhas_personmonth)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -11.640750   0.122199 -95.261  < 2e-16 ***
female1      -0.249226   0.029513  -8.445  < 2e-16 ***
schooling    -0.016464   0.003931  -4.189 2.81e-05 ***
locsize012   -0.008555   0.041674  -0.205  0.83735    
locsize013   -0.086159   0.051174  -1.684  0.09225 .  
locsize014   -0.130262   0.041544  -3.135  0.00172 ** 
age           0.082496   0.001476  55.907  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64671  on 1548755  degrees of freedom
Residual deviance: 61321  on 1548749  degrees of freedom
  (2761 observations deleted due to missingness)
AIC: 61335

Number of Fisher Scoring iterations: 9

iii. Controlling for age and age-squared.

Code
mhas_personmonth <- mhas_personmonth %>%
  mutate(age_squared = age^2)

model_iii <- glm(dead ~ female + schooling + locsize01 + age + age_squared, family = binomial, data = mhas_personmonth)
summary(model_iii)

Call:
glm(formula = dead ~ female + schooling + locsize01 + age + age_squared, 
    family = binomial, data = mhas_personmonth)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.300e+01  6.703e-01 -19.389  < 2e-16 ***
female1     -2.510e-01  2.952e-02  -8.501  < 2e-16 ***
schooling   -1.643e-02  3.930e-03  -4.182 2.90e-05 ***
locsize012  -9.293e-03  4.168e-02  -0.223  0.82356    
locsize013  -8.644e-02  5.117e-02  -1.689  0.09117 .  
locsize014  -1.308e-01  4.154e-02  -3.149  0.00164 ** 
age          1.177e-01  1.716e-02   6.858 6.97e-12 ***
age_squared -2.244e-04  1.090e-04  -2.058  0.03956 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64671  on 1548755  degrees of freedom
Residual deviance: 61316  on 1548748  degrees of freedom
  (2761 observations deleted due to missingness)
AIC: 61332

Number of Fisher Scoring iterations: 9

iv. Controlling for 5-year age groups (top-coded at 85+)

Code
mhas_personmonth <- mhas_personmonth %>%
  mutate(age_group = cut(age, breaks = seq(0, 90, by = 5), right = FALSE))
         
model_iv <- glm(dead ~ female  + schooling + locsize01 + age + age_group, family = binomial, data = mhas_personmonth)
summary(model_iv)

Call:
glm(formula = dead ~ female + schooling + locsize01 + age + age_group, 
    family = binomial, data = mhas_personmonth)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -12.26394    0.60920 -20.131  < 2e-16 ***
female1           -0.27562    0.03153  -8.741  < 2e-16 ***
schooling         -0.01958    0.00413  -4.741 2.13e-06 ***
locsize012        -0.02698    0.04457  -0.605  0.54490    
locsize013        -0.10279    0.05516  -1.863  0.06240 .  
locsize014        -0.14470    0.04481  -3.229  0.00124 ** 
age                0.09607    0.01089   8.821  < 2e-16 ***
age_group[55,60)  -0.37843    0.19914  -1.900  0.05739 .  
age_group[60,65)  -0.17697    0.20416  -0.867  0.38605    
age_group[65,70)  -0.13606    0.23197  -0.587  0.55752    
age_group[70,75)  -0.38313    0.26943  -1.422  0.15503    
age_group[75,80)  -0.43826    0.31246  -1.403  0.16074    
age_group[80,85)  -0.48452    0.35846  -1.352  0.17649    
age_group[85,90)  -0.48818    0.40644  -1.201  0.22971    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57413  on 1512938  degrees of freedom
Residual deviance: 55138  on 1512925  degrees of freedom
  (38578 observations deleted due to missingness)
AIC: 55166

Number of Fisher Scoring iterations: 10

v. Any other(s) you may want to try, if any at all.

va. cubed

Code
mhas_personmonth <- mhas_personmonth %>%
  mutate(age_cubed= age^3)

model_va <- glm(dead ~ female + schooling + locsize01 + age + age_cubed, family = binomial, data = mhas_personmonth)
summary(model_va)

Call:
glm(formula = dead ~ female + schooling + locsize01 + age + age_cubed, 
    family = binomial, data = mhas_personmonth)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.263e+01  4.578e-01 -27.587  < 2e-16 ***
female1     -2.512e-01  2.952e-02  -8.510  < 2e-16 ***
schooling   -1.643e-02  3.930e-03  -4.181  2.9e-05 ***
locsize012  -9.444e-03  4.168e-02  -0.227  0.82073    
locsize013  -8.645e-02  5.117e-02  -1.690  0.09110 .  
locsize014  -1.310e-01  4.154e-02  -3.153  0.00162 ** 
age          1.018e-01  8.716e-03  11.675  < 2e-16 ***
age_cubed   -1.028e-06  4.590e-07  -2.240  0.02507 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64671  on 1548755  degrees of freedom
Residual deviance: 61316  on 1548748  degrees of freedom
  (2761 observations deleted due to missingness)
AIC: 61332

Number of Fisher Scoring iterations: 9

vb interaction.

Code
model_vb <- glm(dead ~ female * age + schooling + locsize01, 
               family = binomial, data = mhas_personmonth)
summary(model_vb)

Call:
glm(formula = dead ~ female * age + schooling + locsize01, family = binomial, 
    data = mhas_personmonth)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -11.317536   0.167163 -67.704  < 2e-16 ***
female1      -0.879035   0.225412  -3.900 9.63e-05 ***
age           0.078322   0.002094  37.410  < 2e-16 ***
schooling    -0.016933   0.003934  -4.304 1.68e-05 ***
locsize012   -0.009423   0.041674  -0.226  0.82112    
locsize013   -0.083961   0.051169  -1.641  0.10083    
locsize014   -0.128941   0.041542  -3.104  0.00191 ** 
female1:age   0.008082   0.002868   2.818  0.00483 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64671  on 1548755  degrees of freedom
Residual deviance: 61313  on 1548748  degrees of freedom
  (2761 observations deleted due to missingness)
AIC: 61329

Number of Fisher Scoring iterations: 9

B. Assumptions of Baseline Hazard

  1. Model I. No Age Control:

    • The baseline hazard is constant with respect to age, implying no change in mortality risk as people get older. This assumption suggests that the probability of dying is the same for all ages, regardless of actual age differences. This assumption is unrealistic and runs counter to the literature. Thus, it is important to control for age.
  2. Model II. Controlling for Age Linearly:

    • The baseline hazard changes linearly with age. This implies that each additional year of age has a fixed, additive effect on the risk of death. The model assumes that mortality risk increases (or decreases) at a constant rate with each year, which may not capture more complex age-related risk patterns which vary over the life course.
  3. Model III. Controlling for Age and Age-Squared:

    • In this model, the baseline hazard has a quadratic relationship with age, allowing for non-linear effects. This can model a U-shaped or inverted U-shaped relationship, where mortality risk changes at different rates across ages. For example, risk might accelerate as people age, or it could rise more slowly at older ages. This approach is more flexible than the linear term alone and can better capture complex curivlinear patterns.
  4. Model IV. Controlling for 5-Year Age Groups (Top-Coded at 85+):

    • The baseline hazard changes non-linearly, with distinct levels of risk for each age group. This approach allows for discrete changes in risk between age groups without imposing a continuous relationship. It’s flexible enough to capture sudden increases or stabilizes in risk within specific age brackets and can be especially useful when there’s a marked difference in mortality risk at certain ages.
  5. Models Va and Vb. Other Specifications (e.g., Splines, Interactions):

    • Including interactions (such as age-gender) means that the baseline hazard is not only age-dependent but also varies by other covariates, like sex. For example, it could capture differences in how the risk of death changes with age between men and women.

C. Model Fit Comparison

Code
model_comparison <- data.frame(
  Model = c("Model I", "Model II", "Model III", "Model IV", "Model V_a", "Model V_b"),
  AIC = c(AIC(model_i), AIC(model_ii), AIC(model_iii), AIC(model_iv), AIC(model_va), AIC(model_vb)),
  BIC = c(BIC(model_i), BIC(model_ii), BIC(model_iii), BIC(model_iv), BIC(model_va), BIC(model_vb))
  # Use the BIC vector defined above
)

model_comparison
      Model      AIC      BIC
1   Model I 64381.18 64454.70
2  Model II 61334.68 61420.45
3 Model III 61332.34 61430.37
4  Model IV 55166.14 55337.36
5 Model V_a 61331.52 61429.54
6 Model V_b 61328.73 61426.75

Code
library(broom)
library(gtsummary)
library(tidyverse)
# Create a list of models
models <- list(model_i, model_ii,model_iii,model_iv, model_va, model_vb)

# Extract coefficients, standard errors, and odds ratios from each model
model_summaries <- lapply(seq_along(models), function(i) {
  model <- models[[i]]
  tidy(model) %>%
    mutate(
      Odds_Ratio = exp(estimate),
      Model = paste("Model", i)
    )
})

# Combine summaries into a single table
comparison_table <- bind_rows(model_summaries) %>%
  select(Model, term, estimate, std.error, Odds_Ratio, p.value)


# Pivot the data wider
df_wide <- comparison_table %>%
    pivot_wider(
        names_from = term,
        values_from = c(estimate, std.error, Odds_Ratio, p.value),
        names_glue = "{term}_{.value}"
    )

df_wide
# A tibble: 6 × 69
  Model   `(Intercept)_estimate` female1_estimate schooling_estimate
  <chr>                    <dbl>            <dbl>              <dbl>
1 Model 1                  -5.39           -0.170            -0.0638
2 Model 2                 -11.6            -0.249            -0.0165
3 Model 3                 -13.0            -0.251            -0.0164
4 Model 4                 -12.3            -0.276            -0.0196
5 Model 5                 -12.6            -0.251            -0.0164
6 Model 6                 -11.3            -0.879            -0.0169
# ℹ 65 more variables: locsize012_estimate <dbl>, locsize013_estimate <dbl>,
#   locsize014_estimate <dbl>, age_estimate <dbl>, age_squared_estimate <dbl>,
#   `age_group[55,60)_estimate` <dbl>, `age_group[60,65)_estimate` <dbl>,
#   `age_group[65,70)_estimate` <dbl>, `age_group[70,75)_estimate` <dbl>,
#   `age_group[75,80)_estimate` <dbl>, `age_group[80,85)_estimate` <dbl>,
#   `age_group[85,90)_estimate` <dbl>, age_cubed_estimate <dbl>,
#   `female1:age_estimate` <dbl>, `(Intercept)_std.error` <dbl>, …

Most models have consistent estimates for the female variable, schooling, and locsize01. Model four with age groups had the best AIC and BIC, suggesting the best fit. Model IV has the lowest AIC (55162.42) and BIC (55309.18) values, suggesting it fits the data best among the models listed, with an appropriate balance of fit and complexity. Model III and Model V_b have similar AIC values (61329.07 and 61325.42, respectively) and slightly different BIC values, indicating comparable fits but with subtle differences in complexity or parameter penalties. Model I has the highest AIC and BIC values (64377.68 and 64426.69), indicating it is the least favorable model in terms of fit and complexity.In model 4, being female decreases the odds of the event, more education reduces the odds of the even and smaller or rural areas having reduced odds of the event. Risk of event increases the odds of the event with age. The effect of age is captured in different age categories, but most of the age group coefficients are not statistically significant, except for the group 55-60, which shows a marginally significant reduction in the odds of the event.

D. Logistic and Complementary Log-Log Models

Da. Logistic Model

Code
logit_model <- glm(dead ~ age_group, data = mhas_personmonth, family = binomial(link = "logit"), na.action = na.exclude)

summary(logit_model)

Call:
glm(formula = dead ~ age_group, family = binomial(link = "logit"), 
    data = mhas_personmonth, na.action = na.exclude)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -7.40036    0.16908 -43.768  < 2e-16 ***
age_group[55,60)  0.03536    0.19350   0.183    0.855    
age_group[60,65)  0.71102    0.17793   3.996 6.44e-05 ***
age_group[65,70)  1.22275    0.17364   7.042 1.90e-12 ***
age_group[70,75)  1.44536    0.17293   8.358  < 2e-16 ***
age_group[75,80)  1.86759    0.17262  10.819  < 2e-16 ***
age_group[80,85)  2.29911    0.17272  13.311  < 2e-16 ***
age_group[85,90)  2.77515    0.17355  15.991  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57551  on 1514413  degrees of freedom
Residual deviance: 55439  on 1514406  degrees of freedom
  (37103 observations deleted due to missingness)
AIC: 55455

Number of Fisher Scoring iterations: 10

Db. Clog-Log Model

Code
cloglog_model <- glm(dead ~ age_group, data = mhas_personmonth, family = binomial(link = "cloglog"), na.action = na.exclude)

summary(cloglog_model)

Call:
glm(formula = dead ~ age_group, family = binomial(link = "cloglog"), 
    data = mhas_personmonth, na.action = na.exclude)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -7.40066    0.16903 -43.783  < 2e-16 ***
age_group[55,60)  0.03535    0.19344   0.183    0.855    
age_group[60,65)  0.71070    0.17787   3.996 6.45e-05 ***
age_group[65,70)  1.22202    0.17358   7.040 1.92e-12 ***
age_group[70,75)  1.44437    0.17287   8.355  < 2e-16 ***
age_group[75,80)  1.86592    0.17256  10.813  < 2e-16 ***
age_group[80,85)  2.29638    0.17265  13.301  < 2e-16 ***
age_group[85,90)  2.77057    0.17345  15.973  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57551  on 1514413  degrees of freedom
Residual deviance: 55439  on 1514406  degrees of freedom
  (37103 observations deleted due to missingness)
AIC: 55455

Number of Fisher Scoring iterations: 10
Code
# Predicted probabilities for logistic regression
logit_preds <- predict(logit_model, type = "response")

# Predicted probabilities for complementary-log-log regression
cloglog_preds <- predict(cloglog_model, type = "response")
Code
  #Combine the predicted values with the age group variable for comparison
 comparison_df <- data.frame(age_group = mhas_personmonth$age_group, 
                             logit_preds = logit_preds, 
                             cloglog_preds = cloglog_preds)
 
 #View the comparison dataframe
 View(comparison_df)
 
 plot(comparison_df$logit_preds, comparison_df$cloglog_preds)

Problem Set 6

Code
# Code adapted from Mills (2011): section 3.5
# Expands data into person-months, up to maximum number age in months observed in data
mhas_splitpoisson<-survSplit(mhas0121noNA, cut=c(50,55,60,65,70,75,80,85), start="ageint", end="agecensor", event="dead")

# Sorting data by ID
mhas_splitpoisson<-mhas_splitpoisson[order (mhas_splitpoisson$id, mhas_splitpoisson$ageint),]

#Creating time-varying age variables for different model specifications0
mhas_splitpoisson<-mhas_splitpoisson %>%
  mutate(
  age=trunc(ageint),
  exposure=agecensor-ageint,
  .after = "died0121"
  )

library(dplyr)
library(ggplot2)
#install.packages("AMR")
library(AMR)
mhas_splitpoisson <- mhas_splitpoisson %>% 
  mutate(
  # Create age groups
  age5 = dplyr::case_when(
    age >= 50 & age < 55 ~ "50-54",
    age >= 55 & age < 60 ~ "55-59",
    age >= 60 & age < 65 ~ "60-64",
    age >= 65 & age < 70 ~ "65-69",
    age >= 70 & age < 75 ~ "70-74",
    age >= 75 & age < 80 ~ "75-79",
    age >= 80 & age < 85 ~ "80-84",
    age >= 85             ~ "85+"
  ),
  # Convert to factor
  age5 = factor(
    age5,
    level = c("50-54", "55-59","60-64","65-69","70-74","75-79","80-84","85+")
)
)
Code
fit.poisson<-glm(dead ~ offset(log(exposure)) + as.factor(age5) + female + as.factor(educlevel) + as.factor(locsize01), data=mhas_splitpoisson, family=poisson(link=log))

summary(fit.poisson)

Call:
glm(formula = dead ~ offset(log(exposure)) + as.factor(age5) + 
    female + as.factor(educlevel) + as.factor(locsize01), family = poisson(link = log), 
    data = mhas_splitpoisson)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -4.75489    0.18347 -25.917  < 2e-16 ***
as.factor(age5)55-59     0.07273    0.20391   0.357 0.721323    
as.factor(age5)60-64     0.78169    0.18804   4.157 3.22e-05 ***
as.factor(age5)65-69     1.30473    0.18392   7.094 1.30e-12 ***
as.factor(age5)70-74     1.51745    0.18334   8.277  < 2e-16 ***
as.factor(age5)75-79     1.93465    0.18314  10.564  < 2e-16 ***
as.factor(age5)80-84     2.37316    0.18334  12.944  < 2e-16 ***
as.factor(age5)85+       3.07802    0.18230  16.884  < 2e-16 ***
female                  -0.25105    0.02962  -8.477  < 2e-16 ***
as.factor(educlevel)15  -0.03017    0.03585  -0.841 0.400093    
as.factor(educlevel)68  -0.08432    0.04432  -1.902 0.057126 .  
as.factor(educlevel)919 -0.20891    0.05228  -3.996 6.45e-05 ***
as.factor(locsize01)2   -0.01968    0.04183  -0.470 0.638045    
as.factor(locsize01)3   -0.08353    0.05129  -1.629 0.103416    
as.factor(locsize01)4   -0.15223    0.04192  -3.632 0.000281 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 24070  on 32737  degrees of freedom
Residual deviance: 20851  on 32723  degrees of freedom
  (1390 observations deleted due to missingness)
AIC: 30279

Number of Fisher Scoring iterations: 6
Code
# Fit the baseline Poisson model
poisson_model <- glm(died0121 ~ female + schooling + locsize01 + offset(log(exposure)), 
                     family = poisson(link = "log"), data = mhas_splitpoisson)

# Add interaction terms with time_group
poisson_interaction <- glm(died0121 ~ female * age5 + schooling * age5 + locsize01 * age5 + 
                             offset(log(exposure)), 
                           family = poisson(link = "log"), data = mhas_splitpoisson)

# Compare models with and without interaction terms
anova(poisson_model, poisson_interaction, test = "Chisq")
Analysis of Deviance Table

Model 1: died0121 ~ female + schooling + locsize01 + offset(log(exposure))
Model 2: died0121 ~ female * age5 + schooling * age5 + locsize01 * age5 + 
    offset(log(exposure))
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1     31126      31087                          
2     31098      30503 28   583.53 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Inspect coefficients of interaction terms
summary(poisson_interaction)

Call:
glm(formula = died0121 ~ female * age5 + schooling * age5 + locsize01 * 
    age5 + offset(log(exposure)), family = poisson(link = "log"), 
    data = mhas_splitpoisson)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.5720937  0.1209081 -13.002  < 2e-16 ***
female              -0.1924310  0.0837088  -2.299 0.021515 *  
age555-59           -0.4164336  0.1464518  -2.843 0.004462 ** 
age560-64           -0.4777482  0.1380577  -3.460 0.000539 ***
age565-69           -0.3840767  0.1342852  -2.860 0.004234 ** 
age570-74           -0.2823493  0.1335776  -2.114 0.034537 *  
age575-79           -0.1341635  0.1345250  -0.997 0.318613    
age580-84           -0.0383802  0.1384203  -0.277 0.781570    
age585+             -0.2195884  0.1449940  -1.514 0.129908    
schooling           -0.0286637  0.0096422  -2.973 0.002951 ** 
locsize01           -0.0541953  0.0398995  -1.358 0.174371    
female:age555-59    -0.0423197  0.1021751  -0.414 0.678736    
female:age560-64     0.0125306  0.0963209   0.130 0.896493    
female:age565-69     0.0009859  0.0937487   0.011 0.991609    
female:age570-74     0.0212610  0.0933818   0.228 0.819896    
female:age575-79     0.0090579  0.0941534   0.096 0.923359    
female:age580-84     0.0453119  0.0969854   0.467 0.640354    
female:age585+       0.1248468  0.1023414   1.220 0.222501    
age555-59:schooling -0.0046282  0.0118901  -0.389 0.697092    
age560-64:schooling -0.0076138  0.0113003  -0.674 0.500457    
age565-69:schooling -0.0195330  0.0111102  -1.758 0.078728 .  
age570-74:schooling -0.0254528  0.0111905  -2.274 0.022936 *  
age575-79:schooling -0.0093922  0.0114318  -0.822 0.411317    
age580-84:schooling  0.0047896  0.0120441   0.398 0.690872    
age585+:schooling    0.0227079  0.0131055   1.733 0.083150 .  
age555-59:locsize01 -0.0172832  0.0483702  -0.357 0.720859    
age560-64:locsize01  0.0044766  0.0453956   0.099 0.921445    
age565-69:locsize01  0.0096621  0.0441248   0.219 0.826672    
age570-74:locsize01 -0.0024054  0.0438973  -0.055 0.956302    
age575-79:locsize01  0.0169606  0.0441632   0.384 0.700945    
age580-84:locsize01  0.0375485  0.0452355   0.830 0.406501    
age585+:locsize01    0.0262056  0.0470279   0.557 0.577367    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 31538  on 31129  degrees of freedom
Residual deviance: 30503  on 31098  degrees of freedom
  (2998 observations deleted due to missingness)
AIC: 57611

Number of Fisher Scoring iterations: 6
Code
library(gtsummary)

tbl_regression(poisson_interaction, exponentiate = TRUE)
Characteristic IRR1 95% CI1 p-value
female 0.82 0.70, 0.97 0.022
age5
    50-54
    55-59 0.66 0.50, 0.88 0.004
    60-64 0.62 0.47, 0.81 <0.001
    65-69 0.68 0.52, 0.89 0.004
    70-74 0.75 0.58, 0.98 0.035
    75-79 0.87 0.67, 1.14 0.3
    80-84 0.96 0.73, 1.26 0.8
    85+ 0.80 0.60, 1.07 0.13
schooling 0.97 0.95, 0.99 0.003
locsize01 0.95 0.88, 1.02 0.2
female * age5
    female * 55-59 0.96 0.78, 1.17 0.7
    female * 60-64 1.01 0.84, 1.22 0.9
    female * 65-69 1.00 0.83, 1.20 >0.9
    female * 70-74 1.02 0.85, 1.23 0.8
    female * 75-79 1.01 0.84, 1.21 >0.9
    female * 80-84 1.05 0.87, 1.27 0.6
    female * 85+ 1.13 0.93, 1.39 0.2
age5 * schooling
    55-59 * schooling 1.00 0.97, 1.02 0.7
    60-64 * schooling 0.99 0.97, 1.01 0.5
    65-69 * schooling 0.98 0.96, 1.00 0.079
    70-74 * schooling 0.97 0.95, 1.00 0.023
    75-79 * schooling 0.99 0.97, 1.01 0.4
    80-84 * schooling 1.00 0.98, 1.03 0.7
    85+ * schooling 1.02 1.00, 1.05 0.083
age5 * locsize01
    55-59 * locsize01 0.98 0.89, 1.08 0.7
    60-64 * locsize01 1.00 0.92, 1.10 >0.9
    65-69 * locsize01 1.01 0.93, 1.10 0.8
    70-74 * locsize01 1.00 0.92, 1.09 >0.9
    75-79 * locsize01 1.02 0.93, 1.11 0.7
    80-84 * locsize01 1.04 0.95, 1.14 0.4
    85+ * locsize01 1.03 0.94, 1.13 0.6
1 IRR = Incidence Rate Ratio, CI = Confidence Interval

1. Model Purpose

  • Poisson Model: Used for count data and assumes the mean equals the variance. In survival analysis, it approximates hazard rates for event counts over time intervals.

  • Other Models:

    • Exponential, Weibull, Gompertz: Parametric models for survival time, assuming specific distributions for time-to-event.

    • Log-logistic/Log-normal: Allow flexible distributions, accommodating non-monotonic hazards (hazards that increase, then decrease).

    • Cox Model: Semi-parametric, making no assumptions about the baseline hazard.

    • Logit/CLog-log: These model binary outcomes or hazard probabilities rather than continuous survival times.

2. Coefficient Comparison

Sex (REF = 1)

  • Poisson Model: 0.82 (lower compared to others).

  • Other Models: Values generally range between 1.31 (Exponential) and 0.764 (Log-normal), suggesting that the Poisson model estimates a smaller effect of Sex.

Locsize (REF = 1)

  • Locsize coefficients across models are similar in general magnitude. For the Poisson model:

    • zzz: 0.96

    • z2z^2z2: 0.36

    • z3z^3z3: 0.96

  • Comparatively, Gompertz and Weibull models yield slightly higher estimates (around 1.02–1.10). The Poisson model coefficients are relatively conservative.

3. Schooling

  • Poisson Model: 0.97

  • Other Models: Coefficients for schooling hover between 0.96–1.01, showing a consistent, minor negative effect.

The similarity indicates that Schooling has a stable effect regardless of the model specification.

4. Age Groups

  • In Poisson, age group coefficients are:

    • 55–60: 0.86

    • 60–65: 0.62

    • 70–75: 0.75

    • 85–90: 0.8

  • Other Models: Similar trends appear (e.g., age 85–90 has decreasing effects in the Log-normal and Log-logistic models).
    However, Gompertz and Weibull models tend to have slightly stronger effects compared to Poisson.

5. Interaction Terms (e.g., Female * Age)

  • The Poisson model interacts terms (like Female by Age) relatively conservatively compared to logit models or parametric survival models.

  • For example:

    • Poisson: Estimates for interactions like Female 55–60 and Female 65–70 are not as pronounced as in the Gompertz/Log-logistic models.

6. Model Interpretation

  • Poisson: Estimates event counts directly and assumes a constant hazard rate across intervals. Its simplicity may limit its flexibility in survival analysis.

  • Other Survival Models:

    • Weibull/Gompertz: Allow hazard rates to increase or decrease over time.

    • Log-logistic: Handles hazards with peaks (non-monotonic).

    • Cox Model: Provides flexibility without specifying a baseline hazard function.

Summary of Key Differences

  1. Magnitude of Effects:

    • Poisson estimates (e.g., Sex 0.82) tend to be smaller and more conservative compared to Exponential, Weibull, or Gompertz.
  2. Flexibility:

    • Poisson assumes a fixed mean-variance relationship. Survival models like Weibull, Log-logistic, and Cox provide more flexibility to fit hazards and survival curves.
  3. Suitability:

    • Use Poisson for counts and simple hazards.

    • Parametric models like Gompertz and Weibull are more appropriate for complex time-to-event distributions.