1 Homework 1

1.1 Info

  • Problem: Analyze the child obesity data with the model specified in the example and interpret the results.

  • Data: The data are from the Muscatine Coronary Risk Factor (MCRF) study, a longitudinal survey of school-age children in Muscatine, Iowa. The MCRF study had the goal of examining the development and persistence of risk factors for coronary disease in children. In the MCRF study, weight and height measurements of five cohorts of children, initially aged 5-7, 7-9, 9-11, 11-13, and 13-15 years, were obtained biennially from 1977 to 1981. Data were collected on 4856 boys and girls. On the basis of a comparison of their weight to age-gender specific norms, children were classified as obese or not obese. The goal of the analysis is to determine whether the risk of obesity increases with age and whether patterns of change in obesity are the same for boys and girls.

Column 1: ID
Column 2: Gender, 0 = Male, 1 = Female
Column 3: Baseline Age
Column 4: Current Age, Age denotes mid-point of age-cohort
Column 5: Occasions of measurements
Column 6: Obesity Status, 1 = Obese, 0 = Non-Obese, . = Missing

1.2 Data management

pacman::p_load(magrittr, dplyr, knitr, rmdformats, geepack, repolr, geesmv)

dta1 <- read.table("muscatine-data.txt", h=F, na.strings = '.')

names(dta1) <- c("ID","Gender","Age0","Age","Occasion","Obese")

head(dta1)
##   ID Gender Age0 Age Occasion Obese
## 1  1      0    6   6        1     1
## 2  1      0    6   8        2     1
## 3  1      0    6  10        3     1
## 4  2      0    6   6        1     1
## 5  2      0    6   8        2     1
## 6  2      0    6  10        3     1
str(dta1)
## 'data.frame':    14568 obs. of  6 variables:
##  $ ID      : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ Gender  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age0    : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Age     : int  6 8 10 6 8 10 6 8 10 6 ...
##  $ Occasion: int  1 2 3 1 2 3 1 2 3 1 ...
##  $ Obese   : int  1 1 1 1 1 1 1 1 1 1 ...

1.3 Descriptive info

dta1t <- dta1 %>% 
  mutate(Occasion = factor(Occasion, labels=c("1977", "1979", "1981"), ordered=T),
         Gender = factor(ifelse(Gender == 0, "M", "F")),
         Obese = factor(ifelse(Obese == 1, "Obese", "Non-Obese")))

ftable(dta1t, row.vars = c(2, 4), col.vars = c(5,6))
##            Occasion      1977            1979            1981      
##            Obese    Non-Obese Obese Non-Obese Obese Non-Obese Obese
## Gender Age                                                         
## F      6                  141    23         0     0         0     0
##        8                  294    58       265    55         0     0
##        10                 270    92       276    87       268    90
##        12                 291    91       279    99       278    92
##        14                 300    89       226    64       256    73
##        16                   0     0       250    87       226    56
##        18                   0     0         0     0       140    37
## M      6                  174    15         0     0         0     0
##        8                  289    67       296    54         0     0
##        10                 312    84       298    77       308    83
##        12                 281    90       299    88       290    90
##        14                 307    73       233    65       269    78
##        16                   0     0       251    67       224    54
##        18                   0     0         0     0       153    34

HARD to present in percentage for each occasion

par(mfrow=c(1,2))
barplot(prop.table(with(subset(dta1, Gender=="0"), 
                           ftable(Obese, Age)), m=2), 
        xlab="Age \n (6, 8, 10, 12, 14, 16, 18 y/o)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="Male", 
        beside=T)
legend('topleft', c("Non-Obese","Obese"), 
       col=c("black","gray"), 
       pch=15, bty='n', cex=.5)
barplot(prop.table(with(subset(dta1, Gender=="1"), 
                         ftable(Obese, Age)), m=2), 
        xlab="Age \n (6, 8, 10, 12, 14, 16, 18 y/o)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="Female", 
        beside=T)

The pattern is similar in male and female. However, male with higher non-obese proportion.

1.4 Marginal models: GEE

Model: Letting Yij = 1 if the ith child is classified as obese at the jth occasion, and Yij = 0 otherwise.
Assume that the marginal probability of obesity at each occasion follows the logistic model with age, gender, and gender × age effects, where age is defined by age at the jth occasion - 12 years.
It can be further assumed that the log odds of obesity changes curvilinearly with age (i.e., quadratic age trend), but the trend over time is allowed to be different for girls and boys.

1.4.1 m0

All

dta1_m <- dta1 %>% 
  mutate(Age = Age - 12,
         Age2 = Age^2)
    
summary(m0 <- geeglm(Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2, data=dta1_m, id = ID, family = binomial, corstr = "ar1"))
## 
## Call:
## geeglm(formula = Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2, 
##     family = binomial, data = dta1_m, id = ID, corstr = "ar1")
## 
##  Coefficients:
##              Estimate   Std.err    Wald Pr(>|W|)    
## (Intercept) -1.205697  0.050734 564.783  < 2e-16 ***
## Gender       0.105306  0.071357   2.178  0.14001    
## Age          0.042264  0.013433   9.899  0.00165 ** 
## Age2        -0.018092  0.003416  28.044 1.19e-07 ***
## Gender:Age   0.002900  0.018563   0.024  0.87585    
## Gender:Age2  0.003495  0.004715   0.549  0.45857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)   0.9943 0.02807
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha   0.6106 0.02036
## Number of clusters:   4856  Maximum cluster size: 3

1.4.2 m1

omit the gender × age and gender × age2

summary(m1 <- geeglm(Obese ~ Gender + Age + Age2, data=dta1_m, id = ID, family = binomial, corstr = "ar1"))
## 
## Call:
## geeglm(formula = Obese ~ Gender + Age + Age2, family = binomial, 
##     data = dta1_m, id = ID, corstr = "ar1")
## 
##  Coefficients:
##             Estimate  Std.err   Wald Pr(>|W|)    
## (Intercept) -1.21897  0.04788 648.17  < 2e-16 ***
## Gender       0.13145  0.06288   4.37    0.037 *  
## Age          0.04387  0.00926  22.43  2.2e-06 ***
## Age2        -0.01630  0.00235  48.05  4.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    0.994   0.028
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha     0.61  0.0203
## Number of clusters:   4856  Maximum cluster size: 3

1.4.3 Comparison

sjPlot::tab_model(m0, m1, show.obs=F, show.ngroups = F)
  Obese Obese
Predictors Odds Ratios CI p Odds Ratios CI p
(Intercept) 0.30 0.27 – 0.33 <0.001 0.30 0.27 – 0.32 <0.001
Gender 1.11 0.97 – 1.28 0.140 1.14 1.01 – 1.29 0.037
Age 1.04 1.02 – 1.07 0.002 1.04 1.03 – 1.06 <0.001
Age2 0.98 0.98 – 0.99 <0.001 0.98 0.98 – 0.99 <0.001
Gender * Age 1.00 0.97 – 1.04 0.876
Gender * Age2 1.00 0.99 – 1.01 0.459

1.5 Estimated Probability Plot

pacman::p_load(tidyr, ggplot2)

dta1_p <- dta1 %>% 
  mutate(Gender = factor(ifelse(Gender == 0, "Boys", "Girls")))

dta1p <- data.frame(dta1_p, id=row.names(dta1_p))

yhat_m0 <- data.frame(id=row.names(fitted(m0)), 
                      phat=fitted(m0))

dta1_m0 <- inner_join(dta1p, yhat_m0, by="id")

ggplot(dta1_m0, aes(Age, group = Gender)) +
  geom_line(data = dta1_m0, aes(y = phat, color = Gender)) + 
  labs(x="Age (years)", 
       y="Estimated Probability of Obesity") +
  scale_y_continuous(limits = c(0.1, 0.27)) +
  scale_x_continuous(limits = c(6, 18),
                        breaks = c(6, 8, 10, 12, 14, 16, 18),
                        minor_breaks = NULL) +
  theme_minimal()+
  theme(legend.position=c(.9, .2))

The estimated probability of obesity for boys at ages 6, 10, 14, and 18 is 0.12, 0.20, 0.23, and 0.18, respectively; for girls, the estimated probability of obesity at ages 6, 10, 14, and 18 is 0.13, 0.22, 0.26, and 0.20.

1.6 Reference

Source: Woolson, R.F., & Clarke, W.R. (1984). Analysis of categorical incompletel longitudinal data. Journal of the Royal Statistical Society, Series A, 147, 87-99.
Reported in Fitzmaurice, G.M., Laired, N.M., & Ware, J.H. (2004). Applied Longitudinal Data Analysis. pp. 306-312.
Reported in Fitzmaurice, G.M., Laired, N.M., & Ware, J.H. (2011). Applied Longitudinal Data Analysis. pp. 364-374.

2 Homework 2

2.1 Info

  • Problem: Replicate the results of multiple sources analysis reported in the Cantabrian example.

  • Data: The data are from a survey of primary care patients in Cantabrian, Spain, to estimate the prevalence of psychiatric illness. The prevalence for each of the sexes is estimated from data provided by the general practitioner and from completing the general health questionnaire.

Column 1: Subject ID
Column 2: Illness = 1, Otherwise = 0
Column 3: Gender ID
Column 4: General practitioner = GP, General health questionnaire = GPQ

2.2 Data management

dta2 <- read.table("cantabrian.txt", h=T)

names(dta2) <- c("ID","Health","Gender","Type")

head(dta2)
##   ID Health Gender Type
## 1  1      0      M   GP
## 2  1      0      M  GHQ
## 3  2      0      M   GP
## 4  2      0      M  GHQ
## 5  3      0      M   GP
## 6  3      0      M  GHQ
str(dta2)
## 'data.frame':    1646 obs. of  4 variables:
##  $ ID    : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ Health: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gender: chr  "M" "M" "M" "M" ...
##  $ Type  : chr  "GP" "GHQ" "GP" "GHQ" ...
dta2t <- dta2 %>% 
  mutate(Health = factor(Health, 
                         levels=0:1, 
                         labels=c("Illness","Otherwise")),
         Gender = factor(Gender, levels = c("M", "F")),
         Type = factor(Type, 
                       levels=c("GP","GHQ")))

2.3 Descriptive info

ftable(dta2t, row.vars = c('Gender','Health'), col.vars = c('Type'))
##                  Type  GP GHQ
## Gender Health                
## M      Illness        287 244
##        Otherwise       37  80
## F      Illness        420 306
##        Otherwise       79 193
prop.table(dta2p <- (ftable(dta2t, row.vars = c('Gender','Health'), col.vars = c('Type'))), 2)
##                  Type     GP    GHQ
## Gender Health                      
## M      Illness        0.3487 0.2965
##        Otherwise      0.0450 0.0972
## F      Illness        0.5103 0.3718
##        Otherwise      0.0960 0.2345

The prevalence in women is higher than in men.

par(mfrow=c(1,2))

barplot(prop.table(with(subset(dta2t, Type=="GP"), 
                           ftable(Gender, Health)), m=2), 
        xlab="Health \n (Illness, Otherwise)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="GP", 
        beside=T)

legend('topleft', c("Male","Female"), 
       col=c("black","gray"), 
       pch=15, bty='n', cex=.5)

barplot(prop.table(with(subset(dta2t, Type=="GHQ"), 
                           ftable(Gender, Health)), m=2), 
        xlab="Health \n (Illness, Otherwise)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="GHQ", 
        beside=T)

2.4 GEE

Multivariate binary responses

summary(m0 <- geeglm(Health ~ Gender + Type + Gender*Type, data=dta2, id = ID, corstr="exchangeable", family = binomial))
## 
## Call:
## geeglm(formula = Health ~ Gender + Type + Gender * Type, family = binomial, 
##     data = dta2, id = ID, corstr = "exchangeable")
## 
##  Coefficients:
##                Estimate Std.err  Wald Pr(>|W|)    
## (Intercept)     -0.4609  0.0919 25.14  5.3e-07 ***
## GenderM         -0.6542  0.1583 17.09  3.6e-05 ***
## TypeGP          -1.2099  0.1265 91.46  < 2e-16 ***
## GenderM:TypeGP   0.2765  0.2283  1.47     0.23    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1  0.0786
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.298  0.0474
## Number of clusters:   823  Maximum cluster size: 2
sjPlot::tab_model(m0, show.obs=F, show.ngroups = F)
  Health
Predictors Odds Ratios CI p
(Intercept) 0.63 0.53 – 0.76 <0.001
Gender [M] 0.52 0.38 – 0.71 <0.001
Type [GP] 0.30 0.23 – 0.38 <0.001
Gender [M] * Type [GP] 1.32 0.84 – 2.06 0.226

The prevalence in women is higher than in men, irrespective of the case finder being used.

The GP detects fewer cases than the GHQ, irrespective of the subject’s gender.

2.5 Reference

Source: Everitt, B.S., & Dunn, G. (2001). Applied Multivariate Data Analysis, 2nd Ed. p. 237