Fit a regression model for parents and another for teachers, separately, and compare the results with that from the multiple sources approach for the Connecticut child surveys example.
The data are obtained from two surveys of children’s mental health. The data set consists of 1,428 children who had both parent and teacher responses, and an additional 1,073 children with only a parent response. In the psychiatric literature, these sources of respondents are often referred to as “informants”.
A standardized measure of childhood psychopathology was used both by parents (Child Behavior Checklist, CBC) and teachers (Teacher Report Forms, TRF) to assess children in the study. The response variable is derived from the externalizing scale, which assesses delinquent and aggressive behavior. The scale has been dichotomized at the cutpoint for borderline/clinical psychopathology.
Source: Zahner, G.E.P., Jacobs, J.H., Freeman, D.H. & Trainor, K.F. (1993). Rural-urban child psychopathology in a northeastern U.S. State: 1986-1989. Journal of the American Academy of Child and Adolescent Psychiatry, 32, 378-387. Reported in Fitzmaurice, G.M., Laird, N.M., & Ware, J.H. (2004). Applied Longitudinal Analysis. p. 434-439.
Column 1: Child ID Column 2: Child’s physical health problems: Fair = Fair to bad , Good Column 3: Single parent status: Single, Otherwise Column 4: Externalizing behavior (Parent report) Column 5: Externalizing behavior (Teachert report)
# load packages
#
pacman::p_load(MASS, gee, tidyverse, geepack)
# input data
dta <- read.table('data/ccs-data.txt', na.string = ".")
#
names(dta) <- c("ID", "Problem", "Single", "Parent", "Teacher")
#
dta <- dta %>%
mutate(ID = factor(ID), Problem = factor(ifelse(Problem == 1, "Y", "N")),
Single = factor(ifelse(Single == 1, "Y", "N")))
#
head(dta)
ID Problem Single Parent Teacher
1 1 Y Y 0 NA
2 2 N N 1 NA
3 3 Y N 0 NA
4 4 Y Y 1 NA
5 5 Y N 0 NA
6 6 Y N 0 NA
#
str(dta)
'data.frame': 2501 obs. of 5 variables:
$ ID : Factor w/ 2501 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Problem: Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 1 1 ...
$ Single : Factor w/ 2 levels "N","Y": 2 1 1 2 1 1 1 1 2 1 ...
$ Parent : int 0 1 0 1 0 0 0 0 0 0 ...
$ Teacher: int NA NA NA NA NA NA NA 0 NA 0 ...
with(dta, ftable(Teacher, Parent))
Parent 0 1
Teacher
0 1030 129
1 165 104
## correlation
cor(dta[,4:5], use="pair")
Parent Teacher
Parent 1.00000 0.29133
Teacher 0.29133 1.00000
# proportion of missing responses
sum(is.na(dta[,4]) | is.na(dta[,5]))/dim(dta)[1]
[1] 0.42903
# stack the parents and teachers report; call them informants
dtaL <- dta %>%
gather(Informant, Report, 4:5)
# sort according to child id
dtaL <- dtaL[order(dtaL$ID), ]
# get rid of annoying row names
rownames(dtaL) <- NULL
# have a look
head(dtaL)
ID Problem Single Informant Report
1 1 Y Y Parent 0
2 1 Y Y Teacher NA
3 2 N N Parent 1
4 2 N N Teacher NA
5 3 Y N Parent 0
6 3 Y N Teacher NA
summary(m1 <- gee(Report ~ Informant + Problem + Single + Informant:Problem,
data = dtaL, id = ID, corstr = "exchangeable", family = binomial))
(Intercept) InformantTeacher ProblemY
-2.15954 0.47123 0.59961
SingleY InformantTeacher:ProblemY
0.63324 -0.42471
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Exchangeable
Call:
gee(formula = Report ~ Informant + Problem + Single + Informant:Problem,
id = ID, data = dtaL, family = binomial, corstr = "exchangeable")
Summary of Residuals:
Min 1Q Median 3Q Max
-0.28412 -0.17650 -0.15708 -0.10396 0.89604
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -2.15394 0.090166 -23.8887 0.088889 -24.2318
InformantTeacher 0.47384 0.114981 4.1210 0.117988 4.0160
ProblemY 0.59957 0.112847 5.3131 0.112781 5.3162
SingleY 0.61372 0.105963 5.7919 0.107593 5.7041
InformantTeacher:ProblemY -0.45729 0.157015 -2.9124 0.157113 -2.9106
Estimated Scale Parameter: 1.001
Number of Iterations: 2
Working Correlation
[,1] [,2]
[1,] 1.0000 0.2682
[2,] 0.2682 1.0000
summary(m1a <- geeglm(Report ~ Informant + Problem + Single +
Informant:Problem + Single:Problem,
data = dtaL, id = ID, corstr = "exchangeable", family = binomial))
Call:
geeglm(formula = Report ~ Informant + Problem + Single + Informant:Problem +
Single:Problem, family = binomial, data = dtaL, id = ID,
corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -2.1662 0.0941 529.88 < 2e-16
InformantTeacher 0.4751 0.1182 16.16 5.8e-05
ProblemY 0.6218 0.1265 24.14 8.9e-07
SingleY 0.6579 0.1595 17.01 3.7e-05
InformantTeacher:ProblemY -0.4591 0.1572 8.53 0.0035
ProblemY:SingleY -0.0794 0.2157 0.14 0.7127
Correlation structure = exchangeable
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1 0.0642
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.267 0.0303
Number of clusters: 2501 Maximum cluster size: 2
sjPlot::tab_model(m1a, show.obs=F, show.ngroups = F)
| Report | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.11 | 0.10 – 0.14 | <0.001 |
| Informant [Teacher] | 1.61 | 1.28 – 2.03 | <0.001 |
| Problem [Y] | 1.86 | 1.45 – 2.39 | <0.001 |
| Single [Y] | 1.93 | 1.41 – 2.64 | <0.001 |
|
Informant [Teacher] * Problem [Y] |
0.63 | 0.46 – 0.86 | 0.004 |
| Problem [Y] * Single [Y] | 0.92 | 0.61 – 1.41 | 0.713 |
plot(m1a, "pearson", which = 1)
grid()
# set theme
ot <- theme_set(theme_bw())
# fortify data with model fits
dta_m1 <- data.frame(na.omit(dtaL), phat = fitted(m1a))
# compare observed with fitted
ggplot(dta_m1, aes(Problem, Report, color = Informant)) +
geom_point(aes(Problem, phat),position = position_dodge(width=.2),
pch = 1, size = rel(2)) +
geom_hline(yintercept = c(mean(dta$Parent, na.rm = T),
mean(dta$Teacher, na.rm = T)), linetype = "dotted")+
stat_summary(fun.data = "mean_cl_boot", position = position_dodge(width=.2)) +
facet_wrap( ~ Single)+
labs(x = "Physical Problem", y = "Probability of report") +
theme(legend.position = c(.1, .88))
Replicate the results of analysis reported in the insomina example.
The data come from a randomized, double-blind clinical trial comparing active hypnotic drug with a placebo in patients who have insomnia problems. The response is the patient’s reported time (in minutes) to fall asleep after going to bed. Patients responded before and after a two-week treatment period. The patients receiving the two treatments were independent samples.
Source: Agresti, A. (2002). Categorical Data Analysis. 2nd Ed. p. 462
Column 1: Subject ID Column 2: Treatment, Active = 1, Placebo = 0 Column 3: Time, Initial = 0, Follow-up = 1 Column 4: Time to sleep, < 20 min. = 1, 20-30 = 2, 30-60 = 3, > 60 min. = 4
#
library(repolr)
#
dta <- read.table("./data/insomnia.txt", h=T)
#
head(dta)
case treat occasion resp
1 1 1 0 1
2 1 1 1 1
3 2 1 0 1
4 2 1 1 1
5 3 1 0 1
6 3 1 1 1
dta_repolr <- repolr(resp ~ treat + occasion + treat:occasion,
subjects="case", data=dta, times=c(1,2), categories=4,
corr.mod="ar1")
#
summary(dta_repolr)
repolr: 2016-02-26 version 3.4
Call:
repolr(formula = resp ~ treat + occasion + treat:occasion, subjects = "case",
data = dta, times = c(1, 2), categories = 4, corr.mod = "ar1")
Coefficients:
coeff se.robust z.robust p.value
cuts1|2 -2.2778 0.2115 -10.7697 0.0000
cuts2|3 -0.9476 0.1774 -5.3416 0.0000
cuts3|4 0.3634 0.1752 2.0742 0.0381
treat 0.0022 0.2386 0.0092 0.9927
occasion 1.0257 0.1564 6.5582 0.0000
treat:occasion 0.7511 0.2478 3.0311 0.0024
Correlation Structure: ar1
Estimated Correlation: 0.217
# Relabel
dta$treat <- factor(ifelse(dta$treat==1, "Active", "Placebo"))
#
dta$occasion <- factor(ifelse(dta$occasion==1, "Follow-up", "Initial"))
#
dta.tbl <- ftable(dta$treat, dta$occasion , dta$resp)
#
dta.tbl
1 2 3 4
Active Follow-up 40 49 19 11
Initial 12 20 40 47
Placebo Follow-up 31 29 35 25
Initial 14 20 35 51
obsp <- as.numeric(t(prop.table(dta.tbl, m=1)))
obsp
[1] 0.3361 0.4118 0.1597 0.0924 0.1008 0.1681 0.3361 0.3950 0.2583 0.2417
[11] 0.2917 0.2083 0.1167 0.1667 0.2917 0.4250
clp <- c(cumsum(obsp[1:4]), cumsum(obsp[5:8]), cumsum(obsp[9:12]),
cumsum(obsp[13:16]))
clp
[1] 0.336 0.748 0.908 1.000 0.101 0.269 0.605 1.000 0.258 0.500 0.792 1.000
[13] 0.117 0.283 0.575 1.000
#
yt <-rep(c(20,25,45,60),4)
#
y <- data.frame(time2sleep=yt, clprop=clp)
#
plot(y[,1], y[,2], type="n",
xlab="Time to Sleep (min.)", ylab="Cumulative proportion")
#
lines(y[1:4, 1], y[1:4,2], lty=1, col="steelblue")
lines(y[5:8, 1], y[5:8,2], lty=5, col="steelblue")
lines(y[9:12, 1], y[9:12,2], lty=2, col="indianred")
lines(y[13:16,1], y[13:16,2], lty=4, col="indianred")
#
text(22, .12, "T,I")
text(22, .47, "T,F")
text(22, .20, "P,I")
text(22, .33, "P,F")
#
grid()
The active drug group and the placebo groups initially fell asleep at about the same rate. Two weeks later, at the follow-up those with the active treatment tended to fall asleep more quickly.
Analyze the child obesity data with the model specified in the example and interpret the results.
The data are from the Muscatine Coronary Risk Factor (MCRF) study, a longitudinal survey of school-age children in Muscatine, Iowa. The MCRF study had the goal of examining the development and persistence of risk factors for coronary disease in children. In the MCRF study, weight and height measurements of five cohorts of children, initially aged 5-7, 7-9, 9-11, 11-13, and 13-15 years, were obtained biennially from 1977 to 1981. Data were collected on 4856 boys and girls. On the basis of a comparison of their weight to age-gender specific norms, children were classified as obese or not obese. The goal of the analysis is to determine whether the risk of obesity increases with age and whether patterns of change in obesity are the same for boys and girls.
Source: Woolson, R.F., & Clarke, W.R. (1984). Analysis of categorical incompletel longitudinal data. Journal of the Royal Statistical Society, Series A, 147, 87-99. Reported in Fitzmaurice, G.M., Laired, N.M., & Ware, J.H. (2004). Applied Longitudinal Data Analysis. pp. 306-312.
Column 1: ID Column 2: Gender, 0 = Male, 1 = Female Column 3: Baseline Age Column 4: Current Age, Age denotes mid-point of age-cohort Column 5: Occasions of measurements Column 6: Obesity Status, 1 = Obese, 0 = Non-Obese, . = Missing
fL <- "data/muscatine-data.txt"
dta <- read.table(fL, h=F, na.strings = '.')
dta <- na.omit(dta)
names(dta) <- c("ID","Gender","Age0","Age","Occasion","Obese")
dta$Gender <- ifelse(dta$Gender == 0, "M", "F")
head(dta)
ID Gender Age0 Age Occasion Obese
1 1 M 6 6 1 1
2 1 M 6 8 2 1
3 1 M 6 10 3 1
4 2 M 6 6 1 1
5 2 M 6 8 2 1
6 2 M 6 10 3 1
ftable(dta, row.vars = c(2, 4), col.vars = c(5,6))
Occasion 1 2 3
Obese 0 1 0 1 0 1
Gender Age
F 6 141 23 0 0 0 0
8 294 58 265 55 0 0
10 270 92 276 87 268 90
12 291 91 279 99 278 92
14 300 89 226 64 256 73
16 0 0 250 87 226 56
18 0 0 0 0 140 37
M 6 174 15 0 0 0 0
8 289 67 296 54 0 0
10 312 84 298 77 308 83
12 281 90 299 88 290 90
14 307 73 233 65 269 78
16 0 0 251 67 224 54
18 0 0 0 0 153 34
par(mfrow=c(1,2))
barplot(prop.table(with(subset(dta, Gender=="M"),
ftable(Obese, Age)), m=2),
xlab="Age \n (6, 8, 10, 12, 14, 16, 18 years)",
ylab="Proportion",
ylim=c(0,1),
main="Male",
beside=T)
legend('topleft', c("Non-Obese","Obese"),
col=c("black","gray"),
pch=15, bty='n', cex=.5)
barplot(prop.table(with(subset(dta, Gender=="F"),
ftable(Obese, Age)), m=2),
xlab="Age \n (6, 8, 10, 12, 14, 16, 18 years)",
ylab="Proportion",
ylim=c(0,1),
main="Female",
beside=T)
Observed by the panels, I found an age effect on obesity and did not find gender effect.
Model: Letting Yij = 1 if the ith child is classified as obese at the jth occasion, and Yij = 0 otherwise. Assume that the marginal probability of obesity at each occasion follows the logistic model with age, gender, and gender x age effects, where age is defined by age at the jth occasion - 12 years. It can be further assumed that the log odds of obesity changes curvilinearly with age (i.e., quadratic age trend), but the trend over time is allowed to be different for girls and boys.
dta_c <- dta %>%
mutate(Age = Age - 12,
Age2 = Age^2)
summary(m0 <- geeglm(Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2, data=dta_c, id = ID, family = binomial, corstr = "ar1"))
Call:
geeglm(formula = Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2,
family = binomial, data = dta_c, id = ID, corstr = "ar1")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -1.10039 0.05018 480.89 < 2e-16
GenderM -0.10531 0.07136 2.18 0.14001
Age 0.04516 0.01281 12.43 0.00042
Age2 -0.01460 0.00325 20.18 7e-06
GenderM:Age -0.00290 0.01856 0.02 0.87585
GenderM:Age2 -0.00349 0.00471 0.55 0.45857
Correlation structure = ar1
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 0.994 0.0281
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.611 0.0204
Number of clusters: 4856 Maximum cluster size: 3
Occasion and quadratic Age had a significant effect on Obesity.
summary(m1 <- geeglm(Obese ~ Gender + Age + Age2, data=dta_c, id = ID, family = binomial, corstr = "ar1"))
Call:
geeglm(formula = Obese ~ Gender + Age + Age2, family = binomial,
data = dta_c, id = ID, corstr = "ar1")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -1.08751 0.04724 530.08 < 2e-16
GenderM -0.13145 0.06288 4.37 0.037
Age 0.04387 0.00926 22.43 2.2e-06
Age2 -0.01630 0.00235 48.05 4.2e-12
Correlation structure = ar1
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 0.994 0.028
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.61 0.0203
Number of clusters: 4856 Maximum cluster size: 3
Here, we found a gender effect on obesity.
anova(m0,m1)
Analysis of 'Wald statistic' Table
Model 1 Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2
Model 2 Obese ~ Gender + Age + Age2
Df X2 P(>|Chi|)
1 2 0.582 0.75
sjPlot::tab_model(m1, show.obs=F, show.ngroups = F)
| Obese | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.34 | 0.31 – 0.37 | <0.001 |
| Gender [M] | 0.88 | 0.78 – 0.99 | 0.037 |
| Age | 1.04 | 1.03 – 1.06 | <0.001 |
| Age2 | 0.98 | 0.98 – 0.99 | <0.001 |
yhat_m1 <- data.frame(ID=row.names(fitted(m1)),
phat=fitted(m1))
dta$ID <- factor(dta$ID)
dta_m1 <- inner_join(dta, yhat_m1, by="ID")
dta_sum <- dta_m1 %>%
group_by(Gender, Age) %>%
summarize(phat = mean(phat))
ggplot(dta_sum, aes(Age, phat, color = Gender, group = Gender)) +
geom_line() +
labs(x="Age (years)",
y="Estimated Probability of Obesity") +
theme_minimal()+
theme(legend.position=c(.9, .2))
Revealed by the results, we can observe that 1. The probability of obesity increased as age increases. 2. Female had larger probability of obesity after 12 years old than male.
Replicate the results of multiple sources analysis reported in the Cantabrian example.
The data are from a survey of primary care patients in Cantabrian, Spain, to estimate the prevalence of psychiatric illness. The prevalence for each of the sexes is estimated from data provided by the general practitioner and from completing the general health questionnaire.
Source: Everitt, B.S., & Dunn, G. (2001). Applied Multivariate Data Analysis, 2nd Ed. p. 237
Column 1: Subject ID Column 2: Illness = 1, Otherweise = 0 Column 3: Gender ID Column 4: General practitioner = GP, General health questionnaire = GPQ
dta <- read.table("data/cantabrian.txt", h=T)
names(dta) <- c("ID","Health","Gender","Type")
head(dta)
ID Health Gender Type
1 1 0 M GP
2 1 0 M GHQ
3 2 0 M GP
4 2 0 M GHQ
5 3 0 M GP
6 3 0 M GHQ
dta_c <- dta %>%
mutate(Health = factor(Health,
levels=0:1,
labels=c("Illness","Otherwise")),
Gender = factor(Gender, levels = c("M", "F")),
Type = factor(Type,
levels=c("GP","GHQ")))
ftable(dta_c, row.vars = c('Gender','Health'), col.vars = c('Type'))
Type GP GHQ
Gender Health
M Illness 287 244
Otherwise 37 80
F Illness 420 306
Otherwise 79 193
prop.table(dta_sum <- (ftable(dta_c, row.vars = c('Gender','Health'), col.vars = c('Type'))), 2)
Type GP GHQ
Gender Health
M Illness 0.3487 0.2965
Otherwise 0.0450 0.0972
F Illness 0.5103 0.3718
Otherwise 0.0960 0.2345
par(mfrow=c(1,2))
barplot(prop.table(with(subset(dta_c, Type=="GP"),
ftable(Gender, Health)), m=2),
xlab="Health \n (Illness, Otherwise)",
ylab="Proportion",
ylim=c(0,1),
main="GP",
beside=T)
legend('topleft', c("Male","Female"),
col=c("black","gray"),
pch=15, bty='n', cex=.5)
barplot(prop.table(with(subset(dta_c, Type=="GHQ"),
ftable(Gender, Health)), m=2),
xlab="Health \n (Illness, Otherwise)",
ylab="Proportion",
ylim=c(0,1),
main="GHQ",
beside=T)
summary(m0 <- geeglm(Health ~ Gender + Type + Gender:Type, data=dta, id = ID, corstr="exchangeable", family = binomial))
Call:
geeglm(formula = Health ~ Gender + Type + Gender:Type, family = binomial,
data = dta, id = ID, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -0.4609 0.0919 25.14 5.3e-07
GenderM -0.6542 0.1583 17.09 3.6e-05
TypeGP -1.2099 0.1265 91.46 < 2e-16
GenderM:TypeGP 0.2765 0.2283 1.47 0.23
Correlation structure = exchangeable
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1 0.0786
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.298 0.0474
Number of clusters: 823 Maximum cluster size: 2
summary(m1 <- geeglm(Health ~ Gender + Type, data=dta, id = ID, corstr="exchangeable", family = binomial))
Call:
geeglm(formula = Health ~ Gender + Type, family = binomial, data = dta,
id = ID, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -0.4888 0.0902 29.4 6.0e-08
GenderM -0.5760 0.1433 16.2 5.8e-05
TypeGP -1.1209 0.1061 111.6 < 2e-16
Correlation structure = exchangeable
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1.01 0.0885
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.295 0.0494
Number of clusters: 823 Maximum cluster size: 2
anova(m0,m1)
Analysis of 'Wald statistic' Table
Model 1 Health ~ Gender + Type + Gender:Type
Model 2 Health ~ Gender + Type
Df X2 P(>|Chi|)
1 1 1.47 0.23
sjPlot::tab_model(m1, show.obs=F, show.ngroups = F)
| Health | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.61 | 0.51 – 0.73 | <0.001 |
| Gender [M] | 0.56 | 0.42 – 0.74 | <0.001 |
| Type [GP] | 0.33 | 0.26 – 0.40 | <0.001 |
The results showed that both Gender and Type were significant preditors for estimating the prevalence of psychiatric illness.
ot <- theme_set(theme_bw())
# fortify data with model fits
dta_m1 <- data.frame(na.omit(dta_c), phat = fitted(m1))
# compare observed with fitted
ggplot(dta_m1, aes(Type, phat, color = Gender)) +
stat_summary(fun.data = "mean_cl_boot", position = position_dodge(width=.2)) +
labs(x = "Physical Problem", y = "Probability of report") +
theme(legend.position = c(.1, .88))