Problem: Replicate the results of analysis reported in the adolescent smoking example of repeated binary outcomes.
Data: A sample of 1,760 Australian adolescents whose smoking patterns were recorded every 6 months for 3 years. Parental smoking and gender variables were also recorded. The goal was to examine if the probability of smoking depends on sex, parental smoking and the wave of the study and how the probability of smoking varies from one individual adolescent to another.
Column 1: Subject ID
Column 2: Gender ID, 1 = Female, 0 = Male
Column 3: Indicator variable for parental smoking 1 = Yes, 0 = No
Column 4: Wave of the study
Column 5: Smoking regularly, 1 = Yes, 0 = No
pacman::p_load(magrittr, vcrpart, ggplot2, dplyr, tidyr, ltm, eRm, lme4)
dta1 <- read.table("adolescentSmoking.txt", h=T)
names(dta1) <- c("ID","Gender","Parent", "Wave", "Smoking")
head(dta1)
## ID Gender Parent Wave Smoking
## 1 1 1 0 1 0
## 2 1 1 0 2 0
## 3 1 1 0 4 0
## 4 1 1 0 5 0
## 5 1 1 0 6 0
## 6 2 0 0 1 0
dta1b <- dta1 %>%
mutate(ID = factor(ID),
Gender = factor(Gender,
levels=0:1,
labels=c("Male","Female")),
Parent = factor(Parent,
levels=0:1,
labels=c("No","Yes")),
Wave = factor(Wave,
levels=1:6, labels=c(1:6)),
Smoking = factor(Smoking,
levels=0:1,
labels=c("No","Yes")))
str(dta1b)
## 'data.frame': 8730 obs. of 5 variables:
## $ ID : Factor w/ 1760 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ Gender : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 1 1 1 1 1 ...
## $ Parent : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wave : Factor w/ 6 levels "1","2","3","4",..: 1 2 4 5 6 1 2 3 4 5 ...
## $ Smoking: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
dta1 <- dta1 %>%
mutate(ID = factor(ID),
Smoking = factor(Smoking))
m1 <- olmm(Smoking ~ Parent + Gender*Wave + re(1 + Wave | ID), data = dta1,
family=cumulative(link="probit"))
summary(m1)
## Linear Mixed Model fit by Marginal Maximum
## Likelihood with Gauss-Hermite Quadrature
##
## Family: cumulative probit
## Formula: Smoking ~ Parent + Gender * Wave + re(1 + Wave | ID)
## Data: dta1
##
## Goodness of fit:
## AIC BIC logLik
## 4082.974 4139.57 -2033.487
##
## Random effects:
## Subject: ID
## Variance StdDev Corr
## (Intercept) 8.68824460 2.94758284 (Intercept)
## Wave 0.09864368 0.31407591 -0.14919258
## Number of obs: 8730, subjects: 1760
##
## Global fixed effects:
## Estimate Std. Error z value
## Parent -1.582880 0.219146 -7.2229
## Gender 0.154144 0.276349 0.5578
## Wave -0.149764 0.077432 -1.9341
## Gender:Wave -0.094836 0.052303 -1.8132
##
## Category-specific fixed effects:
## Estimate Std. Error z value
## 0|1:(Intercept) 4.96294 0.41886 11.848
## [1] 4.962937
## Estimate Std. Error z value
## Parent -1.58288038 0.2191463 -7.2229391
## Gender 0.15414440 0.2763490 0.5577888
## Wave -0.14976383 0.0774324 -1.9341234
## Gender:Wave -0.09483556 0.0523035 -1.8131781
## Subject: ID
## Variance StdDev Corr
## (Intercept) 8.68824460 2.94758284 (Intercept)
## Wave 0.09864368 0.31407591 -0.14919258
dta1w <- spread(dta1b, key = "Wave", value = "Smoking")
names(dta1w) <- c("ID","Gender","Parent","Wave1","Wave2", "Wave3", "Wave4",
"Wave5","Wave6")
w1 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave1), 1))
w2 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave2), 1))
w3 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave3), 1))
w4 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave4), 1))
w5 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave5), 1))
w6 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave6), 1))
dta1p <- rbind(w1,w2,w3,w4,w5,w6)
dta1p <- t(dta1p)
clp <- c((dta1p[3, 1:6]),(dta1p[4, 1:6]))
yt <- rep(c(1,2,3,4,5,6),6)
y <- data.frame(wave=yt, clprop=clp)
plot(y[,1], y[,2], type="n",
xlab="Wave of study", ylab="Proportion of smoker")
lines(dta1p[3, 1:6], lty=1, col="steelblue")
lines(dta1p[4, 1:6], lty=1, col="indianred")
text(6, .12, "Boy", col="steelblue")
text(5, .18, "Girl", col="indianred")
grid()
Source: Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. pp. 241-243.
Problem: Use an appropriate model to analyze the data in PIRLS example of clustered categorical outcomes.
Data: The Progress in International Reading Literacy Study (PIRLS) is an large scale international assessment of reading literacy of young students. For this problem, the data set is the PIRLS 2006 data from only the United States. The question of interest is to predict “enjoy reading” categorical responses from the other co-variates.
Column 1: School ID
Column 2: Gender ID, 1 = Female, 0 = Male
Column 3: US born, 1 = Yes, 0 = No
Column 4: Watch TV, 1 = no time, 2 = up to 1 hour, 3 = 1-3 hours, 4 = 3-5 hours, 5 = 5 hours or more
Column 5: Economically disadvantaged, > 50%, 26 - 50%, 11 - 25%, < 10%
Column 6: Reading score
Column 7: The response to a statement “I enjoy reading”, 1 = disagree a lot, 2 = disagree a little, 3 = agree a little, 4 = agree a lot
dta2 <- read.table("pirls_us.txt", h=T)
names(dta2) <- c("School", "Gender", "US", "TVtime", "Economic", "Reading", "Enjoy")
dta2 <- dta2 %>%
mutate(School = factor(School),
Gender = factor(Gender,
levels=0:1,
labels=c("Male","Female")),
US = factor(US),
TVtime = factor(TVtime,
levels=1:5,
labels=c("no time",
"up to 1 hour",
"1-3 hours",
"3-5 hours",
"5 hours or more")),
Economic = factor(Economic,
levels = c(">50%",
"26-50",
"11-25",
"<10%")),
Enjoy = factor(Enjoy, levels = 1:4))
head(dta2)
## School Gender US TVtime Economic Reading Enjoy
## 1 1 Male 1 1-3 hours <10% 547.0412 1
## 2 1 Male 1 3-5 hours <10% 545.9297 2
## 3 1 Male 1 3-5 hours <10% 587.8460 2
## 4 1 Female 1 1-3 hours <10% 450.8517 2
## 5 1 Male 1 no time <10% 525.2201 3
## 6 1 Male 1 no time <10% 555.4918 3
## Enjoy 1 2 3 4
## US Gender
## 0 Male 41 19 58 84
## Female 9 13 31 103
## 1 Male 332 230 589 943
## Female 155 186 487 1334
## Enjoy 1 2 3 4
## US Economic
## 0 >50% 30 13 50 112
## 26-50 10 8 22 30
## 11-25 5 4 3 15
## <10% 5 7 14 30
## 1 >50% 214 188 439 891
## 26-50 114 95 246 541
## 11-25 60 36 125 293
## <10% 99 97 266 552
## Enjoy 1 2 3 4
## US Economic TVtime
## 0 >50% no time 18 7 17 34
## up to 1 hour 1 0 10 11
## 1-3 hours 5 2 14 28
## 3-5 hours 5 4 8 32
## 5 hours or more 1 0 1 7
## 26-50 no time 8 3 6 8
## up to 1 hour 0 2 2 7
## 1-3 hours 2 1 10 10
## 3-5 hours 0 2 4 3
## 5 hours or more 0 0 0 2
## 11-25 no time 4 1 0 4
## up to 1 hour 1 0 1 2
## 1-3 hours 0 2 2 4
## 3-5 hours 0 1 0 4
## 5 hours or more 0 0 0 1
## <10% no time 2 2 3 5
## up to 1 hour 0 1 4 3
## 1-3 hours 2 3 3 12
## 3-5 hours 1 1 4 9
## 5 hours or more 0 0 0 1
## 1 >50% no time 119 73 140 268
## up to 1 hour 26 39 85 126
## 1-3 hours 31 45 135 239
## 3-5 hours 30 28 65 224
## 5 hours or more 8 3 14 34
## 26-50 no time 54 29 71 125
## up to 1 hour 19 24 39 83
## 1-3 hours 24 23 83 162
## 3-5 hours 15 18 51 145
## 5 hours or more 2 1 2 26
## 11-25 no time 24 10 27 55
## up to 1 hour 11 7 27 41
## 1-3 hours 13 10 47 130
## 3-5 hours 8 8 21 59
## 5 hours or more 4 1 3 8
## <10% no time 44 23 61 73
## up to 1 hour 15 21 59 81
## 1-3 hours 20 30 80 221
## 3-5 hours 14 20 61 147
## 5 hours or more 6 3 5 30
barplot(with(dta2, prop.table(table(Enjoy, US), m=2)), beside=T, legend=T, args.legend=list(x="topleft", cex=.5))
m0 <- ordinal::clmm(Enjoy ~ Gender + US + Reading + (1 | School) + (1 | TVtime) + (1 | Economic), data = dta2)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
## Consider formula(paste(x, collapse = " ")) instead.
## Warning: (2) Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## In addition: Absolute and relative convergence criteria were met
## Cumulative Link Mixed Model fitted with the Laplace approximation
##
## formula: Enjoy ~ Gender + US + Reading + (1 | School) + (1 | TVtime) +
## (1 | Economic)
## data: dta2
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 4614 -5095.06 10208.13 599(1850) 2.82e-01 4.9e+06
##
## Random effects:
## Groups Name Variance Std.Dev.
## School (Intercept) 0.166974 0.40863
## TVtime (Intercept) 0.093257 0.30538
## Economic (Intercept) 0.003375 0.05809
## Number of groups: School 180, TVtime 5, Economic 4
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## GenderFemale 0.6984740 0.0593244 11.774 <2e-16 ***
## US1 -0.1629514 0.1117947 -1.458 0.145
## Reading 0.0058490 0.0004666 12.535 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1|2 1.0564 0.3011 3.509
## 2|3 1.8456 0.3015 6.122
## 3|4 3.1448 0.3041 10.343
Source: PIRLS 2006
Problem: Reproduce the results of analysis of clustered ordinal data reported in the geometry: A-Level example.
Data: The data set contains the results of examination on A-level geometry for 33,276 students from 2,317 institutions in England in 1997. A mean centred average General Certificate of Secondary Education (GCSE) score is derived from all GCSE subjects of the student. This is used as a prior attainment covariate. Also recorded are gender and age of the student. The cohort is aged between 18 and 19 years with a mean of 18.5 years. Institutions were grouped into 11 categories according to their admission policy and type of funding. Eight examination boards were involved in the study although two boards were combined into one.
Column 1: Score on the A-Level Geometry, (A=10, B=8, C=6, D=4, E=2, with F indicating unclassified or fail at 0)
Column 2: The examination board ID, 1=Associate, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford-Cambridge, 8=WJB
Column 3: Gender ID, 0=Male, 1=Female
Column 4: Age in years, mean-centered
Column 5: Institution average GCSE score, mean-centered
Column 6: Institution type, 1=LEA Maintained Comprehensive, 2=Maintained Selective, 3=Maintained Modern, 4=Grammar Comprehensive, 5=Grammar Selective, 6=Grammar Modern, 7=Independent selective, 8=Independent non-selective, 9=Sixth Form College, 10=Further Education College, 11=Others
Column 7: Institution ID
## score board gender age mgcse itype iid
## 1 8 3 1 1 0.856 7 1001
## 2 8 3 1 -6 0.856 7 1001
## 3 8 3 1 5 0.856 7 1001
## 4 10 3 1 -1 0.856 7 1001
## 5 8 3 1 -5 0.856 7 1001
## 6 10 3 1 3 0.856 7 1001
#
dta3$gender <- factor(ifelse(dta3$gender==1,"Female","Male"))
#
dta3$board <- factor(dta3$board)
#
dta3$iid <- factor(dta3$iid)
#
dta3$itype <- factor(dta3$itype)
#
str(dta3)
## 'data.frame': 33276 obs. of 7 variables:
## $ score : int 8 8 8 10 8 10 10 8 6 4 ...
## $ board : Factor w/ 7 levels "1","2","3","5",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ gender: Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 2 ...
## $ age : int 1 -6 5 -1 -5 3 3 4 -2 1 ...
## $ mgcse : num 0.856 0.856 0.856 0.856 0.856 0.856 0.856 0.856 0.856 0.657 ...
## $ itype : Factor w/ 11 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ iid : Factor w/ 2317 levels "1001","1002",..: 1 1 1 1 1 1 1 1 1 2 ...
#
ggplot(data=dta3, aes(x=score)) +
geom_histogram(binwidth=1) +
aes(y = ..density..) +
facet_grid(. ~ itype) +
labs(x="Geometry score", y="Density") +
ggtitle("Institution type")
#
ggplot(data=dta3, aes(x=score)) +
geom_histogram(binwidth=1) +
aes(y = ..density..) +
facet_grid(itype ~ board) +
labs(x="Geometry score", y="Density")+
ggtitle("Institution Type by Examination Board")
#
ggplot(dta3, aes(x=score, fill=gender)) +
geom_bar(binwidth=1, position="dodge") +
labs(x="Geometry score", y="Count")
## Warning: Ignoring unknown parameters: binwidth
#
ggplot(dta3, aes(x=mgcse, y=score)) +
geom_point(alpha=I(0.3), cex=0.1) +
stat_smooth(method="lm") +
facet_grid(itype ~ gender) +
labs(y="Geometry score", x="GCSE score")
## `geom_smooth()` using formula 'y ~ x'
#
pacman::p_load(ordinal)
#
summary(dta3.clmm <- clmm2(score ~ gender + age + mgcse + board + itype,
random=iid, data=dta3, Hess = TRUE))
## Warning: clmm2 may not have converged:
## optimizer 'ucminf' terminated with max|gradient|: 0.0160943494036054
## Cumulative Link Mixed Model fitted with the Laplace approximation
##
## Call:
## clmm2(location = score ~ gender + age + mgcse + board + itype,
## random = iid, data = dta3, Hess = TRUE)
##
## Random effects:
## Var Std.Dev
## iid 0.2347193 0.4844784
##
## Location coefficients:
## Estimate Std. Error z value Pr(>|z|)
## genderMale -0.2862 0.0218 -13.1255 < 2.22e-16
## age -0.0020 0.0029 -0.6811 0.49582285
## mgcse 1.5431 0.0407 37.9416 < 2.22e-16
## board2 -0.2158 0.0552 -3.9073 9.3326e-05
## board3 0.0574 0.0452 1.2705 0.20391396
## board5 -0.9067 0.1027 -8.8307 < 2.22e-16
## board6 -0.0038 0.0559 -0.0684 0.94547377
## board7 0.1275 0.0734 1.7370 0.08239029
## board8 0.5117 0.3426 1.4935 0.13530562
## itype2 0.0292 0.0891 0.3276 0.74319435
## itype3 -0.1547 0.1897 -0.8156 0.41470086
## itype4 0.0507 0.0496 1.0222 0.30670678
## itype5 0.0788 0.0736 1.0703 0.28450407
## itype6 -0.9874 0.2539 -3.8892 0.00010056
## itype7 0.2200 0.0532 4.1347 3.5539e-05
## itype8 -0.0442 0.1525 -0.2900 0.77181787
## itype9 0.0064 0.0612 0.1039 0.91725637
## itype10 -0.3173 0.0654 -4.8526 1.2183e-06
## itype11 -0.2345 0.1685 -1.3915 0.16408490
##
## No scale coefficients
##
## Threshold coefficients:
## Estimate Std. Error z value
## 0|2 -2.6390 0.0481 -54.8113
## 2|4 -1.5242 0.0460 -33.1390
## 4|6 -0.4985 0.0452 -11.0218
## 6|8 0.6294 0.0452 13.9192
## 8|10 2.1112 0.0469 44.9725
##
## log-likelihood: -54914.43
## AIC: 109878.85
## Condition number of Hessian: 14033.00
Source: Fielding, A., Yang, M., & Goldstein, H. (2003). Multilevel ordinal models for examination grades. Statistical Modelling, 3, 127-153.