1 Homework 1

1.1 Info

  • Problem: Replicate the results of analysis reported in the adolescent smoking example of repeated binary outcomes.

  • Data: A sample of 1,760 Australian adolescents whose smoking patterns were recorded every 6 months for 3 years. Parental smoking and gender variables were also recorded. The goal was to examine if the probability of smoking depends on sex, parental smoking and the wave of the study and how the probability of smoking varies from one individual adolescent to another.

Column 1: Subject ID
Column 2: Gender ID, 1 = Female, 0 = Male
Column 3: Indicator variable for parental smoking 1 = Yes, 0 = No
Column 4: Wave of the study
Column 5: Smoking regularly, 1 = Yes, 0 = No

1.2 Data management

pacman::p_load(magrittr, vcrpart, ggplot2, dplyr, tidyr, ltm, eRm, lme4)

dta1 <- read.table("adolescentSmoking.txt", h=T)

names(dta1) <- c("ID","Gender","Parent", "Wave", "Smoking")

head(dta1)
##   ID Gender Parent Wave Smoking
## 1  1      1      0    1       0
## 2  1      1      0    2       0
## 3  1      1      0    4       0
## 4  1      1      0    5       0
## 5  1      1      0    6       0
## 6  2      0      0    1       0
dta1b <- dta1 %>% 
  mutate(ID = factor(ID),
         Gender = factor(Gender,
                         levels=0:1, 
                         labels=c("Male","Female")),
         Parent = factor(Parent,
                         levels=0:1, 
                         labels=c("No","Yes")),
         Wave = factor(Wave,
                       levels=1:6, labels=c(1:6)),
         Smoking = factor(Smoking,
                          levels=0:1, 
                          labels=c("No","Yes")))
str(dta1b)
## 'data.frame':    8730 obs. of  5 variables:
##  $ ID     : Factor w/ 1760 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Gender : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Parent : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Wave   : Factor w/ 6 levels "1","2","3","4",..: 1 2 4 5 6 1 2 3 4 5 ...
##  $ Smoking: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

1.3 GLMM

dta1 <- dta1 %>% 
  mutate(ID = factor(ID),
         Smoking = factor(Smoking))

m1 <- olmm(Smoking ~ Parent + Gender*Wave + re(1 + Wave | ID), data = dta1,
           family=cumulative(link="probit"))

summary(m1)
## Linear Mixed Model fit by Marginal Maximum
## Likelihood with Gauss-Hermite Quadrature 
## 
##  Family: cumulative probit 
## Formula: Smoking ~ Parent + Gender * Wave + re(1 + Wave | ID) 
##    Data: dta1 
## 
## Goodness of fit:
##       AIC     BIC    logLik
##  4082.974 4139.57 -2033.487
## 
## Random effects:
## Subject: ID 
##             Variance    StdDev      Corr       
## (Intercept)  8.68824460  2.94758284 (Intercept)
## Wave         0.09864368  0.31407591 -0.14919258
## Number of obs: 8730, subjects: 1760
## 
## Global fixed effects:
##              Estimate Std. Error z value
## Parent      -1.582880   0.219146 -7.2229
## Gender       0.154144   0.276349  0.5578
## Wave        -0.149764   0.077432 -1.9341
## Gender:Wave -0.094836   0.052303 -1.8132
## 
## Category-specific fixed effects:
##                 Estimate Std. Error z value
## 0|1:(Intercept)  4.96294    0.41886  11.848
summary(m1)$feMatCe[,1]
## [1] 4.962937
summary(m1)$feMatGe
##                Estimate Std. Error    z value
## Parent      -1.58288038  0.2191463 -7.2229391
## Gender       0.15414440  0.2763490  0.5577888
## Wave        -0.14976383  0.0774324 -1.9341234
## Gender:Wave -0.09483556  0.0523035 -1.8131781
summary(m1)$REmat
## Subject: ID 
##             Variance    StdDev      Corr       
## (Intercept)  8.68824460  2.94758284 (Intercept)
## Wave         0.09864368  0.31407591 -0.14919258

1.4 Plot

dta1w <- spread(dta1b, key = "Wave", value = "Smoking")

names(dta1w) <- c("ID","Gender","Parent","Wave1","Wave2", "Wave3", "Wave4", 
                  "Wave5","Wave6")

w1 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave1), 1))
w2 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave2), 1))
w3 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave3), 1))
w4 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave4), 1))
w5 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave5), 1))
w6 <- as.numeric(prop.table(ftable(dta1w$Gender, dta1w$Wave6), 1))
dta1p <- rbind(w1,w2,w3,w4,w5,w6)
dta1p <- t(dta1p)

clp <- c((dta1p[3, 1:6]),(dta1p[4, 1:6]))
yt <- rep(c(1,2,3,4,5,6),6)
y <- data.frame(wave=yt, clprop=clp)

plot(y[,1], y[,2], type="n",
     xlab="Wave of study", ylab="Proportion of smoker")

lines(dta1p[3, 1:6], lty=1, col="steelblue")
lines(dta1p[4, 1:6], lty=1, col="indianred")
text(6, .12, "Boy", col="steelblue")
text(5, .18, "Girl", col="indianred")
grid()

1.5 Reference

Source: Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. pp. 241-243.

2 Homework 2

2.1 Info

  • Problem: Use an appropriate model to analyze the data in PIRLS example of clustered categorical outcomes.

  • Data: The Progress in International Reading Literacy Study (PIRLS) is an large scale international assessment of reading literacy of young students. For this problem, the data set is the PIRLS 2006 data from only the United States. The question of interest is to predict “enjoy reading” categorical responses from the other co-variates.

Column 1: School ID
Column 2: Gender ID, 1 = Female, 0 = Male
Column 3: US born, 1 = Yes, 0 = No
Column 4: Watch TV, 1 = no time, 2 = up to 1 hour, 3 = 1-3 hours, 4 = 3-5 hours, 5 = 5 hours or more
Column 5: Economically disadvantaged, > 50%, 26 - 50%, 11 - 25%, < 10%
Column 6: Reading score
Column 7: The response to a statement “I enjoy reading”, 1 = disagree a lot, 2 = disagree a little, 3 = agree a little, 4 = agree a lot

2.2 Data management

dta2 <- read.table("pirls_us.txt", h=T)

names(dta2) <- c("School", "Gender", "US", "TVtime", "Economic", "Reading", "Enjoy")

dta2 <- dta2 %>% 
  mutate(School = factor(School),
         Gender = factor(Gender, 
                         levels=0:1, 
                         labels=c("Male","Female")),
         US = factor(US),
         TVtime = factor(TVtime,
                         levels=1:5,
                         labels=c("no time",
                                  "up to 1 hour", 
                                  "1-3 hours",
                                  "3-5 hours",
                                  "5 hours or more")), 
         Economic = factor(Economic,
                           levels = c(">50%", 
                                      "26-50", 
                                      "11-25", 
                                      "<10%")),
         Enjoy = factor(Enjoy, levels = 1:4))

head(dta2)
##   School Gender US    TVtime Economic  Reading Enjoy
## 1      1   Male  1 1-3 hours     <10% 547.0412     1
## 2      1   Male  1 3-5 hours     <10% 545.9297     2
## 3      1   Male  1 3-5 hours     <10% 587.8460     2
## 4      1 Female  1 1-3 hours     <10% 450.8517     2
## 5      1   Male  1   no time     <10% 525.2201     3
## 6      1   Male  1   no time     <10% 555.4918     3
with(dta2, ftable(US, Gender, Enjoy))
##           Enjoy    1    2    3    4
## US Gender                          
## 0  Male           41   19   58   84
##    Female          9   13   31  103
## 1  Male          332  230  589  943
##    Female        155  186  487 1334
with(dta2, ftable(US, Economic, Enjoy))
##             Enjoy   1   2   3   4
## US Economic                      
## 0  >50%            30  13  50 112
##    26-50           10   8  22  30
##    11-25            5   4   3  15
##    <10%             5   7  14  30
## 1  >50%           214 188 439 891
##    26-50          114  95 246 541
##    11-25           60  36 125 293
##    <10%            99  97 266 552
with(dta2, ftable(US, Economic, TVtime, Enjoy))
##                             Enjoy   1   2   3   4
## US Economic TVtime                               
## 0  >50%     no time                18   7  17  34
##             up to 1 hour            1   0  10  11
##             1-3 hours               5   2  14  28
##             3-5 hours               5   4   8  32
##             5 hours or more         1   0   1   7
##    26-50    no time                 8   3   6   8
##             up to 1 hour            0   2   2   7
##             1-3 hours               2   1  10  10
##             3-5 hours               0   2   4   3
##             5 hours or more         0   0   0   2
##    11-25    no time                 4   1   0   4
##             up to 1 hour            1   0   1   2
##             1-3 hours               0   2   2   4
##             3-5 hours               0   1   0   4
##             5 hours or more         0   0   0   1
##    <10%     no time                 2   2   3   5
##             up to 1 hour            0   1   4   3
##             1-3 hours               2   3   3  12
##             3-5 hours               1   1   4   9
##             5 hours or more         0   0   0   1
## 1  >50%     no time               119  73 140 268
##             up to 1 hour           26  39  85 126
##             1-3 hours              31  45 135 239
##             3-5 hours              30  28  65 224
##             5 hours or more         8   3  14  34
##    26-50    no time                54  29  71 125
##             up to 1 hour           19  24  39  83
##             1-3 hours              24  23  83 162
##             3-5 hours              15  18  51 145
##             5 hours or more         2   1   2  26
##    11-25    no time                24  10  27  55
##             up to 1 hour           11   7  27  41
##             1-3 hours              13  10  47 130
##             3-5 hours               8   8  21  59
##             5 hours or more         4   1   3   8
##    <10%     no time                44  23  61  73
##             up to 1 hour           15  21  59  81
##             1-3 hours              20  30  80 221
##             3-5 hours              14  20  61 147
##             5 hours or more         6   3   5  30
barplot(with(dta2, prop.table(table(Enjoy, US), m=2)), beside=T, legend=T, args.legend=list(x="topleft", cex=.5))

2.3 GLMM

m0 <- ordinal::clmm(Enjoy ~ Gender + US + Reading + (1 | School) + (1 | TVtime) + (1 | Economic), data = dta2)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
##   Consider formula(paste(x, collapse = " ")) instead.
## Warning: (2) Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met
summary(m0)
## Cumulative Link Mixed Model fitted with the Laplace approximation
## 
## formula: Enjoy ~ Gender + US + Reading + (1 | School) + (1 | TVtime) +  
##     (1 | Economic)
## data:    dta2
## 
##  link  threshold nobs logLik   AIC      niter     max.grad cond.H 
##  logit flexible  4614 -5095.06 10208.13 599(1850) 2.82e-01 4.9e+06
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  School   (Intercept) 0.166974 0.40863 
##  TVtime   (Intercept) 0.093257 0.30538 
##  Economic (Intercept) 0.003375 0.05809 
## Number of groups:  School 180,  TVtime 5,  Economic 4 
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## GenderFemale  0.6984740  0.0593244  11.774   <2e-16 ***
## US1          -0.1629514  0.1117947  -1.458    0.145    
## Reading       0.0058490  0.0004666  12.535   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2   1.0564     0.3011   3.509
## 2|3   1.8456     0.3015   6.122
## 3|4   3.1448     0.3041  10.343

2.4 Reference

Source: PIRLS 2006

3 Homework 3

3.1 Info

  • Problem: Reproduce the results of analysis of clustered ordinal data reported in the geometry: A-Level example.

  • Data: The data set contains the results of examination on A-level geometry for 33,276 students from 2,317 institutions in England in 1997. A mean centred average General Certificate of Secondary Education (GCSE) score is derived from all GCSE subjects of the student. This is used as a prior attainment covariate. Also recorded are gender and age of the student. The cohort is aged between 18 and 19 years with a mean of 18.5 years. Institutions were grouped into 11 categories according to their admission policy and type of funding. Eight examination boards were involved in the study although two boards were combined into one.

Column 1: Score on the A-Level Geometry, (A=10, B=8, C=6, D=4, E=2, with F indicating unclassified or fail at 0)
Column 2: The examination board ID, 1=Associate, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford-Cambridge, 8=WJB
Column 3: Gender ID, 0=Male, 1=Female
Column 4: Age in years, mean-centered
Column 5: Institution average GCSE score, mean-centered
Column 6: Institution type, 1=LEA Maintained Comprehensive, 2=Maintained Selective, 3=Maintained Modern, 4=Grammar Comprehensive, 5=Grammar Selective, 6=Grammar Modern, 7=Independent selective, 8=Independent non-selective, 9=Sixth Form College, 10=Further Education College, 11=Others
Column 7: Institution ID

3.2 Data management

dta3 <- read.table("geometryAL.txt", h=T)

#
head(dta3)
##   score board gender age mgcse itype  iid
## 1     8     3      1   1 0.856     7 1001
## 2     8     3      1  -6 0.856     7 1001
## 3     8     3      1   5 0.856     7 1001
## 4    10     3      1  -1 0.856     7 1001
## 5     8     3      1  -5 0.856     7 1001
## 6    10     3      1   3 0.856     7 1001
#
dta3$gender <- factor(ifelse(dta3$gender==1,"Female","Male"))

#
dta3$board <- factor(dta3$board)

#
dta3$iid <- factor(dta3$iid)

#
dta3$itype <- factor(dta3$itype)

#
str(dta3)
## 'data.frame':    33276 obs. of  7 variables:
##  $ score : int  8 8 8 10 8 10 10 8 6 4 ...
##  $ board : Factor w/ 7 levels "1","2","3","5",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ gender: Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 2 ...
##  $ age   : int  1 -6 5 -1 -5 3 3 4 -2 1 ...
##  $ mgcse : num  0.856 0.856 0.856 0.856 0.856 0.856 0.856 0.856 0.856 0.657 ...
##  $ itype : Factor w/ 11 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ iid   : Factor w/ 2317 levels "1001","1002",..: 1 1 1 1 1 1 1 1 1 2 ...
#
ggplot(data=dta3, aes(x=score)) +
 geom_histogram(binwidth=1) +
 aes(y = ..density..) +
 facet_grid(. ~ itype) +
 labs(x="Geometry score", y="Density") +
 ggtitle("Institution type")

#
ggplot(data=dta3, aes(x=score)) +
 geom_histogram(binwidth=1) +
 aes(y = ..density..) +
 facet_grid(itype ~ board) +
 labs(x="Geometry score", y="Density")+
 ggtitle("Institution Type by Examination Board")

#
ggplot(dta3, aes(x=score, fill=gender)) +
 geom_bar(binwidth=1, position="dodge") + 
 labs(x="Geometry score", y="Count")
## Warning: Ignoring unknown parameters: binwidth

#
ggplot(dta3, aes(x=mgcse, y=score)) +
 geom_point(alpha=I(0.3), cex=0.1) +
 stat_smooth(method="lm") +
 facet_grid(itype ~ gender) +
 labs(y="Geometry score", x="GCSE score")
## `geom_smooth()` using formula 'y ~ x'

#
dta3$score <- factor(dta3$score)
#
pacman::p_load(ordinal)

#
summary(dta3.clmm <- clmm2(score ~ gender + age + mgcse + board + itype, 
                          random=iid, data=dta3, Hess = TRUE))
## Warning: clmm2 may not have converged:
##   optimizer 'ucminf' terminated with max|gradient|: 0.0160943494036054
## Cumulative Link Mixed Model fitted with the Laplace approximation
## 
## Call:
## clmm2(location = score ~ gender + age + mgcse + board + itype, 
##     random = iid, data = dta3, Hess = TRUE)
## 
## Random effects:
##           Var   Std.Dev
## iid 0.2347193 0.4844784
## 
## Location coefficients:
##            Estimate Std. Error z value  Pr(>|z|)  
## genderMale  -0.2862   0.0218   -13.1255 < 2.22e-16
## age         -0.0020   0.0029    -0.6811 0.49582285
## mgcse        1.5431   0.0407    37.9416 < 2.22e-16
## board2      -0.2158   0.0552    -3.9073 9.3326e-05
## board3       0.0574   0.0452     1.2705 0.20391396
## board5      -0.9067   0.1027    -8.8307 < 2.22e-16
## board6      -0.0038   0.0559    -0.0684 0.94547377
## board7       0.1275   0.0734     1.7370 0.08239029
## board8       0.5117   0.3426     1.4935 0.13530562
## itype2       0.0292   0.0891     0.3276 0.74319435
## itype3      -0.1547   0.1897    -0.8156 0.41470086
## itype4       0.0507   0.0496     1.0222 0.30670678
## itype5       0.0788   0.0736     1.0703 0.28450407
## itype6      -0.9874   0.2539    -3.8892 0.00010056
## itype7       0.2200   0.0532     4.1347 3.5539e-05
## itype8      -0.0442   0.1525    -0.2900 0.77181787
## itype9       0.0064   0.0612     0.1039 0.91725637
## itype10     -0.3173   0.0654    -4.8526 1.2183e-06
## itype11     -0.2345   0.1685    -1.3915 0.16408490
## 
## No scale coefficients
## 
## Threshold coefficients:
##      Estimate Std. Error z value 
## 0|2   -2.6390   0.0481   -54.8113
## 2|4   -1.5242   0.0460   -33.1390
## 4|6   -0.4985   0.0452   -11.0218
## 6|8    0.6294   0.0452    13.9192
## 8|10   2.1112   0.0469    44.9725
## 
## log-likelihood: -54914.43 
## AIC: 109878.85 
## Condition number of Hessian: 14033.00

3.3 Reference

Source: Fielding, A., Yang, M., & Goldstein, H. (2003). Multilevel ordinal models for examination grades. Statistical Modelling, 3, 127-153.