1 In-class exercises 1:

Fit a regression model for parents and another for teachers, separately, and compare the results with that from the multiple sources approach for the Connecticut child surveys example.

The data are obtained from two surveys of children’s mental health. The data set consists of 1,428 children who had both parent and teacher responses, and an additional 1,073 children with only a parent response. In the psychiatric literature, these sources of respondents are often referred to as “informants”.

A standardized measure of childhood psychopathology was used both by parents (Child Behavior Checklist, CBC) and teachers (Teacher Report Forms, TRF) to assess children in the study. The response variable is derived from the externalizing scale, which assesses delinquent and aggressive behavior. The scale has been dichotomized at the cutpoint for borderline/clinical psychopathology.

Source: Zahner, G.E.P., Jacobs, J.H., Freeman, D.H. & Trainor, K.F. (1993). Rural-urban child psychopathology in a northeastern U.S. State: 1986-1989. Journal of the American Academy of Child and Adolescent Psychiatry, 32, 378-387. Reported in Fitzmaurice, G.M., Laird, N.M., & Ware, J.H. (2004). Applied Longitudinal Analysis. p. 434-439.

Column 1: Child ID Column 2: Child’s physical health problems: Fair = Fair to bad , Good Column 3: Single parent status: Single, Otherwise Column 4: Externalizing behavior (Parent report) Column 5: Externalizing behavior (Teachert report)

1.1 Load data file

# load packages
#
pacman::p_load(MASS, gee, tidyverse, geepack)

# input data
dta <- read.table('data/ccs-data.txt', na.string = ".")

#
names(dta) <- c("ID", "Problem", "Single", "Parent", "Teacher")

#
dta <- dta %>%
       mutate(ID = factor(ID), Problem = factor(ifelse(Problem == 1, "Y", "N")),
                               Single = factor(ifelse(Single == 1, "Y", "N")))

#
head(dta)
  ID Problem Single Parent Teacher
1  1       Y      Y      0      NA
2  2       N      N      1      NA
3  3       Y      N      0      NA
4  4       Y      Y      1      NA
5  5       Y      N      0      NA
6  6       Y      N      0      NA
#
str(dta)
'data.frame':   2501 obs. of  5 variables:
 $ ID     : Factor w/ 2501 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Problem: Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 1 1 ...
 $ Single : Factor w/ 2 levels "N","Y": 2 1 1 2 1 1 1 1 2 1 ...
 $ Parent : int  0 1 0 1 0 0 0 0 0 0 ...
 $ Teacher: int  NA NA NA NA NA NA NA 0 NA 0 ...

1.2 Descrptive Table

with(dta, ftable(Teacher, Parent))
        Parent    0    1
Teacher                 
0              1030  129
1               165  104
## correlation
cor(dta[,4:5], use="pair")
         Parent Teacher
Parent  1.00000 0.29133
Teacher 0.29133 1.00000
# proportion of missing responses
sum(is.na(dta[,4]) | is.na(dta[,5]))/dim(dta)[1]
[1] 0.42903
# stack the parents and teachers report; call them informants
dtaL <- dta %>%
        gather(Informant, Report, 4:5)
# sort according to child id
dtaL <- dtaL[order(dtaL$ID), ]

# get rid of annoying row names
rownames(dtaL) <- NULL

# have a look
head(dtaL)
  ID Problem Single Informant Report
1  1       Y      Y    Parent      0
2  1       Y      Y   Teacher     NA
3  2       N      N    Parent      1
4  2       N      N   Teacher     NA
5  3       Y      N    Parent      0
6  3       Y      N   Teacher     NA

1.3 Model

summary(m1 <- gee(Report ~ Informant + Problem + Single + Informant:Problem,
        data = dtaL, id = ID, corstr = "exchangeable", family = binomial))
              (Intercept)          InformantTeacher                  ProblemY 
                 -2.15954                   0.47123                   0.59961 
                  SingleY InformantTeacher:ProblemY 
                  0.63324                  -0.42471 

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Exchangeable 

Call:
gee(formula = Report ~ Informant + Problem + Single + Informant:Problem, 
    id = ID, data = dtaL, family = binomial, corstr = "exchangeable")

Summary of Residuals:
     Min       1Q   Median       3Q      Max 
-0.28412 -0.17650 -0.15708 -0.10396  0.89604 


Coefficients:
                          Estimate Naive S.E.  Naive z Robust S.E. Robust z
(Intercept)               -2.15394   0.090166 -23.8887    0.088889 -24.2318
InformantTeacher           0.47384   0.114981   4.1210    0.117988   4.0160
ProblemY                   0.59957   0.112847   5.3131    0.112781   5.3162
SingleY                    0.61372   0.105963   5.7919    0.107593   5.7041
InformantTeacher:ProblemY -0.45729   0.157015  -2.9124    0.157113  -2.9106

Estimated Scale Parameter:  1.001
Number of Iterations:  2

Working Correlation
       [,1]   [,2]
[1,] 1.0000 0.2682
[2,] 0.2682 1.0000
summary(m1a <- geeglm(Report ~ Informant + Problem + Single +
                      Informant:Problem + Single:Problem,
        data = dtaL, id = ID, corstr = "exchangeable", family = binomial))

Call:
geeglm(formula = Report ~ Informant + Problem + Single + Informant:Problem + 
    Single:Problem, family = binomial, data = dtaL, id = ID, 
    corstr = "exchangeable")

 Coefficients:
                          Estimate Std.err   Wald Pr(>|W|)
(Intercept)                -2.1662  0.0941 529.88  < 2e-16
InformantTeacher            0.4751  0.1182  16.16  5.8e-05
ProblemY                    0.6218  0.1265  24.14  8.9e-07
SingleY                     0.6579  0.1595  17.01  3.7e-05
InformantTeacher:ProblemY  -0.4591  0.1572   8.53   0.0035
ProblemY:SingleY           -0.0794  0.2157   0.14   0.7127

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)        1  0.0642
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.267  0.0303
Number of clusters:   2501  Maximum cluster size: 2 

1.4 Parameter Estimates

sjPlot::tab_model(m1a, show.obs=F, show.ngroups = F)
  Report
Predictors Odds Ratios CI p
(Intercept) 0.11 0.10 – 0.14 <0.001
Informant [Teacher] 1.61 1.28 – 2.03 <0.001
Problem [Y] 1.86 1.45 – 2.39 <0.001
Single [Y] 1.93 1.41 – 2.64 <0.001
Informant [Teacher] *
Problem [Y]
0.63 0.46 – 0.86 0.004
Problem [Y] * Single [Y] 0.92 0.61 – 1.41 0.713

1.5 Residual plot - not informative

plot(m1a, "pearson", which = 1)
grid()

# set theme
ot <- theme_set(theme_bw())

# fortify data with model fits
dta_m1 <- data.frame(na.omit(dtaL), phat = fitted(m1a))

# compare observed with fitted
ggplot(dta_m1, aes(Problem, Report, color = Informant)) +
 geom_point(aes(Problem, phat),position = position_dodge(width=.2),
            pch = 1, size = rel(2)) +
 geom_hline(yintercept = c(mean(dta$Parent, na.rm = T),
                           mean(dta$Teacher, na.rm = T)), linetype = "dotted")+
 stat_summary(fun.data = "mean_cl_boot", position = position_dodge(width=.2)) +
 facet_wrap( ~ Single)+
 labs(x = "Physical Problem", y = "Probability of report") +
 theme(legend.position = c(.1, .88))

  1. For kids of single parent status, both informants were more likely to report behavior problems.
  2. The effect of children health problems on report of behavior problems is more apparent on teachers than on parents.

2 In-class Exercises 2:

Replicate the results of analysis reported in the insomina example.

The data come from a randomized, double-blind clinical trial comparing active hypnotic drug with a placebo in patients who have insomnia problems. The response is the patient’s reported time (in minutes) to fall asleep after going to bed. Patients responded before and after a two-week treatment period. The patients receiving the two treatments were independent samples.

Source: Agresti, A. (2002). Categorical Data Analysis. 2nd Ed. p. 462

Column 1: Subject ID Column 2: Treatment, Active = 1, Placebo = 0 Column 3: Time, Initial = 0, Follow-up = 1 Column 4: Time to sleep, < 20 min. = 1, 20-30 = 2, 30-60 = 3, > 60 min. = 4

2.1 Load data file

#
library(repolr)

#
dta <- read.table("./data/insomnia.txt", h=T)

#
head(dta)
  case treat occasion resp
1    1     1        0    1
2    1     1        1    1
3    2     1        0    1
4    2     1        1    1
5    3     1        0    1
6    3     1        1    1

2.2 Model

dta_repolr <- repolr(resp ~ treat + occasion + treat:occasion, 
                     subjects="case", data=dta, times=c(1,2), categories=4, 
                     corr.mod="ar1")
#
summary(dta_repolr)

repolr: 2016-02-26 version 3.4 

Call:
repolr(formula = resp ~ treat + occasion + treat:occasion, subjects = "case", 
    data = dta, times = c(1, 2), categories = 4, corr.mod = "ar1")

Coefficients: 
                coeff     se.robust  z.robust  p.value 
cuts1|2          -2.2778    0.2115   -10.7697    0.0000
cuts2|3          -0.9476    0.1774    -5.3416    0.0000
cuts3|4           0.3634    0.1752     2.0742    0.0381
treat             0.0022    0.2386     0.0092    0.9927
occasion          1.0257    0.1564     6.5582    0.0000
treat:occasion    0.7511    0.2478     3.0311    0.0024

Correlation Structure:  ar1 
Estimated Correlation:  0.217 

2.3 hand-made cumulative prop. against time to sleep

# Relabel
dta$treat <- factor(ifelse(dta$treat==1, "Active", "Placebo"))

#
dta$occasion <- factor(ifelse(dta$occasion==1, "Follow-up", "Initial"))

#
dta.tbl <- ftable(dta$treat, dta$occasion , dta$resp)

#
dta.tbl
                    1  2  3  4
                              
Active  Follow-up  40 49 19 11
        Initial    12 20 40 47
Placebo Follow-up  31 29 35 25
        Initial    14 20 35 51
obsp <- as.numeric(t(prop.table(dta.tbl, m=1)))
obsp
 [1] 0.3361 0.4118 0.1597 0.0924 0.1008 0.1681 0.3361 0.3950 0.2583 0.2417
[11] 0.2917 0.2083 0.1167 0.1667 0.2917 0.4250
clp <- c(cumsum(obsp[1:4]), cumsum(obsp[5:8]), cumsum(obsp[9:12]), 
         cumsum(obsp[13:16]))
clp
 [1] 0.336 0.748 0.908 1.000 0.101 0.269 0.605 1.000 0.258 0.500 0.792 1.000
[13] 0.117 0.283 0.575 1.000
#
yt <-rep(c(20,25,45,60),4)

#
y <- data.frame(time2sleep=yt, clprop=clp)

#
plot(y[,1], y[,2], type="n",
     xlab="Time to Sleep (min.)", ylab="Cumulative proportion")

#
lines(y[1:4,  1],  y[1:4,2], lty=1, col="steelblue")
lines(y[5:8,  1],  y[5:8,2], lty=5, col="steelblue")
lines(y[9:12, 1],  y[9:12,2], lty=2, col="indianred")
lines(y[13:16,1],  y[13:16,2], lty=4, col="indianred")

#
text(22, .12, "T,I")
text(22, .47, "T,F")
text(22, .20, "P,I")
text(22, .33, "P,F")
#
grid()

The active drug group and the placebo groups initially fell asleep at about the same rate. Two weeks later, at the follow-up those with the active treatment tended to fall asleep more quickly.

3 Homework 1:

Analyze the child obesity data with the model specified in the example and interpret the results.

The data are from the Muscatine Coronary Risk Factor (MCRF) study, a longitudinal survey of school-age children in Muscatine, Iowa. The MCRF study had the goal of examining the development and persistence of risk factors for coronary disease in children. In the MCRF study, weight and height measurements of five cohorts of children, initially aged 5-7, 7-9, 9-11, 11-13, and 13-15 years, were obtained biennially from 1977 to 1981. Data were collected on 4856 boys and girls. On the basis of a comparison of their weight to age-gender specific norms, children were classified as obese or not obese. The goal of the analysis is to determine whether the risk of obesity increases with age and whether patterns of change in obesity are the same for boys and girls.

Source: Woolson, R.F., & Clarke, W.R. (1984). Analysis of categorical incompletel longitudinal data. Journal of the Royal Statistical Society, Series A, 147, 87-99. Reported in Fitzmaurice, G.M., Laired, N.M., & Ware, J.H. (2004). Applied Longitudinal Data Analysis. pp. 306-312.

Column 1: ID Column 2: Gender, 0 = Male, 1 = Female Column 3: Baseline Age Column 4: Current Age, Age denotes mid-point of age-cohort Column 5: Occasions of measurements Column 6: Obesity Status, 1 = Obese, 0 = Non-Obese, . = Missing

3.1 Load data file

fL <- "data/muscatine-data.txt"
dta <- read.table(fL, h=F, na.strings = '.')
dta <- na.omit(dta)
names(dta) <- c("ID","Gender","Age0","Age","Occasion","Obese")
dta$Gender <- ifelse(dta$Gender == 0, "M", "F")
head(dta)
  ID Gender Age0 Age Occasion Obese
1  1      M    6   6        1     1
2  1      M    6   8        2     1
3  1      M    6  10        3     1
4  2      M    6   6        1     1
5  2      M    6   8        2     1
6  2      M    6  10        3     1

3.2 Summary Table

ftable(dta, row.vars = c(2, 4), col.vars = c(5,6))
           Occasion   1       2       3    
           Obese      0   1   0   1   0   1
Gender Age                                 
F      6            141  23   0   0   0   0
       8            294  58 265  55   0   0
       10           270  92 276  87 268  90
       12           291  91 279  99 278  92
       14           300  89 226  64 256  73
       16             0   0 250  87 226  56
       18             0   0   0   0 140  37
M      6            174  15   0   0   0   0
       8            289  67 296  54   0   0
       10           312  84 298  77 308  83
       12           281  90 299  88 290  90
       14           307  73 233  65 269  78
       16             0   0 251  67 224  54
       18             0   0   0   0 153  34

3.3 Visualization

par(mfrow=c(1,2))
barplot(prop.table(with(subset(dta, Gender=="M"), 
                           ftable(Obese, Age)), m=2), 
        xlab="Age \n (6, 8, 10, 12, 14, 16, 18 years)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="Male", 
        beside=T)
legend('topleft', c("Non-Obese","Obese"), 
       col=c("black","gray"), 
       pch=15, bty='n', cex=.5)
barplot(prop.table(with(subset(dta, Gender=="F"), 
                         ftable(Obese, Age)), m=2), 
        xlab="Age \n (6, 8, 10, 12, 14, 16, 18 years)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="Female", 
        beside=T)

Observed by the panels, I found an age effect on obesity and did not find gender effect.

3.4 Generalized Estimating Equations

Model: Letting Yij = 1 if the ith child is classified as obese at the jth occasion, and Yij = 0 otherwise. Assume that the marginal probability of obesity at each occasion follows the logistic model with age, gender, and gender x age effects, where age is defined by age at the jth occasion - 12 years. It can be further assumed that the log odds of obesity changes curvilinearly with age (i.e., quadratic age trend), but the trend over time is allowed to be different for girls and boys.

 dta_c <- dta %>% 
  mutate(Age = Age - 12,
         Age2 = Age^2)
    
summary(m0 <- geeglm(Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2, data=dta_c, id = ID, family = binomial, corstr = "ar1"))

Call:
geeglm(formula = Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2, 
    family = binomial, data = dta_c, id = ID, corstr = "ar1")

 Coefficients:
             Estimate  Std.err   Wald Pr(>|W|)
(Intercept)  -1.10039  0.05018 480.89  < 2e-16
GenderM      -0.10531  0.07136   2.18  0.14001
Age           0.04516  0.01281  12.43  0.00042
Age2         -0.01460  0.00325  20.18    7e-06
GenderM:Age  -0.00290  0.01856   0.02  0.87585
GenderM:Age2 -0.00349  0.00471   0.55  0.45857

Correlation structure = ar1 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    0.994  0.0281
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.611  0.0204
Number of clusters:   4856  Maximum cluster size: 3 

Occasion and quadratic Age had a significant effect on Obesity.

3.5 Model 2:

summary(m1 <- geeglm(Obese ~ Gender + Age + Age2, data=dta_c, id = ID, family = binomial, corstr = "ar1"))

Call:
geeglm(formula = Obese ~ Gender + Age + Age2, family = binomial, 
    data = dta_c, id = ID, corstr = "ar1")

 Coefficients:
            Estimate  Std.err   Wald Pr(>|W|)
(Intercept) -1.08751  0.04724 530.08  < 2e-16
GenderM     -0.13145  0.06288   4.37    0.037
Age          0.04387  0.00926  22.43  2.2e-06
Age2        -0.01630  0.00235  48.05  4.2e-12

Correlation structure = ar1 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    0.994   0.028
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha     0.61  0.0203
Number of clusters:   4856  Maximum cluster size: 3 

Here, we found a gender effect on obesity.

3.6 Model Comparison

anova(m0,m1)
Analysis of 'Wald statistic' Table

Model 1 Obese ~ Gender + Age + Age2 + Gender:Age + Gender:Age2 
Model 2 Obese ~ Gender + Age + Age2
  Df    X2 P(>|Chi|)
1  2 0.582      0.75
sjPlot::tab_model(m1, show.obs=F, show.ngroups = F)
  Obese
Predictors Odds Ratios CI p
(Intercept) 0.34 0.31 – 0.37 <0.001
Gender [M] 0.88 0.78 – 0.99 0.037
Age 1.04 1.03 – 1.06 <0.001
Age2 0.98 0.98 – 0.99 <0.001

3.7 Estimated Probability Plot

yhat_m1 <- data.frame(ID=row.names(fitted(m1)), 
                      phat=fitted(m1))
dta$ID <- factor(dta$ID)
dta_m1 <- inner_join(dta, yhat_m1, by="ID")
dta_sum <- dta_m1 %>%
  group_by(Gender, Age) %>%
  summarize(phat = mean(phat))

ggplot(dta_sum, aes(Age, phat, color = Gender, group = Gender)) +
  geom_line() +
  labs(x="Age (years)", 
       y="Estimated Probability of Obesity") +
  theme_minimal()+
  theme(legend.position=c(.9, .2))

Revealed by the results, we can observe that 1. The probability of obesity increased as age increases. 2. Female had larger probability of obesity after 12 years old than male.

4 Homework 2:

Replicate the results of multiple sources analysis reported in the Cantabrian example.

The data are from a survey of primary care patients in Cantabrian, Spain, to estimate the prevalence of psychiatric illness. The prevalence for each of the sexes is estimated from data provided by the general practitioner and from completing the general health questionnaire.

Source: Everitt, B.S., & Dunn, G. (2001). Applied Multivariate Data Analysis, 2nd Ed. p. 237

Column 1: Subject ID Column 2: Illness = 1, Otherweise = 0 Column 3: Gender ID Column 4: General practitioner = GP, General health questionnaire = GPQ

4.1 Load data file

dta <- read.table("data/cantabrian.txt", h=T)

names(dta) <- c("ID","Health","Gender","Type")

head(dta)
  ID Health Gender Type
1  1      0      M   GP
2  1      0      M  GHQ
3  2      0      M   GP
4  2      0      M  GHQ
5  3      0      M   GP
6  3      0      M  GHQ

4.2 Relabel

dta_c <- dta %>% 
  mutate(Health = factor(Health, 
                         levels=0:1, 
                         labels=c("Illness","Otherwise")),
         Gender = factor(Gender, levels = c("M", "F")),
         Type = factor(Type, 
                       levels=c("GP","GHQ")))

4.3 Summary Table

ftable(dta_c, row.vars = c('Gender','Health'), col.vars = c('Type'))
                 Type  GP GHQ
Gender Health                
M      Illness        287 244
       Otherwise       37  80
F      Illness        420 306
       Otherwise       79 193
prop.table(dta_sum <- (ftable(dta_c, row.vars = c('Gender','Health'), col.vars = c('Type'))), 2)
                 Type     GP    GHQ
Gender Health                      
M      Illness        0.3487 0.2965
       Otherwise      0.0450 0.0972
F      Illness        0.5103 0.3718
       Otherwise      0.0960 0.2345

4.4 Visualization

par(mfrow=c(1,2))

barplot(prop.table(with(subset(dta_c, Type=="GP"), 
                           ftable(Gender, Health)), m=2), 
        xlab="Health \n (Illness, Otherwise)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="GP", 
        beside=T)

legend('topleft', c("Male","Female"), 
       col=c("black","gray"), 
       pch=15, bty='n', cex=.5)

barplot(prop.table(with(subset(dta_c, Type=="GHQ"), 
                           ftable(Gender, Health)), m=2), 
        xlab="Health \n (Illness, Otherwise)",
        ylab="Proportion", 
        ylim=c(0,1),
        main="GHQ", 
        beside=T)

4.5 Generalized Estimating Equations

4.6 Model 0

summary(m0 <- geeglm(Health ~ Gender + Type + Gender:Type, data=dta, id = ID, corstr="exchangeable", family = binomial))

Call:
geeglm(formula = Health ~ Gender + Type + Gender:Type, family = binomial, 
    data = dta, id = ID, corstr = "exchangeable")

 Coefficients:
               Estimate Std.err  Wald Pr(>|W|)
(Intercept)     -0.4609  0.0919 25.14  5.3e-07
GenderM         -0.6542  0.1583 17.09  3.6e-05
TypeGP          -1.2099  0.1265 91.46  < 2e-16
GenderM:TypeGP   0.2765  0.2283  1.47     0.23

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)        1  0.0786
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.298  0.0474
Number of clusters:   823  Maximum cluster size: 2 

4.7 Model 1

summary(m1 <- geeglm(Health ~ Gender + Type, data=dta, id = ID, corstr="exchangeable", family = binomial))

Call:
geeglm(formula = Health ~ Gender + Type, family = binomial, data = dta, 
    id = ID, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err  Wald Pr(>|W|)
(Intercept)  -0.4888  0.0902  29.4  6.0e-08
GenderM      -0.5760  0.1433  16.2  5.8e-05
TypeGP       -1.1209  0.1061 111.6  < 2e-16

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     1.01  0.0885
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.295  0.0494
Number of clusters:   823  Maximum cluster size: 2 

4.8 Model Comparison

anova(m0,m1)
Analysis of 'Wald statistic' Table

Model 1 Health ~ Gender + Type + Gender:Type 
Model 2 Health ~ Gender + Type
  Df   X2 P(>|Chi|)
1  1 1.47      0.23
sjPlot::tab_model(m1, show.obs=F, show.ngroups = F)
  Health
Predictors Odds Ratios CI p
(Intercept) 0.61 0.51 – 0.73 <0.001
Gender [M] 0.56 0.42 – 0.74 <0.001
Type [GP] 0.33 0.26 – 0.40 <0.001

The results showed that both Gender and Type were significant preditors for estimating the prevalence of psychiatric illness.

4.9 Estimated Probability Plot

ot <- theme_set(theme_bw())

# fortify data with model fits
dta_m1 <- data.frame(na.omit(dta_c), phat = fitted(m1))

# compare observed with fitted
ggplot(dta_m1, aes(Type, phat, color = Gender)) +
 stat_summary(fun.data = "mean_cl_boot", position = position_dodge(width=.2)) +
 labs(x = "Physical Problem", y = "Probability of report") +
 theme(legend.position = c(.1, .88))