group means vs marginal means

library(data.table)
library(dplyr)
library(forcats)
library(ggplot2)
source("/Users/claire/Desktop/Desktop/gradstats2022/gradstats_funcs.R")
nursery<- fread("/Users/claire/Desktop/Desktop/gradstats2022/nursery.csv", header = T, data.table = F)

when we have equal Ns, group means will equal the mean of means, as we’ve learned before.

this is a similar idea to marginal means. marginal means refer to computing means across the margins of a mean table”

there are multiple ways to get at these different pieces of information:

with(nursery, tapply(PEABODY, list(COLLEGE,NURSERY), mean)) # gives the means by group
##          no      yes
## both 74.500 87.00000
## none 67.125 76.83333
## one  74.000 76.13333
aov.out = aov(PEABODY ~ COLLEGE * NURSERY, data=nursery)
model.tables(aov.out, type="means", se=T) ## this runs an ANOVA and spits out several mean tables
## Design is unbalanced - use se.contrast() for se's
## Tables of means
## Grand mean
##       
## 75.02 
## 
##  COLLEGE 
##     both  none   one
##       82 72.95 75.28
## rep    5 20.00 25.00
## 
##  NURSERY 
##       no  yes
##     71.3 77.5
## rep 20.0 30.0
## 
##  COLLEGE:NURSERY 
##        NURSERY
## COLLEGE no    yes  
##    both 74.50 87.00
##    rep   2.00  3.00
##    none 67.12 76.83
##    rep   8.00 12.00
##    one  74.00 76.13
##    rep  10.00 15.00

why are the above sets of means similar and different? the first set under the model.tables gives means only taking into account one factor at a time – mean peabody score for those who went to nursery for not (regardless of college educated parents) and mean peabody score for different levels of college educated parents – regardless of attending nursery.

the last set of means under model.tables is the same as the first function – because it calculates those marginal means fully taking into account both factors AND the unequal Ns. we see the unequal Ns given after “rep”

example 1: in the context of the ANOVA we’ve been running, how do peabody scores differ for those kids who attended nursery vs those who did not? (or– controlling for number of college educated parents, how does attending nursery school influence peabody scores)

this code, again, gives us the average peabody score for someone who attended nursery (77.5) vs not (71.3)

describeBy(PEABODY ~ NURSERY, data=nursery)
## 
##  Descriptive statistics by group 
## NURSERY: no
##         vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## PEABODY    1 20 71.3 9.19     68   70.12 6.67  60  94    34 0.95    -0.18 2.06
## ------------------------------------------------------------ 
## NURSERY: yes
##         vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
## PEABODY    1 30 77.5 9.09     78   78.17 6.67  51  96    45 -0.67     0.94 1.66

we might think that’s good enough but it’s not given the question! we want to know how to answer this in the context of our ANOVA or controlling for college-educated parents.

describeBy(PEABODY ~ NURSERY + COLLEGE, data=nursery)
## 
##  Descriptive statistics by group 
## NURSERY: no
## COLLEGE: both
##         vars n mean    sd median trimmed   mad min max range skew kurtosis  se
## PEABODY    1 2 74.5 13.44   74.5    74.5 14.08  65  84    19    0    -2.75 9.5
## ------------------------------------------------------------ 
## NURSERY: yes
## COLLEGE: both
##         vars n mean   sd median trimmed   mad min max range skew kurtosis   se
## PEABODY    1 3   87 8.54     86      87 10.38  79  96    17 0.12    -2.33 4.93
## ------------------------------------------------------------ 
## NURSERY: no
## COLLEGE: none
##         vars n  mean   sd median trimmed  mad min max range skew kurtosis   se
## PEABODY    1 8 67.12 6.15   65.5   67.12 2.22  63  82    19 1.69     1.32 2.17
## ------------------------------------------------------------ 
## NURSERY: yes
## COLLEGE: none
##         vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
## PEABODY    1 12 76.83 7.58   77.5    77.4 5.19  60  88    28 -0.51    -0.22
##           se
## PEABODY 2.19
## ------------------------------------------------------------ 
## NURSERY: no
## COLLEGE: one
##         vars  n mean    sd median trimmed  mad min max range skew kurtosis   se
## PEABODY    1 10   74 10.15     73   73.25 5.19  60  94    34 0.57    -0.72 3.21
## ------------------------------------------------------------ 
## NURSERY: yes
## COLLEGE: one
##         vars  n  mean   sd median trimmed  mad min max range  skew kurtosis  se
## PEABODY    1 15 76.13 9.69     75      77 7.41  51  90    39 -0.92     0.65 2.5

now that we’eve edited the code, we can get these means broken down the same way as the ANOVA model would.

the mean peabody score for those who attended nursery and

  • had no parents attend college=76.83
  • had one parent attend college=76.13
  • had both parents attend college=87

so the total mean of means for those who attended nursery is \((76.83+76.13+87)/3=79.99\)

similarly:

the mean peabody score for those who did not attend nursery and

  • had no parents attend college=67.12
  • had one parent attend college=74
  • had both parents attend college=74.5

so the total mean of means for those who did not attend nursery is \((67.12+74+74.5)/3=71.87\)

we can see where the mean of means come into play

nursery$nurse<- ifelse(nursery$NURSERY=="yes", -1/2, ifelse(nursery$NURSERY=="no", 1/2, NA))
nursery$lin_coll<- ifelse(nursery$COLLEGE=="none", -1/2, ifelse(nursery$COLLEGE=="one", 0, ifelse(nursery$COLLEGE=="both", 1/2, NA)))
nursery$quad_coll<- ifelse(nursery$COLLEGE=="none", -1/3, ifelse(nursery$COLLEGE=="one", 2/3, ifelse(nursery$COLLEGE=="both", -1/3, NA)))
nursery$lin_nursery<- nursery$lin_coll*nursery$nurse
nursery$quad_nursery<- nursery$quad_coll*nursery$nurse

mcSummary(lm(PEABODY~nurse+lin_coll+quad_coll+lin_nursery+quad_nursery, data=nursery))
## lm(formula = PEABODY ~ nurse + lin_coll + quad_coll + lin_nursery + 
##     quad_nursery, data = nursery)
## 
## Omnibus ANOVA
##                  SS df      MS EtaSq     F     p
## Model       998.205  5 199.641 0.224 2.535 0.042
## Error      3464.775 44  78.745                  
## Corr Total 4462.980 49  91.081                  
## 
##   RMSE AdjEtaSq
##  8.874    0.135
## 
## Coefficients
##                 Est StErr      t     SSR(3) EtaSq   tol  CI_2.5 CI_97.5     p
## (Intercept)  75.932 1.626 46.706 171776.910 0.980    NA  72.655  79.208 0.000
## nurse        -8.114 3.251 -2.495    490.359 0.124 0.621 -14.667  -1.561 0.016
## lin_coll      8.771 4.528  1.937    295.402 0.079 0.749  -0.356  17.897 0.059
## quad_coll    -1.298 2.900 -0.448     15.778 0.005 0.749  -7.142   4.546 0.657
## lin_nursery  -2.792 9.057 -0.308      7.482 0.002 0.619 -21.044  15.461 0.759
## quad_nursery  8.971 5.799  1.547    188.431 0.052 0.677  -2.717  20.658 0.129

the beta for the nursery vs not coefficient = 8.11 –> aka the difference in the mean of means for nursery vs not (\(79.99-71.87\)).

example 2: how do calculators predict test performance, controlling for type of test taken?

calctest<- fread("/Users/claire/Desktop/Desktop/gradstats2022/calctest.csv", header=T, data.table=F)
with(calctest, tapply(correct, list(calc,type), mean)) 
##     AB MC write
## no  15 18    13
## yes 20 22    12

vs

describeBy(correct ~ calc, data=calctest)
## 
##  Descriptive statistics by group 
## calc: no
##         vars n  mean  sd median trimmed  mad min max range  skew kurtosis   se
## correct    1 9 15.33 2.5     15   15.33 2.97  11  19     8 -0.17    -1.22 0.83
## ------------------------------------------------------------ 
## calc: yes
##         vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## correct    1 9   18 5.05     19      18 5.93  10  24    14 -0.28     -1.6 1.68
calctest$calc_yes_no<- ifelse(calctest$calc=="yes", -1/2, ifelse(calctest$calc=="no", 1/2, NA))
calctest$type_AB_MC<- ifelse(calctest$type=="AB", -1/2, ifelse(calctest$type=="write", 0, ifelse(calctest$type=="MC", 1/2, NA)))
calctest$type_AB_MC_vs_write<- ifelse(calctest$type=="AB", -1/3, ifelse(calctest$type=="write", 2/3, ifelse(calctest$type=="MC", -1/3, NA)))
calctest$calc_AB_MC<- calctest$calc_yes_no*calctest$type_AB_MC
calctest$calc_AB_MC_vs_write<- calctest$calc_yes_no*calctest$type_AB_MC_vs_write

mcSummary(lm(correct~calc_yes_no+type_AB_MC+type_AB_MC_vs_write+calc_AB_MC+calc_AB_MC_vs_write, data=calctest))
## lm(formula = correct ~ calc_yes_no + type_AB_MC + type_AB_MC_vs_write + 
##     calc_AB_MC + calc_AB_MC_vs_write, data = calctest)
## 
## Omnibus ANOVA
##             SS df     MS EtaSq    F p
## Model      238  5 47.600 0.832 11.9 0
## Error       48 12  4.000             
## Corr Total 286 17 16.824             
## 
##  RMSE AdjEtaSq
##     2    0.762
## 
## Coefficients
##                        Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept)         16.667 0.471 35.355 5000.00 0.990  NA 15.640  17.694 0.000
## calc_yes_no         -2.667 0.943 -2.828   32.00 0.400   1 -4.721  -0.612 0.015
## type_AB_MC           2.500 1.155  2.165   18.75 0.281   1 -0.016   5.016 0.051
## type_AB_MC_vs_write -6.250 1.000 -6.250  156.25 0.765   1 -8.429  -4.071 0.000
## calc_AB_MC           1.000 2.309  0.433    0.75 0.015   1 -4.032   6.032 0.673
## calc_AB_MC_vs_write  5.500 2.000  2.750   30.25 0.387   1  1.142   9.858 0.018

in this case, the difference in math performance between those who used a calculator or not, controlling for type of test, is 2.667. we can get that number from the first with() statement above:

-mean with calculator=18 -mean without calculator=15.33

\(18-15.33=2.67\)

however, we can also get it if we calculate the marginal means from the describeBy statement above:

the mean correct for those with a calculator and:

  • AB=20
  • MC=22
  • written exams=12

so the total mean of means for those who used a calculator is \((20+22+12)/3=18\)

and

the mean correct for those without a calculator and:

  • AB=15
  • MC=18
  • written exams=13

so the total mean of means for those who used a calculator is \((15+18+13)/3=15.33\)

in sum:

both of these examples show how ANOVA models take into account sample size! it’s very important to know the question you are asking, and then present the appropriate information.

with equal N, it does not matter where you derive the means. if we were to ask what the effect of calculators on performance is regardless of exam type, we would get the same answer:

mcSummary(lm(correct~calc_yes_no, data=calctest))
## lm(formula = correct ~ calc_yes_no, data = calctest)
## 
## Omnibus ANOVA
##             SS df     MS EtaSq     F     p
## Model       32  1 32.000 0.112 2.016 0.175
## Error      254 16 15.875                  
## Corr Total 286 17 16.824                  
## 
##   RMSE AdjEtaSq
##  3.984    0.056
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 16.667 0.939 17.747   5000 0.952  NA 14.676  18.658 0.000
## calc_yes_no -2.667 1.878 -1.420     32 0.112  NA -6.648   1.315 0.175

using a calculator improves performance by 2.67 points.

let’s revist the example from last week’s homework

we asked you to find the simple effect of nursery on someone who had no parents attend college

nursery<- fread("/Users/claire/Desktop/Desktop/gradstats2022/nursery.csv", header=T, data.table=F)
head(nursery)
##   NURSERY COLLEGE PEABODY KINDER
## 1      no    both      65     59
## 2      no    both      84     79
## 3      no     one      76     69
## 4      no     one      71     63
## 5      no     one      72     68
## 6      no     one      87     81
table(nursery$COLLEGE) ### both, none, one
## 
## both none  one 
##    5   20   25
table(nursery$NURSERY) ### no, yes
## 
##  no yes 
##  20  30

create contrast codes

nursery$nurse<- ifelse(nursery$NURSERY=="yes", 1/2, ifelse(nursery$NURSERY=="no", -1/2, NA))
nursery$lin_coll<- ifelse(nursery$COLLEGE=="none", -1/2, ifelse(nursery$COLLEGE=="one", 0, ifelse(nursery$COLLEGE=="both", 1/2, NA)))
nursery$quad_coll<- ifelse(nursery$COLLEGE=="none", -1/3, ifelse(nursery$COLLEGE=="one", 2/3, ifelse(nursery$COLLEGE=="both", -1/3, NA)))
nursery$lin_nursery<- nursery$lin_coll*nursery$nurse
nursery$quad_nursery<- nursery$quad_coll*nursery$nurse

run the model

mcSummary(lm(PEABODY~nurse+lin_coll+quad_coll+lin_nursery+quad_nursery, data=nursery))
## lm(formula = PEABODY ~ nurse + lin_coll + quad_coll + lin_nursery + 
##     quad_nursery, data = nursery)
## 
## Omnibus ANOVA
##                  SS df      MS EtaSq     F     p
## Model       998.205  5 199.641 0.224 2.535 0.042
## Error      3464.775 44  78.745                  
## Corr Total 4462.980 49  91.081                  
## 
##   RMSE AdjEtaSq
##  8.874    0.135
## 
## Coefficients
##                 Est StErr      t     SSR(3) EtaSq   tol  CI_2.5 CI_97.5     p
## (Intercept)  75.932 1.626 46.706 171776.910 0.980    NA  72.655  79.208 0.000
## nurse         8.114 3.251  2.495    490.359 0.124 0.621   1.561  14.667 0.016
## lin_coll      8.771 4.528  1.937    295.402 0.079 0.749  -0.356  17.897 0.059
## quad_coll    -1.298 2.900 -0.448     15.778 0.005 0.749  -7.142   4.546 0.657
## lin_nursery   2.792 9.057  0.308      7.482 0.002 0.619 -15.461  21.044 0.759
## quad_nursery -8.971 5.799 -1.547    188.431 0.052 0.677 -20.658   2.717 0.129

test the simple effect of NURSERY on PEABODY, when COLLEGE = none.

\(peabody=75.9+8.11*nurse+8.771lin-1.298quad+2.79lin*nurse-8.971quad*nurse*\)

\(peabody=75.9+8.771lin-1.298quad+(8.11+2.79lin-8.971quad)*nurse\)

\(peabody=75.9+(8.77*(-.5))+(-1.298*(-.33))=71.9\) –> INTERCEPT = mean of group mean peabody scores when college= none across nurse groups

\(peabody=(8.11+2.79lin-8.971quad)*nurse\)

\(peabody=8.11+(2.79*(-.5)-(8.997*(-.33)))=9.7\) –> SLOPE = as you go from no nursery to nursery, this is the change in peabody scores for those who had no parents that attended college.

we can confirm this by dummy coding. since there are 3 groups within the college factor, we still need two codes. that means to dummy code for none, it needs to be zero on both codes!

nursery$nurse<- ifelse(nursery$NURSERY=="yes", 1/2, ifelse(nursery$NURSERY=="no", -1/2, NA))
nursery$dummy_coll_none_both<- ifelse(nursery$COLLEGE=="none", 0, ifelse(nursery$COLLEGE=="one", 0, ifelse(nursery$COLLEGE=="both", 1, NA)))
nursery$dummy_coll_none_one<- ifelse(nursery$COLLEGE=="none", 0, ifelse(nursery$COLLEGE=="one", 1, ifelse(nursery$COLLEGE=="both", 0, NA)))
nursery$dummmy1_nurse<- nursery$dummy_coll_none_both*nursery$nurse
nursery$dummy2_nurse<- nursery$dummy_coll_none_one*nursery$nurse
mcSummary(lm(PEABODY~nurse*dummy_coll_none_both+nurse*+dummy_coll_none_one, data=nursery))
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## lm(formula = PEABODY ~ nurse * dummy_coll_none_both + nurse * 
##     +dummy_coll_none_one, data = nursery)
## 
## Omnibus ANOVA
##                  SS df      MS EtaSq     F     p
## Model       998.205  5 199.641 0.224 2.535 0.042
## Error      3464.775 44  78.745                  
## Corr Total 4462.980 49  91.081                  
## 
##   RMSE AdjEtaSq
##  8.874    0.135
## 
## Coefficients
##                               Est StErr      t    SSR(3) EtaSq   tol  CI_2.5
## (Intercept)                71.979 2.025 35.542 99475.208 0.966    NA  67.898
## nurse                       9.708 4.050  2.397   452.408 0.115 0.400   1.545
## dummy_coll_none_both        8.771 4.528  1.937   295.402 0.079 0.853  -0.356
## dummy_coll_none_one         3.088 2.717  1.136   101.682 0.029 0.853  -2.388
## nurse:dummy_coll_none_both  2.792 9.057  0.308     7.482 0.002 0.771 -15.461
## nurse:dummy_coll_none_one  -7.575 5.434 -1.394   153.015 0.042 0.435 -18.527
##                            CI_97.5     p
## (Intercept)                 76.061 0.000
## nurse                       17.871 0.021
## dummy_coll_none_both        17.897 0.059
## dummy_coll_none_one          8.563 0.262
## nurse:dummy_coll_none_both  21.044 0.759
## nurse:dummy_coll_none_one    3.377 0.170

in the above model output, we can see that the slope for nursery is 9.71. this is the simple effect of nursery on peabody scores for someone who had NO parents attend college. we just calculated that above.

the intercept is a bit more confusing. it is the mean of group mean peabody scores for those who had no parents attend college, controlling for nursery. because nursery is in the model, the intercept is when nursery=0 and when college=0. when college=0 on both codes, we are talking about someone in the none group. when nursery=0, we are averaging across no and yes nursery groups because we contrast coded it.

therefore: \((67.125+76.83333)/2= 71.9\) is the intercept, or the marginal mean for those in the none college group.

finally, confirm with means and plot

aov.out = aov(PEABODY ~ COLLEGE * NURSERY, data=nursery)
model.tables(aov.out, type="means", se=T)
## Design is unbalanced - use se.contrast() for se's
## Tables of means
## Grand mean
##       
## 75.02 
## 
##  COLLEGE 
##     both  none   one
##       82 72.95 75.28
## rep    5 20.00 25.00
## 
##  NURSERY 
##       no  yes
##     71.3 77.5
## rep 20.0 30.0
## 
##  COLLEGE:NURSERY 
##        NURSERY
## COLLEGE no    yes  
##    both 74.50 87.00
##    rep   2.00  3.00
##    none 67.12 76.83
##    rep   8.00 12.00
##    one  74.00 76.13
##    rep  10.00 15.00
with(nursery, tapply(PEABODY, list(COLLEGE,NURSERY), mean))
##          no      yes
## both 74.500 87.00000
## none 67.125 76.83333
## one  74.000 76.13333
means<- nursery %>% group_by(COLLEGE,NURSERY) %>% 
  summarise(mean=mean(PEABODY),
            .groups = 'drop') %>%
  na.omit() %>%
  mutate(college = fct_relevel(COLLEGE, "none", "one", "both")) %>%
  as.data.frame() 


ggplot(data=means, aes(x=college, y=mean, group=NURSERY)) +
  geom_line(aes(linetype=NURSERY)) +
  geom_point(aes(shape=NURSERY))+
  ggtitle("Peabody score by nursery or not") +
  labs(y = "Peabody score", x = "College educated parents")