library(data.table)
library(dplyr)
library(forcats)
library(ggplot2)
source("/Users/claire/Desktop/Desktop/gradstats2022/gradstats_funcs.R")
nursery<- fread("/Users/claire/Desktop/Desktop/gradstats2022/nursery.csv", header = T, data.table = F)
when we have equal Ns, group means will equal the mean of means, as we’ve learned before.
this is a similar idea to marginal means. marginal means refer to computing means across the margins of a mean table”
there are multiple ways to get at these different pieces of information:
with(nursery, tapply(PEABODY, list(COLLEGE,NURSERY), mean)) # gives the means by group
## no yes
## both 74.500 87.00000
## none 67.125 76.83333
## one 74.000 76.13333
aov.out = aov(PEABODY ~ COLLEGE * NURSERY, data=nursery)
model.tables(aov.out, type="means", se=T) ## this runs an ANOVA and spits out several mean tables
## Design is unbalanced - use se.contrast() for se's
## Tables of means
## Grand mean
##
## 75.02
##
## COLLEGE
## both none one
## 82 72.95 75.28
## rep 5 20.00 25.00
##
## NURSERY
## no yes
## 71.3 77.5
## rep 20.0 30.0
##
## COLLEGE:NURSERY
## NURSERY
## COLLEGE no yes
## both 74.50 87.00
## rep 2.00 3.00
## none 67.12 76.83
## rep 8.00 12.00
## one 74.00 76.13
## rep 10.00 15.00
why are the above sets of means similar and different? the first set under the model.tables gives means only taking into account one factor at a time – mean peabody score for those who went to nursery for not (regardless of college educated parents) and mean peabody score for different levels of college educated parents – regardless of attending nursery.
the last set of means under model.tables is the same as the first function – because it calculates those marginal means fully taking into account both factors AND the unequal Ns. we see the unequal Ns given after “rep”
this code, again, gives us the average peabody score for someone who attended nursery (77.5) vs not (71.3)
describeBy(PEABODY ~ NURSERY, data=nursery)
##
## Descriptive statistics by group
## NURSERY: no
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 20 71.3 9.19 68 70.12 6.67 60 94 34 0.95 -0.18 2.06
## ------------------------------------------------------------
## NURSERY: yes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 30 77.5 9.09 78 78.17 6.67 51 96 45 -0.67 0.94 1.66
we might think that’s good enough but it’s not given the question! we want to know how to answer this in the context of our ANOVA or controlling for college-educated parents.
describeBy(PEABODY ~ NURSERY + COLLEGE, data=nursery)
##
## Descriptive statistics by group
## NURSERY: no
## COLLEGE: both
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 2 74.5 13.44 74.5 74.5 14.08 65 84 19 0 -2.75 9.5
## ------------------------------------------------------------
## NURSERY: yes
## COLLEGE: both
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 3 87 8.54 86 87 10.38 79 96 17 0.12 -2.33 4.93
## ------------------------------------------------------------
## NURSERY: no
## COLLEGE: none
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 8 67.12 6.15 65.5 67.12 2.22 63 82 19 1.69 1.32 2.17
## ------------------------------------------------------------
## NURSERY: yes
## COLLEGE: none
## vars n mean sd median trimmed mad min max range skew kurtosis
## PEABODY 1 12 76.83 7.58 77.5 77.4 5.19 60 88 28 -0.51 -0.22
## se
## PEABODY 2.19
## ------------------------------------------------------------
## NURSERY: no
## COLLEGE: one
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 10 74 10.15 73 73.25 5.19 60 94 34 0.57 -0.72 3.21
## ------------------------------------------------------------
## NURSERY: yes
## COLLEGE: one
## vars n mean sd median trimmed mad min max range skew kurtosis se
## PEABODY 1 15 76.13 9.69 75 77 7.41 51 90 39 -0.92 0.65 2.5
now that we’eve edited the code, we can get these means broken down the same way as the ANOVA model would.
the mean peabody score for those who attended nursery and
so the total mean of means for those who attended nursery is \((76.83+76.13+87)/3=79.99\)
similarly:
the mean peabody score for those who did not attend nursery and
so the total mean of means for those who did not attend nursery is \((67.12+74+74.5)/3=71.87\)
nursery$nurse<- ifelse(nursery$NURSERY=="yes", -1/2, ifelse(nursery$NURSERY=="no", 1/2, NA))
nursery$lin_coll<- ifelse(nursery$COLLEGE=="none", -1/2, ifelse(nursery$COLLEGE=="one", 0, ifelse(nursery$COLLEGE=="both", 1/2, NA)))
nursery$quad_coll<- ifelse(nursery$COLLEGE=="none", -1/3, ifelse(nursery$COLLEGE=="one", 2/3, ifelse(nursery$COLLEGE=="both", -1/3, NA)))
nursery$lin_nursery<- nursery$lin_coll*nursery$nurse
nursery$quad_nursery<- nursery$quad_coll*nursery$nurse
mcSummary(lm(PEABODY~nurse+lin_coll+quad_coll+lin_nursery+quad_nursery, data=nursery))
## lm(formula = PEABODY ~ nurse + lin_coll + quad_coll + lin_nursery +
## quad_nursery, data = nursery)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 998.205 5 199.641 0.224 2.535 0.042
## Error 3464.775 44 78.745
## Corr Total 4462.980 49 91.081
##
## RMSE AdjEtaSq
## 8.874 0.135
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 75.932 1.626 46.706 171776.910 0.980 NA 72.655 79.208 0.000
## nurse -8.114 3.251 -2.495 490.359 0.124 0.621 -14.667 -1.561 0.016
## lin_coll 8.771 4.528 1.937 295.402 0.079 0.749 -0.356 17.897 0.059
## quad_coll -1.298 2.900 -0.448 15.778 0.005 0.749 -7.142 4.546 0.657
## lin_nursery -2.792 9.057 -0.308 7.482 0.002 0.619 -21.044 15.461 0.759
## quad_nursery 8.971 5.799 1.547 188.431 0.052 0.677 -2.717 20.658 0.129
the beta for the nursery vs not coefficient = 8.11 –> aka the difference in the mean of means for nursery vs not (\(79.99-71.87\)).
calctest<- fread("/Users/claire/Desktop/Desktop/gradstats2022/calctest.csv", header=T, data.table=F)
with(calctest, tapply(correct, list(calc,type), mean))
## AB MC write
## no 15 18 13
## yes 20 22 12
vs
describeBy(correct ~ calc, data=calctest)
##
## Descriptive statistics by group
## calc: no
## vars n mean sd median trimmed mad min max range skew kurtosis se
## correct 1 9 15.33 2.5 15 15.33 2.97 11 19 8 -0.17 -1.22 0.83
## ------------------------------------------------------------
## calc: yes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## correct 1 9 18 5.05 19 18 5.93 10 24 14 -0.28 -1.6 1.68
calctest$calc_yes_no<- ifelse(calctest$calc=="yes", -1/2, ifelse(calctest$calc=="no", 1/2, NA))
calctest$type_AB_MC<- ifelse(calctest$type=="AB", -1/2, ifelse(calctest$type=="write", 0, ifelse(calctest$type=="MC", 1/2, NA)))
calctest$type_AB_MC_vs_write<- ifelse(calctest$type=="AB", -1/3, ifelse(calctest$type=="write", 2/3, ifelse(calctest$type=="MC", -1/3, NA)))
calctest$calc_AB_MC<- calctest$calc_yes_no*calctest$type_AB_MC
calctest$calc_AB_MC_vs_write<- calctest$calc_yes_no*calctest$type_AB_MC_vs_write
mcSummary(lm(correct~calc_yes_no+type_AB_MC+type_AB_MC_vs_write+calc_AB_MC+calc_AB_MC_vs_write, data=calctest))
## lm(formula = correct ~ calc_yes_no + type_AB_MC + type_AB_MC_vs_write +
## calc_AB_MC + calc_AB_MC_vs_write, data = calctest)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 238 5 47.600 0.832 11.9 0
## Error 48 12 4.000
## Corr Total 286 17 16.824
##
## RMSE AdjEtaSq
## 2 0.762
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 16.667 0.471 35.355 5000.00 0.990 NA 15.640 17.694 0.000
## calc_yes_no -2.667 0.943 -2.828 32.00 0.400 1 -4.721 -0.612 0.015
## type_AB_MC 2.500 1.155 2.165 18.75 0.281 1 -0.016 5.016 0.051
## type_AB_MC_vs_write -6.250 1.000 -6.250 156.25 0.765 1 -8.429 -4.071 0.000
## calc_AB_MC 1.000 2.309 0.433 0.75 0.015 1 -4.032 6.032 0.673
## calc_AB_MC_vs_write 5.500 2.000 2.750 30.25 0.387 1 1.142 9.858 0.018
in this case, the difference in math performance between those who used a calculator or not, controlling for type of test, is 2.667. we can get that number from the first with() statement above:
-mean with calculator=18 -mean without calculator=15.33
\(18-15.33=2.67\)
however, we can also get it if we calculate the marginal means from the describeBy statement above:
the mean correct for those with a calculator and:
so the total mean of means for those who used a calculator is \((20+22+12)/3=18\)
and
the mean correct for those without a calculator and:
so the total mean of means for those who used a calculator is \((15+18+13)/3=15.33\)
both of these examples show how ANOVA models take into account sample size! it’s very important to know the question you are asking, and then present the appropriate information.
with equal N, it does not matter where you derive the means. if we were to ask what the effect of calculators on performance is regardless of exam type, we would get the same answer:
mcSummary(lm(correct~calc_yes_no, data=calctest))
## lm(formula = correct ~ calc_yes_no, data = calctest)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 32 1 32.000 0.112 2.016 0.175
## Error 254 16 15.875
## Corr Total 286 17 16.824
##
## RMSE AdjEtaSq
## 3.984 0.056
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 16.667 0.939 17.747 5000 0.952 NA 14.676 18.658 0.000
## calc_yes_no -2.667 1.878 -1.420 32 0.112 NA -6.648 1.315 0.175
using a calculator improves performance by 2.67 points.
nursery<- fread("/Users/claire/Desktop/Desktop/gradstats2022/nursery.csv", header=T, data.table=F)
head(nursery)
## NURSERY COLLEGE PEABODY KINDER
## 1 no both 65 59
## 2 no both 84 79
## 3 no one 76 69
## 4 no one 71 63
## 5 no one 72 68
## 6 no one 87 81
table(nursery$COLLEGE) ### both, none, one
##
## both none one
## 5 20 25
table(nursery$NURSERY) ### no, yes
##
## no yes
## 20 30
create contrast codes
nursery$nurse<- ifelse(nursery$NURSERY=="yes", 1/2, ifelse(nursery$NURSERY=="no", -1/2, NA))
nursery$lin_coll<- ifelse(nursery$COLLEGE=="none", -1/2, ifelse(nursery$COLLEGE=="one", 0, ifelse(nursery$COLLEGE=="both", 1/2, NA)))
nursery$quad_coll<- ifelse(nursery$COLLEGE=="none", -1/3, ifelse(nursery$COLLEGE=="one", 2/3, ifelse(nursery$COLLEGE=="both", -1/3, NA)))
nursery$lin_nursery<- nursery$lin_coll*nursery$nurse
nursery$quad_nursery<- nursery$quad_coll*nursery$nurse
run the model
mcSummary(lm(PEABODY~nurse+lin_coll+quad_coll+lin_nursery+quad_nursery, data=nursery))
## lm(formula = PEABODY ~ nurse + lin_coll + quad_coll + lin_nursery +
## quad_nursery, data = nursery)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 998.205 5 199.641 0.224 2.535 0.042
## Error 3464.775 44 78.745
## Corr Total 4462.980 49 91.081
##
## RMSE AdjEtaSq
## 8.874 0.135
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 75.932 1.626 46.706 171776.910 0.980 NA 72.655 79.208 0.000
## nurse 8.114 3.251 2.495 490.359 0.124 0.621 1.561 14.667 0.016
## lin_coll 8.771 4.528 1.937 295.402 0.079 0.749 -0.356 17.897 0.059
## quad_coll -1.298 2.900 -0.448 15.778 0.005 0.749 -7.142 4.546 0.657
## lin_nursery 2.792 9.057 0.308 7.482 0.002 0.619 -15.461 21.044 0.759
## quad_nursery -8.971 5.799 -1.547 188.431 0.052 0.677 -20.658 2.717 0.129
\(peabody=75.9+8.11*nurse+8.771lin-1.298quad+2.79lin*nurse-8.971quad*nurse*\)
\(peabody=75.9+8.771lin-1.298quad+(8.11+2.79lin-8.971quad)*nurse\)
\(peabody=75.9+(8.77*(-.5))+(-1.298*(-.33))=71.9\) –> INTERCEPT = mean of group mean peabody scores when college= none across nurse groups
\(peabody=(8.11+2.79lin-8.971quad)*nurse\)
\(peabody=8.11+(2.79*(-.5)-(8.997*(-.33)))=9.7\) –> SLOPE = as you go from no nursery to nursery, this is the change in peabody scores for those who had no parents that attended college.
we can confirm this by dummy coding. since there are 3 groups within the college factor, we still need two codes. that means to dummy code for none, it needs to be zero on both codes!
nursery$nurse<- ifelse(nursery$NURSERY=="yes", 1/2, ifelse(nursery$NURSERY=="no", -1/2, NA))
nursery$dummy_coll_none_both<- ifelse(nursery$COLLEGE=="none", 0, ifelse(nursery$COLLEGE=="one", 0, ifelse(nursery$COLLEGE=="both", 1, NA)))
nursery$dummy_coll_none_one<- ifelse(nursery$COLLEGE=="none", 0, ifelse(nursery$COLLEGE=="one", 1, ifelse(nursery$COLLEGE=="both", 0, NA)))
nursery$dummmy1_nurse<- nursery$dummy_coll_none_both*nursery$nurse
nursery$dummy2_nurse<- nursery$dummy_coll_none_one*nursery$nurse
mcSummary(lm(PEABODY~nurse*dummy_coll_none_both+nurse*+dummy_coll_none_one, data=nursery))
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## lm(formula = PEABODY ~ nurse * dummy_coll_none_both + nurse *
## +dummy_coll_none_one, data = nursery)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 998.205 5 199.641 0.224 2.535 0.042
## Error 3464.775 44 78.745
## Corr Total 4462.980 49 91.081
##
## RMSE AdjEtaSq
## 8.874 0.135
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5
## (Intercept) 71.979 2.025 35.542 99475.208 0.966 NA 67.898
## nurse 9.708 4.050 2.397 452.408 0.115 0.400 1.545
## dummy_coll_none_both 8.771 4.528 1.937 295.402 0.079 0.853 -0.356
## dummy_coll_none_one 3.088 2.717 1.136 101.682 0.029 0.853 -2.388
## nurse:dummy_coll_none_both 2.792 9.057 0.308 7.482 0.002 0.771 -15.461
## nurse:dummy_coll_none_one -7.575 5.434 -1.394 153.015 0.042 0.435 -18.527
## CI_97.5 p
## (Intercept) 76.061 0.000
## nurse 17.871 0.021
## dummy_coll_none_both 17.897 0.059
## dummy_coll_none_one 8.563 0.262
## nurse:dummy_coll_none_both 21.044 0.759
## nurse:dummy_coll_none_one 3.377 0.170
in the above model output, we can see that the slope for nursery is 9.71. this is the simple effect of nursery on peabody scores for someone who had NO parents attend college. we just calculated that above.
the intercept is a bit more confusing. it is the mean of group mean peabody scores for those who had no parents attend college, controlling for nursery. because nursery is in the model, the intercept is when nursery=0 and when college=0. when college=0 on both codes, we are talking about someone in the none group. when nursery=0, we are averaging across no and yes nursery groups because we contrast coded it.
therefore: \((67.125+76.83333)/2= 71.9\) is the intercept, or the marginal mean for those in the none college group.
finally, confirm with means and plot
aov.out = aov(PEABODY ~ COLLEGE * NURSERY, data=nursery)
model.tables(aov.out, type="means", se=T)
## Design is unbalanced - use se.contrast() for se's
## Tables of means
## Grand mean
##
## 75.02
##
## COLLEGE
## both none one
## 82 72.95 75.28
## rep 5 20.00 25.00
##
## NURSERY
## no yes
## 71.3 77.5
## rep 20.0 30.0
##
## COLLEGE:NURSERY
## NURSERY
## COLLEGE no yes
## both 74.50 87.00
## rep 2.00 3.00
## none 67.12 76.83
## rep 8.00 12.00
## one 74.00 76.13
## rep 10.00 15.00
with(nursery, tapply(PEABODY, list(COLLEGE,NURSERY), mean))
## no yes
## both 74.500 87.00000
## none 67.125 76.83333
## one 74.000 76.13333
means<- nursery %>% group_by(COLLEGE,NURSERY) %>%
summarise(mean=mean(PEABODY),
.groups = 'drop') %>%
na.omit() %>%
mutate(college = fct_relevel(COLLEGE, "none", "one", "both")) %>%
as.data.frame()
ggplot(data=means, aes(x=college, y=mean, group=NURSERY)) +
geom_line(aes(linetype=NURSERY)) +
geom_point(aes(shape=NURSERY))+
ggtitle("Peabody score by nursery or not") +
labs(y = "Peabody score", x = "College educated parents")