Introduction

In this guide, we’ll explore Factorial ANOVA using R, a method used to examine the effect of two or more independent variables (also known as factors) on a dependent variable, particularly when these factors are categorical. It’s especially valuable when you’re interested not just in the individual effects of these factors, but also in how they interact with each other.

Create alcohol example data

cones <- c(0,1,0,2,1,2,3,0,3,1,4,6,3,2,3,0,0,3,2,0,4,5,2,3,2,7,9,6,4,6)
speed <- factor(c(rep("slow",15),rep("fast",15)))
alcohol <- factor(c(rep("1",5),rep("3",5),rep("5",5),rep("1",5),rep("3",5),rep("5",5)))
alcdata <- data.frame(cones,speed,alcohol)

In aov - and in most analyses in R - you can put in variables and all of their interactions by connecting them with a *. By default, aov uses Type I SS, so the order you enter predictors matters. Take a look at these two aov functions and their outputs.

output<-aov(cones ~ speed*alcohol,data=alcdata)
summary(output)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## speed          1  16.13   16.13   8.274  0.00831 ** 
## alcohol        2  85.40   42.70  21.897 3.87e-06 ***
## speed:alcohol  2   8.47    4.23   2.171  0.13595    
## Residuals     24  46.80    1.95                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
output<-aov(cones ~ alcohol*speed,data=alcdata)
summary(output)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## alcohol        2  85.40   42.70  21.897 3.87e-06 ***
## speed          1  16.13   16.13   8.274  0.00831 ** 
## alcohol:speed  2   8.47    4.23   2.171  0.13595    
## Residuals     24  46.80    1.95                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results are exactly the same. Wait, what’s going on here? The results are exactly the same even though this is Type I SS, so the order matters? Why? Think about it for a minute. If you come up with an answer, go to the end of this document to read my answer. If you don’t come up with an answer, go through the rest of the script to see the results for the writing score example before jumping to the end for the answer.

No matter what, the interaction is considered after the main effects, so the results for the interaction are accounting for the (main) effects of alcohol and speed.

You can change the results to Type III with the Anova function from the car package. This might make more sense, depending on the particular analysis and scientific context. Certainly Type III is the most common form.

Note: capital A Anova. The lowercase anova function is different. There might even be an ANOVA function; I don’t know. Don’t forget - R is case sensitive.

car::Anova(output,type="III")
## Anova Table (Type III tests)
## 
## Response: cones
##               Sum Sq Df F value    Pr(>F)    
## (Intercept)    5.000  1  2.5641    0.1224    
## alcohol       73.733  2 18.9060 1.174e-05 ***
## speed          0.100  1  0.0513    0.8228    
## alcohol:speed  8.467  2  2.1709    0.1360    
## Residuals     46.800 24                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now with Type III SS, speed is no longer significant. That is, after accounting for the interaction, speed is no longer significant. But neither is the interaction. Maybe you choose to drop the interaction? There isn’t an obviously correct answer here. Make your choices, state why you made them, and, if appropriate, state how the conclusions are impacted by your choices. You can also switch to Type II with the Anova function.

car::Anova(output,type="II")
## Anova Table (Type II tests)
## 
## Response: cones
##               Sum Sq Df F value    Pr(>F)    
## alcohol       85.400  2 21.8974 3.874e-06 ***
## speed         16.133  1  8.2735  0.008308 ** 
## alcohol:speed  8.467  2  2.1709  0.135954    
## Residuals     46.800 24                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results here are the same as for Type I SS, for the same reason as why the Type I SS results are the same regardless of order.

Factorial ANOVA effect sizes

The effect sizes for factorial ANOVA are the same as with one-way ANOVA

library(sjstats)
library(knitr)
library(kableExtra)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
kable(anova_stats(output)) %>%
  kable_styling()
etasq partial.etasq omegasq partial.omegasq epsilonsq cohens.f term sumsq df meansq statistic p.value power
alcohol 0.545 0.646 0.513 0.582 0.520 1.351 alcohol 85.400 2 42.700 21.897 0.000 1.000
speed 0.103 0.256 0.089 0.195 0.090 0.587 speed 16.133 1 16.133 8.274 0.008 0.819
alcohol:speed 0.054 0.153 0.029 0.072 0.029 0.425 alcohol:speed 8.467 2 4.233 2.171 0.136 0.443
1 NA NA NA NA NA NA Residuals 46.800 24 1.950 NA NA NA

Looking at \(\omega^2\), alcohol has a very large effect. Speed has a medium effect. The interaction has a small effect. This offers further justification for dropping the interaction.

Post-hoc tests

You will get a warning message with post-hoc tests because they are of questionable validity when there are interactions.

library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
ph<-glht(output,linfct=mcp(alcohol="Tukey"))
## Warning in mcp2matrix(model, linfct = linfct): covariate interactions found --
## default contrast might be inappropriate
summary(ph)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = cones ~ alcohol * speed, data = alcdata)
## 
## Linear Hypotheses:
##            Estimate Std. Error t value Pr(>|t|)    
## 3 - 1 == 0   2.2000     0.8832   2.491  0.05076 .  
## 5 - 1 == 0   5.4000     0.8832   6.114  < 0.001 ***
## 5 - 3 == 0   3.2000     0.8832   3.623  0.00361 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Let’s try dropping the interaction since it was not significant and the effect size was small.

output<-aov(cones ~ alcohol + speed,data=alcdata)
summary(output)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## alcohol      2  85.40   42.70   20.09 5.32e-06 ***
## speed        1  16.13   16.13    7.59   0.0106 *  
## Residuals   26  55.27    2.13                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But by default, aov is uses Type I SS. That doesn’t make sense here; the predictors have no natural order.

car::Anova(output,type="3")
## Anova Table (Type III tests)
## 
## Response: cones
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 20.008  1  9.4128  0.004986 ** 
## alcohol     85.400  2 20.0881 5.315e-06 ***
## speed       16.133  1  7.5899  0.010576 *  
## Residuals   55.267 26                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Without the interaction, post-hoc tests make more sense.

ph<-glht(output,linfct=mcp(alcohol="Tukey"))
summary(ph)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = cones ~ alcohol + speed, data = alcdata)
## 
## Linear Hypotheses:
##            Estimate Std. Error t value Pr(>|t|)    
## 3 - 1 == 0    1.600      0.652   2.454  0.05352 .  
## 5 - 1 == 0    4.100      0.652   6.288  < 0.001 ***
## 5 - 3 == 0    2.500      0.652   3.834  0.00197 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Why are we only getting post-hoc tests for alcohol? Because speed only has two levels, so the F test for speed is a pairwise comparison. Post-hoc tests don’t have anything to add.

Reading & writing score example

# Load the dataset
dta_rw <- read.csv("readwrite2.csv")

# Add one more IV: gender
dta_rw$gender <- sample(c("female", "male"), nrow(dta_rw), replace = TRUE)

library(tidyverse)
library(dplyr)

# Convert all IVs to factor variables
dta_rw <- dta_rw %>%
  mutate(gender = as.factor(gender),
         level = as.factor(level))

Descriptives by level, by gender, and by both together

library(dplyr)

# by proficiency levels
dta_rw %>%
  group_by(level) %>%
  summarize(count = n(),
            mean = mean(writing),
            sd = sd(writing))
## # A tibble: 5 × 4
##   level count  mean    sd
##   <fct> <int> <dbl> <dbl>
## 1 1       226  10.3 2.13 
## 2 2       342  13.3 2.19 
## 3 3       524  16.5 2.28 
## 4 4       853  20.3 2.32 
## 5 5        52  23.6 0.721
# by gender
dta_rw %>%
  group_by(gender) %>%
  summarize(count = n(),
            mean = mean(writing),
            sd = sd(writing))
## # A tibble: 2 × 4
##   gender count  mean    sd
##   <fct>  <int> <dbl> <dbl>
## 1 female   982  17.1  4.22
## 2 male    1015  17.0  4.31
# by proficiency level & gender
dta_rw %>%
  group_by(level, gender) %>%
  summarize(count = n(),
            mean = mean(writing),
            sd = sd(writing))
## `summarise()` has grouped output by 'level'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 5
## # Groups:   level [5]
##    level gender count  mean    sd
##    <fct> <fct>  <int> <dbl> <dbl>
##  1 1     female   100  10.2 2.20 
##  2 1     male     126  10.4 2.08 
##  3 2     female   169  13.3 2.13 
##  4 2     male     173  13.2 2.24 
##  5 3     female   262  16.5 2.29 
##  6 3     male     262  16.4 2.27 
##  7 4     female   419  20.2 2.33 
##  8 4     male     434  20.4 2.32 
##  9 5     female    32  23.6 0.712
## 10 5     male      20  23.6 0.754

A couple preliminary analyses before the factorial ANOVA. First, let’s get the sample sizes in each cell.

table(dta_rw$level,dta_rw$gender)
##    
##     female male
##   1    100  126
##   2    169  173
##   3    262  262
##   4    419  434
##   5     32   20

Let’s check if there is any relation between the two categorical predictors. This uses a chi-square test of independence. We will learn more about this analysis method later this semester.

chisq.test(table(dta_rw$level,dta_rw$gender))
## 
##  Pearson's Chi-squared test
## 
## data:  table(dta_rw$level, dta_rw$gender)
## X-squared = 5.5271, df = 4, p-value = 0.2374

The chi-square value (labeled X-squared) is not zero, so there is an observed relation between the predictors. However, it is not significant, which means there is no evidence that these variables are related in the population. This makes it easier to interpret the ANOVA results, because we can interpret the effect of proficiency level without worrying about gender and vice versa.

A friendly reminder, The CrossTable function in the gmodels package gives more detailed information about categorical variables.

library(gmodels)
CrossTable(dta_rw$level,dta_rw$gender,fisher=TRUE,chisq = TRUE,expected=TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1997 
## 
##  
##              | dta_rw$gender 
## dta_rw$level |    female |      male | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |       100 |       126 |       226 | 
##              |   111.133 |   114.867 |           | 
##              |     1.115 |     1.079 |           | 
##              |     0.442 |     0.558 |     0.113 | 
##              |     0.102 |     0.124 |           | 
##              |     0.050 |     0.063 |           | 
## -------------|-----------|-----------|-----------|
##            2 |       169 |       173 |       342 | 
##              |   168.174 |   173.826 |           | 
##              |     0.004 |     0.004 |           | 
##              |     0.494 |     0.506 |     0.171 | 
##              |     0.172 |     0.170 |           | 
##              |     0.085 |     0.087 |           | 
## -------------|-----------|-----------|-----------|
##            3 |       262 |       262 |       524 | 
##              |   257.671 |   266.329 |           | 
##              |     0.073 |     0.070 |           | 
##              |     0.500 |     0.500 |     0.262 | 
##              |     0.267 |     0.258 |           | 
##              |     0.131 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
##            4 |       419 |       434 |       853 | 
##              |   419.452 |   433.548 |           | 
##              |     0.000 |     0.000 |           | 
##              |     0.491 |     0.509 |     0.427 | 
##              |     0.427 |     0.428 |           | 
##              |     0.210 |     0.217 |           | 
## -------------|-----------|-----------|-----------|
##            5 |        32 |        20 |        52 | 
##              |    25.570 |    26.430 |           | 
##              |     1.617 |     1.564 |           | 
##              |     0.615 |     0.385 |     0.026 | 
##              |     0.033 |     0.020 |           | 
##              |     0.016 |     0.010 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       982 |      1015 |      1997 | 
##              |     0.492 |     0.508 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  5.527131     d.f. =  4     p =  0.2373549 
## 
## 
##  
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Alternative hypothesis: two.sided
## p =  0.2389056 
## 
## 

More fun with SS

aov uses Type I SS by default, so order matters.

output <- aov(writing ~ level*gender, data = dta_rw)
summary(output)
##                Df Sum Sq Mean Sq  F value Pr(>F)    
## level           4  26340    6585 1309.427 <2e-16 ***
## gender          1      1       1    0.192  0.661    
## level:gender    4     10       3    0.512  0.727    
## Residuals    1987   9993       5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
output <- aov(writing ~ gender*level, data = dta_rw)
summary(output)
##                Df Sum Sq Mean Sq  F value Pr(>F)    
## gender          1     15      15    3.065 0.0802 .  
## level           4  26326    6581 1308.709 <2e-16 ***
## gender:level    4     10       3    0.512 0.7272    
## Residuals    1987   9993       5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the order impacts the results. Again, see the point at the end of the document to see why order matters here but not with the alcohol example.

Type I SS are not appropriate here because there is no scientifically meaningful ordering of the predictors. There are plausible scientific reasons why a particular order might be chosen- maybe we’re controlling for gender before considering proficiency levels - but here I don’t have any order.

Type II would therefore be more appropriate: simultaneous for main effects, sequential for interaction. That is, the effect of the interaction is considered only after main effects are accounted for.

car::Anova(output,type="2")
## Anova Table (Type II tests)
## 
## Response: writing
##               Sum Sq   Df   F value Pr(>F)    
## gender           1.0    1    0.1918 0.6615    
## level        26326.0    4 1308.7092 <2e-16 ***
## gender:level    10.3    4    0.5116 0.7272    
## Residuals     9992.6 1987                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now there is a significant difference in writing scores across proficiency levels, but no significant differences in writing scores across gender There is no evidence of an interaction, so the main effects are directly interpretable.

We might also go with the most common SS, Type III.

car::Anova(output,type="3")
## Anova Table (Type III tests)
## 
## Response: writing
##               Sum Sq   Df   F value Pr(>F)    
## (Intercept)  10465.3    1 2080.9898 <2e-16 ***
## gender           2.4    1    0.4728 0.4918    
## level        12582.4    4  625.4949 <2e-16 ***
## gender:level    10.3    4    0.5116 0.7272    
## Residuals     9992.6 1987                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now only the proficiency level effect is significant. What’s the best answer? There isn’t a clear right or wrong, and you can justify Type II or Type III.

kable(anova_stats(output)) %>%
  kable_styling()
etasq partial.etasq omegasq partial.omegasq epsilonsq cohens.f term sumsq df meansq statistic p.value power
gender 0.000 0.002 0.000 0.001 0.000 0.039 gender 15.414 1 15.414 3.065 0.080 0.417
level 0.724 0.725 0.724 0.724 0.724 1.623 level 26325.975 4 6581.494 1308.709 0.000 1.000
gender:level 0.000 0.001 0.000 -0.001 0.000 0.032 gender:level 10.292 4 2.573 0.512 0.727 0.175
1 NA NA NA NA NA NA Residuals 9992.616 1987 5.029 NA NA NA

The gender and interaction effect sizes are negligible. Proficiency level is large.

Given all of this evidence, I would follow this up by dropping the interaction, since it is non-significant and of negligible magnitude.

output <- aov(writing ~ level + gender, data = dta_rw)
car::Anova(output,type="3")
## Anova Table (Type III tests)
## 
## Response: writing
##             Sum Sq   Df  F value Pr(>F)    
## (Intercept)  21095    1 4198.801 <2e-16 ***
## level        26326    4 1309.995 <2e-16 ***
## gender           1    1    0.192 0.6613    
## Residuals    10003 1991                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OK, now comes decision time. Which result provides the best scientific information? I would go with the main effects only model (i.e., with no interaction) using Type III SS. I don’t have a strong reason for this, except that it offers the most parsimonious description of what’s going on.

Testing for homogeneity of variance

We should have tested for homogeneity of variance before doing the analyses. Let’s do it now. If we do the obvious test- e.g., bartlett.test(writing~level*gender,data = dta_rw) - we’ll get an error. You can’t look at homogeneity of variance with more than 1 group predictor. We’re going to need to do some manipulation of the categories. A nice package to do this is forcats. The fct_cross function creates a new variable that combines two categorical variables. Then, we can look at homogeneity of variance across all 10 gender X level combinations.

library(forcats)

# Note that all IVs need to be a factor variable for the fct_cross() to work properly
dta_rw$levelgen <- fct_cross(dta_rw$level,
                             dta_rw$gender, 
                             sep="-")
unique(dta_rw$levelgen)
##  [1] 3-male   3-female 2-female 4-female 2-male   4-male   1-female 5-female
##  [9] 1-male   5-male  
## 10 Levels: 1-female 2-female 3-female 4-female 5-female 1-male ... 5-male

Now we have 10 groups in a single factor variable (5 level * 2 gender) instead of two factor variables. This allows for testing homogeneity of variance.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
bartlett.test(writing~levelgen,data = dta_rw)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  writing by levelgen
## Bartlett's K-squared = 72.209, df = 9, p-value = 5.608e-12
leveneTest(writing~levelgen,data = dta_rw)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    9  4.9026 1.591e-06 ***
##       1987                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fligner.test(writing~levelgen,data = dta_rw)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  writing by levelgen
## Fligner-Killeen:med chi-squared = 39.779, df = 9, p-value = 8.331e-06

All of the tests are significant. This means that the assumption of homogeneity of variance does not hold, and all our results so far cannot be trusted.

In such a case, the WRS2 package offers a two-way ANOVA that is robust to heterogeneity of variance, but isn’t well-established in the statistical literature. In other words, we should not be 100% confident in the results from this package, but if there are major concerns with heterogeneity of variance, then this should be considered more trustworthy than the results with the assumption of homogeneity of variance.

library(WRS2)
t2way(writing ~ level*gender, data=dta_rw)
## Call:
## t2way(formula = writing ~ level * gender, data = dta_rw)
## 
##                  value p.value
## level        7380.2014   0.001
## gender          0.5752   0.449
## level:gender    3.7356   0.451

Another option is to do this as a regression using heteroskedasticity robust SEs.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(sandwich)
output<-lm(writing ~ level*gender, data=dta_rw)
summary(output)
## 
## Call:
## lm(formula = writing ~ level * gender, data = dta_rw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.364 -1.364  0.400  1.603  6.603 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.23000    0.22425  45.618   <2e-16 ***
## level2             3.09544    0.28293  10.941   <2e-16 ***
## level3             6.30053    0.26360  23.902   <2e-16 ***
## level4             9.96809    0.24958  39.939   <2e-16 ***
## level5            13.36375    0.45546  29.341   <2e-16 ***
## gendermale         0.20651    0.30034   0.688    0.492    
## level2:gendermale -0.29496    0.38604  -0.764    0.445    
## level3:gendermale -0.34010    0.35860  -0.948    0.343    
## level4:gendermale -0.04054    0.33733  -0.120    0.904    
## level5:gendermale -0.20026    0.70626  -0.284    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.243 on 1987 degrees of freedom
## Multiple R-squared:  0.7251, Adjusted R-squared:  0.7238 
## F-statistic: 582.2 on 9 and 1987 DF,  p-value: < 2.2e-16
coeftest(output, vcov = vcovHC(output, type = "HC0"))
## 
## t test of coefficients:
## 
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.230000   0.218566 46.8051   <2e-16 ***
## level2             3.095444   0.273055 11.3363   <2e-16 ***
## level3             6.300534   0.260283 24.2065   <2e-16 ***
## level4             9.968091   0.246290 40.4730   <2e-16 ***
## level5            13.363750   0.251240 53.1913   <2e-16 ***
## gendermale         0.206508   0.286309  0.7213   0.4708    
## level2:gendermale -0.294958   0.370952 -0.7951   0.4266    
## level3:gendermale -0.340096   0.348750 -0.9752   0.3296    
## level4:gendermale -0.040543   0.327481 -0.1238   0.9015    
## level5:gendermale -0.200258   0.352595 -0.5680   0.5701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This approach works very differently than the t2way function approach. I have no idea which is better. It may be that one approach works better in some circumstances, and the other works better in other circumstances.

The end

You’ve reached the end. Let’s answer the question of why the order of predictors with Type I SS didn’t matter for the alcohol example but did (albeit not strongly) for the writing scores example.

The reason for the alcohol example is that the sample sizes were the same for all cells. That means the alcohol and speed factors are independent. That is, if you know which level of alcohol a participant is in, it offers no information about which level of speed the participant is in.

This has a small interpretational benefit: the effect of alcohol on cones hit is interpretable without considering the level of speed. Normally, we interpret an effect of a variable, whether in an ANOVA or a regression, as “the effect of X on Y, controlling for all other predictors” which is the same as saying “the effect of X on Y, holding all other predictors constant”, although we usually abbreviate this interpretation by just saying “the effect of X on Y” with controlling for all other predictors implied.

But here, the other predictors (speed) don’t matter for X (alcohol), so the interpretation doesn’t involve ‘holding all other predictors constant’.

In my opinion, this is a very minor benefit - the interpretation of any predictor even when it is correlated with other predictors is no big deal.

Since alcohol and speed are independent, the SS for alcohol is the same whether we control for speed or not - although it’s important to note that the F-ratio will be different for alcohol if we include speed in the ANOVA because the SSwithin and dfwithin will be different.

output<-aov(cones ~ speed + alcohol,data=alcdata)
summary(output)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## speed        1  16.13   16.13    7.59   0.0106 *  
## alcohol      2  85.40   42.70   20.09 5.32e-06 ***
## Residuals   26  55.27    2.13                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
output<-aov(cones ~ alcohol,data=alcdata)
summary(output)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## alcohol      2   85.4   42.70   16.15 2.44e-05 ***
## Residuals   27   71.4    2.64                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the writing score example, the order mattered.

This is because a student’s proficiency level should be randomly determined (unlike the experimental assignment). Within the general population, there’s no inherent link between gender and proficiency levels. However, it’s important to remember that our data is a sample, not the entire population. In sampling, random allocation can sometimes lead to unexpected patterns. In this case, our sample displayed a slight, though not statistically significant (as indicated by the chi-square test of independence), relationship between gender and proficiency level. This relationship, it should be noted, is likely a product of random chance rather than any underlying dependency between these variables.