Analysis of Variance (ANOVA) and F-tests

These commands help visualize the F distribution, make F probability calculations, e.g., to calculate p-values, and run F tests in the context of analysis of variance (ANOVA). ANOVA can be used to analyze how the mean of a quantitative response variable varies with the (>2) levels of a categorical predictor variable. Some commands require mosaic (noted in comments) so you should install that package.

library(mosaic)

The \(F\) distribution and probability calculations

Visualizing the F distribution using plotDist():

# plotDist() is in the mosaic package
plotDist(dist='f', df1=5, df2=79, col="blue")

plotDist(dist='f', df1=3, df2=79, col="red", add=TRUE)

Making F cummulative probability calculations using pf():

# Note that we are usually use positive F values and
# are interested in upper tail probabilities.
# so use lower.tail=FALSE
F.prob <- pf(q=2.82, df1=3, df2=46, lower.tail=FALSE)
F.prob

## [1] 0.04925044

Making and visualizing \(F\) cummulative probability calculations using xpf():

# xpf() is in the mosaic package
xpf(q=2.82, df1=3, df2=46, lower.tail=FALSE)

## [1] 0.04925044

Finding critical \(F^∗\) values using qt():

# since F tests are always upper, we want 0.5 in the upper tail
F.crit <- qf(p=0.05, df1=3, df2=46, lower.tail=FALSE)
F.crit

## [1] 2.806845

Finding and visualizing critical \(F^∗\) values using xqt():

# xpf() is in the mosaic package
# since F tests are always upper, we want 0.5 in the upper tail
# same as 0.95 in the low end
xqf(p=0.95, df1=3, df2=46)

## [1] 2.806845

Constructing an ANOVA model using `lm()`

# We will use the dataset "chickwts" from the 'datasets' package included with R
# Chick weights reared on six feed types
data(chickwts)
head(chickwts) #extra

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

Run an ANOVA to test the null hypothesis that \(\mu_1 = \mu_2 = ... = \mu_6\) using lm():

# lm() constructs a linear model of a quantitative variable
# if predictor is a categorical factor, this is ANOVA
# if predictor is quantitative, this is regression
chick.mod <- lm(weight~feed, data=chickwts)
# lm() just makes the model, to look at an ANOVA table we can use anova()
anova(chick.mod)

## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## feed       5 231129   46226  15.365 5.936e-10 ***
## Residuals 65 195556    3009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summarize the model and examine \(t\)-tests for the individual feed types compared to the baseline factor level using summary():

summary(chick.mod)

## 
## Call:
## lm(formula = weight ~ feed, data = chickwts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -123.909  -34.413    1.571   38.170  103.091 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    323.583     15.834  20.436  < 2e-16 ***
## feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
## feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
## feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
## feedsoybean    -77.155     21.578  -3.576 0.000665 ***
## feedsunflower    5.333     22.393   0.238 0.812495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.85 on 65 degrees of freedom
## Multiple R-squared:  0.5417, Adjusted R-squared:  0.5064 
## F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

Perform multiple comparisons for all pairwise levels of the predictor variable using TukeyHSD():

TukeyHSD(chick.mod)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = x)
## 
## $feed
##                            diff         lwr       upr     p adj
## horsebean-casein    -163.383333 -232.346876 -94.41979 0.0000000
## linseed-casein      -104.833333 -170.587491 -39.07918 0.0002100
## meatmeal-casein      -46.674242 -113.906207  20.55772 0.3324584
## soybean-casein       -77.154762 -140.517054 -13.79247 0.0083653
## sunflower-casein       5.333333  -60.420825  71.08749 0.9998902
## linseed-horsebean     58.550000  -10.413543 127.51354 0.1413329
## meatmeal-horsebean   116.709091   46.335105 187.08308 0.0001062
## soybean-horsebean     86.228571   19.541684 152.91546 0.0042167
## sunflower-horsebean  168.716667   99.753124 237.68021 0.0000000
## meatmeal-linseed      58.159091   -9.072873 125.39106 0.1276965
## soybean-linseed       27.678571  -35.683721  91.04086 0.7932853
## sunflower-linseed    110.166667   44.412509 175.92082 0.0000884
## soybean-meatmeal     -30.480519  -95.375109  34.41407 0.7391356
## sunflower-meatmeal    52.007576  -15.224388 119.23954 0.2206962
## sunflower-soybean     82.488095   19.125803 145.85039 0.0038845

Visualizing relationship between a categorical predictor and a quantitative response

Make a figure of weight as a function of feed type.

# can include 95% CI for means if we include the "Hmisc" package
library(Hmisc)
ggplot(chickwts, aes(x=feed, y=weight)) +
  geom_jitter(width=0.1) +
  stat_summary(fun.data="mean_cl_normal", col="blue")

Reorder the feed typed based on the mean response and add groups from the Tukey HSD multiple comparison test.

ggplot(chickwts, aes(x=reorder(feed,weight,mean), y=weight)) +
  geom_jitter(width=0.1) +
  stat_summary(fun.data="mean_cl_normal", col="blue") +
  annotate("text", x=1:6, y=420, col="black", label=c("A","AB", "B", "BC", "C", "C")) +
  labs(x="feed type", y="chick weight")

Note that these confidence intervals are not adjusted for the multiple comparisons!

Analysis of Variance (ANOVA) and \(F\)-tests

The \(F\) distribution and probability calculations

Constructing an ANOVA model using lm()

Visualizing relationship between a categorical predictor and a quantitative response

Constructing an ANOVA model using `lm()`