Credits

See this datascience+ https://datascienceplus.com/one-way-anova-in-r/ for the original work that inspired my efforts. Credit to Bidyut Ghosh (Assistant Professor at Symbiosis International University, India) for the original posting.

Special thanks also to Dani Navarro, The University of New South Wales (Sydney) for the book Learning Statistics with R http://dj-navarro.appspot.com/lsr/lsr-0.5.1.pdf (hereafter simply LSR) and the lsr packages available through CRAN. I highly recommend it.

Let’s load the required R packages. See https://github.com/cran/lsr for more on lsr.

library(ggplot2)
library(lsr)
library(psych)
library(car)
library(tidyverse)
library(dunn.test)
library(BayesFactor)
library(scales)
library(knitr)
library(kableExtra)
options(width = 130)
options(knitr.table.format = "html") 

The Oneway Analysis of Variance (ANOVA)

The Oneway ANOVA is a statistical technique that allows us to compare mean differences of one outcome (dependent) variable across two or more groups (levels) of one independent variable (factor). If there are only two levels (e.g. Male/Female) of the independent (predictor) variable the results are analogous to Student’s t test. It is also true that ANOVA is a special case of the GLM or regression models so as the number of levels increase it might make more sense to try one of those approaches. ANOVA also allows for comparisons of mean differences across multiple factors (Factorial or Nway ANOVA) which we won’t address here.

Our scenario and data

Professor Ghosh’s original scenario can be summarized this way. Imagine that you are interested in understanding whether knowing the brand of car tire can help you predict whether you will get more or less mileage before you need to replace them. We’ll draw what is hopefully a random sample of 60 tires from four different manufacturers and use the mean mileage by brand to help inform our thinking. While we expect variation across our sample we’re interested in whether the differences between the tire brands (the groups) is significantly different than what we would expect in random variation within the groups.

Our research or testable hypothesis is then described \[\mu_{Apollo} \ne \mu_{Bridgestone} \ne \mu_{CEAT} \ne \mu_{Falken}\] as at least one of the tire brand populations is different than the other three. Our null is basically “nope, brand doesn’t matter in predicting tire mileage – all brands are the same”.

He provides the following data set with 60 observations. I’ve chosen to download directly but you could of course park the file locally and work from there.

Column Contains Type
Brands What brand tyre factor
Mileage Tyre life in thousands num
tyre<-read.csv("https://datascienceplus.com/wp-content/uploads/2017/08/tyre.csv")
# tyre<-read.csv("tyre.csv") # if you have it on your local hard drive
str(tyre)
## 'data.frame':    60 obs. of  2 variables:
##  $ Brands : Factor w/ 4 levels "Apollo","Bridgestone",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Mileage: num  33 36.4 32.8 37.6 36.3 ...
summary(tyre)
##          Brands      Mileage     
##  Apollo     :15   Min.   :27.88  
##  Bridgestone:15   1st Qu.:32.69  
##  CEAT       :15   Median :34.84  
##  Falken     :15   Mean   :34.74  
##                   3rd Qu.:36.77  
##                   Max.   :41.05
head(tyre)
##   Brands Mileage
## 1 Apollo  32.998
## 2 Apollo  36.435
## 3 Apollo  32.777
## 4 Apollo  37.637
## 5 Apollo  36.304
## 6 Apollo  35.915
# View(tyre) # if you use RStudio this is a nice way to see the data in spreadsheet format

The data set contains what we expected. The dependent variable Mileage is numeric and the independent variable Brand is of type factor. R is usually adept at coercing a chr string or an integer as the independent variable but I find it best to explicitly make it a factor when you’re working on ANOVAs.

Let’s graph and describe the basics. First a simple boxplot of all 60 data points along with a summary using the describe command from the package psych. Then in reverse order lets describe describeBy and boxplot breaking it down by group (in our case tire brand).

boxplot(tyre$Mileage, 
        horizontal = TRUE, 
        main="Mileage distribution across all brands",
        col = "blue")

describe(tyre) # the * behind Brands reminds us it's a factor and some of the numbers are nonsensical
##         vars  n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
## Brands*    1 60  2.50 1.13   2.50    2.50 1.48  1.00  4.00  3.00  0.00    -1.41 0.15
## Mileage    2 60 34.74 2.98  34.84   34.76 3.09 27.88 41.05 13.17 -0.11    -0.44 0.38
describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE, digits = 2)
##     item      group1 vars  n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
## X11    1      Apollo    1 15 34.80 2.22  34.84   34.85 2.37 30.62 38.33  7.71 -0.18    -1.24 0.57
## X12    2 Bridgestone    1 15 31.78 2.20  32.00   31.83 1.65 27.88 35.01  7.13 -0.29    -1.05 0.57
## X13    3        CEAT    1 15 34.76 2.53  34.78   34.61 2.03 30.43 41.05 10.62  0.64     0.33 0.65
## X14    4      Falken    1 15 37.62 1.70  37.38   37.65 1.18 34.31 40.66  6.35  0.13    -0.69 0.44
boxplot(tyre$Mileage~tyre$Brands, 
        main="Boxplot comparing Mileage of Four Brands of Tyre", 
        col= rainbow(4), 
        horizontal = TRUE)

Let’s format the table describeby generates to make it a little nicer using the kable package. Luckily describeby generates a dataframe with mat=TRUE and after that we can select which columns to publish (dropping some of the less used) as well as changing the column labels as desired.

describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE) %>% #create dataframe
  select(Brand=group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max, 
                Skew=skew, Kurtosis=kurtosis, SEM=se) %>% 
  kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE,
        caption="Tire Mileage Brand Descriptive Statistics") %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
Tire Mileage Brand Descriptive Statistics
Brand N Mean SD Median Min Max Skew Kurtosis SEM
Apollo 15 34.80 2.22 34.84 30.62 38.33 -0.18 -1.24 0.57
Bridgestone 15 31.78 2.20 32.00 27.88 35.01 -0.29 -1.05 0.57
CEAT 15 34.76 2.53 34.78 30.43 41.05 0.64 0.33 0.65
Falken 15 37.62 1.70 37.38 34.31 40.66 0.13 -0.69 0.44

Certainly much nicer looking and I only scratched the surface of the options available. We can certainly look at the numbers and learn a lot. But let’s see if we can also improve our plotting to be more informative.

The more I use ggplot2 the more I love the ability to use it to customize the presentation of the data to optimize understanding! The next plot might be accused of being a little “busy” but essentially answers our Oneway ANOVA question in one picture (note that I have stayed with the original decision to set \(\alpha\) = 0.01 significance level (99% confidence intervals)).

ggplot(tyre, aes(reorder(Brands,Mileage),Mileage,fill=Brands))+
# ggplot(tyre, aes(Brands,Mileage,fill=Brands))+ # if you want to leave them alphabetic
  geom_jitter(colour = "dark gray",width=.1) +
  stat_boxplot(geom ='errorbar',width = 0.4) +
  geom_boxplot()+
  labs(title="Boxplot, dotplot and SEM plot of mileage for four brands of tyres", 
       x = "Brands (sorted)",
       y = "Mileage (in thousands)",
       subtitle ="Gray dots=sample data points, Black dot=outlier, Blue dot=mean, Red=99% confidence interval",
       caption = "Data from https://datascienceplus.com/one-way-anova-in-r/") +
  guides(fill=FALSE) +
  stat_summary(fun.data = "mean_cl_normal", colour = "red", size = 1.5, fun.args = list(conf.int=.99)) +
  stat_summary(geom="point", fun.y=mean, color="blue") +
  theme_bw()