Case study 1: One-way ANOVA

Foreword: The Elementary Statistics for Medical Students (ESMS) project

This project aims to strengthen the statistical skills for Medical students in Vietnam. We provide the problem-based tutorials, each one will show the student how to resolve a study question using the appropriate methods. We focus on the basic statistical methods and common tasks All tutorials will be based on R, not only because this is a great tool for data visualization and analysis, but also for encouraging the students to use a free software instead of using the pirated packages like SPSS or Medcalc. Through these tutorials you will realize that R is much more powerful than any commercial package you ever know.

Introduction

The one-way analysis of variance (ANOVA) is an omnibus statistical method for verifying whether statistically significant differences exist between the means of independent groups. It worth to mention that the ANOVA is based on a linear model (LM), so our true intention is to evaluate the “effect” of a multi-levels factor on the outcome. Further analysis such as pairwise comparison or planned contrast would also be required for verifying our hypothesis, because stand alone ANOVA cannot locate which specific groups were statistically significantly different from each other.

This tutorial will show you how to carry out an One-way ANOVA in R. We also suggest some popular techniques for pairwise comparison, including Tukey post-hoc test, Bonferroni adjustment and planned contrast analysis. Last but not least, we also introduce some bootstraping procedures for ANOVA and post-hoc test, as well as how to interpret and report the results.

Context

In this parrallel group clinical trial, 606 patients were randomized into 4 treatment groups (A,B,C and D). Each group was treated with a different dose of drug X (A=lowest, D=highest). All patients underwent a liver function test before and after the treatment. The outcomes include Alkaline phosphatase, Alanine aminotransferase, Aspartate aminotransferase and Total bilirubin levels. For this tutorial we will only focus on the Total bilirubin level (TBL). Our study question is whether the drug doses have an effect on post-treatment TBL level?

** Method**

Data analysis was performed using R programming language. The drug dose effect on post treatment TBL was evaluated using One-way ANOVA. The TBL levels were compared among 4 dose groups by Bonferroni post-hoc test and contrast analysis.

Data preparation

The dataset could be downloaded from the famous vincentarebundock database as follow:

library(tidyverse)

data=read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/texmex/liver.csv")%>%as_tibble()

Step 1: Exploring data and Testing of Assumptions

Before making any move, we should make sure that our data can actually be handled by ANOVA. Unfortunately, most of the published studies never consider checking of assumption on their data, whilst the real world data are far from perfect.

The result obtained from One-way ANOVA is only reliable if our data satisfy all the required assumptions as follows:

1. The outcome should be a continuous variable
1. The factor should consist of two or more independent level (groups)
1. The observations should be independent (i.e there is no relationship between the observations), if not you might need another approach, such as repeated measures ANOVA or mixed models.
1. Beware the outliers, they reduce the validity and power of our result. This problem could be resolved by Bootstraping and/or Transformation
1. The outcome variable should be approximately normally distributed for each level in the factor. If not, we might need to transform our data or use a non-parametric test.
1. A homogeneity of variances is required. This could be verified using Levene’s test. If not, we might need to adopt some special post-hoc tests as well as the Welch ANOVA instead of conventional one.

In fact, checking for these six assumptions is easy, but dealing with the violations is more difficult. First, we will explore our data and checking for any violation of those assumptions:

Table 1: Post treatment TBL levels in four Dose groups

d=psych::describeBy(data$TBL.M,data$dose)

t1=rbind(d$A[,c(2:13)], d$B[,c(2:13)], d$C[,c(2:13)], d$D[,c(2:13)])%>%as.data.frame()
row.names(t1)=c("A","B","C","D")
knitr::kable(t1)

	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
A	152	9.570375	3.557731	8.4645	9.162516	2.788771	3.249	23.085	19.836	1.271402	2.066067	0.2885701
B	148	10.554628	4.166143	9.4905	10.131750	3.422582	3.420	25.992	22.572	1.162064	1.648718	0.3424549
C	148	10.747581	5.235447	9.4050	10.023450	3.422582	4.275	42.750	38.475	2.895294	12.872133	0.4303511
D	158	11.845538	5.271410	10.6020	11.223211	4.309918	3.933	39.843	35.910	2.014411	6.771180	0.4193708

Checking for Normal distribution

The normal distribution of data can be verified by QQ-plot and/or statistical tests.

qqnorm(data$TBL.M)
qqline(data$TBL.M, col = "red")

or by using the D’Agostino test:

library(fBasics)
dagoTest(data$TBL.M)

## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 320.5378
##     Z3  | Skewness: 14.4439
##     Z4  | Kurtosis: 10.5788
##   P VALUE:
##     Omnibus  Test: < 2.2e-16 
##     Skewness Test: < 2.2e-16 
##     Kurtosis Test: < 2.2e-16 
## 
## Description:
##  Thu Feb 23 01:20:06 2017 by user: Admin

or by looking directly at the distribution density curve:

data%>%ggplot(aes(x=TBL.M,fill=dose))+geom_density(alpha=0.5)+scale_y_continuous("Total bilirubin after treatment")+theme_bw()+scale_fill_manual(values=c("white","grey80","grey30","grey5"))

Those results indicate that our data is NOT normaly distributed. Such violation could be resolved by data transformation. Though many techniques are availabe for data transformation, we introduce here a very flexible method which is the BOX-COX transformation.

The Box-Cox transformation could be applied directly using the CARET package:

BCT=data[,c(2:9)]%>%apply(.,2,as.numeric)%>%preProcess(.,method="BoxCox")
data[,c(2:9)]=data[,c(2:9)]%>%as.matrix()%>%predict(BCT,.)

After being transformed, the dataset is now normally distributed with less ourlier

dagoTest(data$TBL.M)

## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 2.2161
##     Z3  | Skewness: -0.472
##     Z4  | Kurtosis: 1.4119
##   P VALUE:
##     Omnibus  Test: 0.3302 
##     Skewness Test: 0.637 
##     Kurtosis Test: 0.158 
## 
## Description:
##  Thu Feb 23 01:20:06 2017 by user: Admin

qqnorm(data$TBL.M)
qqline(data$TBL.M, col = "red")

data%>%ggplot(aes(x=dose,y=TBL.M,fill=dose))+geom_boxplot(alpha=0.7)+scale_y_continuous("Total bilirubin after treatment (BoxCox transformed)")+coord_flip()+theme_bw()+scale_fill_manual(values=c("white","grey60","grey40","grey20"))

Step 2: Performing One-way ANOVA

This step consists of performing an ANOVA or F-test, based on a linear model of “Outcome ~ Factor + error”. It could be easily done using the aov () function.

aov(TBL.M ~ dose, data=data)%>%summary()

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## dose          3  0.705 0.23501   6.609 0.000213 ***
## Residuals   602 21.408 0.03556                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows us the output of the ANOVA analysis. First, We can see that the significance value is 0.000319 (i.e., p=0.0003), which is below 0.05, indicating that there is statistically significant difference in mean TBL level between at least two drug doses. Unfortunately, the F-test cannot tell us which of the specific drug doses differed from other.

The Levene test for variance homogeneity confirms that our ANOVA result satisfies the Assumption N°6

aov(TBL.M ~ dose, data=data)%>%car::leveneTest()

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  0.1754 0.9131
##       602

Effect size of ANOVA

The effect size for ANOVA is measured by two metrics: Omega squared and Eta squared. They could be manually calculated as follow:

effectsize=function(output){
 out=output%>%summary()
 df=out[[1]]$Df[1]
 SS=out[[1]]$`Sum Sq`[1]
 MS=out[[1]]$`Mean Sq`[1]
 SSR=out[[1]]$`Sum Sq`[2]   
 MSR=out[[1]]$`Mean Sq`[2]
 eta2=SS/(SS+SSR)
 omega2<- df*(MS-MSR) / (SS + SSR + MSR)
 cbind(Effectsizes=c("Eta_squared","Omega_squared"),Estimated=c(eta2,omega2))
}

output=aov(TBL.M ~ dose, data=data)

effectsize(output)

##      Effectsizes     Estimated           
## [1,] "Eta_squared"   "0.0318832046673342"
## [2,] "Omega_squared" "0.0270152573079852"

Step 3: Multiple Comparisons Table

Our thrid step consists of performing some pariwise comparisons. As mentioned above, this could be done via a pairwise post-hoc test likes Bonferroni, or by analysing the planned contrast matrix:

Bonferroni Post-hoc test

pairwise.t.test(data$TBL.M, data$dose,p.adjust.method="bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$TBL.M and data$dose 
## 
##   A       B     C    
## B 0.285   -     -    
## C 0.206   1.000 -    
## D 6.3e-05 0.098 0.141
## 
## P value adjustment method: bonferroni

The Bonferroni post-hoc test’s result indicates that the difference in mean TBL level was only significative between the Highest (D) and Lowest (A) doses. No significant difference was found for the intermediate doses (B,C).

An alternative method is Tukey posthoc test:

aov(TBL.M ~ dose, data=data)%>%TukeyHSD()

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = TBL.M ~ dose, data = data)
## 
## $dose
##            diff          lwr        upr     p adj
## B-A 0.043239481 -0.012862272 0.09934123 0.1945352
## C-A 0.046193976 -0.009907777 0.10229573 0.1474914
## D-A 0.095203892  0.040008990 0.15039879 0.0000621
## C-B 0.002954495 -0.053520031 0.05942902 0.9991256
## D-B 0.051964411 -0.003609348 0.10753817 0.0764662
## D-C 0.049009915 -0.006563843 0.10458367 0.1057793

aov(TBL.M ~ dose, data=data)%>%TukeyHSD()%>%plot()

First, the TukeyHSD function provides mean, as well as upper and lower limits of confidence interval for difference in TBL level between pairs of drug doses. Only the difference between Highest (D) and Lowest (A) doses could be considered significative, at a significance level of 0.0001, as the 95%CI does not contain zero (both lwr and upr limits were positive).

Planned contrast analysis

Compared to the conventional post-hoc tests, a planned contrast analysis is much more flexible, as it allows us to test every possible hypothesis. For example, the following contrast matrix supports not only simple contrasts (between two drug doses) but also the complexe hypothesis, likes: whether the mean TBL in Highest dose (D) differs significantly to the mean TBL of 3 lower doses (A, B and C). In a similar way, we can test our hypothesis about the difference in TBL between Upper dose groups (C+D) and Lower dose groups (A+B). All we need to do is setting the weight for contrasts.

cntrMat <- rbind("D vs A"=c( -1,   0,   0,  1),
                 "D vs B"=c( 0,   -1,   0,  1),
                 "D vs C"=c( 0,   0,   -1,  1),
                 "(BCD) vs A"=c( -1,   1/3,   1/3,  1/3),
                 "((CD) vs (AB))"   =c( -1/2,   -1/2,   1/2,  1/2),
                 "D vs (ABC)"    =c( -1/3,   -1/3,   -1/3,  1),
                 "(CD) vs A"    =c( -1,  0,   1/2,  1/2),
                 "(CD) vs B"    =c( 0,  -1,   1/2,  1/2)
                 )

library(multcomp)                 
out=aov(TBL.M ~ dose, data=data)

summary(glht(out,linfct=mcp(dose=cntrMat),alternative="two.sided"),test=adjusted("bonferroni"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = TBL.M ~ dose, data = data)
## 
## Linear Hypotheses:
##                     Estimate Std. Error t value Pr(>|t|)    
## D vs A == 0          0.09520    0.02142   4.444 8.42e-05 ***
## D vs B == 0          0.05196    0.02157   2.409  0.13040    
## D vs C == 0          0.04901    0.02157   2.272  0.18754    
## (BCD) vs A == 0      0.06155    0.01767   3.482  0.00426 ** 
## ((CD) vs (AB)) == 0  0.04908    0.01533   3.202  0.01148 *  
## D vs (ABC) == 0      0.06539    0.01745   3.748  0.00157 ** 
## (CD) vs A == 0       0.07070    0.01872   3.777  0.00139 ** 
## (CD) vs B == 0       0.02746    0.01888   1.454  1.00000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)

Based on this result, we could say that:

The Highest Dose (D) results in higher TBL level, compared to all lower doses (A,B and C)
The Upper doses (C and D) also represent a higher TBL level than that in Lower doses (A and B)
The mean TBL in pooled group of upper doses (B,C and D) was significantly higher than that in lowest dose (A)

Step 4: Bootstraping ANOVA analysis

This is an advance technique that implies skills in function writing and extracting information from the output and therefore not mandatory for a routine analysis. However, Bootstraping is very helpful if you have a limited sample size and/or outliers.

First, we should write down a statistical function called anovaboot, it will provide the F-test, etasquared and Omegasquared for our resampled data.

The we call the boot package and perform a bootstrap with our function:

anovaboot=function(formula,data,i){
  drs= data[i,]
  out=summary(aov(lm(formula=formula,data=drs)))
  df=out[[1]]$Df[1]
  SS=out[[1]]$`Sum Sq`[1]
  MS=out[[1]]$`Mean Sq`[1]
  SSR=out[[1]]$`Sum Sq`[2]   
  MSR=out[[1]]$`Mean Sq`[2]
  Fstat=out[[1]]$`F value`[1]
  p=out[[1]]$`Pr(>F)`[1]
  eta2=SS/(SS+SSR)
  omega2<- df*(MS-MSR) / (SS + SSR + MSR)
  return=c(SS,MS,Fstat,eta2,omega2,p)
}


#Bootstrap

library(boot)

set.seed(123)
res=boot(statistic=anovaboot,formula="TBL.M~dose",data,R=1000)%>%.$t%>%as_tibble()

names(res)=c("SSq","MSq","F_stats","Eta_sq","Omega_sq","P_value")

Hmisc::describe(res)

## res 
## 
##  6  Variables      1000  Observations
## ---------------------------------------------------------------------------
## SSq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.7959   0.3606   0.3315   0.4091 
##      .25      .50      .75      .90      .95 
##   0.5712   0.7594   0.9889   1.2308   1.3731 
## 
## lowest : 0.1377494 0.1459124 0.1749832 0.1761131 0.1785651
## highest: 1.8777922 1.9456574 1.9649548 1.9666145 2.1809989
## ---------------------------------------------------------------------------
## MSq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1   0.2653   0.1202   0.1105   0.1364 
##      .25      .50      .75      .90      .95 
##   0.1904   0.2531   0.3296   0.4103   0.4577 
## 
## lowest : 0.04591646 0.04863747 0.05832773 0.05870438 0.05952171
## highest: 0.62593073 0.64855248 0.65498492 0.65553816 0.72699964
## ---------------------------------------------------------------------------
## F_stats 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1    7.536    3.459    3.146    3.842 
##      .25      .50      .75      .90      .95 
##    5.344    7.194    9.347   11.689   13.088 
## 
## lowest :  1.245522  1.405630  1.610905  1.640770  1.720563
## highest: 17.134346 17.226800 17.590400 19.705958 20.257045
## ---------------------------------------------------------------------------
## Eta_sq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  0.03598  0.01594  0.01544  0.01878 
##      .25      .50      .75      .90      .95 
##  0.02594  0.03461  0.04451  0.05504  0.06123 
## 
## lowest : 0.006168634 0.006956073 0.007963836 0.008110281 0.008501343
## highest: 0.078669724 0.079060654 0.080594869 0.089421079 0.091692491
## ---------------------------------------------------------------------------
## Omega_sq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1  0.03113    0.016  0.01051  0.01387 
##      .25      .50      .75      .90      .95 
##  0.02105  0.02975  0.03968  0.05026  0.05646 
## 
## lowest : 0.001213982 0.002004043 0.003015165 0.003162099 0.003554466
## highest: 0.073965177 0.074357504 0.075897202 0.084755111 0.087034722
## ---------------------------------------------------------------------------
## P_value 
##         n   missing  distinct      Info      Mean       Gmd       .05 
##      1000         0      1000         1   0.00546  0.009922 2.736e-08 
##       .10       .25       .50       .75       .90       .95 
## 1.876e-07 4.782e-06 9.470e-05 1.226e-03 9.634e-03 2.474e-02 
## 
## lowest : 1.616239e-12 3.386410e-12 5.874875e-11 9.614896e-11 1.089908e-10
## highest: 1.615548e-01 1.787836e-01 1.856657e-01 2.400944e-01 2.923597e-01
## ---------------------------------------------------------------------------

res%>%gather(SSq:P_value,key="Metric",value="Value")%>%ggplot(aes(x=Value,fill=Metric))+geom_histogram(alpha=0.7,color="black",show.legend = F)+facet_wrap(~Metric,ncol=2,scales="free")+theme_bw()

etap=res%>%ggplot(aes(x=Eta_sq))+geom_histogram(alpha=0.8,fill="gold",color="black")+theme_bw()+geom_vline(xintercept = mean(res$Eta_sq),color="blue",size=0.8,linetype="dashed")
pp=res%>%ggplot(aes(x=P_value))+geom_histogram(alpha=0.8,fill="gold",color="black")+theme_bw()+geom_vline(xintercept = 0.02,color="red",size=0.8,linetype="dashed")

etap=res%>%ggplot(aes(x=Eta_sq))+geom_histogram(alpha=0.8,fill="gold",color="black")+theme_bw()+geom_vline(xintercept = mean(res$Eta_sq),color="blue",size=0.8,linetype="dashed")
pp=res%>%ggplot(aes(x=P_value))+geom_histogram(alpha=0.8,fill="gold",color="black")+theme_bw()+geom_vline(xintercept = 0.02,color="red",size=0.8,linetype="dashed")

library(gridExtra)

grid.arrange(etap,pp)

res%>%mutate(iteration=c(1:nrow(.)))%>%gather(SSq:P_value,key="Metric",value="Value")%>%ggplot(aes(x=iteration,y=Value,color=Metric))+geom_path(alpha=0.8,size=0.6,show.legend = F)+facet_wrap(~Metric,ncol=2,scales="free")+theme_bw()

The bootstrap result could be explored by either descriptive statistics or graphical tools

Those results, based on 1000 iterations confirm the significance level of ANOVA F-test. However, the Bootstrap 95%CI show a relatively low effect-size (Etasq = 0.02 to 0.06, which is lower than 0.12, the medium level based on Cohen’s rule).

The bootstrap analysis could also be performed on a Tukey post hoc test, as follows:

tukeyboot=function(formula,data,i){
  drs= data[i,]
  out=aov(lm(formula=formula,data=drs))
  tk=TukeyHSD(out)
  return=cbind(d1=tk$dose[1,1],
               d2=tk$dose[2,1],
               d3=tk$dose[3,1],
               d4=tk$dose[4,1],
               d5=tk$dose[5,1],
               d6=tk$dose[6,1],
               p1=tk$dose[1,4],
               p2=tk$dose[2,4],
               p3=tk$dose[3,4],
               p4=tk$dose[4,4],
               p5=tk$dose[5,4],
               p6=tk$dose[6,4])
}

restk=boot(statistic=tukeyboot,formula="TBL.M~dose",data,R=1000)%>%.$t%>%as_tibble()

restkdif=restk[,c(1:6)]
names(restkdif)=c("BA","CA","DA","CB","DB","DC")
restkdif=restkdif%>%gather(BA:DC,key="Pair",value="Value")%>%mutate(Para="Meandif")

restkp=restk[,c(7:12)]
names(restkp)=c("BA","CA","DA","CB","DB","DC")
restkp=restkp%>%gather(BA:DC,key="Pair",value="Value")%>%mutate(Para="P_value")

restklong=rbind(restkdif,restkp)

restklong%>%ggplot(aes(x=Value,fill=Pair))+geom_histogram(color="black",show.legend = F)+facet_grid(Pair~Para,scales="free")+theme_bw()

We just rebuilt a pairwise comparison table using Tukey posthoc test, based on 1000 resampled datasets (i.e 606,000 observations). Interestingly, those results indicate that the mean TBL is not only significantly different between Highest and Lowest doses, but also between the D vs B and D vs C dose groups. This is the magic of bootstraping.

Step 5: Reporting the ANOVA results

Based on these tables and figures, we could report the results of our study as follows:

“The higher drug doses had a significantly effect on the post-treatment TBL level, as determined by one-way ANOVA (F(3,606) = 6.318,p=0.0003, Eta-squared =0.03). The contrast analysis indicates a significant contrast not only between the highest dose and lowest dose groups (D vs A, p=0.0001), but also the Upper dose and Lower dose groups (C+D vs A+B, p=0.009). There was no statistically significant difference between the intermediate and low dose groups.”

Exercise

Selecting any outcome among the remaining parameters, including Alkaline phosphatase, Alanine aminotransferase, Aspartate aminotransferase, then perform your own ANOVA study on this parameter.
Evaluate the change in liver function indices from Baseline to Post-treatment points, then perfom an ANOVA on those changes.

Thank you

ESMS - Case study 1: One-way ANOVA

Lê Đông Nhật Nam

22 Feb 2017

Case study 1: One-way ANOVA