I came across below article some time ago making the valid point that bargraphs without raw data can be very misleading. The paper provides syntax to reproduce their alternatives, which include visualization of raw data, in excell and prism.
http://dx.doi.org/10.1371/journal.pbio.1002128
https://www.ctspedia.org/do/view/CTSpedia/TemplateTesting
I personally prefer plotting in R so I have made up some code for making similar graphs using the ggplot package in R which I wanted to share here. I am also curious to hear of any suggestions for improvement or other approaches that others are using.
require(ggplot2)
## Loading required package: ggplot2
R.Version()$version.string
## [1] "R version 3.2.2 (2015-08-14)"
packageVersion("ggplot2")
## [1] '1.0.1'
Let’s first start with generating some data. For this specific example I set out a situation in which we study the effect of a treatment on a measure in the field and in the lab. So this is a 2x2 factorial design. More simple graphs can then be generated by simplifying the code below (mainly the ‘fill = factor(treatment)’ command).
measure<-c(rnorm(10,mean=9,sd=1),rnorm(20,mean=6,sd=2),rnorm(10,mean=9,sd=1), rnorm(20,mean=7,sd=2))
location<-rep(c("Field","Lab"),each=30)
treatment<-rep(c(rep("Treatment 1",20),rep("Treatment 2",10)),2)
df = data.frame(location, treatment, measure)
str(df)
## 'data.frame': 60 obs. of 3 variables:
## $ location : Factor w/ 2 levels "Field","Lab": 1 1 1 1 1 1 1 1 1 1 ...
## $ treatment: Factor w/ 2 levels "Treatment 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ measure : num 8.25 8.08 9.68 8.82 8.73 ...
The general plot command:
plot<-ggplot(df,aes(x= factor(location),y=measure,fill = factor(treatment)))
A series of variations of the graph building upon this general plot:
plot1<-plot+
theme_classic(base_size = 20)+
theme(legend.position="top")+
geom_boxplot(outlier.colour = NA,position=position_dodge(0.9))+
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.8),binwidth = diff(range(df$measure)/150),dotsize=4)+
labs(x = "Location", y="Measure",fill = "Treatment",title="1: box plot with raw data")+
scale_fill_manual(values=c("darkgray", "White"))
plot1
The grouping/binning of scores along the y-axis can be modified with the binwidth command. Larger values in the denominator will result in smaller datapoints and higher resolution of the raw data.Position_dodge commands can be used to adjust spacing and position of raw data and boxplots.
The ‘outlier.colour=NA’ argument in geom_boxplot suppresses plotting of boxplot outliers since these are plotted by the raw data.
plot2<-plot+
theme_classic(base_size = 20)+
theme(legend.position="top")+
geom_violin(position=position_dodge(0.9),trim=F)+
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.75),binwidth = diff(range(df$measure)/100),dotsize=4)+
stat_summary(fun.y=mean, geom="point", size = 4, colour="red",position=position_dodge(0.9))+
labs(x = "Location", y="Measure",fill = "Treatment",title="2: violin plot with mean")+
scale_fill_manual(values=c("darkgray", "White"))
plot2
## ymax not defined: adjusting position using y instead
mean_sd <- function(x) {
m <- mean(x)
ymin <- m-sd(x)
ymax <- m+sd(x)
return(c(y=m,ymin=ymin,ymax=ymax))
}
plot3<-plot+
theme_classic(base_size = 20)+
theme(legend.position="top")+
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.75),binwidth = diff(range(df$measure)/100),dotsize=4)+
stat_summary(fun.data=mean_sd, shape=21, size = 1,colour="red",position=position_dodge(0.9))+
labs(x = "Location", y="Measure",fill = "Treatment",title="3: mean +/- sd")+
scale_fill_manual(values=c("darkgray", "White"))
plot3
mean_se<- function(x) {
m <- mean(x)
ymin <- m-sqrt(var(x)/length(x))
ymax <- m+sqrt(var(x)/length(x))
return(c(y=m,ymin=ymin,ymax=ymax))
}
plot4<-plot+
theme_classic(base_size = 20)+
theme(legend.position="top")+
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.75),binwidth = diff(range(df$measure)/100),dotsize=4)+
stat_summary(fun.data=mean_se, shape=21, size = 1,colour="red",position=position_dodge(0.9))+
labs(x = "Location", y="Measure",title="4: mean +/- se")+
scale_fill_manual(name="Treatment",values=c("darkgray", "White"))
plot4
mean_ci<- function(x) {
m <- mean(x)
ymin <- m-1.96*sqrt(var(x)/length(x))
ymax <- m+1.96*sqrt(var(x)/length(x))
return(c(y=m,ymin=ymin,ymax=ymax))
}
plot5<-plot+
theme_classic(base_size = 20)+
theme(legend.position="top")+
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.75),binwidth = diff(range(df$measure)/100),dotsize=4)+
stat_summary(fun.data=mean_ci, shape=21, colour="red",size = 1,position=position_dodge(0.9))+
labs(x = "Location", y="Measure",title="5: mean +/- 95%CI")+
scale_fill_manual(name="Treatment",values=c("darkgray","white"))
plot5
Space for improvement: 1) work out separate settings for the size of markers for mean and error bars in the last 3 graphs. At the moment means are a bit too small. 2) work out separate legends for labels of raw data and summary stats.