Data Visualization Demo

This is a quick demo of some useful plotting tips in R to make visualization of data more transparent and more publication-friendly. One of my favorite plotting tricks is to combine raw data with summary statistics on one plot. This way, as an author, we can’t hide how messy data under tidy standard error bars.

We’ll mostly be using the package ggplot2. I will cover some of the basics of plotting with ggplot2 including changes axis labels, axis minimum and maximum values, background color, etc., but I assume you’ve used ggplot2 before. I will also show you how to jitter your data and add transparency to it if you have lots of overlapping data.

Make fake data

First, let’s generate some fake data that comes from four treatment groups.

# create four treatments with different means 
treatmentA <- rnorm(50, mean=20, sd=2)
treatmentB <- rnorm(75, mean=19.5, sd=2)
treatmentC <- rnorm(75, mean=12, sd=2)
treatmentD <- rnorm(50, mean=5, sd=1)

# make a blank data frame 
raw_data <- data.frame(treatment=character(250),
                       value=numeric(250)) 

# populate it 
raw_data$treatment <- c(rep("A",50),rep("B",75),rep("C",75),rep("D",50))
raw_data$value <- c(treatmentA,treatmentB,treatmentC,treatmentD)

Summary statistics

Let’s calculate a new data frame with the summary statistics of your raw data. We’ll use the package plyr. We’ll calculate the number of observations (N), the mean, standard deviation, and standard error for each treatment group.

library(plyr)

# specify the summary stats you want 
summary_data <- ddply(raw_data,
                      c("treatment"),
                      summarise,
                      N = length(value),
                      avg_value = mean(value),
                      sd = sd(value),
                      se = sd/sqrt(N)
                      )

# check out the summary stats 
summary_data

##   treatment  N avg_value        sd        se
## 1         A 50 19.800152 1.8567385 0.2625825
## 2         B 75 19.355315 1.8081966 0.2087926
## 3         C 75 11.985342 1.9616185 0.2265082
## 4         D 50  4.910724 0.9801721 0.1386173

Plotting raw data

First, we’ll just plot just the raw data. We’ll use the package ggplot2.

library(ggplot2)

raw_plot <- ggplot(raw_data, aes(x=treatment,y=value)) # raw_data is the name of the data you're using 
raw_plot <- raw_plot + geom_point() # you are plotting points 
raw_plot <- raw_plot + xlab("Treatment") # label the axes 
raw_plot <- raw_plot + ylab("Value")
raw_plot

Most journals don’t want a grey background, so let’s change the background to white using the command theme_classic(). We’ll also make all of the labelling (axes, etc.) a little larger. base_size tells R how big to make the labels.

raw_plot <- ggplot(raw_data, aes(x=treatment,y=value))
raw_plot <- raw_plot + geom_point()
raw_plot <- raw_plot + xlab("Treatment")
raw_plot <- raw_plot + ylab("Value")
raw_plot <- raw_plot + theme_classic(base_size=18)
raw_plot

If you want to adjust y-axis you can do that with ylim.

raw_plot <- ggplot(raw_data, aes(x=treatment,y=value))
raw_plot <- raw_plot + geom_point()
raw_plot <- raw_plot + xlab("Treatment")
raw_plot <- raw_plot + ylab("Value")
raw_plot <- raw_plot + ylim(0,30)
raw_plot <- raw_plot + theme_classic(base_size=18)
raw_plot

You might also want to re-arrange the ordering of treatments on the x-axis, since ggplot will automatically alphabetize the x-axis labels.

# re-order the treatments so the order is how you would like 
raw_data$treatment <- factor(raw_data$treatment, levels=c("D", "C", "B", "A"))

# we'll also do this for the summary data and use it later on 
summary_data$treatment <- factor(summary_data$treatment, levels=c("D", "C", "B", "A"))

# then re-plot things 
raw_plot <- ggplot(raw_data, aes(x=treatment,y=value))
raw_plot <- raw_plot + geom_point()
raw_plot <- raw_plot + xlab("Treatment")
raw_plot <- raw_plot + ylab("Value")
raw_plot <- raw_plot + ylim(0,30)
raw_plot <- raw_plot + theme_classic(base_size=18)
raw_plot

There’s a lot of overlappying points, so let’s spread them out a little bit horizontally. This is called “jittering.” The w parameter specifies how wide you want to jitter the points left and right. Use h to jitter up and down.

raw_plot <- ggplot(raw_data, aes(x=treatment,y=value))
raw_plot <- raw_plot + geom_point(position = position_jitter(w=0.2))
raw_plot <- raw_plot + xlab("Treatment")
raw_plot <- raw_plot + ylab("Value")
raw_plot <- raw_plot + ylim(0,30)
raw_plot <- raw_plot + theme_classic(base_size=18)
raw_plot

We can also add a bit of transparency to the points so we can see overlapping points. We’ll use the parameter alpha in geom_point. We’ll also use colour to make the points even lighter.

raw_plot <- ggplot(raw_data, aes(x=treatment,y=value))
raw_plot <- raw_plot + geom_point(position = position_jitter(w=0.2), alpha=0.4, colour="grey58")
raw_plot <- raw_plot + xlab("Treatment")
raw_plot <- raw_plot + ylab("Value")
raw_plot <- raw_plot + ylim(0,30)
raw_plot <- raw_plot + theme_classic(base_size=18)
raw_plot

You can see that overlapping points appear darker than single points.

Adding summary statistics to raw data

Now, we’re going to add the summary statistics (mean and standard deviation) over these data. We’re going to make a new plot called combine_plot, but it will start with the raw_plot that we’ve been working with.

combined_plot <- raw_plot + geom_point(data=summary_data, aes(x=treatment,y=avg_value), colour="black", size=3)
combined_plot <- combined_plot + geom_errorbar(data = summary_data,
                                               aes(x = treatment, y = avg_value,
                                                   ymax = avg_value + sd,ymin = avg_value - sd), 
                                               width=0.1, colour="black")
combined_plot

You can add the sample sizes to the plot if you want. Use the command annotate and specify the positions with x and y values.

combined_plot <- raw_plot + geom_point(data=summary_data, aes(x=treatment,y=avg_value), colour="black", size=3)
combined_plot <- combined_plot + geom_errorbar(data = summary_data,
                                               aes(x = treatment, y = avg_value,
                                                   ymax = avg_value + sd,ymin = avg_value - sd), 
                                               width=0.1, colour="black")

combined_plot <- combined_plot + annotate("text",x="D",y=0,label="n=50")
combined_plot <- combined_plot + annotate("text",x="C",y=0,label="n=75")
combined_plot <- combined_plot + annotate("text",x="B",y=0,label="n=75")
combined_plot <- combined_plot + annotate("text",x="A",y=0,label="n=50")
combined_plot

Statistics and post-hoc labels

Now, let’s say you do an anova and a post-hoc Tukey’s test to see which treatments differ:

anova <- aov(value ~ treatment, data=raw_data)
summary(anova)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## treatment     3   8239    2746   909.6 <2e-16 ***
## Residuals   246    743       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(anova)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = value ~ treatment, data = raw_data)
## 
## $treatment
##           diff        lwr       upr    p adj
## C-D  7.0746181  6.2540387  7.895197 0.000000
## B-D 14.4445912 13.6240119 15.265171 0.000000
## A-D 14.8894284 13.9905288 15.788328 0.000000
## B-C  7.3699731  6.6360247  8.103922 0.000000
## A-C  7.8148103  6.9942310  8.635390 0.000000
## A-B  0.4448372 -0.3757422  1.265417 0.499137

There is a significant effect of treatment on the response value. Treatments A and B are not significantly different from each other (p = 0.629). All other pairs of treatments are significantly different from each other, so let’s add the post-hoc labels to the plot. Treatments A and B will share a symbol and B and C will each have unique symbols. In this example, I added se+3 to each avg_value to buffer the posthoc labels from the data points. The exact values will depend on how variable your data are.

The finished product

# Add the labels to the summary_data data frame 
summary_data$labels <- c("a","a","b","c")

combined_plot <- raw_plot + geom_point(data=summary_data, aes(x=treatment,y=avg_value), colour="black", size=3)
combined_plot <- combined_plot + geom_errorbar(data = summary_data,
                                               aes(x = treatment, y = avg_value,
                                                   ymax = avg_value + sd,ymin = avg_value - sd), 
                                               width=0.1, colour="black")

combined_plot <- combined_plot + annotate("text",x="D",y=0,label="n=50")
combined_plot <- combined_plot + annotate("text",x="C",y=0,label="n=75")
combined_plot <- combined_plot + annotate("text",x="B",y=0,label="n=75")
combined_plot <- combined_plot + annotate("text",x="A",y=0,label="n=50")
combined_plot <- combined_plot + geom_text(data=summary_data,aes(x=treatment,y=avg_value+se+3,label=labels))
combined_plot

Strengths of this plot:

No unnessary ticks, grids, marks, or shading
Raw data are visible to the viewer
Sample sizes reported
Results of statistical analysis reported

How else would you change this plot?