Session of Nov 19, 2013: ggplot2, stat_summary

placing text on a figure

How to place asteriks from a t-test above each data point in a scatter plot. A better options than multiple t-test is given as well.

# load the libraries
library(plyr)
library(ggplot2)

Using some measured data from a glasshouse experiment.

df <- read.csv("t_test.csv", header = TRUE, na.strings = "NA")

# changing some names
names(df) <- c("day", "species", "treatment", "measured")

#getting rid of the day before 
df <- df[df$day != 0, ]

t-tests

# quick t-test using the data frame
# just to show the result of a t-test
t.test(df$measured[df$treatment == "control"], 
       df$measured[df$treatment == "stress"])

## 
##  Welch Two Sample t-test
## 
## data:  df$measured[df$treatment == "control"] and df$measured[df$treatment == "stress"]
## t = 15.86, df = 386.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   8.644 11.090
## sample estimates:
## mean of x mean of y 
##    17.329     7.462


# to do a t-test for each species and each day, we create a fucntion that aggregates some values from the t-test.
my.t <- function(x, y) {
  t <- t.test(x, y)
  p <- t$p.value
  x.mean <- t$estimate[[1]]
  y.mean <- t$estimate[[2]]
  # assembling the output as vector with named elements
  out <- c(p_value      = p, 
           control_mean = x.mean, 
           stress_mean  = y.mean)
  return(out)
}

The function can be applied to each combination of species and day, and the results can be aggregated in another data frame. We use the split-apply-combine approach from the package plyr that we used before. See http://www.jstatsoft.org/v40/i01/paper for details.

# here using dlply with your function and the original data frame
# "dlply" means the input is a data frame, and the output is a data frame
# plyr does the heavy lifting in the background that puts the output vector of our my.t function into a data frame with all the species and treatment information in place
# the function here is sill hardcoded to the specific data set
# try to make it more flexible!
p.table <- ddply(df,
             .(species, day),
             function(x) my.t(x$measured[x$treatment == "control"], 
                              x$measured[x$treatment == "stress"]))

# have a look at the resulting data frame "p.table"
class(p.table)

## [1] "data.frame"

head(p.table)

##     species day p_value control_mean stress_mean
## 1 A. aneura   1 0.06906        25.26       18.74
## 2 A. aneura   4 0.33561        24.58       20.68
## 3 A. aneura   7 0.34788        22.72       19.74
## 4 A. aneura   9 0.01609        25.68       18.42
## 5 A. aneura  11 0.01408        24.40       15.32
## 6 A. aneura  14 0.21478        22.56       18.16

str(p.table)

## 'data.frame':    42 obs. of  5 variables:
##  $ species     : Factor w/ 2 levels "A. aneura","A. melanoxylon": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day         : int  1 4 7 9 11 14 16 18 21 23 ...
##  $ p_value     : num  0.0691 0.3356 0.3479 0.0161 0.0141 ...
##  $ control_mean: num  25.3 24.6 22.7 25.7 24.4 ...
##  $ stress_mean : num  18.7 20.7 19.7 18.4 15.3 ...

Plot

Now that we have the t-tests, it is time to assemble a graph to see how the mean values hold up.

ggplot2 provides a geom stat_summary that calculates some statistics on the fly. It uses functions from the package Hmisc for that. See ?hmisc for details on these wrapper functions in ggplot2:

mean_cl_boot(x, ...)

mean_cl_normal(x, ...)

mean_sdl(x, ...)

median_hilow(x, ...)

From the helpfile for ?smean.sd:

smean.cl.normal computes 3 summary variables: the sample mean and lower and upper Gaussian confidence limits based on the t-distribution.

smean.sdl computes the mean plus or minus a constant times the standard deviation.

smean.cl.boot is a very fast implementation of the basic nonparametric bootstrap for obtaining confidence limits for the population mean without assuming normality. These functions all delete NAs automatically.

smedian.hilow computes the sample median and a selected pair of outer quantiles having equal tail areas.

# 
p <- ggplot(df, aes(x = day, y = measured))
  p <- p + stat_summary(aes(colour = treatment), 
                        fun.data = "mean_sdl", 
                        mult = 1)
  p <- p + facet_grid(species ~ . )
p

plot of chunk unnamed-chunk-5

However, for some reason, standard error is requested as error bars for this grpah. Standard error is not implemented in R, so we do it ourselves. See http://cran.r-project.org/doc/manuals/R-intro.html

stderr <- function(x) {
          sqrt(var(x[!is.na(x)]) / length(x[!is.na(x)]))
}

# Then create a wrapper to *stderr* to make it compatible with *stat_summary*

# see values shown in ?stat_summary
my.stderr <- function(x) {
             meany <- mean(x)
             ymin  <- mean(x) - stderr(x)
             ymax  <- mean(x) + stderr(x)
             # assemble the named output
             out <- c(y = meany, ymin = ymin, ymax = ymax)
             return(out)
}

# the graph:
p <- ggplot(df, aes(x = day, y = measured))
  p <- p + stat_summary(aes(colour = treatment),
                        fun.data = "my.stderr", 
                        geom = "errorbar", width = 0.5)
  # geom_errorbar does not add a point to indicate the mean value
  p <- p + stat_summary(aes(colour = treatment),
                        fun.y = "mean", 
                        geom = "point")
  p <- p + facet_grid(species ~. )
p

plot of chunk unnamed-chunk-6

Adding asteriks to the graph that indicate the result of a treatment between treatments.

First, read a recent discussion of p-values regarding their uselessness: http://theconversation.com/the-problem-with-p-values-how-significant-are-they-really-20029

p.table$stars <- NA # to indicate no effect, looks cleaner in the plot
p.table$stars[p.table$p_value <=  0.05]  <- "*"
p.table$stars[p.table$p_value <=  0.01]  <- "**"
p.table$stars[p.table$p_value <= 0.001]  <- "***"

The asteriks are text-elements that we place on top of the graph. There are two options for doing that. Either a fixed position, or dependent on the existing error bars. Just to make sure there is no overlap of text and lines or bars.

# here we add a new layer to the basic plot with geom_text
# the y-position of the test is fixed at the y-intercept of 32
# The data for this layer are from a different data frame.
# The source of the data is specified with the "data = " argument.
p + geom_text(aes(x = day, y = 32, label = stars),
                    data = p.table, size = rel(5))

## Warning: Removed 5 rows containing missing values (geom_text).
## Warning: Removed 3 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-8

Better to place the asteriks closer to the data points. To do that, we calculate the position.

# calculate the position of the stars in the plots - slightly above the error bars.
# only the positions of the control data is needed, as they are above the stressed treatment
# using control and stressed plants will give the wrong mean and standard deviation for this case. We only need the control, as we want the asteriks above all points in the graph.
# We calculate the position based on sd here to give some offset to the actual error bars.
positions <- dlply(df[df$treatment == "control", ],
                   .(species, day),
                   function(x) {
                   my.position = mean(x$measured) + sd(x$measured)
                   })

# can't put a list in a data frame. Have to "unlist" them first, then put them in
p.table$position <- unlist(positions)

# Then, doing the graph again
p + geom_text(aes(x = day, y = position, label = stars),
                  data = p.table, size = rel(5))

## Warning: Removed 5 rows containing missing values (geom_text).
## Warning: Removed 3 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-9

Multiple tests are not really helpful to present a message. The graph is cluttered with asteriks.

Confidence intervals can be used to present the same data without the neede to do lots of t-tests. Let the data speak for themselves. Not every statistical test is helpful. And confidence intervalls are implemented in ggplot2 right away. No need to write another function.

# we add the confidence intervalls on top of the existing plot.
p + stat_summary(aes(fill = treatment), 
                 fun.data = "mean_cl_normal", 
                 geom = "ribbon", alpha = 0.5)

plot of chunk unnamed-chunk-10

Modifying ggplot

As a full example how flexible ggplot2 is, here a complex graph.

p <- ggplot(df, aes(x = day, y = measured))
  p <- p + stat_summary(aes(fill = treatment), 
                        fun.data = "mean_cl_normal", 
                        geom = "ribbon", alpha = 0.5)

  p <- p + stat_summary(aes(colour = treatment), 
                        fun.y = "mean", mult = 1, geom = "line")

  p <- p + stat_summary(aes(shape = treatment, fill = treatment), 
                        fun.y = "mean", geom = "point")

  # customise shapes, colours and fills to match black and white
  p <- p + scale_shape_manual(values = c(21, 23))
  p <- p + scale_colour_manual(values = c("black", "black"))
  p <- p + scale_fill_manual(values = c("blue", "red"))

  # Add the line that indicates start of stress
  # moved the vertical line by half a day
  p <- p + geom_vline(aes(xintercept = 28.5), linetype = "dashed")

  # add the asteriks above each measurement day - they are not needed anymore when the confidence bands are shown. Just for illustration.
  p <- p + geom_text(aes(x = day, y = position + 1, label = stars),
                     data = p.table, size = rel(5))

  p <- p + facet_grid(species ~ .)

  # Beautification
  p <- p + labs(x = "Day",
                y = expression(A~"["~mu*mol~m^-2*s^-1~"]"))
  p <- p + theme_bw()
  p <- p + theme(panel.grid.major = element_blank(), 
                 panel.grid.minor = element_blank(),
                 strip.background = element_rect(fill = "white"),
                 strip.text.y = element_text(size = rel(0.8),
                                             face = "italic", angle = -90),
                 legend.title=element_blank(),
                 legend.position = c(0.1, 0.6),
                 legend.key = element_blank())
p

## Warning: Removed 5 rows containing missing values (geom_text).
## Warning: Removed 3 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-11