Inspiration for this Publication

Credits

This RPubs document is based on Geoff Cumming’s 2014 article, “The New Statistics, Why and How,” in Psychological Science, 2014, Vol. 25(1) 7–29.

Have access to Penn’s library? Give it a read!

In this document we will create plots like the ones that appear in that article.

Caveats

I will not slavishly follow every graphical nuance, but rather try to show you tips and tricks along the way. Because I build up gradually, it makes sense to start at the beginning and skim your way through, even if you aren’t interested in the first couple of figures.

Do not be intimidated by the length of this document! It is intended to be a tutorial in the use of ggplot, and for that reason I intentionally make mistakes, move incrementally, etc., to show you how each line works. If you would rather skip to the end of any figure, just fast-forward to the ‘Final Code’ section to see the completed R code.

Needed Libraries

library(ggplot2)

Figure 1: Two-sample Estimates

In this figure, we will look at the differences in mean between two groups, create point and interval estimates, and show the significance level.

Skip ahead to Final Code

Skip ahead to Joy’s R Package

Data

Generating Random Normals

Let’s generate some fake data and draw some samples, shall we?

set.seed(42)
myControls <- data.frame(value=rnorm(2000, mean = 30, sd=20))
myData <- data.frame(value=rnorm(2000, mean = 40, sd=20))
cases <- NULL
controls <- NULL
for (i in 1:25) {
  cases[[i]] <- sample(myData$value, size=32)
  controls[[i]] <- sample(myControls$value, size=32)
}

Now we have 25 samples of n=32, each drawn out of 2 normally distributed datasets with mean difference of 10 and standard deviation of 20.

Point and Interval Estimates

How can we plot our sample point and 95% CI? Well, first let’s calculate them!
I’m assuming a T, not normal distribution for calculating our standard error, even though we happen to know the population standard deviation. This is so odd and unlike typical research that I’d rather just do a T test, since that’s what we do in real life.

Still, you could do the Z if you wanted to.

I’ll start by making an estimates data frame. I’ll set “upper” and “lower” to just have NA values, “meanDiff” will be the sample mean differences, “caseSSE” will be the sum of squared errors from the cases in each sample, “controlSSE” will be the sum of squared errors from the controls, and “sd”, blank for now, will be the pooled sample standard deviation.

estimates <- data.frame(lower=NA, 
               meanDiff=sapply(cases, mean)-sapply(controls, mean),
               caseSSE= sapply(cases, function(x) sum((x-mean(x))^2)),
               controlSSE = sapply(controls, function(x) sum((x-mean(x))^2)),
               sd=NA,
               upper=NA)

Now that I’ve got my SSE in place, I can come up with the pooled sd quite easily:

estimates$sd <- sqrt((estimates$caseSSE+estimates$controlSSE)/64)

Now I have to figure out the standard error for each sample.

se <- estimates$sd/sqrt(32)  # basic stats here

Here I choose my significance level to figure out how many se’s I want to go out in either direction.

# we have two tails, so we want the 97.5%, not the 
# 95%, quantile (that gives us 2.5% at the end) to get
# the number of standard deviations away (T-statistic).

tBound <- qt(0.975, df=31)

# if you wanted the Z-statistic, you could:

zBound <-qnorm(0.975)

estimates$lower <- estimates$meanDiff - se*tBound # or zBound
estimates$upper <- estimates$meanDiff + se*tBound # or zBound

Significance

Let’s also consider what our p-values are. There are a couple of ways to do this. Lets use t.test() and extract the p-values from the list (a bit messy, but worth it, because it handles the two-tailed thing easily, unlike some other R commands!). Keep in mind that our null hypothesis is that the mean is 0. Since our actual population mean is 10 (not that we know this in real life), we expect that we’ll find some evidence to discard our null hypothesis on at least some samples. Let’s see if that’s the case.

tTest <- mapply(t.test, x=controls, y=cases)
# flip it and make it a data frame
tTest <- as.data.frame(t(tTest))
estimates$p <- unlist(tTest$p.value)
estimates$p <- round(estimates$p, 4)

Let’s plot the point estimates first, then worry about the confidence intervals. To make things easier we’ll note which sample is which by adding a sampleNum column in our data frame.

estimates$sampleNum <- as.numeric(row.names(estimates))

Plotting

Point Estimates

Let’s plot the mean on the x axis and the sample number on the y:

plot2 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
  geom_point()
plot2

Error Bars

How can we add error bars?

plot2 <- plot2 + geom_errorbarh(aes(xmin=lower, 
                                    xmax=upper) )
plot2

See how I can take my existing defined plot and just add a layer to it?

Colors

Cool, but those error bars are a bit heavy, don’t you think? Let’s try tweaking the colors a bit and combine it all:

plot2 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
  geom_point(color="red") +
  geom_errorbarh(aes(xmin=lower, 
                 xmax=upper), color="grey" )
plot2

Layer Order

Oh, snap, the grey error bars are in front of the red dots. Let’s fix that. We’ll tweak the colors, too.

plot2 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
    geom_errorbarh(aes(xmin=lower, 
                 xmax=upper), color="darkgrey" ) +
  geom_point(color="#AA3322")

plot2

Vertical Line

We should probably put in a line that shows the true means difference of the populations, so we can see how our confidence intervals look compared to reality:

plot2 <- plot2 + geom_vline(xintercept = 10, color="blue") 
plot2

Swapping Axis Order

But I kind of want to read from top to bottom, not from bottom to top, since my sample numbers aren’t really a value, they represent in some way the passage of time, and the first should be up top.

plot2 <- plot2 + scale_y_reverse()
plot2

Can I get rid of the y-axis altogether and plot the point estimate to the left of the graph, say, at x=6, and each label at the right height(i.e. y of my label is the same as SampleNum)?

Here I’ll try to do it. Instead of overwriting plot2, I’ll just display my attempt, so if I’m wrong I haven’t clobbered plot2.

plot2 + geom_text(aes(x=-10, y=sampleNum, label=round(meanDiff,2)))

Alignment

We’re so close! Let’s add our explanatory text with some alignment.

plot2 + geom_text(aes(x=-10, y=sampleNum, 
    label=paste("Mean Difference:", round(meanDiff,2))),
    size = 2.5, hjust="inward")

Axis Range

What if we also wanted to add our p values, though? We’re also running out of room, so let’s adjust everything so things aren’t covering each other up.

plot2 + geom_text(aes(x=-26, y=sampleNum, 
    label=paste("Mean:", round(meanDiff,2))),
    size = 2.5, hjust="inward") +
    geom_text(aes(x=-18, y=sampleNum, 
            label=paste("p: ",
                p, sep="")),
    size = 2.5, hjust="inward") +
    scale_x_continuous(limits=c(-26,30))

Color as Data

We’re doing great! But really, I want only the “bad” confidence intervals to show up in bright colors and the “good” ones to not. Can I do that?

Let’s reconstruct everything from the beginning so we remember where we’re at.

plot3 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
         geom_errorbarh(aes(xmin=lower, 
            xmax=upper), color="darkgrey" ) +
         geom_point(color="#AA3322") + 
         geom_vline(xintercept = 10, color="blue") +
         scale_y_reverse() + 
         geom_text(aes(x=-26, y=sampleNum, 
             label=paste("Mean:", round(meanDiff,2))),
                size = 2.5, hjust="inward") +
         geom_text(aes(x=-18, y=sampleNum, 
             label=paste("p: ",
                p, sep="")),
             size = 2.5, hjust="inward") +
         scale_x_continuous(limits=c(-26,30))

plot3

That’s the whole plot for what I already have. I can change the geom_point and geom_errorbarh parameters to use color as a data differentiator by putting the color syntax within the aes parameter.

I’ll first set a “problem” variable in my dataset which will be TRUE if the upper CI limit is below the true mean or the lower CI limit is above the true mean. Otherwise, it’ll be false.

estimates$problem = estimates$lower >10 | estimates$upper < 10

And now I can set my color (in the aes parameter!) to be equal to problem. That’ll give me two colors: one for problems and one for not.

plot3 <- 
ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
        geom_errorbarh(aes(xmin=lower, 
                     xmax=upper, color=problem)) +
         geom_point(aes(color=problem))  + 
         geom_vline(xintercept = 10, color="blue") +
         scale_y_reverse() + 
         geom_text(aes(x=-26, y=sampleNum, 
             label=paste("Mean:", round(meanDiff,2))),
                size = 2.5, hjust="inward") +
         geom_text(aes(x=-18, y=sampleNum, 
             label=paste("p: ",
                p, sep="")),
             size = 2.5, hjust="inward") +
         scale_x_continuous(limits=c(-26,30))

plot3

Hmmm. So close, but so far away! How can I tell ggplot which colors to use?

problemColors <- c("TRUE"="red", "FALSE"="darkgrey")
colorScale <- scale_colour_manual(name="problem", values=problemColors)

plot3 + colorScale

Theming

Gosh, we’re so close. Let’s get rid of that grey background, the axis labels, and the y axis numbering:

plot3 + colorScale +
  theme_light() +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.x=element_blank())

Oh, and the “problem” legend is a problem. Let’s remove it!

While we’re at it, we’ll add the standard asterisks to the p values, too!

estimates$significance <- ""
estimates$significance[estimates$p<.05] <- "*"
estimates$significance[estimates$p<.01] <- "**"
estimates$significance[estimates$p<.001] <- "***"

plot3 <- plot3 + colorScale +
  theme_light() +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.x=element_blank(),
        legend.position="none") +
  geom_text(aes(x=-13, y=sampleNum, 
        label=estimates$significance),
        size = 2.5, hjust="inward")
  
plot3

Probability Density

What about a curve showing the distribution of the mean differences? Let’s start by drawing the distribution of the differences in mean, to get an idea of our reality.

plot <- ggplot(data = estimates, mapping = aes(meanDiff)) +
        geom_density()
plot

Well, it’s an ugly bell, but it’s a bell, I suppose! But what is the “ideal” normal curve for this? We know that the mean is 10, but what’s the standard deviation? Recall that the variance of difference is the sum of variance, so the square root of the sum of the squared sd’s for each population (both are 20) should be what we get for the standard deviation of the difference of the entire populations. However, we have to remember that we’re dealing with a sampling distribution, so we’ll divide by the square root of 32, our sample size.

popDifferenceSE <- sqrt(20^2+20^2)/sqrt(32)
popDifferenceSE

## [1] 5

Let’s plot a “perfect” sampling distribution of what we’d expect if we sampled not 25, but infinitely many samples out of an infinite pair of populations, each with a sd of 20 and a difference of population means of 10.

In actuality, we’ll do a random collection of one million. That’s pretty darn close!

fakeData<-data.frame(value=rnorm(1000000, mean=10, sd=popDifferenceSE))
plot4 <- ggplot(data = fakeData, mapping = aes(value)) +
        geom_density() +
        theme_light()
plot4

Groovy, now let’s plot that idealized curve under our point and interval estimates, using grid():

Combining multiple plots

library(grid)
library(gridExtra)
grid.arrange(plot3, plot4, ncol = 1, heights = c(4, 1))

We are really close here. Let’s get rid of all our horizontal grid lines, in both graphs, make sure their x axis ranges match, and remove all the excess labels that make it obvious that these are two plots. That way we’ll have one cohesive graphic.

Final Code

# set the data and the most used elements
finalTop <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
  
  # add error bars, parameterized by other columns of 'estimates'
  geom_errorbarh(aes(xmin=lower,xmax=upper, color=problem)) +
  
  # add point estimate, colored according to the problem column of 'estimate'  
  geom_point(aes(color=problem))  + 
  
  # draw a vertical line at the true difference in mean.
  geom_vline(xintercept = 10, color="blue") +
  
  # flip the y axis
  scale_y_reverse() + 
  
  # Add the text for the mean value at the same x value (so a vertical
  # column), but with the y value matching the sample.  Round to 2 decimal pts.
  geom_text(aes(x=-24, y=sampleNum, 
                label=paste("Mean:", round(meanDiff,2))),
            size = 2.5, hjust="inward") +
  
  # Add the text for the p value at the same x value (so a vertical
  # column), but with the y value matching the sample.  Round to 2 decimal pts.
  geom_text(aes(x=-17, y=sampleNum, 
                label=paste("p: ",
                            p, sep="")),
            size = 2.5, hjust="inward") +
  
  # make sure the graph is wide enough to include the text we added  
  scale_x_continuous(limits=c(-26,30)) +
  
  # color the "problem" status according to a scale we set up separately
  colorScale +
  
  # Get rid of gridlines, axes
  theme_void() +
  
  # Some theme changes. Get rid of the legend.
  theme(legend.position = "none",
        
  # Position the plot title at 65% of the width, just over our true mean diff.
        plot.title = element_text(hjust = 0.65),
  
  # Also make sure there's space at the top, but none on bottom!
  plot.margin = unit(c(10, 0, -2, 0), "pt")) +
        
  # Add a title to the graph
  ggtitle("Mean, 95% CI") +
  
  # Add the significance asterisks, drawn from a column in 'estimates'
  geom_text(aes(x=-12, y=sampleNum, 
                label=estimates$significance),
            size = 2.5, hjust="inward") 

# set the data and the most used elements
finalBottom <- ggplot(data = fakeData, mapping = aes(value)) +
  
  # make sure the graph is wide enough to match the top graph 
  scale_x_continuous(limits=c(-26,30)) +
  
  # We want to draw a density curve
  geom_density() +
  
  # draw a vertical line at the true difference in mean.
  geom_vline(xintercept = 10, color="blue") +
  
  # start with a minimal theme.  We can't use theme_void bc it 
  # clobbers the axis numbers, which we want.
  theme_minimal() +
  
  # Some theme changes. Put a light grey line in for major vertical grid lines
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank(),
        
        # Remove all the axis text except the x values
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.x=element_blank(),
        axis.ticks.x=element_blank(),
        
        # Also make sure there's space at the bottom, but none on top!
        plot.margin = unit(c(-2, 0, 10, 0), "pt"))
  

finalComplete <- grid.arrange(finalTop, finalBottom, 
                              ncol = 1, heights = c(4, 1))

## Warning: Removed 34 rows containing non-finite values (stat_density).

Generalization

I know that the above code is hard to implement and remember, so I wrote a very simple R package that will minimally plot your point and interval estimates, for one or more samples and a given confidence level. You can optionally color points and/or error bars as well as flip the y axis. Let’s try it!

First, we’ll load my package, called ciPlotter.

library(devtools) # you must use devtools to install from a GitHub clone
install_github("paytonk/RPackages/ciPlotter", host="https://github.research.chop.edu/api/v3", quiet=TRUE)
library(ciPlotter)

There’s only one (public) function there, plotEstimates. Let’s see what it does. First, let’s plot the sample estimates for “cases”, with no other parameters specified, then using the additional parameters to tailor the plot’s appearance.

Note that in this case, we are passing plotEstimates a list that consists of numerical vectors.

plotEstimates(cases)

plotEstimates(cases, 
              confLevel = 0.80, 
              pointColor = "blue", 
              ciColor = "brown",
              yaxis = "reverse")

If you don’t have a list of numerical vectors, but just have one sample, you can just pass in a single numerical vector, as long as it has more than one number in it:

plotEstimates(c(10,9,8,12,12,19,10,15))

It’s not a full or perfect solution, but it can help you visualize your point and confidence interval estimates in a jiffy. If you like and use this package, let me know! And if you have a wishlist for its continued improvement, that would be great to pass along as well.

Figure 2: Violin Plots

This figure features “cat-eye” plots that highlight the difference in captured data for different confidence levels.

Skip ahead to Final Code

Data

Let’s use a large set of normally distributed data, since it’ll make a nice smooth violin curve that’s idealized because of its 100k data points!

fakeData2 <- data.frame(value=rnorm(100000, mean=12, sd=4))

Plotting

Now let’s try our hand at the most basic violin plot. We’ll do three in a row, just like in Cumming’s Figure 2.

Basic Violin

violin <- ggplot(fakeData2, aes(x=1, y=value)) + 
  geom_violin() +
  geom_violin(aes(x=2)) +
  geom_violin(aes(x=3))

violin

Let’s fill the violin plot according to the confidence levels, as Cumming does. The first will be a 99% CI, so the entire shape will be filled. The second will be filled at a 95% CI, so there will be a thick bar across the middle.
The last will be filled only to the SE level, so it’ll resemble a cat’s eye.

Let’s get what those values are. Since this is such big data, we’ll use the qnorm() instead of qt() command. Keep in mind that we’re pretending this distribution is a sampling distribution (which it isn’t), so instead of calculating the SE, we’ll just use one standard deviation.

Interval selection

cInt99 <- qnorm(c(0.005,0.995), mean=12, sd=4)
cInt95 <- qnorm(c(0.05,0.95), mean=12, sd=4)
seInt <- 12 + c(-4, 4)

Let’s set up our data frame to have the levels marked true or false. We’ll add columns for each of our three confidence levels.

violinData <- data.frame(value=fakeData2$value, 
  CI99=(fakeData2$value > cInt99[1] & fakeData2$value < cInt99[2]),
  CI95=(fakeData2$value > cInt95[1] & fakeData2$value < cInt95[2]),
  SE=(fakeData2$value > seInt[1] & fakeData2$value < seInt[2]))

It’s always good to have a sanity check and make sure our values look right!

table(violinData$CI99)

## 
## FALSE  TRUE 
##  1024 98976

table(violinData$CI95)

## 
## FALSE  TRUE 
##  9963 90037

table(violinData$SE)

## 
## FALSE  TRUE 
## 31588 68412

Filling by Color (Fail)

OK, so now we can use the fill command in the aes() parameter of ggplot:

violin <- ggplot(violinData, aes(x=1, y=value)) + 
  geom_violin(aes(fill=CI99))+
  geom_violin(aes(x=2, fill=CI95)) +
  geom_violin(aes(x=3, fill=SE))

violin

Oh, that’s horrible! It didn’t fill the interior of the violin, it created two weird violins for each level. That clearly isn’t what we wanted. To make a long research journey shorter, the geom_violin() doesn’t do what we want.
Luckily, geom_violin() is an extension/application of a different ggplot element, geom_polygon(), and we can hack what we want based on that, instead.

Hacking geom_violin()

Thanks are due here to Joran, who gave a consise explanation to this conundrum.

I have stolen liberally from fair Joran.

We start by letting geom_violin figure out the plotting of the violin shape. We’ll just do one to start.

p <- ggplot() + 
    geom_violin(data = violinData,aes(x = 0,y = value))
p_build <- ggplot2::ggplot_build(p)$data[[1]]

Then we have to draw the polygon:

#This comes directly from the source of geom_violin
p_build <- transform(p_build,
                     xminv = x - violinwidth * (x - xmin),
                     xmaxv = x + violinwidth * (xmax - x))

p_build <- rbind(plyr::arrange(transform(p_build, x = xminv), y),
                 plyr::arrange(transform(p_build, x = xmaxv), -y))

That gives us the violin (or manta ray, or spinning top) that we know and love:

p_fill <- ggplot() + geom_polygon(data = p_build, aes(x = x,y = y))
p_fill

We can add our colors like this:

#Add our fill variable
p_build$fill_group<- ifelse((p_build$y < cInt95[1] | p_build$y > cInt95[2]),
                                'Out','In')

p_fill <- ggplot() + 
    geom_polygon(data = p_build,
                 aes(x = x,y = y,fill = as.factor(fill_group)))
p_fill

That didn’t quite work as planned! It’s important to remember that in factors, they come alphabetized! So, since “In” comes before “Out”, it’s drawn first, and “Out” is drawn over it, and the “Out” is trying to be a single solid polygon, so it turns out to be a vertical stripe down the middle, not just the tips. I bet if we change one letter in our code, so that the alphabetical order changes, we’re good!

We could also change the factor levels, but this is a quick hack that’s easier to remember when you’re in a hurry.

#Add our fill variable
p_build$fill_group<- ifelse((p_build$y < cInt95[1] | p_build$y > cInt95[2]),
                                'AOut','In')

p_fill <- ggplot() + 
    geom_polygon(data = p_build,
                 aes(x = x,y = y,fill = as.factor(fill_group)))
p_fill

Hooray! We’re close! Let’s add just one more line to get the colors what we want them to be.

p_fill + scale_fill_manual(values = c("white", "black"))

Oh, and give it an edge that can be seen!

p_fill <- ggplot() + 
    geom_polygon(data = p_build,
                 aes(x = x,y = y,fill = as.factor(fill_group)),
                 color="black") +
  scale_fill_manual(values = c("white", "black")) 
p_fill

Groovy! Now, can we take this example, the 95% CI, and place it alongside other interval examples? Let’s see! Since I want to place my three violins at 1, 2, and 3 (or any other equidistant values), I can do the following. I’m essentially reproducing the code I already did, but starting with data that’s three times as long: I’m going to repeat my y values for each potential x value.

violinData2 <- data.frame(interval=rep(c(1,2,3),each=100000), 
                          value = rep(violinData$value,3))

Now, for the plotting. I’m essentially reproducing what was above, but where there are changes, I’ve commented them so that they stand out.

p2 <- ggplot() + 
    geom_violin(data = violinData2,aes(x = as.factor(interval),y = value))
# NB that above, the as.factor() is very important to help identify the three
# "groups" (separate violin plots).

p_build2 <- ggplot2::ggplot_build(p2)$data[[1]]

# I'm going to make my violins a bit skinnier (75% of original size) in order to 
# make room on either side of them, for the error bars we'll put in later.
p_build2 <- transform(p_build2,
                     xminv = x - violinwidth * (x - xmin) * .75,
                     xmaxv = x + violinwidth * (xmax - x) * .75)

p_build2 <- rbind(plyr::arrange(transform(p_build2, x = xminv), y),
                 plyr::arrange(transform(p_build2, x = xmaxv), -y))

# Here we'll set the In and Out variables based on their respective intervals.

# Set everyone to be "Out" by definition first
p_build2$fill_group<-'AOut'

# If you're in group 1, we'll change you to 'In' if you are in.
p_build2$fill_group[which(p_build2$group==1 & 
                            p_build2$y > cInt99[1] &
                            p_build2$y < cInt99[2])] <- 'In'
 
# If you're in group 2, we'll change you to 'In' if you are in.
p_build2$fill_group[which(p_build2$group==2 & 
                            p_build2$y > cInt95[1] &
                            p_build2$y < cInt95[2])] <- 'In'

# If you're in group 3, we'll change you to 'In' if you are in.
p_build2$fill_group[which(p_build2$group==3 & 
                            p_build2$y > seInt[1] &
                            p_build2$y < seInt[2])] <- 'In'

# This is necessary to ensure that instead of trying to draw
# 1 violin, we're telling ggplot to draw by group AND fill_group.
p_build2$group1 <- with(p_build2,interaction(factor(group),factor(fill_group)))

p_fill2 <- ggplot() + 
    geom_polygon(data = p_build2,
                 aes(x = x,y = y,group = group1, fill = as.factor(fill_group)),
                 color="black") +
  scale_fill_manual(values = c("white", "black")) 
p_fill2

Oh man, we are so close. We’d like to add some error bars drawn and labeled to show what the heck we’re looking at. We also want to get rid of the legend.

Final Code

p_fill2 <- ggplot() + 
    geom_polygon(data = p_build2,
                 aes(x = x,y = y,group = group1, fill = as.factor(fill_group)),
                 color="black") +
  scale_fill_manual(values = c("white", "black")) +
  geom_errorbar(aes(x=0.5, ymin=cInt99[1], ymax = cInt99[2], width=0.1)) +
  geom_errorbar(aes(x=1.5, ymin=cInt95[1], ymax = cInt95[2], width=0.1)) +
  geom_errorbar(aes(x=2.5, ymin=seInt[1], ymax = seInt[2], width=0.1)) +
  geom_text() +
  annotate("text", label = "99% CI", x = 0.5, y = 25, size = 5, colour = "red") +
  annotate("text", label = "95% CI", x = 1.5, y = 22.5, size = 5, colour = "red") +
  annotate("text", label = "SE", x = 2.5, y = 20, size = 5, colour = "red") +
  theme(legend.position="none") +
  labs(title="Comparison of Various Interval Choices: Black = Capture Area")

p_fill2

Figure 3: Double Y-Axis

I thought that Figure 3 would be extra tricky, as ggplot was designed with a strong, intentional design decision to avoid double axes, which our Figure 3 has. However, the release of ggplot 2.2.0 this fall allows us to add a secondary axis rather easily! If you have an older version of ggplot, consider updating it with update.packages("ggplot2").

Skip ahead to Final Code

Data

This figure has two independent groups (n=40 in both). The challenge we are facing is how to plot both the point and interval estimates for the two groups AND the point and interval estimate for the difference. Generating the data itself won’t be that challenging:

set.seed(234)
Letters <- rnorm(n=40, mean=50, sd=80)
Letters[Letters<0] <- 0
Digits <- rnorm(n=40, mean=100, sd=100)
Digits[Digits<0] <- 0

Now let’s get our point estimate and interval estimate for each group, as well as determining the point and interval estimate for the difference. Note that we keep in mind pooled SD to determine the SE of the difference.

# get point estimates for each group, plus the difference.
# Note that we're putting together vectors here, so that "pointEst" has 
# three values: one for each group, and one for the difference.
pointEst <- c(mean(Letters), mean(Digits), mean(Digits)-mean(Letters))

# get standard error for each group
se <- c(sd(Letters)/sqrt(length(Letters)),
        sd(Digits)/sqrt(length(Digits)))
# And the standard error of the difference
seDifference <- sqrt(sd(Letters)^2/length(Letters) + sd(Digits)^2/length(Digits))

# get all the se's together:

se <- c(se, seDifference)

# Figure out the T-statistic we're looking for (per group)
lowerTStat <- qt(.025, df=length(Letters)-1)
# And the T-statistic for the difference: 
lowerTStatDiff <- qt(0.025, df=length(Letters)+length(Digits)-2)

# So, we'll use the lowerTStat twice, for the two groups, and the
# lowerTStatDiff once, for the difference of the two:
lowerT <- c(lowerTStat, lowerTStat, lowerTStatDiff)

# determine CI
lowerBound <- pointEst + lowerT * se
upperBound <- pointEst + lowerT * se * -1

# set up a data frame to hold our values
estimatesFig3 <- data.frame(group = c("Letters", "Digits", "Difference"),
                              pointEst = pointEst, 
                              lowerBound = lowerBound, 
                              upperBound=upperBound)

Plotting

Let’s begin by plotting the groups and the difference, with the error bars associated with a 95% CI:

plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
  geom_point() +
  geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.2))
plotFig3

Factor Orders

Argh, we’re back to factor variables (like “Digits”, “Letters”) being placed in arbitrary alphabetical order. Let’s fix that by reordering the factor. This is the better way to do it (even though I advocate the “add-an-A” hack earlier on in this document).

estimatesFig3$group <- factor(estimatesFig3$group, 
                               levels = c("Letters", "Digits", "Difference"))
plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
  geom_point() +
  geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.1))
plotFig3

Tweaking Values

The issue is that we want the third bar, the “difference” bar, to have its point estimate line up with the point estimate of “Digits”. Let’s tweak the values in our data frame a bit!

Note: I don’t consider this good practice. Double y-axes are a bad idea. But, for the sake of imitating this figure, let’s do it!

tweak <- estimatesFig3$pointEst[2] - estimatesFig3$pointEst[3]
estimatesFig3[3,2:4] <- estimatesFig3[3,2:4] + tweak

Let’s look at the plot again!

plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
  geom_point() +
  geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.1))
plotFig3

Sigh. Yep, for better or for worse, that’s what we wanted.

Secondary Axis

Let’s see if we can get a second axis on the right side to give the ACTUAL values for that third bar (the one we tweaked upward).

plotFig3 <- plotFig3 +
  scale_y_continuous (sec.axis = sec_axis(~.-tweak))
plotFig3

Let’s connect the point estimates of our groups to the “difference” axis we created. Again, I’m not certain this is the best graphical grammar, but okay. We’ll plot it all the way over to 3.7, even though we only have 3 categories, just so it gets close to the axis.

plotFig3 <- plotFig3 +
  geom_segment(aes(x=1, xend=3.7, y = estimatesFig3$pointEst[1],
                   yend = estimatesFig3$pointEst[1]), linetype=3) + 
  geom_segment(aes(x=2, xend=3.7, y = estimatesFig3$pointEst[2],
                   yend = estimatesFig3$pointEst[2]), linetype=3)
plotFig3

We’re basically done. In the “Final Code” section I’ll do a bit of axis and title cleanup and make the plotting character a triangle for the “differences” point.

Final Code

# Set data source and most useful aesthetic pieces
plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
  
  # add points for those.  Set the shape of the point to whether it's 
  # "Difference" or not.  Make points big.
  geom_point(aes(shape=(group=="Difference")), size=3) +
  
  # Add error bars.  Make sure the whiskers aren't too wide!
  geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.1)) +
  
  # Add the secondary axis.
  scale_y_continuous(sec.axis = sec_axis(~.-tweak)) +
  
  # Add the lower dotted line
  geom_segment(aes(x=1, xend=3.7, y = estimatesFig3$pointEst[1],
                   yend = estimatesFig3$pointEst[1]), linetype=3) + 
  
  # Add the upper dotted line
  geom_segment(aes(x=2, xend=3.7, y = estimatesFig3$pointEst[2],
                   yend = estimatesFig3$pointEst[2]), linetype=3) +
  
  # Brighten the background
  theme_light() +
  
  # get rid of the legend added by having 2 point types
  theme(legend.position = "none") +
  
  # Add good labels
  labs(title="Figure 3", x= "Groups / Difference", y="Measure") 
  
  
plotFig3

Figure 4: Plotting Across Time

This is a more traditional figure, that has two groups measured over time.

Skip ahead to Final Code

Data

Once again, let’s do some faux data. This time I’ll just make up the point and interval estimates, because we’ve been through generating these from the actual sample data already a few times.

Let’s construct some data:

estimatesFig4 <- data.frame(
  group = rep(c("Control", "Treatment"),4),
  time = rep(c("Pretest","Posttest","Follow-Up 1","Follow-Up 2"), each=2),
  pointEst = c(84,81,80,44,77,46,85,39),
  lowerBound = c(84,81,80,44,77,46,85,39) - c(12,12,10,15,16,14,13,14),
  upperBound = c(84,81,80,44,77,46,85,39) + c(12,12,10,15,16,14,13,14))

estimatesFig4$time <- factor(estimatesFig4$time, 
               levels = c("Pretest","Posttest","Follow-Up 1","Follow-Up 2"))

Plotting

Basic Line Graph

ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
  geom_line()

Points, Axis Range

Getting closer! Let’s add the points and make the y axis range start at 0:

plotFig4 <- ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
  geom_point(aes(shape=group), size=3) +
  geom_line() +
  scale_y_continuous(limits=c(0,120))
plotFig4

Error Bars, Horizontal Line

And we can add the error bars and the horizontal line:

plotFig4 + geom_errorbar(aes(ymin=lowerBound, ymax=upperBound), 
                         width=0.1) +
  geom_hline(yintercept = 60)

Avoiding Overlap

We should move over one of the groups, so they don’t overlap like that!

plotFig4 <- ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
  geom_point(aes(shape=group), size=3, position=position_dodge(width=0.2)) +
  geom_line(position=position_dodge(width=0.2)) +
  scale_y_continuous(limits=c(0,120))  + 
  geom_errorbar(aes(ymin=lowerBound, ymax=upperBound), 
                         width=0.1, position=position_dodge(width=0.2)) +
  geom_hline(yintercept = 60)
plotFig4

OK, we’re really close! I’ll just do a bit of cleanup in the “Final Code.”

Final Code

# Select data frame and columns of primary interest.  Setting a group
# allows us to graph 2 different data sets at once.
plotFig4 <- ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
  
# Add the points, make them biggish, and make sure they don't overlap.
  geom_point(aes(shape=group), size=3, position=position_dodge(width=0.2)) +
  
  # Add the lines, making sure they line up with points
  geom_line(position=position_dodge(width=0.2)) +
  
  # Make the y axis scale what we want
  scale_y_continuous(limits=c(0,120))  +
  
  # Add the error bars.  Again, make sure they line up with points!
  geom_errorbar(aes(ymin=lowerBound, ymax=upperBound), 
                         width=0.1, position=position_dodge(width=0.2)) +
  
  # Add the horizontal line
  geom_hline(yintercept = 60) +
  
  # Add a horizontal line label
  geom_label(aes(x=0.5, y=60,label = "Clinical\nThreshold"), hjust="inward") +
  
  # Make labels better
  labs(x="", y="Anxiety Score", shape="") 

plotFig4

Figure 5: Redux of Figure 3

Figure 5 is very similar to Figure 3, and we won’t go through the machinations to re-do it here.

Figure 6: Meta-analyses

Figure 6 is intended to show a meta-analysis across several studies. Means and 95% CI’s are plotted for each study, with points at different sizes according to their relative weight in the meta-analysis. We won’t go thru the data manipulation needed to arrive at the suggested data.

Skip ahead to Final Code

Data

Let’s construct the data just as it appears in the figure:

metaAnalysis <- data.frame(StudyID = c(1:9),
                           StudyName=c("Aden (1993)", 
                                  "Buggs (1995)",
                                  "Crazed (1999)",
                                  "Dudley (2003)",
                                  "Evers (2005)",
                                  "Fox (2009)",
                                  "Random effects (previous)",
                                  "Mine (2011)",
                                  "Random effects (all)"),
                           StudyType=c(rep("single", 6), "meta", "single", "meta"),
                           Mean = c(454,317,430,525,479,387,432.299,531,441.425),
                           SD = c(142,158,137,260,144,165,NA,233,NA),
                           N = c(24,7,54,8,14,13,120,18,138))

Let’s get the 95% confidence interval:

metaAnalysis$se <- metaAnalysis$SD/sqrt(metaAnalysis$N)
metaAnalysis$t <- qt(.975, df=metaAnalysis$N-1)
metaAnalysis$lower <- metaAnalysis$Mean - (metaAnalysis$se*metaAnalysis$t)
metaAnalysis$upper <- metaAnalysis$Mean + (metaAnalysis$se*metaAnalysis$t)

Plotting

Initial plot

Let’s toss all of those points and error bars into a plot with the y axis reversed:

ggplot(metaAnalysis, aes(x=Mean, y=StudyID)) +
  geom_point(aes(shape=StudyType)) +
  geom_errorbarh(aes(xmin=lower, xmax=upper), height=0) +
  scale_y_reverse()

## Warning: Removed 2 rows containing missing values (geom_errorbarh).

Shapes and Sizes

Getting close! How can I set the shapes to what I want? I know from http://sape.inf.usi.ch/quick-reference/ggplot2/shape that plotting character 18 is a filled diamond, and 15 is a filled square, so let’s try this. Note I’m also tossing in size = N while I’m at it! This isn’t precisely the weighting value, but it’s good enough and shows how to alter size by some variable.

shapes <- c("single" = 15, "meta" = 18)
ggplot(metaAnalysis, aes(x=Mean, y=StudyID)) +
  geom_point(aes(shape=StudyType, size = N)) +
  geom_errorbarh(aes(xmin=lower, xmax=upper), height=0) +
  scale_y_reverse() +
  scale_shape_manual(values = shapes)

## Warning: Removed 2 rows containing missing values (geom_errorbarh).

Now let’s put it all together with a few last tweaks:

Final Code

# Set data source and most crucial columns
plotFig6 <- ggplot(metaAnalysis, aes(x=Mean, y=StudyID)) +
  
  # Make points.  Their shape should depend on the study type, and
  # the point size should vary according to the size of the study.
  geom_point(aes(shape=StudyType, size = N)) +
  
  # make sure the size scale isn't too small!
  scale_size_continuous(range=c(3,10)) +
  
  # Draw error bars with no whisker height
  geom_errorbarh(aes(xmin=lower, xmax=upper), height=0) +
  
  # Flip the y axis
  scale_y_reverse() +
  
  # Set the shapes I want for my points
  scale_shape_manual(values = shapes) +
 
  # give it a minimal theme
  theme_minimal() +
  
  # Remove the legends.
  theme(legend.position="none",
        
        # remove the y-axis stuff.
        axis.title.y=element_blank(),
        axis.text.y=element_blank()) +
  
  # Make the X axis on top and make it have more numbers
  scale_x_continuous(position = "top", breaks = seq(100,800,by=100)) +
  
  # fix labels
  labs(x="ms")

  
  
plotFig6

## Warning: Removed 2 rows containing missing values (geom_errorbarh).

Figure 7: Precision/Sample Size

Figure 7 displays precision-for-planning curves. It shows how large a sample is necessary to provide various sizes of margin of error (in terms of standard deviation units).

Skip ahead to Final Code

Data

Let’s set our desired MOE values.

moe <- seq(0.3,1.5, by=0.1)

I’m using the MBESS package, which has a lot of AIPE functions, to figure out the data. Here, width is confidence interval, so we can roughly approximate a 95% CI width as 2 margins of error (although this won’t be quite right with small groups, but bear with me…)

library(MBESS)
assurance50 <- sapply(moe, function(x) ss.aipe.smd(delta=0.5, conf.level=.95,
                                   width = 2*(x)))
assurance99 <- sapply(moe, function(x) ss.aipe.smd(delta=0.5, conf.level=.95,
                                   width = 2*(x), assurance = 0.99))

Let’s combine the data:

sampleSizes <- data.frame(moe, assurance50, assurance99)

Plotting

Lines and Points

This time, we have wide (not long) data, so we can’t use the “group” feature to separate lines. We’ll need to add two geoms. This also keeps us from making a legend very easily. But let’s say we just need a quick exploratory plot, so we won’t reshape the data.

plotFig7 <- ggplot(sampleSizes, aes(x=moe, y=assurance50)) +
  geom_point(shape = 19) + geom_line() +
  geom_point(aes(x=moe, y=assurance99), shape = 17) +
  geom_line(aes(x=moe, y=assurance99))
plotFig7

Annotate Points

Let’s label points with their value. We’ll use offset and size values to make them more attractive.

plotFig7 + geom_text(aes(label=assurance50),
                     hjust=1.5, vjust=1.5, size=3) + 
  geom_text(aes(y=assurance99, label=assurance99), hjust=-.5, 
            vjust=-.5, size=3)

Final Code

Alright, just a few more small edits!

# Set data source
plotFig7 <- ggplot(sampleSizes, aes(x=moe, y=assurance50)) +
  
  # Large dots for assurance at 50%
  geom_point(shape = 19) + 
  geom_line() +
  
  # Filled triangles for assurance at 99%
  geom_point(aes(x=moe, y=assurance99), shape = 17) +
  geom_line(aes(x=moe, y=assurance99)) +
  
  # Label the points
  geom_text(aes(label=assurance50),
                     hjust=1.5, vjust=1.5, size=3) + 
  geom_text(aes(y=assurance99, label=assurance99), hjust=-.5, 
            vjust=-.5, size=3) +

  # Very simple theme
  theme_bw() +
  
  # Change labels
  labs(x="Margin of Error", y="Sample Size")

plotFig7

The New Statistics: Visualizations

Joy Payton

12/27/2016

Inspiration for this Publication

Credits

Caveats

Needed Libraries

Figure 1: Two-sample Estimates

Data

Generating Random Normals

Point and Interval Estimates

Significance

Plotting

Point Estimates

Error Bars

Colors

Layer Order

Vertical Line

Swapping Axis Order

Text Size

Alignment

Axis Range

Color as Data

Theming

Probability Density

Combining multiple plots

Final Code

Generalization

Figure 2: Violin Plots

Data

Plotting

Basic Violin

Interval selection

Filling by Color (Fail)

Hacking geom_violin()

Final Code

Figure 3: Double Y-Axis

Data

Plotting

Factor Orders

Tweaking Values

Secondary Axis

Final Code

Figure 4: Plotting Across Time

Data

Plotting

Basic Line Graph

Points, Axis Range

Error Bars, Horizontal Line

Avoiding Overlap

Final Code

Figure 5: Redux of Figure 3

Figure 6: Meta-analyses

Data

Plotting

Initial plot

Shapes and Sizes

Final Code

Figure 7: Precision/Sample Size

Data

Plotting

Lines and Points

Annotate Points

Final Code