This RPubs document is based on Geoff Cumming’s 2014 article, “The New Statistics, Why and How,” in Psychological Science, 2014, Vol. 25(1) 7–29.
Have access to Penn’s library? Give it a read!
In this document we will create plots like the ones that appear in that article.
I will not slavishly follow every graphical nuance, but rather try to show you tips and tricks along the way. Because I build up gradually, it makes sense to start at the beginning and skim your way through, even if you aren’t interested in the first couple of figures.
Do not be intimidated by the length of this document! It is intended to be a tutorial in the use of ggplot, and for that reason I intentionally make mistakes, move incrementally, etc., to show you how each line works. If you would rather skip to the end of any figure, just fast-forward to the ‘Final Code’ section to see the completed R code.
library(ggplot2)
In this figure, we will look at the differences in mean between two groups, create point and interval estimates, and show the significance level.
Let’s generate some fake data and draw some samples, shall we?
set.seed(42)
myControls <- data.frame(value=rnorm(2000, mean = 30, sd=20))
myData <- data.frame(value=rnorm(2000, mean = 40, sd=20))
cases <- NULL
controls <- NULL
for (i in 1:25) {
cases[[i]] <- sample(myData$value, size=32)
controls[[i]] <- sample(myControls$value, size=32)
}
Now we have 25 samples of n=32, each drawn out of 2 normally distributed datasets with mean difference of 10 and standard deviation of 20.
How can we plot our sample point and 95% CI? Well, first let’s calculate them!
I’m assuming a T, not normal distribution for calculating our standard error, even though we happen to know the population standard deviation. This is so odd and unlike typical research that I’d rather just do a T test, since that’s what we do in real life.
Still, you could do the Z if you wanted to.
I’ll start by making an estimates data frame. I’ll set “upper” and “lower” to just have NA values, “meanDiff” will be the sample mean differences, “caseSSE” will be the sum of squared errors from the cases in each sample, “controlSSE” will be the sum of squared errors from the controls, and “sd”, blank for now, will be the pooled sample standard deviation.
estimates <- data.frame(lower=NA,
meanDiff=sapply(cases, mean)-sapply(controls, mean),
caseSSE= sapply(cases, function(x) sum((x-mean(x))^2)),
controlSSE = sapply(controls, function(x) sum((x-mean(x))^2)),
sd=NA,
upper=NA)
Now that I’ve got my SSE in place, I can come up with the pooled sd quite easily:
estimates$sd <- sqrt((estimates$caseSSE+estimates$controlSSE)/64)
Now I have to figure out the standard error for each sample.
se <- estimates$sd/sqrt(32) # basic stats here
Here I choose my significance level to figure out how many se’s I want to go out in either direction.
# we have two tails, so we want the 97.5%, not the
# 95%, quantile (that gives us 2.5% at the end) to get
# the number of standard deviations away (T-statistic).
tBound <- qt(0.975, df=31)
# if you wanted the Z-statistic, you could:
zBound <-qnorm(0.975)
estimates$lower <- estimates$meanDiff - se*tBound # or zBound
estimates$upper <- estimates$meanDiff + se*tBound # or zBound
Let’s also consider what our p-values are. There are a couple of ways to do this. Lets use t.test() and extract the p-values from the list (a bit messy, but worth it, because it handles the two-tailed thing easily, unlike some other R commands!). Keep in mind that our null hypothesis is that the mean is 0. Since our actual population mean is 10 (not that we know this in real life), we expect that we’ll find some evidence to discard our null hypothesis on at least some samples. Let’s see if that’s the case.
tTest <- mapply(t.test, x=controls, y=cases)
# flip it and make it a data frame
tTest <- as.data.frame(t(tTest))
estimates$p <- unlist(tTest$p.value)
estimates$p <- round(estimates$p, 4)
Let’s plot the point estimates first, then worry about the confidence intervals. To make things easier we’ll note which sample is which by adding a sampleNum column in our data frame.
estimates$sampleNum <- as.numeric(row.names(estimates))
Let’s plot the mean on the x axis and the sample number on the y:
plot2 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
geom_point()
plot2
How can we add error bars?
plot2 <- plot2 + geom_errorbarh(aes(xmin=lower,
xmax=upper) )
plot2
See how I can take my existing defined plot and just add a layer to it?
Cool, but those error bars are a bit heavy, don’t you think? Let’s try tweaking the colors a bit and combine it all:
plot2 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
geom_point(color="red") +
geom_errorbarh(aes(xmin=lower,
xmax=upper), color="grey" )
plot2
Oh, snap, the grey error bars are in front of the red dots. Let’s fix that. We’ll tweak the colors, too.
plot2 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
geom_errorbarh(aes(xmin=lower,
xmax=upper), color="darkgrey" ) +
geom_point(color="#AA3322")
plot2
We should probably put in a line that shows the true means difference of the populations, so we can see how our confidence intervals look compared to reality:
plot2 <- plot2 + geom_vline(xintercept = 10, color="blue")
plot2
But I kind of want to read from top to bottom, not from bottom to top, since my sample numbers aren’t really a value, they represent in some way the passage of time, and the first should be up top.
plot2 <- plot2 + scale_y_reverse()
plot2
Can I get rid of the y-axis altogether and plot the point estimate to the left of the graph, say, at x=6, and each label at the right height(i.e. y of my label is the same as SampleNum)?
Here I’ll try to do it. Instead of overwriting plot2, I’ll just display my attempt, so if I’m wrong I haven’t clobbered plot2.
plot2 + geom_text(aes(x=-10, y=sampleNum, label=round(meanDiff,2)))
Not bad, but the labels are awfully big! Let’s try using “size”. We’ll also add some text in there to explain what the number is.
plot2 + geom_text(aes(x=-10, y=sampleNum,
label=paste("Mean:", round(meanDiff,2))),
size=2.5)
We’re so close! Let’s add our explanatory text with some alignment.
plot2 + geom_text(aes(x=-10, y=sampleNum,
label=paste("Mean Difference:", round(meanDiff,2))),
size = 2.5, hjust="inward")
What if we also wanted to add our p values, though? We’re also running out of room, so let’s adjust everything so things aren’t covering each other up.
plot2 + geom_text(aes(x=-26, y=sampleNum,
label=paste("Mean:", round(meanDiff,2))),
size = 2.5, hjust="inward") +
geom_text(aes(x=-18, y=sampleNum,
label=paste("p: ",
p, sep="")),
size = 2.5, hjust="inward") +
scale_x_continuous(limits=c(-26,30))
We’re doing great! But really, I want only the “bad” confidence intervals to show up in bright colors and the “good” ones to not. Can I do that?
Let’s reconstruct everything from the beginning so we remember where we’re at.
plot3 <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
geom_errorbarh(aes(xmin=lower,
xmax=upper), color="darkgrey" ) +
geom_point(color="#AA3322") +
geom_vline(xintercept = 10, color="blue") +
scale_y_reverse() +
geom_text(aes(x=-26, y=sampleNum,
label=paste("Mean:", round(meanDiff,2))),
size = 2.5, hjust="inward") +
geom_text(aes(x=-18, y=sampleNum,
label=paste("p: ",
p, sep="")),
size = 2.5, hjust="inward") +
scale_x_continuous(limits=c(-26,30))
plot3
That’s the whole plot for what I already have. I can change the geom_point and geom_errorbarh parameters to use color as a data differentiator by putting the color syntax within the aes parameter.
I’ll first set a “problem” variable in my dataset which will be TRUE if the upper CI limit is below the true mean or the lower CI limit is above the true mean. Otherwise, it’ll be false.
estimates$problem = estimates$lower >10 | estimates$upper < 10
And now I can set my color (in the aes parameter!) to be equal to problem. That’ll give me two colors: one for problems and one for not.
plot3 <-
ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
geom_errorbarh(aes(xmin=lower,
xmax=upper, color=problem)) +
geom_point(aes(color=problem)) +
geom_vline(xintercept = 10, color="blue") +
scale_y_reverse() +
geom_text(aes(x=-26, y=sampleNum,
label=paste("Mean:", round(meanDiff,2))),
size = 2.5, hjust="inward") +
geom_text(aes(x=-18, y=sampleNum,
label=paste("p: ",
p, sep="")),
size = 2.5, hjust="inward") +
scale_x_continuous(limits=c(-26,30))
plot3
Hmmm. So close, but so far away! How can I tell ggplot which colors to use?
problemColors <- c("TRUE"="red", "FALSE"="darkgrey")
colorScale <- scale_colour_manual(name="problem", values=problemColors)
plot3 + colorScale
Gosh, we’re so close. Let’s get rid of that grey background, the axis labels, and the y axis numbering:
plot3 + colorScale +
theme_light() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.x=element_blank())
Oh, and the “problem” legend is a problem. Let’s remove it!
While we’re at it, we’ll add the standard asterisks to the p values, too!
estimates$significance <- ""
estimates$significance[estimates$p<.05] <- "*"
estimates$significance[estimates$p<.01] <- "**"
estimates$significance[estimates$p<.001] <- "***"
plot3 <- plot3 + colorScale +
theme_light() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.x=element_blank(),
legend.position="none") +
geom_text(aes(x=-13, y=sampleNum,
label=estimates$significance),
size = 2.5, hjust="inward")
plot3
What about a curve showing the distribution of the mean differences? Let’s start by drawing the distribution of the differences in mean, to get an idea of our reality.
plot <- ggplot(data = estimates, mapping = aes(meanDiff)) +
geom_density()
plot
Well, it’s an ugly bell, but it’s a bell, I suppose! But what is the “ideal” normal curve for this? We know that the mean is 10, but what’s the standard deviation? Recall that the variance of difference is the sum of variance, so the square root of the sum of the squared sd’s for each population (both are 20) should be what we get for the standard deviation of the difference of the entire populations. However, we have to remember that we’re dealing with a sampling distribution, so we’ll divide by the square root of 32, our sample size.
popDifferenceSE <- sqrt(20^2+20^2)/sqrt(32)
popDifferenceSE
## [1] 5
Let’s plot a “perfect” sampling distribution of what we’d expect if we sampled not 25, but infinitely many samples out of an infinite pair of populations, each with a sd of 20 and a difference of population means of 10.
In actuality, we’ll do a random collection of one million. That’s pretty darn close!
fakeData<-data.frame(value=rnorm(1000000, mean=10, sd=popDifferenceSE))
plot4 <- ggplot(data = fakeData, mapping = aes(value)) +
geom_density() +
theme_light()
plot4
Groovy, now let’s plot that idealized curve under our point and interval estimates, using grid():
library(grid)
library(gridExtra)
grid.arrange(plot3, plot4, ncol = 1, heights = c(4, 1))
We are really close here. Let’s get rid of all our horizontal grid lines, in both graphs, make sure their x axis ranges match, and remove all the excess labels that make it obvious that these are two plots. That way we’ll have one cohesive graphic.
# set the data and the most used elements
finalTop <- ggplot(data=estimates, aes(x=meanDiff, y=sampleNum)) +
# add error bars, parameterized by other columns of 'estimates'
geom_errorbarh(aes(xmin=lower,xmax=upper, color=problem)) +
# add point estimate, colored according to the problem column of 'estimate'
geom_point(aes(color=problem)) +
# draw a vertical line at the true difference in mean.
geom_vline(xintercept = 10, color="blue") +
# flip the y axis
scale_y_reverse() +
# Add the text for the mean value at the same x value (so a vertical
# column), but with the y value matching the sample. Round to 2 decimal pts.
geom_text(aes(x=-24, y=sampleNum,
label=paste("Mean:", round(meanDiff,2))),
size = 2.5, hjust="inward") +
# Add the text for the p value at the same x value (so a vertical
# column), but with the y value matching the sample. Round to 2 decimal pts.
geom_text(aes(x=-17, y=sampleNum,
label=paste("p: ",
p, sep="")),
size = 2.5, hjust="inward") +
# make sure the graph is wide enough to include the text we added
scale_x_continuous(limits=c(-26,30)) +
# color the "problem" status according to a scale we set up separately
colorScale +
# Get rid of gridlines, axes
theme_void() +
# Some theme changes. Get rid of the legend.
theme(legend.position = "none",
# Position the plot title at 65% of the width, just over our true mean diff.
plot.title = element_text(hjust = 0.65),
# Also make sure there's space at the top, but none on bottom!
plot.margin = unit(c(10, 0, -2, 0), "pt")) +
# Add a title to the graph
ggtitle("Mean, 95% CI") +
# Add the significance asterisks, drawn from a column in 'estimates'
geom_text(aes(x=-12, y=sampleNum,
label=estimates$significance),
size = 2.5, hjust="inward")
# set the data and the most used elements
finalBottom <- ggplot(data = fakeData, mapping = aes(value)) +
# make sure the graph is wide enough to match the top graph
scale_x_continuous(limits=c(-26,30)) +
# We want to draw a density curve
geom_density() +
# draw a vertical line at the true difference in mean.
geom_vline(xintercept = 10, color="blue") +
# start with a minimal theme. We can't use theme_void bc it
# clobbers the axis numbers, which we want.
theme_minimal() +
# Some theme changes. Put a light grey line in for major vertical grid lines
theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
# Remove all the axis text except the x values
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.x=element_blank(),
axis.ticks.x=element_blank(),
# Also make sure there's space at the bottom, but none on top!
plot.margin = unit(c(-2, 0, 10, 0), "pt"))
finalComplete <- grid.arrange(finalTop, finalBottom,
ncol = 1, heights = c(4, 1))
## Warning: Removed 34 rows containing non-finite values (stat_density).
I know that the above code is hard to implement and remember, so I wrote a very simple R package that will minimally plot your point and interval estimates, for one or more samples and a given confidence level. You can optionally color points and/or error bars as well as flip the y axis. Let’s try it!
First, we’ll load my package, called ciPlotter.
library(devtools) # you must use devtools to install from a GitHub clone
install_github("paytonk/RPackages/ciPlotter", host="https://github.research.chop.edu/api/v3", quiet=TRUE)
library(ciPlotter)
There’s only one (public) function there, plotEstimates. Let’s see what it does. First, let’s plot the sample estimates for “cases”, with no other parameters specified, then using the additional parameters to tailor the plot’s appearance.
Note that in this case, we are passing plotEstimates a list that consists of numerical vectors.
plotEstimates(cases)
plotEstimates(cases,
confLevel = 0.80,
pointColor = "blue",
ciColor = "brown",
yaxis = "reverse")
If you don’t have a list of numerical vectors, but just have one sample, you can just pass in a single numerical vector, as long as it has more than one number in it:
plotEstimates(c(10,9,8,12,12,19,10,15))
It’s not a full or perfect solution, but it can help you visualize your point and confidence interval estimates in a jiffy. If you like and use this package, let me know! And if you have a wishlist for its continued improvement, that would be great to pass along as well.
This figure features “cat-eye” plots that highlight the difference in captured data for different confidence levels.
Let’s use a large set of normally distributed data, since it’ll make a nice smooth violin curve that’s idealized because of its 100k data points!
fakeData2 <- data.frame(value=rnorm(100000, mean=12, sd=4))
Now let’s try our hand at the most basic violin plot. We’ll do three in a row, just like in Cumming’s Figure 2.
violin <- ggplot(fakeData2, aes(x=1, y=value)) +
geom_violin() +
geom_violin(aes(x=2)) +
geom_violin(aes(x=3))
violin
Let’s fill the violin plot according to the confidence levels, as Cumming does. The first will be a 99% CI, so the entire shape will be filled. The second will be filled at a 95% CI, so there will be a thick bar across the middle.
The last will be filled only to the SE level, so it’ll resemble a cat’s eye.
Let’s get what those values are. Since this is such big data, we’ll use the qnorm() instead of qt() command. Keep in mind that we’re pretending this distribution is a sampling distribution (which it isn’t), so instead of calculating the SE, we’ll just use one standard deviation.
cInt99 <- qnorm(c(0.005,0.995), mean=12, sd=4)
cInt95 <- qnorm(c(0.05,0.95), mean=12, sd=4)
seInt <- 12 + c(-4, 4)
Let’s set up our data frame to have the levels marked true or false. We’ll add columns for each of our three confidence levels.
violinData <- data.frame(value=fakeData2$value,
CI99=(fakeData2$value > cInt99[1] & fakeData2$value < cInt99[2]),
CI95=(fakeData2$value > cInt95[1] & fakeData2$value < cInt95[2]),
SE=(fakeData2$value > seInt[1] & fakeData2$value < seInt[2]))
It’s always good to have a sanity check and make sure our values look right!
table(violinData$CI99)
##
## FALSE TRUE
## 1024 98976
table(violinData$CI95)
##
## FALSE TRUE
## 9963 90037
table(violinData$SE)
##
## FALSE TRUE
## 31588 68412
OK, so now we can use the fill command in the aes() parameter of ggplot:
violin <- ggplot(violinData, aes(x=1, y=value)) +
geom_violin(aes(fill=CI99))+
geom_violin(aes(x=2, fill=CI95)) +
geom_violin(aes(x=3, fill=SE))
violin
Oh, that’s horrible! It didn’t fill the interior of the violin, it created two weird violins for each level. That clearly isn’t what we wanted. To make a long research journey shorter, the geom_violin() doesn’t do what we want.
Luckily, geom_violin() is an extension/application of a different ggplot element, geom_polygon(), and we can hack what we want based on that, instead.
Thanks are due here to Joran, who gave a consise explanation to this conundrum.
I have stolen liberally from fair Joran.
We start by letting geom_violin figure out the plotting of the violin shape. We’ll just do one to start.
p <- ggplot() +
geom_violin(data = violinData,aes(x = 0,y = value))
p_build <- ggplot2::ggplot_build(p)$data[[1]]
Then we have to draw the polygon:
#This comes directly from the source of geom_violin
p_build <- transform(p_build,
xminv = x - violinwidth * (x - xmin),
xmaxv = x + violinwidth * (xmax - x))
p_build <- rbind(plyr::arrange(transform(p_build, x = xminv), y),
plyr::arrange(transform(p_build, x = xmaxv), -y))
That gives us the violin (or manta ray, or spinning top) that we know and love:
p_fill <- ggplot() + geom_polygon(data = p_build, aes(x = x,y = y))
p_fill
We can add our colors like this:
#Add our fill variable
p_build$fill_group<- ifelse((p_build$y < cInt95[1] | p_build$y > cInt95[2]),
'Out','In')
p_fill <- ggplot() +
geom_polygon(data = p_build,
aes(x = x,y = y,fill = as.factor(fill_group)))
p_fill
That didn’t quite work as planned! It’s important to remember that in factors, they come alphabetized! So, since “In” comes before “Out”, it’s drawn first, and “Out” is drawn over it, and the “Out” is trying to be a single solid polygon, so it turns out to be a vertical stripe down the middle, not just the tips. I bet if we change one letter in our code, so that the alphabetical order changes, we’re good!
We could also change the factor levels, but this is a quick hack that’s easier to remember when you’re in a hurry.
#Add our fill variable
p_build$fill_group<- ifelse((p_build$y < cInt95[1] | p_build$y > cInt95[2]),
'AOut','In')
p_fill <- ggplot() +
geom_polygon(data = p_build,
aes(x = x,y = y,fill = as.factor(fill_group)))
p_fill
Hooray! We’re close! Let’s add just one more line to get the colors what we want them to be.
p_fill + scale_fill_manual(values = c("white", "black"))
Oh, and give it an edge that can be seen!
p_fill <- ggplot() +
geom_polygon(data = p_build,
aes(x = x,y = y,fill = as.factor(fill_group)),
color="black") +
scale_fill_manual(values = c("white", "black"))
p_fill
Groovy! Now, can we take this example, the 95% CI, and place it alongside other interval examples? Let’s see! Since I want to place my three violins at 1, 2, and 3 (or any other equidistant values), I can do the following. I’m essentially reproducing the code I already did, but starting with data that’s three times as long: I’m going to repeat my y values for each potential x value.
violinData2 <- data.frame(interval=rep(c(1,2,3),each=100000),
value = rep(violinData$value,3))
Now, for the plotting. I’m essentially reproducing what was above, but where there are changes, I’ve commented them so that they stand out.
p2 <- ggplot() +
geom_violin(data = violinData2,aes(x = as.factor(interval),y = value))
# NB that above, the as.factor() is very important to help identify the three
# "groups" (separate violin plots).
p_build2 <- ggplot2::ggplot_build(p2)$data[[1]]
# I'm going to make my violins a bit skinnier (75% of original size) in order to
# make room on either side of them, for the error bars we'll put in later.
p_build2 <- transform(p_build2,
xminv = x - violinwidth * (x - xmin) * .75,
xmaxv = x + violinwidth * (xmax - x) * .75)
p_build2 <- rbind(plyr::arrange(transform(p_build2, x = xminv), y),
plyr::arrange(transform(p_build2, x = xmaxv), -y))
# Here we'll set the In and Out variables based on their respective intervals.
# Set everyone to be "Out" by definition first
p_build2$fill_group<-'AOut'
# If you're in group 1, we'll change you to 'In' if you are in.
p_build2$fill_group[which(p_build2$group==1 &
p_build2$y > cInt99[1] &
p_build2$y < cInt99[2])] <- 'In'
# If you're in group 2, we'll change you to 'In' if you are in.
p_build2$fill_group[which(p_build2$group==2 &
p_build2$y > cInt95[1] &
p_build2$y < cInt95[2])] <- 'In'
# If you're in group 3, we'll change you to 'In' if you are in.
p_build2$fill_group[which(p_build2$group==3 &
p_build2$y > seInt[1] &
p_build2$y < seInt[2])] <- 'In'
# This is necessary to ensure that instead of trying to draw
# 1 violin, we're telling ggplot to draw by group AND fill_group.
p_build2$group1 <- with(p_build2,interaction(factor(group),factor(fill_group)))
p_fill2 <- ggplot() +
geom_polygon(data = p_build2,
aes(x = x,y = y,group = group1, fill = as.factor(fill_group)),
color="black") +
scale_fill_manual(values = c("white", "black"))
p_fill2
Oh man, we are so close. We’d like to add some error bars drawn and labeled to show what the heck we’re looking at. We also want to get rid of the legend.
p_fill2 <- ggplot() +
geom_polygon(data = p_build2,
aes(x = x,y = y,group = group1, fill = as.factor(fill_group)),
color="black") +
scale_fill_manual(values = c("white", "black")) +
geom_errorbar(aes(x=0.5, ymin=cInt99[1], ymax = cInt99[2], width=0.1)) +
geom_errorbar(aes(x=1.5, ymin=cInt95[1], ymax = cInt95[2], width=0.1)) +
geom_errorbar(aes(x=2.5, ymin=seInt[1], ymax = seInt[2], width=0.1)) +
geom_text() +
annotate("text", label = "99% CI", x = 0.5, y = 25, size = 5, colour = "red") +
annotate("text", label = "95% CI", x = 1.5, y = 22.5, size = 5, colour = "red") +
annotate("text", label = "SE", x = 2.5, y = 20, size = 5, colour = "red") +
theme(legend.position="none") +
labs(title="Comparison of Various Interval Choices: Black = Capture Area")
p_fill2
I thought that Figure 3 would be extra tricky, as ggplot was designed with a strong, intentional design decision to avoid double axes, which our Figure 3 has. However, the release of ggplot 2.2.0 this fall allows us to add a secondary axis rather easily! If you have an older version of ggplot, consider updating it with update.packages("ggplot2")
.
This figure has two independent groups (n=40 in both). The challenge we are facing is how to plot both the point and interval estimates for the two groups AND the point and interval estimate for the difference. Generating the data itself won’t be that challenging:
set.seed(234)
Letters <- rnorm(n=40, mean=50, sd=80)
Letters[Letters<0] <- 0
Digits <- rnorm(n=40, mean=100, sd=100)
Digits[Digits<0] <- 0
Now let’s get our point estimate and interval estimate for each group, as well as determining the point and interval estimate for the difference. Note that we keep in mind pooled SD to determine the SE of the difference.
# get point estimates for each group, plus the difference.
# Note that we're putting together vectors here, so that "pointEst" has
# three values: one for each group, and one for the difference.
pointEst <- c(mean(Letters), mean(Digits), mean(Digits)-mean(Letters))
# get standard error for each group
se <- c(sd(Letters)/sqrt(length(Letters)),
sd(Digits)/sqrt(length(Digits)))
# And the standard error of the difference
seDifference <- sqrt(sd(Letters)^2/length(Letters) + sd(Digits)^2/length(Digits))
# get all the se's together:
se <- c(se, seDifference)
# Figure out the T-statistic we're looking for (per group)
lowerTStat <- qt(.025, df=length(Letters)-1)
# And the T-statistic for the difference:
lowerTStatDiff <- qt(0.025, df=length(Letters)+length(Digits)-2)
# So, we'll use the lowerTStat twice, for the two groups, and the
# lowerTStatDiff once, for the difference of the two:
lowerT <- c(lowerTStat, lowerTStat, lowerTStatDiff)
# determine CI
lowerBound <- pointEst + lowerT * se
upperBound <- pointEst + lowerT * se * -1
# set up a data frame to hold our values
estimatesFig3 <- data.frame(group = c("Letters", "Digits", "Difference"),
pointEst = pointEst,
lowerBound = lowerBound,
upperBound=upperBound)
Let’s begin by plotting the groups and the difference, with the error bars associated with a 95% CI:
plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
geom_point() +
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.2))
plotFig3
Argh, we’re back to factor variables (like “Digits”, “Letters”) being placed in arbitrary alphabetical order. Let’s fix that by reordering the factor. This is the better way to do it (even though I advocate the “add-an-A” hack earlier on in this document).
estimatesFig3$group <- factor(estimatesFig3$group,
levels = c("Letters", "Digits", "Difference"))
plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
geom_point() +
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.1))
plotFig3
The issue is that we want the third bar, the “difference” bar, to have its point estimate line up with the point estimate of “Digits”. Let’s tweak the values in our data frame a bit!
Note: I don’t consider this good practice. Double y-axes are a bad idea. But, for the sake of imitating this figure, let’s do it!
tweak <- estimatesFig3$pointEst[2] - estimatesFig3$pointEst[3]
estimatesFig3[3,2:4] <- estimatesFig3[3,2:4] + tweak
Let’s look at the plot again!
plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
geom_point() +
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.1))
plotFig3
Sigh. Yep, for better or for worse, that’s what we wanted.
Let’s see if we can get a second axis on the right side to give the ACTUAL values for that third bar (the one we tweaked upward).
plotFig3 <- plotFig3 +
scale_y_continuous (sec.axis = sec_axis(~.-tweak))
plotFig3
Let’s connect the point estimates of our groups to the “difference” axis we created. Again, I’m not certain this is the best graphical grammar, but okay. We’ll plot it all the way over to 3.7, even though we only have 3 categories, just so it gets close to the axis.
plotFig3 <- plotFig3 +
geom_segment(aes(x=1, xend=3.7, y = estimatesFig3$pointEst[1],
yend = estimatesFig3$pointEst[1]), linetype=3) +
geom_segment(aes(x=2, xend=3.7, y = estimatesFig3$pointEst[2],
yend = estimatesFig3$pointEst[2]), linetype=3)
plotFig3
We’re basically done. In the “Final Code” section I’ll do a bit of axis and title cleanup and make the plotting character a triangle for the “differences” point.
# Set data source and most useful aesthetic pieces
plotFig3 <- ggplot(data=estimatesFig3, aes(x=group, y=pointEst)) +
# add points for those. Set the shape of the point to whether it's
# "Difference" or not. Make points big.
geom_point(aes(shape=(group=="Difference")), size=3) +
# Add error bars. Make sure the whiskers aren't too wide!
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound, width=0.1)) +
# Add the secondary axis.
scale_y_continuous(sec.axis = sec_axis(~.-tweak)) +
# Add the lower dotted line
geom_segment(aes(x=1, xend=3.7, y = estimatesFig3$pointEst[1],
yend = estimatesFig3$pointEst[1]), linetype=3) +
# Add the upper dotted line
geom_segment(aes(x=2, xend=3.7, y = estimatesFig3$pointEst[2],
yend = estimatesFig3$pointEst[2]), linetype=3) +
# Brighten the background
theme_light() +
# get rid of the legend added by having 2 point types
theme(legend.position = "none") +
# Add good labels
labs(title="Figure 3", x= "Groups / Difference", y="Measure")
plotFig3
This is a more traditional figure, that has two groups measured over time.
Once again, let’s do some faux data. This time I’ll just make up the point and interval estimates, because we’ve been through generating these from the actual sample data already a few times.
Let’s construct some data:
estimatesFig4 <- data.frame(
group = rep(c("Control", "Treatment"),4),
time = rep(c("Pretest","Posttest","Follow-Up 1","Follow-Up 2"), each=2),
pointEst = c(84,81,80,44,77,46,85,39),
lowerBound = c(84,81,80,44,77,46,85,39) - c(12,12,10,15,16,14,13,14),
upperBound = c(84,81,80,44,77,46,85,39) + c(12,12,10,15,16,14,13,14))
estimatesFig4$time <- factor(estimatesFig4$time,
levels = c("Pretest","Posttest","Follow-Up 1","Follow-Up 2"))
ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
geom_line()
Getting closer! Let’s add the points and make the y axis range start at 0:
plotFig4 <- ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
geom_point(aes(shape=group), size=3) +
geom_line() +
scale_y_continuous(limits=c(0,120))
plotFig4
And we can add the error bars and the horizontal line:
plotFig4 + geom_errorbar(aes(ymin=lowerBound, ymax=upperBound),
width=0.1) +
geom_hline(yintercept = 60)
We should move over one of the groups, so they don’t overlap like that!
plotFig4 <- ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
geom_point(aes(shape=group), size=3, position=position_dodge(width=0.2)) +
geom_line(position=position_dodge(width=0.2)) +
scale_y_continuous(limits=c(0,120)) +
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound),
width=0.1, position=position_dodge(width=0.2)) +
geom_hline(yintercept = 60)
plotFig4
OK, we’re really close! I’ll just do a bit of cleanup in the “Final Code.”
# Select data frame and columns of primary interest. Setting a group
# allows us to graph 2 different data sets at once.
plotFig4 <- ggplot(data=estimatesFig4, aes(x=time, y=pointEst, group=group)) +
# Add the points, make them biggish, and make sure they don't overlap.
geom_point(aes(shape=group), size=3, position=position_dodge(width=0.2)) +
# Add the lines, making sure they line up with points
geom_line(position=position_dodge(width=0.2)) +
# Make the y axis scale what we want
scale_y_continuous(limits=c(0,120)) +
# Add the error bars. Again, make sure they line up with points!
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound),
width=0.1, position=position_dodge(width=0.2)) +
# Add the horizontal line
geom_hline(yintercept = 60) +
# Add a horizontal line label
geom_label(aes(x=0.5, y=60,label = "Clinical\nThreshold"), hjust="inward") +
# Make labels better
labs(x="", y="Anxiety Score", shape="")
plotFig4
Figure 5 is very similar to Figure 3, and we won’t go through the machinations to re-do it here.
Figure 6 is intended to show a meta-analysis across several studies. Means and 95% CI’s are plotted for each study, with points at different sizes according to their relative weight in the meta-analysis. We won’t go thru the data manipulation needed to arrive at the suggested data.
Let’s construct the data just as it appears in the figure:
metaAnalysis <- data.frame(StudyID = c(1:9),
StudyName=c("Aden (1993)",
"Buggs (1995)",
"Crazed (1999)",
"Dudley (2003)",
"Evers (2005)",
"Fox (2009)",
"Random effects (previous)",
"Mine (2011)",
"Random effects (all)"),
StudyType=c(rep("single", 6), "meta", "single", "meta"),
Mean = c(454,317,430,525,479,387,432.299,531,441.425),
SD = c(142,158,137,260,144,165,NA,233,NA),
N = c(24,7,54,8,14,13,120,18,138))
Let’s get the 95% confidence interval:
metaAnalysis$se <- metaAnalysis$SD/sqrt(metaAnalysis$N)
metaAnalysis$t <- qt(.975, df=metaAnalysis$N-1)
metaAnalysis$lower <- metaAnalysis$Mean - (metaAnalysis$se*metaAnalysis$t)
metaAnalysis$upper <- metaAnalysis$Mean + (metaAnalysis$se*metaAnalysis$t)
Let’s toss all of those points and error bars into a plot with the y axis reversed:
ggplot(metaAnalysis, aes(x=Mean, y=StudyID)) +
geom_point(aes(shape=StudyType)) +
geom_errorbarh(aes(xmin=lower, xmax=upper), height=0) +
scale_y_reverse()
## Warning: Removed 2 rows containing missing values (geom_errorbarh).
Getting close! How can I set the shapes to what I want? I know from http://sape.inf.usi.ch/quick-reference/ggplot2/shape that plotting character 18 is a filled diamond, and 15 is a filled square, so let’s try this. Note I’m also tossing in size = N while I’m at it! This isn’t precisely the weighting value, but it’s good enough and shows how to alter size by some variable.
shapes <- c("single" = 15, "meta" = 18)
ggplot(metaAnalysis, aes(x=Mean, y=StudyID)) +
geom_point(aes(shape=StudyType, size = N)) +
geom_errorbarh(aes(xmin=lower, xmax=upper), height=0) +
scale_y_reverse() +
scale_shape_manual(values = shapes)
## Warning: Removed 2 rows containing missing values (geom_errorbarh).
Now let’s put it all together with a few last tweaks:
# Set data source and most crucial columns
plotFig6 <- ggplot(metaAnalysis, aes(x=Mean, y=StudyID)) +
# Make points. Their shape should depend on the study type, and
# the point size should vary according to the size of the study.
geom_point(aes(shape=StudyType, size = N)) +
# make sure the size scale isn't too small!
scale_size_continuous(range=c(3,10)) +
# Draw error bars with no whisker height
geom_errorbarh(aes(xmin=lower, xmax=upper), height=0) +
# Flip the y axis
scale_y_reverse() +
# Set the shapes I want for my points
scale_shape_manual(values = shapes) +
# give it a minimal theme
theme_minimal() +
# Remove the legends.
theme(legend.position="none",
# remove the y-axis stuff.
axis.title.y=element_blank(),
axis.text.y=element_blank()) +
# Make the X axis on top and make it have more numbers
scale_x_continuous(position = "top", breaks = seq(100,800,by=100)) +
# fix labels
labs(x="ms")
plotFig6
## Warning: Removed 2 rows containing missing values (geom_errorbarh).
Figure 7 displays precision-for-planning curves. It shows how large a sample is necessary to provide various sizes of margin of error (in terms of standard deviation units).
Let’s set our desired MOE values.
moe <- seq(0.3,1.5, by=0.1)
I’m using the MBESS package, which has a lot of AIPE functions, to figure out the data. Here, width is confidence interval, so we can roughly approximate a 95% CI width as 2 margins of error (although this won’t be quite right with small groups, but bear with me…)
library(MBESS)
assurance50 <- sapply(moe, function(x) ss.aipe.smd(delta=0.5, conf.level=.95,
width = 2*(x)))
assurance99 <- sapply(moe, function(x) ss.aipe.smd(delta=0.5, conf.level=.95,
width = 2*(x), assurance = 0.99))
Let’s combine the data:
sampleSizes <- data.frame(moe, assurance50, assurance99)
This time, we have wide (not long) data, so we can’t use the “group” feature to separate lines. We’ll need to add two geoms. This also keeps us from making a legend very easily. But let’s say we just need a quick exploratory plot, so we won’t reshape the data.
plotFig7 <- ggplot(sampleSizes, aes(x=moe, y=assurance50)) +
geom_point(shape = 19) + geom_line() +
geom_point(aes(x=moe, y=assurance99), shape = 17) +
geom_line(aes(x=moe, y=assurance99))
plotFig7
Let’s label points with their value. We’ll use offset and size values to make them more attractive.
plotFig7 + geom_text(aes(label=assurance50),
hjust=1.5, vjust=1.5, size=3) +
geom_text(aes(y=assurance99, label=assurance99), hjust=-.5,
vjust=-.5, size=3)
Alright, just a few more small edits!
# Set data source
plotFig7 <- ggplot(sampleSizes, aes(x=moe, y=assurance50)) +
# Large dots for assurance at 50%
geom_point(shape = 19) +
geom_line() +
# Filled triangles for assurance at 99%
geom_point(aes(x=moe, y=assurance99), shape = 17) +
geom_line(aes(x=moe, y=assurance99)) +
# Label the points
geom_text(aes(label=assurance50),
hjust=1.5, vjust=1.5, size=3) +
geom_text(aes(y=assurance99, label=assurance99), hjust=-.5,
vjust=-.5, size=3) +
# Very simple theme
theme_bw() +
# Change labels
labs(x="Margin of Error", y="Sample Size")
plotFig7