When I started teaching introductory courses in R, my main challenge was to help students overcome their fear of programming and show them how R can make data analysis truly fun and exciting. To solve this challenge, I designed and write the yarrr package. The yarrr package is a collection of datasets, functions, and tutorials that help students learn and appreciate R.
One of the main tools in the yarrr package is the pirateplot(). The purpose of the pirateplot was to solve a common task: How can I easily understand the relationship between one or more categorical independent variables and a continuous dependent variable in a factorial design? For example, an experiment might compare four different experimental conditions a, b, c and d, on a dependent variable y. As factorial experiments are the prototype for experimental psychology, this is a problem that both students, and myself, constantly face.
The standard way to visualize a factorial design is a barplot like the one shown in Figure 1a. A barplot shows the mean of each distribution with error bars. Barplots are standard practice because they are simple and easy to create in any statistical software. They also provide a picture of the data that, appears, straightforward. Looking at our barplot, it looks like there was no difference between conditions on the dependent variable y. Indeed, an ANOVA on these data will `confirm’ this conclusion with a p-value of 0.939.
But is this conclusion justified? No, it is not. The problem is that our data visualization tool, the barplot, obscured important patterns in the data. Statisticians have shown again and again that, because they hide raw data and distributional information, barplots hide important patterns in data, from multiple modes, to outliers. Yet, despite this overwhelming evidence that barplots are insufficient for conveying patterns in data (Lane and Sándor 2009, Weissgerber et al. (2015), Cleveland (1984)) we are still routinely publishing barplots in our top journals (Cooper, Schriger, and Close 2002).
Why are we still using barplots to visualize data? I think the main reason is that people simply are not aware of the alternatives. While there are barplot alternatives such as violinplots (Hintze and Nelson 1998) and beanplots (Kampstra and others 2008) that show distributional information, most people simply don’t know what they are or how to create them. Or, if they do know about the alternatives, they simply are not motivated to change their habits.
In order to give both students, and myself, an straightforward replacement for barplots, I created the pirateplot function. The pirateplot function is an easy-to-use function that creates a plot I call a pirateplot. Unlike a barplot which only shows descriptive statistics (and possibly some inferential statistics in the form of a confidence interval), a pirateplot simultaneously shows three key aspects of data: Raw data (shown as individual points), Descriptive statistics (shown as lines), and Inferential statistics (95% Bayesian highest density intervals or frequentist confidence intervals, and smoothed densities). A pirateplot of our data is shown in Figure 1b. Here, we can clearly see patterns in the data that the barplot missed. For example, we see that conditions B and C have two distinct subgroups, while conditions A and D appear to be truly identical.
Importantly, I designed pirateplots to be even easier to create in R than a standard barplot. Using the yarrr package, you can create the pirateplot above by typing pirateplot(y ~ condition, data = data). Pirateplots are not only easy to create, they should also be fun to use. Using the theme and pal arguments, it’s easy to customize your own pirateplot with colors inspired by movies and tv shows, including my own childhood Saturday morning cartoon favorite the X-Men, and themes that change which elements of the plot stand out. In Figure 4, you can see 4 different versions of plots from exactly the same data created with pirateplot() by adding the theme and pal arguments.
The color palettes in the yarrr package are not restricted to a pirateplot. All of the palettes are contained in the piratepal() function and can be easily used in any plot you’d like such as the scatterplot in Figure 3.
In my own courses, I found that students get much more excited about data when they see it presented in a colorful, informative pirateplot than when it is reduced to a dull barplot. Pirateplots are also catching on outside of the classroom. Plots created or inspired by the pirateplot are already being used in publications (Wagenmakers et al. 2016) and even in research departments at companies such as Pandora.
library(yarrr)
set.seed(100)
data <- data.frame(condition = rep(letters[1:4], each = 100),
y = c(rnorm(n = 100, mean = 100, sd = 10),
c(rnorm(n = 50, mean = 80, sd = 5), rnorm(n = 50, mean = 120, sd = 5)),
c(rnorm(n = 70, mean = 80, sd = 5), rnorm(n = 30, mean = 150, sd = 5)),
rnorm(n = 100, mean = 100, sd = 10)#rexp(n = 100, rate = 1 / 100) / 10 + 90
),
id = 1:100)
par(mfrow = c(1, 2))
papaja::apa_barplot(data = data, factors = "condition", id = "id", dv = "y", main = "Barplot")
pirateplot(formula = y ~ condition, data = data, cap.beans = TRUE, main = "Pirateplot", bty = "n")
par(mfrow = c(1, 2))
yarrr::piratepal("xmen", plot.result = TRUE, trans = .3)
yarrr::piratepal("pony", plot.result = TRUE, trans = .3)
set.seed(100)
x <- rnorm(100, mean = 100, sd = 10)
y <- x + rnorm(100, mean = 0, sd = 10)
plot(x, y, col = piratepal("pony", trans = .3), pch = 16, main = "Scatterplot with the pony palette")
A scatterplot using colors from the pony palette contained in the yarrr package.
par(mfrow = c(2, 2))
pirateplot(formula = y ~ condition, data = data, cap.beans = TRUE, main = "theme = 1, pal = 'gray'", bty = "n", pal = "gray", theme = 1)
pirateplot(formula = y ~ condition, data = data, cap.beans = TRUE, main = "theme = 2, pal = 'gray'", bty = "n", theme = 2)
pirateplot(formula = y ~ condition, data = data, cap.beans = TRUE, main = "theme = 3, pal = 'xmen'", bty = "n", pal = "xmen", theme = 3)
pirateplot(formula = y ~ condition, data = data, cap.beans = TRUE, main = "theme = 4, pal = 'gray'", bty = "n", theme = 4)
Cleveland, William S. 1984. “Graphs in Scientific Publications.” The American Statistician 38 (4). Taylor & Francis Group: 261–69.
Cooper, Richelle J, David L Schriger, and Reb JH Close. 2002. “Graphical Literacy: The Quality of Graphs in a Large-Circulation Journal.” Annals of Emergency Medicine 40 (3). Elsevier: 317–22.
Hintze, Jerry L, and Ray D Nelson. 1998. “Violin Plots: A Box Plot-Density Trace Synergism.” The American Statistician 52 (2). Taylor & Francis Group: 181–84.
Kampstra, Peter, and others. 2008. “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions.” Journal of Statistical Software 28 (1): 1–9.
Lane, David M, and Anikó Sándor. 2009. “Designing Better Graphs by Including Distributional Information and Integrating Words, Numbers, and Images.” Psychological Methods 14 (3). American Psychological Association: 239.
Wagenmakers, E-J, Titia Beek, Laura Dijkhoff, Quentin F Gronau, A Acosta, RB Adams, DN Albohn, et al. 2016. “Registered Replication Report Strack, Martin, & Stepper (1988).” Perspectives on Psychological Science. SAGE Publications, 1745691616674458.
Weissgerber, Tracey L, Natasa M Milic, Stacey J Winham, and Vesna D Garovic. 2015. “Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm.” PLoS Biology 13 (4). Public Library of Science: e1002128.