Plotting with ggplot2

Ryan L. Irey, MA

University of Minnesota

This tutorial will guide you through the process of turning a table of data into an intuitive and aesthetically pleasing graphical display. To guide us through this process, we will use a table from an article on hearing aid research by Gustafson, Pittman, and Fanning (2013)*. The table shows that, when the length of tubing that connects a hearing aid to the patient's ear canal changes, the level of sound at specfic frequencies tends to change as well.

Gustafson, S., Pittman, A., & Fanning, R. (2013, June). Effects of Tubing Length and Coupling Method on Hearing Threshold and Real-Ear to Coupler Difference Measures. *American Journal of Audiology, 22, 190-199.

Part 1: Creating a useable spreadsheet

Before make any graph of the data in a table, we need to convert the table into a format that can be used in R. This can be done simply using Microsoft Excel or a similar spreadsheet software. When creating a new spreadsheet, it is often useful to allocate each variable to its own column by labeling each column in the first row. However, before we can label our variables, lets look at the original table and decide what variables we will be including:

This table has data for two independent variables: the six test conditions (the column labeled “variable”) and the ten different frequencies along the top. This table also has two dependent variables: the mean signal level (the columns labeled “M”) and the standard deviation of the observed data (the columns labeled “SD”). We will want to include all of these variables in our spreadsheet. Thus, our spreadsheet will have four columns, one for each variable described so far.

Step 1. Label each column with an appropriate name - It is useful to pick variable names that as short and descriptive as possible. Also, it is important to never use spaces in the names; if a space is needed, try using a period instead (e.g., instead of “foam tip”, try “foam.tip”). One way of naming the columns is given below:

Step 2. Input the data - for each frequency, we have six different mean values (corresponding to each of the six test conditions; we also expect to have six different standard deviation values, however we only have five. This is due to a convention in the field of research from which this table was obtained. We will omit the “Coupler” variable from further consideration, and only consider the remaining five variables). First, we want to indicate our test condition and frequency information before entering the data for the dependent variables. Lets start by listing the ten different frequencies in the frequency column of the spreadsheet. Our first test condition is called “foam”, so we will list foam tip in the variable column alongside each frequency. You should have something like this:

Now, input the mean and standard deviation values that correspond to each frequency within the foam condition. You will now have something like this:

Now we're on our way! We will now do the exact same thing for the remaining four test conditions. Repeat the listing of the ten different frequencies, and be sure to repeat the new test condition in the condition column alongside each frequency. Again, enter in the mean and standard deviation values from the table that corresponde to each frequency within each test condition. Your final spreadsheet will look something like this:

Finally, we will save this spreadsheet as a comma-separated-value file, or .csv This step is very important for use with R

Part 2: Using the spreadsheet in R

Before we import and begin to work with our spreadsheet, let's load the two libraries we will need for this task:

library(ggplot2)
library(RColorBrewer)

The ggplot2 library contains all the necessary tools we'll need to plot our data. The RColorBrewer library can be used to specifiy particular colors in our graph.

Step 1. Import (or “read-in”) the spreadsheet - Since our spreadsheet is a .csv file, we will use the read.csv() function to bring the file into R as an object. Here, we'll refer the file in R as “signal”, and all you'll need to enter as an argument for this function is the file path of the .csv file on your computer

signal <- read.csv("~/Google Drive/PhD/13-14 Coursework/Spring 2014/EPSY 8252/Lab 1/signal.csv")

The str() and head() functions are particularly useful for seeing how R classifies each variable (e.g., integer, factor, character, etc.) and previewing the data frame format, respectively.

str(signal)

## 'data.frame':    50 obs. of  4 variables:
##  $ variable : Factor w/ 5 levels "3cm","4cm","5cm",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ frequency: int  250 500 750 1000 1500 2000 3000 4000 6000 8000 ...
##  $ mean     : num  81 80.8 83 84.9 88.1 89.6 89.5 85.3 66.2 57.3 ...
##  $ stdev    : num  2.1 2.8 3.3 3.1 1.7 2.5 4.3 5.5 8.6 11 ...

head(signal)

##   variable frequency mean stdev
## 1     foam       250 81.0   2.1
## 2     foam       500 80.8   2.8
## 3     foam       750 83.0   3.3
## 4     foam      1000 84.9   3.1
## 5     foam      1500 88.1   1.7
## 6     foam      2000 89.6   2.5

For this example, we will want to turn our frequency variable into a factor, so that each frequency value is spaced evenly along the x-axis (by not doing so, for example, we would see a cluster of data points squished between 250 and 2000 Hz, but only two points between 2000 and 4000 Hz).

signal$frequency <- as.factor(signal$frequency)
str(signal)

## 'data.frame':    50 obs. of  4 variables:
##  $ variable : Factor w/ 5 levels "3cm","4cm","5cm",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ frequency: Factor w/ 10 levels "250","500","750",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean     : num  81 80.8 83 84.9 88.1 89.6 89.5 85.3 66.2 57.3 ...
##  $ stdev    : num  2.1 2.8 3.3 3.1 1.7 2.5 4.3 5.5 8.6 11 ...

Step 2. Adding new data to an existing data frame - Thinking ahead to what we'll be plotting, our dependent variables are mean and standard deviation values. To better aid the interpretation of our graph, we may want to include two new columns of data: one column that gives the mean minus the standard deviation, and another that gives the mean plus the standard deviation. As we'll see, this will make adding error bars a breeze!

Let's call these two new columns of data “stdev.lwr” and “stdev.upr”. We can tell R to calculate these values using the following script:

signal$stdev.lwr = signal$mean - signal$stdev
signal$stdev.upr = signal$mean + signal$stdev

For the first line, the code to the left of the = sign says “create a new column in our "signal” spreadsheet, and call it “stdev.lwr”. The code to the right of the = sign specifieds that the data in this column should equal the mean value minus the standard deviation in each row". The second line is essentially saying the same thing, except instead of subtracting the standard deviation, the standard deviation is added to the mean. Let's see what results:

head(signal, 10)

##    variable frequency mean stdev stdev.lwr stdev.upr
## 1      foam       250 81.0   2.1      78.9      83.1
## 2      foam       500 80.8   2.8      78.0      83.6
## 3      foam       750 83.0   3.3      79.7      86.3
## 4      foam      1000 84.9   3.1      81.8      88.0
## 5      foam      1500 88.1   1.7      86.4      89.8
## 6      foam      2000 89.6   2.5      87.1      92.1
## 7      foam      3000 89.5   4.3      85.2      93.8
## 8      foam      4000 85.3   5.5      79.8      90.8
## 9      foam      6000 66.2   8.6      57.6      74.8
## 10     foam      8000 57.3  11.0      46.3      68.3

Part 3. Visualizing the data in ggplot2

Now that we have all of our data formatted, we can start to visualize the data in the form of a graph. In your own practice, here are somethings to consider:

    - What information will go on the graph? 
    - What kind of graph will best represent the data?
    - Will the message my data conveys be easily interpreted on one graph, or mutliple graphs?

In this example, we have our primary dependent variable, the mean, which will no doubt go on the y-axis (along with the range of values covered by one standard deviation). In deciding what to put on the x-axis, it makes the most sense here to plot the dependent variable as a function of frequency. Taking into account the five test conditions, this gives us five groups of data to convey.

Given that we only have one mean value per frequency, it is most logical to use a line graph to show how the mean value changes from one frequency to the next. Additionally, this connected line of mean values will have an error bar extending vertically in either direction; the bounds of this vertical line are specified in the stdev.lwr and stdev.upr columns.

We are still faced with the issue as to how to convey differences in test conditions. Here, we have some options - but let's get going on what we've decided so far:

Step 1. Initializing the plot - The script below shows how to specify the arguments for your plot. Note that the “col” argument is short for “color”, and is a type of grouping variable that we will need to address down the line.

# ggplot(data = signal, aes(x = frequency, y = mean, color = variable))

First, we indicate the data frame we will be using, next we use an aesthetic function - aes() - to specify which variables will go on which axis. Finally, we still have a lot of instructions for ggplot(), so we will evenutally put a “+” sign at the end of each line. This sign tells R to incorporate the next line of code before fully executing the script.

Step 2. Plotting the data* - Next, we will tell ggplot what kind of graph we want to make. To best visualize the data in this example, we will plot two different graph types on top of each other, and add a third type to generate the error bars. Graph types are typically called up using the geom_testtype() function, as below:

ggplot(data = signal, aes(x = frequency, y = mean, color = variable)) + geom_line(aes(group = variable), 
    se = FALSE) + geom_point(aes(group = variable), size = 2) + geom_errorbar(aes(ymin = stdev.lwr, 
    ymax = stdev.upr, group = variable, width = 0.3))

plot of chunk unnamed-chunk-8

The geom_line() function tells ggplot to plot a line connecting each mean value across the different frequencies, and that we will have different lines corresponding to each of the different test conditions. The se = FALSE argument suppresses this function's desire to generate its own confidence intervals when data are sufficiently plentiful.

The geom_point() function tells ggplot to put a nice, big (size = 2) circle at each frequency's mean value, to better clarify the data.

The geom_errorbar() function tells ggplot to draw a vertical line extending from the mean value down to the sd.lwr value, and also a vertical bar extending from the mean value up to the sd.upr value (for each frequency/test condition).

Step 3. Adding some basic aesthetics - Now that we have our basic graph, we can add a few aesthetics to tie everything together:

ggplot(data = signal, aes(x = frequency, y = mean, color = variable)) + geom_line(aes(group = variable), 
    se = FALSE) + geom_point(aes(group = variable), size = 2) + geom_errorbar(aes(ymin = stdev.lwr, 
    ymax = stdev.upr, group = variable, width = 0.3)) + xlab("Frequency (Hz)") + 
    ylab("Ear Canal Level (dB SPL)") + ggtitle("Signal Level in the Ear Canal") + 
    theme_bw()

plot of chunk unnamed-chunk-9

Now we can execute the entire ggplot script; here is what we get:

ggplot(data = signal, aes(x = frequency, y = mean, color = variable)) + geom_line(aes(group = variable), 
    se = FALSE) + geom_point(aes(group = variable), size = 2) + geom_errorbar(aes(ymin = stdev.lwr, 
    ymax = stdev.upr, group = variable, width = 0.3)) + xlab("Frequency (Hz)") + 
    ylab("Ear Canal Level (dB SPL)") + ggtitle("Signal Level in the Ear Canal") + 
    theme_bw()

plot of chunk unnamed-chunk-10

This is a pretty good starting point, but we can do better both aesthetically and in providing a clear picture of the data. Although the case is pretty mild in this example sometimes the data are clustered such that the data are difficult to discern. One thing we can try is to parse out the test conditions into their own graphs, instead of relying purely on the color of the line to distinguish the test conditions. We can do this using the facet_wrap() function, in which we'll specify the parsing variable we want to use.

Additionally, if you're not too keen on the default colors in R for the different lines, we can use the scale.color.manual() function to specify some more aesthetically pleasing colors.

ggplot(data = signal, aes(x = frequency, y = mean, color = variable)) + geom_line(aes(group = variable), 
    se = FALSE) + geom_point(aes(group = variable), size = 2) + geom_errorbar(aes(ymin = stdev.lwr, 
    ymax = stdev.upr, group = variable, width = 0.3)) + xlab("Frequency (Hz)") + 
    ylab("Ear Canal Level (dB SPL)") + ggtitle("Signal Level in the Ear Canal") + 
    theme_bw() + scale_color_manual(name = "Variable Type", values = c("#1B9E77", 
    "#D95F02", "#7570B3", "#E7298A", "#66A61E"), labels = c("ER3A Foam Tip", 
    "3cm Tube", "4cm Tube", "5cm Tube", "6cm Tube")) + facet_wrap(~variable, 
    nrow = 1)

plot of chunk unnamed-chunk-11

We're getting closer! Since the facet_wrap() function provides labeling information that is redundant with the legend, we can turn the legend off as well. The theme(legend.position) function will accomplish this.

ggplot(data = signal, aes(x = frequency, y = mean, color = variable)) + geom_line(aes(group = variable), 
    se = FALSE) + geom_point(aes(group = variable), size = 2) + geom_errorbar(aes(ymin = stdev.lwr, 
    ymax = stdev.upr, group = variable, width = 0.3)) + xlab("Frequency (Hz)") + 
    ylab("Ear Canal Level (dB SPL)") + ggtitle("Signal Level in the Ear Canal") + 
    theme_bw() + scale_color_manual(name = "Variable Type", values = c("#1B9E77", 
    "#D95F02", "#7570B3", "#E7298A", "#66A61E"), labels = c("ER3A Foam Tip", 
    "3cm Tube", "4cm Tube", "5cm Tube", "6cm Tube")) + facet_wrap(~variable, 
    nrow = 1) + theme(legend.position = "none")

plot of chunk unnamed-chunk-12

Wrapping up

We now have an aesthetically-pleasing graph that is easy to interpret. We can easily compare the signal level at each frequency across the different test conditions, and also clearly see how the variability at a particular frequency might be different for different test conditions (as well as within test conditions).