Representing Data Using ggplot2

This tutorial is a step-by-step guide to using ggplot in order to create a plot representing data obtained from a table in a published journal article. More specifically, we will create a line graph representing pre- and post-training scores for consonant, vowel, sentence, and gender perception scores in cochlear implant users to examine whether an auditory training program improves performance.

Source:

Fu, Q., Galvin, J., Wang, X., & Nogaki, G. (2005). Moderate auditory training can improve speech performance of adult cochlear implant patients. Acoustics Research Letters Online 6:106-111.

This article provides a table which includes subject information (age, gender, whether they were pre- or post-lingually deafened, duration of cochlear implant use, implant device, and processing strategy) as well as pre- and post-training scores for vowel, consonant, sentence, and gender recognition tasks.

Step 1: Examine the original data set and organize it to be read into R.

To create a line plot which examines pre- and post-training scores, we want to know the individual subjects' identificanion numbers so that we can separate them from each other, as well as the test type, so that we can separate each participants' data into each of the four test categories (consonant, vowel, sentence, or gender recognition). We will also need each individuals' score for each test type and we will need to determine whether the scores are from a pre-test or a post-test to visualize the effect of the training program on recognition scores.

Columns can be made in an excel spreadsheet to organize this data so that R can use it easily. For example, four columns representing participants, time of testing, scores, and type of test (pre- or post-test) can be created. All other subject information (for example, age or gender) is not important for this graph and can be left out of the data spreadsheet. This spreadsheet can been seen below, in Step 2.

Step 2: Save the data set as a .csv file and read it into R through the “import dataset” option (choose the .csv file from your computer to open the spreadsheet in R).

CI2 <- read.csv("~/Desktop/Fall 2012/EPSY 8261/csv/CI2.csv")  # import the data set
CI2  #examine the data set to verify that it is correct

##    subject time score      test
## 1        1    0   9.6     vowel
## 2        1    1  26.9     vowel
## 3        2    0  10.0     vowel
## 4        2    1  18.6     vowel
## 5        3    0  11.8     vowel
## 6        3    1  27.8     vowel
## 7        4    0  14.1     vowel
## 8        4    1  27.2     vowel
## 9        5    0  24.5     vowel
## 10       5    1  35.9     vowel
## 11       6    0  26.0     vowel
## 12       6    1  37.5     vowel
## 13       7    0  32.7     vowel
## 14       7    1  56.7     vowel
## 15       8    0  33.1     vowel
## 16       8    1  60.0     vowel
## 17       9    0  34.0     vowel
## 18       9    1  47.7     vowel
## 19      10    0  41.5     vowel
## 20      10    1  56.6     vowel
## 21       3    0   6.0 consonant
## 22       3    1  11.4 consonant
## 23       4    0  15.0 consonant
## 24       4    1  27.8 consonant
## 25       5    0  16.1 consonant
## 26       5    1  24.0 consonant
## 27       7    0  27.5 consonant
## 28       7    1  57.0 consonant
## 29       8    0  46.5 consonant
## 30       8    1  68.5 consonant
## 31       9    0  34.0 consonant
## 32       9    1  41.7 consonant
## 33      10    0  30.4 consonant
## 34      10    1  40.0 consonant
## 35       1    0  47.9    gender
## 36       1    1  48.3    gender
## 37       2    0  55.8    gender
## 38       2    1  54.6    gender
## 39       4    0  63.0    gender
## 40       4    1  64.5    gender
## 41       5    0  85.8    gender
## 42       5    1  85.0    gender
## 43       8    0  89.1    gender
## 44       8    1  89.5    gender
## 45       9    0  85.6    gender
## 46       9    1  88.3    gender
## 47       7    0   0.0  sentence
## 48       7    1  29.7  sentence
## 49       8    0  51.4  sentence
## 50       8    1  81.4  sentence
## 51      10    0  32.3  sentence
## 52      10    1  56.4  sentence

As you can see, this data set includes a column for participant and subject number, whether the data was a pre-test (0) or a post-test (1), individual scores, and the test type (vowel, consonant, setneces, or gender).

Step 3: To make the graph easier to read, create a factor to re-name the numerical labels, 0 and 1, into the labels “pre-test” and “post-test.”

CI2$time <- factor(CI2$time, labels = c("pre-test", "post-test"))
# create new labels in the 'time' column
head(CI2)  # examine the first few lines of the data

##   subject      time score  test
## 1       1  pre-test   9.6 vowel
## 2       1 post-test  26.9 vowel
## 3       2  pre-test  10.0 vowel
## 4       2 post-test  18.6 vowel
## 5       3  pre-test  11.8 vowel
## 6       3 post-test  27.8 vowel

Now, instead of 0 or 1, the data is labeled as “pre-test” or “post-test.”

Step 4: Load the ggplot2 package.

library("ggplot2")  # load the ggplot package

Step 5: Use ggplot to create a line graph.

ggplot(data = CI2, aes(x = time, y = score, group = subject, color = factor(subject))) + 
    geom_line() + 
geom_point() + 
facet_wrap(~test) + 
ylim(0, 100) + 
xlab("Test Time") + 
ylab("Percent Corect") + 
scale_color_manual(values = c("#8B7355", "#EE2C2C", "#1C86EE", "#8B0000", "#228B22", 
    "#000000", "#9400D3", "#FF7F00", "#FFD700", "#8B7D6B"), guide = FALSE) + 
    theme_bw()

plot of chunk unnamed-chunk-4

Let's look at each line of code used to create the graph separately:

1.) ggplot(data=CI2, aes(x=time, y=score, group=subject, color=factor(subject))) +

This line inputs the data file, uses an aesthetic to set the x-axis (time), y-axis (score), and defines grouping (subject numbers) for color coding

2.) geom_line() +

Here, we create line plot of pre- and post-test scores

3.) geom_point() +

We also want to add points representing pre- and post-test scores to connect the lines

4.) facet_wrap(~test) +

This creates four separate plots, one for each test evaluated

5.) ylim(0,100) +

Changes the y-axis scale from 0 to 100 to easily visualize the whole range of possible scores

6.) xlab(“Test Time”) +
ylab(“Percent Corect”) +

Changes the x- and y-axis labels

7.) scale_color_manual(values=c(“#8B7355”,“#EE2C2C”,“#1C86EE”,“#8B0000”,“#228B22”,“#000000”,“#9400D3”,“#FF7F00”,“#FFD700”,“#8B7D6B”), guide = FALSE) +

Manually creates individual colors for each participants' data and removes the key (guide)

8.) theme_bw()

Eliminates the grey background