We could probably go on delving deeper into programming the R language for quite a while…but it's probably time that we looked at the reason most people want to learn R: so that they can make plots.

Plots are both the starting point for analysis as we explore our data, then the end point as after our filtering, normalisation, modelling and so on we produce a Figure for a paper or presentation. I'm going to introduce you first to the base R plotting functions- but then I'm going to quickly move on to show a newer and altogether more elegant plotting package called ggplot2. Along the way we will also have a quick word about getting your data in the right format for plotting (or other analyses).

Basic Plots

Begin by loading the 'tut4.RData' file into your workspace. The data.frame here (facs.df) is taken from a paper studying T-cell responses to KSHV infection (rubbish paper but handy for demonstrating graphics). The TotalT, CD8 and CD4 columns are cell counts from Fluorescence-activated cell sorting (FACS), there were 3 technical replicates, and it was all repeated in 10 patients. Then there were some treatment and response factors.. it's not important.

Robey RC, Lagos D, Gratrix F, Henderson S, Matthews NC, Vart RJ, Bower M,Boshoff C, Gotch FM. The CD8 and CD4 T-cell response against Kaposi's sarcoma-associated herpesvirus is skewed towards early and late lytic antigens.PLoS One. 2009 Jun 17;4(6)

load("tut4.RData")
ls()
## [1] "facs.df"
colnames(facs.df)
## [1] "response"  "Patient"   "TotalT"    "CD8"       "CD4"       "Replicate"
## [7] "Gene"      "GeneClass"
summary(facs.df)
##           response      Patient        TotalT          CD8       
##  nonresponder :288   1      : 48   Min.   : 3.0   Min.   : 1.10  
##  responder    :240   10     : 48   1st Qu.:13.7   1st Qu.: 8.32  
##  weakresponder: 48   11     : 48   Median :20.3   Median :14.15  
##                      2      : 48   Mean   :21.1   Mean   :16.50  
##                      3      : 48   3rd Qu.:27.9   3rd Qu.:23.10  
##                      4      : 48   Max.   :48.3   Max.   :51.30  
##                      (Other):288   NA's   :18     NA's   :14     
##       CD4       Replicate          Gene       GeneClass  
##  Min.   : 2.2   1:192     A_pSIN     : 36   Control: 69  
##  1st Qu.:15.4   2:192     K11.1K12K15: 36   Early  :144  
##  Median :23.4   3:192     K3         : 36   IEarly :180  
##  Mean   :24.3             K5         : 36   Late   : 78  
##  3rd Qu.:32.5             K6K8K14    : 36   Latent :105  
##  Max.   :54.4             K9         : 36                
##  NA's   :20               (Other)    :360                
# the attach function is useful when dealing with data.frames it simply
# makes columns accessible by name e.g. CD8 rather than facs.df$CD8
attach(facs.df)
# plot the first hundred rows of CD4 vs CD8
plot(x = CD4[1:100], y = CD8[1:100], xlab = "CD4")
points(CD8[101:200] ~ CD4[101:200], col = "green")

plot of chunk firstPlot

There are a few lessons to take from our first plot. Note that the y-axis label has been set to correspond to the input but like on the x-axis we can replace that if the variable name or subsetting is unhelpful using xlab=''. Next we have made a plot and then added further data in green using the points() function. However points() only adds to an existing plot and if the data lies outside of the range of the existing plot it will be unseen. Lastly we specified the input data to points using a model formula. For many R functions model formulae are used to specify the statistical model that we wish to apply to the data. Here it just means y vs x the reverse order to our previous plot command with named arguments.This is because the 'response' variable comes first in a model formula such as lm(TotalT~CD8) and the 'explanatory' variable' is second. So if this was a times series we would write (Height~Years) or if it was drug response we might write (Height~Dose). Usually though it makes no diffrence here we plot reponse variable on the y-axis. Here's a more useful plot

plot(TotalT ~ CD8, col = Replicate)
legend("topleft", legend = paste0("Rep", 1:3), fill = 1:3)
# I fit a regression line and extract the coefficients
cf = coef(lm(TotalT ~ CD8))
# I plot the fit
abline(coef = cf, lty = 2)
# add a note on the fit coefficients to the plot
cf.text = paste0("y = ", signif(cf[2], 3), "x + ", signif(cf[1], 
    3))
text(x = 40, y = 10, cf.text)
title(main = "FACS cell count")

plot of chunk secondPlot

This is a slightly more useful plot. This time colour represents different replicates and the replicates are probably similar - so we can ignore this factor. The legend will help for presentation figures. Next we calculate the linear fit of the data using lm() an R signature function. We extract the coefficients of the fit and plot it by feeding these into the abline() command. Finally I add a bit of textual description and add a title().

There are many many different types of plot in R in many many different packages but most of them have the same or very similar parameters and arguments. Like the summary() function the plot() command is object oriented and will often create the correct default graph based on either the mode of the x and y variables or the object passed to it.

# We can create two plots on top of each other like so...  see the par()
# command for many other graphical parameters
par(mfrow = c(2, 1))
# density() returns a density object that plot() recognises
plot(density(CD4, na.rm = T))
# CD4~GeneClass is a numeric~factor model hence a boxplot
plot(CD4 ~ GeneClass)

plot of chunk ooPlot

Exercises

  1. There were a lot of other factors in this experiment.. In the scatterplot first make the colour representdifferent Patients.
  2. There is a factor for GeneClass can you make a different symbol for each in the scatterplot.
  3. Calculate and add vertical and horizontal lines for the means of CD4 and CD8.
  4. Make a plot of the sum of CD4 and CD8 vs TotalT using a model formula with both different symbols for response and different colour for GeneClass.
  5. Calculate a regression or linear model again using lm(), save as an object and see what happens when you plot the object.

Beyond Base Graphics: ggplot2

There is a lot more we could say about base graphics. There are many many different types of graph, titles, legends and parameters that can be set with the par() command. Most of it is obvious iteration and combination of what you have learnt so far. In the last couple of years though a new package written by a rising star of the R world Hadley Wickham has nearly supplanted base graphics: ggplot2. Certainly online amongst R-bloggers and enthusiasts, ggplot2 has been very quickly adopted as the default plotting method.

The package is based on an earlier book 'The Grammar of Graphics' that proposed a unified model of how graphics should be produced by computers.

The main strengths are:

The drawbacks are:

Let's have a look at a quick example of ggplot2 in action using the convenience command qplot() which wraps up many ggplot2 features into a single step similar to the base plot() command.

library(ggplot2)
# the legend is automatic, faceting splits into mini-plots for each
# patient colour is a group factor
qplot(x = CD4, y = CD8, data = facs.df, col = Replicate, facets = ~Patient)

plot of chunk qplot

# I don't like grey backgrounds
theme_set(theme_bw())
# NB here size is a continuous not a factor scale
qplot(x = CD4, y = CD8, data = facs.df, col = Replicate, size = TotalT, 
    facets = ~Patient)

plot of chunk qplot

Probably the first thing you notice is that the graphics are split up into mini-plots for each Patient, this is the faceting. Note that the order of the mini-plots or facets is alphanumeric for a factor. Faceting is a fantastic way of comparing different groups (factors) within your data and can be combined with size, colour or at a pinch line or symbol type to clearly plot 5 or 6 scales or dimensions of data. Here we have firstly combined faceting with colour representing the Replicates and a legend is added automatically. Next we simplify the style of the plot to a more austere white background, then plot a new plot with a further dimension the TotalT as size of point. Again the legend is handled automatically.

Perhaps the most jarring aspect of ggplot2 to newcomers is shown in this next example where I add another command to fit each facet with an lm regression line.

# for simplicity let's ignore Replicate and TotalT for now
qplot(x = CD4, y = CD8, data = facs.df, facets = ~Patient) + geom_smooth(method = "lm")

plot of chunk addFunc

This looks different to what you have seen in the tutorial so far. We haven't created new functions or commands by adding commands or functions together before. In ggplot2 however this is the dominant syntax. Plots are often constructed over multiple lines of code by adding elements together. In this example we add a second plot-type or geom on top of the first. In actual fact the first command here might equally be written:

qplot(x = CD4, y = CD8, data = facs.df, facets = ~Patient, geom = "point")
qplot(x = CD4, y = CD8, data = facs.df, facets = ~Patient) + geom_point()

Either additional is redundant as qplot() recognises that a point scatterplot is the appropriate plot type (geom). Addition and recombination of different geom and other elements can lead to a myriad of different plots.

Exercises

  1. Have a look at the different geoms on the ggplot2 website
  2. Remove the facets but then add a rug geom to the last plot.
  3. Create a boxplot of CD4 by GeneClass again faceted by Patient.
  4. Can you plot the y-axis on a log-2 scale?

The elements of ggplot2

Let's go back and look at the underlying system. The elements of ggplot2 can be broken into 5 different types - we've already mentioned geoms. The combination of all elements is called a layer. Each complete plot will have at least one layer, but some plots may have many. The elements are:

  1. Data
  2. Mapping
  3. Geom
  4. Statistic
  5. Position

The first of these Data is obvious in most cases. However it is perfectly possible to overlay two different layers that draw on separate data from separate data.frames. I won't go into that here, but keep it in mind. The second Mapping is the relation between data and an aesthetic or scale (size, colour, position, chartype) that can represent continuous or discrete (i.e. factors) data. The third geom the plot type we have seen already. The Statistic relates to plots where a summary of the data rather than the actual data is plotted. For instance a histogram or barchart is a plot of the sum of a numeric variable. Usually the Statistic transformation required is handled as implicit to the particular geom. The exception is if you want to create a bar chart from data that is not already summarised:

# diamonds is a built-in ggplot2 dataset price is vector of many thousands
# of values we can chart the median price like so
qplot(x = cut, y = price, geom = "bar", fill = color, data = diamonds, 
    stat = "summary", fun.y = "median")

plot of chunk barchart

Finally Position in contrast to faceting which separates groups into different mini-plots will separate groups within the same plot. For example the last plot has a hidden automatic positioning, 'position-stack' which stacks all the different colors of diamond on top of each other. An alternative would be 'position_dodge':

qplot(x = cut, y = price, geom = "bar", fill = color, data = diamonds, 
    stat = "summary", fun.y = "median", position = "dodge")

plot of chunk barDodge

So whilst often many of these elements of a plot or layer will be automatic dependent upon the data input or the chosen 'geom' you can explicitly change any of them. Sometimes that won't make much sense but sometimes it enables you to build combinations of elements into original plots. Indeed many other packages have been built that rely on ggplot2 to create complex novel graphics:

colnums = c(2:5, 8)
require(tabplot)
tableplot(facs.df[, colnums], sortCol = "TotalT")

plot of chunk originalPlot

require(GGally)
ggpairs(facs.df[, colnums])

plot of chunk originalPlot

require(hexbin)
qplot(CD8, CD4, data = facs.df) + stat_binhex(na.rm = T, bins = 30)

plot of chunk originalPlot

However the real satisfaction comes in building your own graphics. Often for more complex or unique graphics you will want to use the ggplot command. This is similar to the 'qplot' command but the emphasis is more on building up layers of elements onto a plot. For example:

# Data is facs.df then aes or aesthetics are the mappings
bp = ggplot(facs.df, aes(x = GeneClass, y = TotalT))
# bp is an incomplete layer it has no geom
bp
## Error: No layers in plot
bp = bp + geom_boxplot()
# now it is complete
bp

plot of chunk ggplot2

# but we can still iteratively add more elements
bp = bp + facet_wrap(~Patient)
bp

plot of chunk ggplot2

# and so on and so forth
bp + geom_jitter(position = position_jitter(width = 0.1))

plot of chunk ggplot2

There are subtle differences between using qplot and ggplot but it's the iterative style of experimentation that you see above that is the hallmark of the ggplot method. The possible combinations of plot and the many many different geoms available are far too many to detail here. I would encourage to take a few minutes now though to have a look at these example sites and bookmark them for later:

There are many more websites with examples of ggplot2 code out there that you can borrow and adapt. The majority of R packages for now are full of code dependent upon base graphics but most new developers and new users are moving to ggplot2.

Exercises

  1. Use the ggplot command to create an incomplete plot from facs.df that maps CD4 verus CD8
  2. Make a scatterplot of the data but on a logarithmic scale
  3. Add a smoothing loess line.
  4. Go back make it binhexed like before and then re-add the loess line
  5. It's probably better on a linear scale, and perhaps faceted by patient. Do it.

Getting data into the right format

Coincidentally the data we have used so far has been in just the right format for feeding straight into plot, qplot or ggplot. Each column of our data.frame is a single variable that we might for example specify in a model formula e.g. TotalT~CD8+CD4. Often however the data that we load from various data files is not in this format. Consider for instance an experiment with samples collected over a series of timepoints or from a number of different individuals. You might find a datafile something like this:

Sample Time(1min) Time(2min)
A 54 23
B 45 43
C 33 23
D 11 12

The model formula desired would be something like Value~Time+Sample but the data in this format does not allow it. Formats like this are sometimes called wide and they are often useful for making plots in programs like Excel. We want it in long format so it looks like this:

Sample Time Value
A 1 54
A 2 23
B 1 45
B 2 43
C 1 33
C 2 23
D 1 11
D 2 12

We can now use qplot

# not plotted
qplot(Time, Value, col = Sample, data = long.df, geom = "line")

It's quite a common task so there are a number of R base commands and packages that can handle this. One of the easiest to use is reshape2 and the melt function. There is a good built in example dataset with this package. The data investigates the effect of different frying oils on the taste of French fries over time. There are three different types of frying oils (treatment), each in two different fryers (rep), tested by 12 people (subject) on 10 different days (time). The sensory attributes recorded, in order of desirability, are potato, buttery, grassy, rancid, painty flavours.

The following example is adapted from the reshape2 paper in the Journal of Statistical Software.

library(reshape2)
## Attaching package: 'reshape2'
## The following object(s) are masked from 'package:reshape':
## 
## colsplit, melt, recast
# this data is not in a good plotting or linear modelling format
head(french_fries)
##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3
# columns 1 to 4 will stay the same bu tbe repeated 5 times 5 being the
# number of value columns that will be squished or melted into a single
# column
melted_fries <- melt(french_fries, id = 1:4, na.rm = TRUE, value.name = "desirability")
head(melted_fries)
##   time treatment subject rep variable value
## 1    1         1       3   1   potato   2.9
## 2    1         1       3   2   potato  14.0
## 3    1         1      10   1   potato  11.0
## 4    1         1      10   2   potato   9.9
## 5    1         1      15   1   potato   1.2
## 6    1         1      15   2   potato   8.8
# a fussy fussy plot but you get the point
qplot(variable, desirability, data = melted_fries, facets = time ~ 
    treatment, geom = "boxplot")
## Error: object 'desirability' not found

We have ended this tutorial starting to talk about model formulae whilst we earlier mentioned linear models and the lm function in the context of object orientation. In the next tutorial we will return to the lm command both because it is both a really powerful function in itself but also because most of the other statistical methods for R closely follow the same syntax and downstream. So if you know lm then you know glm (generalised linear models), you know 'nls' (nonlinear least squares), or even nlme (non linear mixed effects). Or at least you know how to do it if not what it is.