We could probably go on delving deeper into programming the R language for quite a while…but it's probably time that we looked at the reason most people want to learn R: so that they can make plots.
Plots are both the starting point for analysis as we explore our data, then the end point as after our filtering, normalisation, modelling and so on we produce a Figure for a paper or presentation. I'm going to introduce you first to the base R plotting functions- but then I'm going to quickly move on to show a newer and altogether more elegant plotting package called ggplot2. Along the way we will also have a quick word about getting your data in the right format for plotting (or other analyses).
Begin by loading the 'tut4.RData' file into your workspace. The data.frame here (facs.df) is taken from a paper studying T-cell responses to KSHV infection (rubbish paper but handy for demonstrating graphics). The TotalT, CD8 and CD4 columns are cell counts from Fluorescence-activated cell sorting (FACS), there were 3 technical replicates, and it was all repeated in 10 patients. Then there were some treatment and response factors.. it's not important.
Robey RC, Lagos D, Gratrix F, Henderson S, Matthews NC, Vart RJ, Bower M,Boshoff C, Gotch FM. The CD8 and CD4 T-cell response against Kaposi's sarcoma-associated herpesvirus is skewed towards early and late lytic antigens.PLoS One. 2009 Jun 17;4(6)
load("tut4.RData")
ls()
## [1] "facs.df"
colnames(facs.df)
## [1] "response" "Patient" "TotalT" "CD8" "CD4" "Replicate"
## [7] "Gene" "GeneClass"
summary(facs.df)
## response Patient TotalT CD8
## nonresponder :288 1 : 48 Min. : 3.0 Min. : 1.10
## responder :240 10 : 48 1st Qu.:13.7 1st Qu.: 8.32
## weakresponder: 48 11 : 48 Median :20.3 Median :14.15
## 2 : 48 Mean :21.1 Mean :16.50
## 3 : 48 3rd Qu.:27.9 3rd Qu.:23.10
## 4 : 48 Max. :48.3 Max. :51.30
## (Other):288 NA's :18 NA's :14
## CD4 Replicate Gene GeneClass
## Min. : 2.2 1:192 A_pSIN : 36 Control: 69
## 1st Qu.:15.4 2:192 K11.1K12K15: 36 Early :144
## Median :23.4 3:192 K3 : 36 IEarly :180
## Mean :24.3 K5 : 36 Late : 78
## 3rd Qu.:32.5 K6K8K14 : 36 Latent :105
## Max. :54.4 K9 : 36
## NA's :20 (Other) :360
# the attach function is useful when dealing with data.frames it simply
# makes columns accessible by name e.g. CD8 rather than facs.df$CD8
attach(facs.df)
# plot the first hundred rows of CD4 vs CD8
plot(x = CD4[1:100], y = CD8[1:100], xlab = "CD4")
points(CD8[101:200] ~ CD4[101:200], col = "green")
There are a few lessons to take from our first plot. Note that the y-axis label has been set to correspond to the input but like on the x-axis we can replace that if the variable name or subsetting is unhelpful using xlab=''. Next we have made a plot and then added further data in green using the points() function. However points() only adds to an existing plot and if the data lies outside of the range of the existing plot it will be unseen. Lastly we specified the input data to points using a model formula. For many R functions model formulae are used to specify the statistical model that we wish to apply to the data. Here it just means y vs x the reverse order to our previous plot command with named arguments.This is because the 'response' variable comes first in a model formula such as lm(TotalT~CD8) and the 'explanatory' variable' is second. So if this was a times series we would write (Height~Years) or if it was drug response we might write (Height~Dose). Usually though it makes no diffrence here we plot reponse variable on the y-axis. Here's a more useful plot
plot(TotalT ~ CD8, col = Replicate)
legend("topleft", legend = paste0("Rep", 1:3), fill = 1:3)
# I fit a regression line and extract the coefficients
cf = coef(lm(TotalT ~ CD8))
# I plot the fit
abline(coef = cf, lty = 2)
# add a note on the fit coefficients to the plot
cf.text = paste0("y = ", signif(cf[2], 3), "x + ", signif(cf[1],
3))
text(x = 40, y = 10, cf.text)
title(main = "FACS cell count")
This is a slightly more useful plot. This time colour represents different replicates and the replicates are probably similar - so we can ignore this factor. The legend will help for presentation figures. Next we calculate the linear fit of the data using lm() an R signature function. We extract the coefficients of the fit and plot it by feeding these into the abline() command. Finally I add a bit of textual description and add a title().
There are many many different types of plot in R in many many different packages but most of them have the same or very similar parameters and arguments. Like the summary() function the plot() command is object oriented and will often create the correct default graph based on either the mode of the x and y variables or the object passed to it.
# We can create two plots on top of each other like so... see the par()
# command for many other graphical parameters
par(mfrow = c(2, 1))
# density() returns a density object that plot() recognises
plot(density(CD4, na.rm = T))
# CD4~GeneClass is a numeric~factor model hence a boxplot
plot(CD4 ~ GeneClass)
There is a lot more we could say about base graphics. There are many many different types of graph, titles, legends and parameters that can be set with the par() command. Most of it is obvious iteration and combination of what you have learnt so far. In the last couple of years though a new package written by a rising star of the R world Hadley Wickham has nearly supplanted base graphics: ggplot2. Certainly online amongst R-bloggers and enthusiasts, ggplot2 has been very quickly adopted as the default plotting method.
The package is based on an earlier book 'The Grammar of Graphics' that proposed a unified model of how graphics should be produced by computers.
The main strengths are:
The drawbacks are:
Let's have a look at a quick example of ggplot2 in action using the convenience command qplot() which wraps up many ggplot2 features into a single step similar to the base plot() command.
library(ggplot2)
# the legend is automatic, faceting splits into mini-plots for each
# patient colour is a group factor
qplot(x = CD4, y = CD8, data = facs.df, col = Replicate, facets = ~Patient)
# I don't like grey backgrounds
theme_set(theme_bw())
# NB here size is a continuous not a factor scale
qplot(x = CD4, y = CD8, data = facs.df, col = Replicate, size = TotalT,
facets = ~Patient)
Probably the first thing you notice is that the graphics are split up into mini-plots for each Patient, this is the faceting. Note that the order of the mini-plots or facets is alphanumeric for a factor. Faceting is a fantastic way of comparing different groups (factors) within your data and can be combined with size, colour or at a pinch line or symbol type to clearly plot 5 or 6 scales or dimensions of data. Here we have firstly combined faceting with colour representing the Replicates and a legend is added automatically. Next we simplify the style of the plot to a more austere white background, then plot a new plot with a further dimension the TotalT as size of point. Again the legend is handled automatically.
Perhaps the most jarring aspect of ggplot2 to newcomers is shown in this next example where I add another command to fit each facet with an lm regression line.
# for simplicity let's ignore Replicate and TotalT for now
qplot(x = CD4, y = CD8, data = facs.df, facets = ~Patient) + geom_smooth(method = "lm")
This looks different to what you have seen in the tutorial so far. We haven't created new functions or commands by adding commands or functions together before. In ggplot2 however this is the dominant syntax. Plots are often constructed over multiple lines of code by adding elements together. In this example we add a second plot-type or geom on top of the first. In actual fact the first command here might equally be written:
qplot(x = CD4, y = CD8, data = facs.df, facets = ~Patient, geom = "point")
qplot(x = CD4, y = CD8, data = facs.df, facets = ~Patient) + geom_point()
Either additional is redundant as qplot() recognises that a point scatterplot is the appropriate plot type (geom). Addition and recombination of different geom and other elements can lead to a myriad of different plots.
Let's go back and look at the underlying system. The elements of ggplot2 can be broken into 5 different types - we've already mentioned geoms. The combination of all elements is called a layer. Each complete plot will have at least one layer, but some plots may have many. The elements are:
The first of these Data is obvious in most cases. However it is perfectly possible to overlay two different layers that draw on separate data from separate data.frames. I won't go into that here, but keep it in mind. The second Mapping is the relation between data and an aesthetic or scale (size, colour, position, chartype) that can represent continuous or discrete (i.e. factors) data. The third geom the plot type we have seen already. The Statistic relates to plots where a summary of the data rather than the actual data is plotted. For instance a histogram or barchart is a plot of the sum of a numeric variable. Usually the Statistic transformation required is handled as implicit to the particular geom. The exception is if you want to create a bar chart from data that is not already summarised:
# diamonds is a built-in ggplot2 dataset price is vector of many thousands
# of values we can chart the median price like so
qplot(x = cut, y = price, geom = "bar", fill = color, data = diamonds,
stat = "summary", fun.y = "median")
Finally Position in contrast to faceting which separates groups into different mini-plots will separate groups within the same plot. For example the last plot has a hidden automatic positioning, 'position-stack' which stacks all the different colors of diamond on top of each other. An alternative would be 'position_dodge':
qplot(x = cut, y = price, geom = "bar", fill = color, data = diamonds,
stat = "summary", fun.y = "median", position = "dodge")
So whilst often many of these elements of a plot or layer will be automatic dependent upon the data input or the chosen 'geom' you can explicitly change any of them. Sometimes that won't make much sense but sometimes it enables you to build combinations of elements into original plots. Indeed many other packages have been built that rely on ggplot2 to create complex novel graphics:
colnums = c(2:5, 8)
require(tabplot)
tableplot(facs.df[, colnums], sortCol = "TotalT")
require(GGally)
ggpairs(facs.df[, colnums])
require(hexbin)
qplot(CD8, CD4, data = facs.df) + stat_binhex(na.rm = T, bins = 30)
However the real satisfaction comes in building your own graphics. Often for more complex or unique graphics you will want to use the ggplot command. This is similar to the 'qplot' command but the emphasis is more on building up layers of elements onto a plot. For example:
# Data is facs.df then aes or aesthetics are the mappings
bp = ggplot(facs.df, aes(x = GeneClass, y = TotalT))
# bp is an incomplete layer it has no geom
bp
## Error: No layers in plot
bp = bp + geom_boxplot()
# now it is complete
bp
# but we can still iteratively add more elements
bp = bp + facet_wrap(~Patient)
bp
# and so on and so forth
bp + geom_jitter(position = position_jitter(width = 0.1))
There are subtle differences between using qplot and ggplot but it's the iterative style of experimentation that you see above that is the hallmark of the ggplot method. The possible combinations of plot and the many many different geoms available are far too many to detail here. I would encourage to take a few minutes now though to have a look at these example sites and bookmark them for later:
There are many more websites with examples of ggplot2 code out there that you can borrow and adapt. The majority of R packages for now are full of code dependent upon base graphics but most new developers and new users are moving to ggplot2.
ggplot command to create an incomplete plot from facs.df that maps CD4 verus CD8Coincidentally the data we have used so far has been in just the right format for feeding straight into plot, qplot or ggplot. Each column of our data.frame is a single variable that we might for example specify in a model formula e.g. TotalT~CD8+CD4. Often however the data that we load from various data files is not in this format. Consider for instance an experiment with samples collected over a series of timepoints or from a number of different individuals. You might find a datafile something like this:
| Sample | Time(1min) | Time(2min) |
|---|---|---|
| A | 54 | 23 |
| B | 45 | 43 |
| C | 33 | 23 |
| D | 11 | 12 |
The model formula desired would be something like Value~Time+Sample but the data in this format does not allow it. Formats like this are sometimes called wide and they are often useful for making plots in programs like Excel. We want it in long format so it looks like this:
| Sample | Time | Value |
|---|---|---|
| A | 1 | 54 |
| A | 2 | 23 |
| B | 1 | 45 |
| B | 2 | 43 |
| C | 1 | 33 |
| C | 2 | 23 |
| D | 1 | 11 |
| D | 2 | 12 |
We can now use qplot
# not plotted
qplot(Time, Value, col = Sample, data = long.df, geom = "line")
It's quite a common task so there are a number of R base commands and packages that can handle this. One of the easiest to use is reshape2 and the melt function. There is a good built in example dataset with this package. The data investigates the effect of different frying oils on the taste of French fries over time. There are three different types of frying oils (treatment), each in two different fryers (rep), tested by 12 people (subject) on 10 different days (time). The sensory attributes recorded, in order of desirability, are potato, buttery, grassy, rancid, painty flavours.
The following example is adapted from the reshape2 paper in the Journal of Statistical Software.
library(reshape2)
## Attaching package: 'reshape2'
## The following object(s) are masked from 'package:reshape':
##
## colsplit, melt, recast
# this data is not in a good plotting or linear modelling format
head(french_fries)
## time treatment subject rep potato buttery grassy rancid painty
## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5
## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0
## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0
## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0
## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1
## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3
# columns 1 to 4 will stay the same bu tbe repeated 5 times 5 being the
# number of value columns that will be squished or melted into a single
# column
melted_fries <- melt(french_fries, id = 1:4, na.rm = TRUE, value.name = "desirability")
head(melted_fries)
## time treatment subject rep variable value
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## 6 1 1 15 2 potato 8.8
# a fussy fussy plot but you get the point
qplot(variable, desirability, data = melted_fries, facets = time ~
treatment, geom = "boxplot")
## Error: object 'desirability' not found
We have ended this tutorial starting to talk about model formulae whilst we earlier mentioned linear models and the lm function in the context of object orientation. In the next tutorial we will return to the lm command both because it is both a really powerful function in itself but also because most of the other statistical methods for R closely follow the same syntax and downstream. So if you know lm then you know glm (generalised linear models), you know 'nls' (nonlinear least squares), or even nlme (non linear mixed effects). Or at least you know how to do it if not what it is.