R Graphics using lattice

We will work with a mouse dataset, containing “gene expression profiles of purified photoreceptors at distinct developmental stages and from different genetic backgrounds”. The microarray platform was Affymetrix mouse genomic expression array 430 2.0.

Load the photoRec (mini) data

kDat <- readRDS("GSE4051_MINI.rds")
str(kDat)
## 'data.frame':    39 obs. of  7 variables:
##  $ sidChar   : chr  "Sample_11" "Sample_12" "Sample_2" "Sample_9" ...
##  $ sidNum    : num  11 12 2 9 36 37 38 39 16 17 ...
##  $ devStage  : Factor w/ 5 levels "E16","P2","P6",..: 5 5 5 5 5 5 5 5 1 1 ...
##  $ gType     : Factor w/ 2 levels "wt","NrlKO": 2 2 2 2 1 1 1 1 2 2 ...
##  $ crabHammer: num  9.68 9.13 9.74 9.82 9.96 ...
##  $ eggBomb   : num  7.2 7.16 7.11 6.56 7.87 ...
##  $ poisonFang: num  6.98 7.35 7.08 7.04 6.99 ...
table(kDat$devStage)
## 
##     E16      P2      P6     P10 4_weeks 
##       7       8       8       8       8
table(kDat$gType)
## 
##    wt NrlKO 
##    20    19
with(kDat, table(devStage, gType))
##          gType
## devStage  wt NrlKO
##   E16      4     3
##   P2       4     4
##   P6       4     4
##   P10      4     4
##   4_weeks  4     4

We see there are 39 samples (rows or observations). We have

- sample ID character (sidChar): a string recording sample ID
- sample ID number (sidNum): integers between 1 and 39
- developmental stage of the mouse (devStage): a factor with levels 
    * E16 (day 16 of embryonic development), 
    * P2 (postnatal day 2), 
    * P6 (postnatal day 6), 
    * P10 (postnatal day 10), 
    * 4_weeks (4 weeks postnatal)
- genotype of the mouse (gType): 
    * wt (wild type) or 
    * NrlKO (gene Nrl has been knocked out)
- gene expression level (in some abstract unit-less sense) for 
    * crabHammer, 
    * eggBomb, and 
    * poisonFang

It's pretty clear the intent was to have 4 mice at each combination of genotype and developmental stage, but there was some mishap for the embryonic knockouts, i.e. E16-NrlKO, where we only have data from 3 mice.

Scatterplots

The most important plot type is arguably the scatterplot. The lattice function for this is xyplot(). Let's plot the expression data for two of these genes against each other using the lattice function xyplot().

To write the figure to file with an R code precede the figure-making code by opening a graphics device and follow it with a command that closes the device. You can view the file whatever way is appropriate to your OS. It's a silly figure but we've shown the process. Don't be surprised when you don't see any figure appear in your screen device. While you're sending graphics output to, e.g., the pdf() device, you'll be “flying blind”, which is why it's important to work out the graphics commands in advance.

library(lattice)
pdf("Fig1_eggBomb_vs_crabHammer.pdf")
xyplot(eggBomb ~ crabHammer, kDat)
dev.off()
## pdf 
##   2
list.files(pattern = "Fig1_eggBomb_vs_crabHammer.pdf")
## [1] "Fig1_eggBomb_vs_crabHammer.pdf"

lattice graphing functions, like many other functions in R, utilize a special formula syntax and that is what is expected as the first inputted argument. The general form is y ~ x, read as “y twiddle x”. Here it requests that the first variable, eggBomb, be associated with the y direction and the second, crabHammer, be associated with x. The second argument kDat specifies the data.frame where the variables eggBomb and crabHammer can be found. It is being matched against the formal argument data by its position. As with the formula syntax, the use of a data argument, usually in the second position, is common in R, though sadly not universal. Take advantage of it whenever you can!

You try: request a scatterplot of the variable poisonFang against crabHammer.

xyplot(poisonFang ~ crabHammer, kDat)

plot of chunk Fig2_scatter_poisonFang_vs_crabHammer

Plotting two response variables against explanatory variable

Let's imagine that crabHammer is somehow a natural explanatory variable or predictor (weird here, I admit, but go with me) and eggBomb and poisonFang are natural response variables. We might want to see both responses plotted against crabHammer at the same time. Here is a first way to do so, using a bit of a cheat known as the “extended formula interface” in lattice.


xyplot(poisonFang + eggBomb ~ crabHammer, kDat, auto.key = T)

plot of chunk Fig3_scatter_poisonFangANDeggBomb_vs_crabHammer

The “+” sign between the two response variables is, in this context, purely artificial, i.e. we don't literally add them up and scatterplot the sum against crabHammer. Read it as “and”. We've added auto.key = TRUE which produces the legend at the top and is a great convenience. Custom legends are absolutely possible but don't go there until you need to!

What if we want each response to have it's own scatterplot, but we want to put them side-by-side for comparison?


xyplot(poisonFang + eggBomb ~ crabHammer, kDat, outer = T, grid = T)

plot of chunk Fig4_scatter_poisonFangANDeggBomb_vs_crabHammer_separate

The addition of outer = TRUE has caused the information to be presented in two panels, which is the lattice term for the individual plotting cells. The addition of a background grid, via grid = TRUE, is a great help for viewers who want to make comparisons across panels.

What if we'd like to know which points are from the wild type mice versus the Nrl knockouts?


xyplot(poisonFang + eggBomb ~ crabHammer, kDat, group = gType, outer = T, grid = T, 
    auto.key = T)

plot of chunk Fig5_scatter by gType

The more proper way to make this last figure requires us to reshape the data.

The lattice paradigm is for panels to correspond to a level of a factor or to a unique combination of levels of two or more factors.

Since reshaping is a topic in and of itself, we present the reshaping code without explanation here.

nDat <- with(kDat, data.frame(sidChar, sidNum, devStage, gType, crabHammer, 
    probeset = factor(rep(c("eggBomb", "poisonFang"), each = nrow(kDat))), geneExp = c(eggBomb, 
        poisonFang)))
str(nDat)
## 'data.frame':    78 obs. of  7 variables:
##  $ sidChar   : Factor w/ 39 levels "Sample_1","Sample_10",..: 3 4 12 39 30 31 32 33 8 9 ...
##  $ sidNum    : num  11 12 2 9 36 37 38 39 16 17 ...
##  $ devStage  : Factor w/ 5 levels "E16","P2","P6",..: 5 5 5 5 5 5 5 5 1 1 ...
##  $ gType     : Factor w/ 2 levels "wt","NrlKO": 2 2 2 2 1 1 1 1 2 2 ...
##  $ crabHammer: num  9.68 9.13 9.74 9.82 9.96 ...
##  $ probeset  : Factor w/ 2 levels "eggBomb","poisonFang": 1 1 1 1 1 1 1 1 1 1 ...
##  $ geneExp   : num  7.2 7.16 7.11 6.56 7.87 ...
head(nDat)
##     sidChar sidNum devStage gType crabHammer probeset geneExp
## 1 Sample_11     11  4_weeks NrlKO      9.677  eggBomb   7.204
## 2 Sample_12     12  4_weeks NrlKO      9.129  eggBomb   7.165
## 3  Sample_2      2  4_weeks NrlKO      9.744  eggBomb   7.107
## 4  Sample_9      9  4_weeks NrlKO      9.822  eggBomb   6.558
## 5 Sample_36     36  4_weeks    wt      9.960  eggBomb   7.866
## 6 Sample_37     37  4_weeks    wt      9.667  eggBomb   6.992
tail(nDat)
##      sidChar sidNum devStage gType crabHammer   probeset geneExp
## 73  Sample_4      4       P6 NrlKO      8.992 poisonFang   7.342
## 74  Sample_7      7       P6 NrlKO      8.803 poisonFang   7.754
## 75 Sample_28     28       P6    wt      8.214 poisonFang   7.428
## 76 Sample_29     29       P6    wt     10.130 poisonFang   7.100
## 77 Sample_30     30       P6    wt      8.951 poisonFang   7.274
## 78 Sample_31     31       P6    wt      8.693 poisonFang   7.409

Our reshaped dataset nDat has 78 observations of 7 variables, whereas kDat had 39 = 78 / 2 observations of 7 variables. The variables sidChar, sidNum, devStage, gType, and crabHammer have simply been repeated twice, resulting 39 * 2 = 78 rows. The variables eggBomb and poisonFang have been concatenated into a new variable geneExp with a new factor probeset identifying which gene the data is for.

Now we can make the previous plot with more canonical lattice syntax, i.e. this workflow and way of thinking will serve you better in the future:


xyplot(geneExp ~ crabHammer | probeset, nDat, grid = T, group = gType, auto.key = T)

plot of chunk Fig6_proper scatter by gType

We are using another feature of the formula syntax, y ~ x | z, which requests a scatterplot of y versus x for each level of the factor z. In general, in lattice plots the use of the vertical bar | requests individual plots in panels for each value of the factor specified on the right.

You try: Remake this plot but instead of conveying genotype via color, show developmental stage.


xyplot(geneExp ~ crabHammer | probeset, nDat, grid = T, group = devStage, auto.key = T)

plot of chunk Fig7_proper scatter by devStage

Stripplot

The next set of figures we will make requires yet more data reshaping, which is a substantial background task in many analyses. We drop the idea of crabHammer being a predictor and eggBomb and poisonFang being responses and we just treat them all equivalently.

oDat <- with(kDat, data.frame(sidChar, sidNum, devStage, gType, probeset = factor(rep(c("crabHammer", 
    "eggBomb", "poisonFang"), each = nrow(kDat))), geneExp = c(crabHammer, eggBomb, 
    poisonFang)))
str(oDat)
## 'data.frame':    117 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_1","Sample_10",..: 3 4 12 39 30 31 32 33 8 9 ...
##  $ sidNum  : num  11 12 2 9 36 37 38 39 16 17 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 5 5 5 5 5 5 5 5 1 1 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 2 2 2 2 1 1 1 1 2 2 ...
##  $ probeset: Factor w/ 3 levels "crabHammer","eggBomb",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ geneExp : num  9.68 9.13 9.74 9.82 9.96 ...

Our newly reshaped version of kDat, named oDat, has 39 * 3 = 117 observations but is otherwise very similar to nDat.

A stripplot is a univariate scatterplot. Let's inspect the gene expression data, plain and simple.


stripplot(~geneExp, oDat)

plot of chunk Fig8_simple stripplot

Pretty boring and slightly nonsensical! We had to start somewhere.

Let's split things out for the three different genes.


stripplot(probeset ~ geneExp, oDat)

plot of chunk Fig9_stripplot for the 3 genes

Sometimes, it can help to add jitter, a small bit of meaningless noise, in the horizontal position.


stripplot(probeset ~ geneExp, oDat, jitter.data = T)

plot of chunk Fig10_stripplot with jitter

Notice that all the data is presented in one panel but with the different genes corresponding to different locations in the y direction.

What if we want to put the different genes in different panels?


stripplot(~geneExp | probeset, oDat, layout = c(nlevels(oDat$probeset), 1))

plot of chunk Fig11_stripplot in diff panels

What if we want to see information about wild type versus Nrl knockout?


stripplot(~geneExp | probeset, oDat, layout = c(nlevels(oDat$probeset), 1), 
    groups = gType, auto.key = T)

plot of chunk Fig12_stripplot panels with gType coding

Let's start exploring gene expression changes over the course of development.


stripplot(geneExp ~ devStage, oDat)

plot of chunk Fig13_Expr change across devStage

Retaining one panel per gene ….


stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1))

plot of chunk Fig14_Expr change across devStage_panels

Adding back the genotype information ….


stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = T)

plot of chunk Fig15_Expr change across devStage_panels coded by gType

Adding averages


stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = T, grid = T, type = c("a", "p"))

plot of chunk Fig16_stripplot with averages

The argument 'type' can be used to add a variety of enhancements. Type is specified as a vector (through the use of 'c'). The option 'p' in the above example specifies the data as points on the plot, 'a' refers to getting the average of each category and joining them by a line (other summaries can be requested too). Some of the other options include 'l' for joining points by lines, 'b' for both points and lines, 'r' for adding the fit from a simple linear regression and 'smooth' for adding a nonparametric “smooth” fitted curve.

stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = T, grid = T, type = c("r", "p"))

plot of chunk unnamed-chunk-4


stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = T, grid = T, type = c("smooth", "p"))
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1
## Warning: pseudoinverse used at 3
## Warning: neighborhood radius 1
## Warning: reciprocal condition number  0
## Warning: There are other near singularities as well. 1

plot of chunk unnamed-chunk-4


stripplot(geneExp ~ devStage | probeset, oDat, layout = c(nlevels(oDat$probeset), 
    1), groups = gType, auto.key = T, grid = T, type = c("b", "p"))

plot of chunk unnamed-chunk-4

Density plot

Here's a nice alternative to histograms!

densityplot(~geneExp, oDat)

plot of chunk unnamed-chunk-5

The vertical bar works as usual.

densityplot(~geneExp | gType, oDat, grid = T)

plot of chunk unnamed-chunk-6

groups works as usual – a real advantage over histogram.

densityplot(~geneExp, oDat, groups = gType, auto.key = T)

plot of chunk unnamed-chunk-7

The argument 'bw' specifies the bandwidth or the spread of the underlying Gaussian distributions.

It controls how smooth this smoothed histogram will be. Though densityplot() has a sensible default, you can always specify directly if you wish. The argument 'n' controls the number of points at which the kernel density estimate is evaluated. It is easy to confuse this with the usual use of 'n' to denote sample size, so beware. If your density looks jaggedy, try increasing 'n'.

jBw <- 0.2
jn <- 400
densityplot(~geneExp, oDat, groups = gType, auto.key = T, bw = jBw, n = jn, 
    main = paste("bw=", jBw, ",n=", jn))

plot of chunk unnamed-chunk-8

You try: use densityplot() to explore the gene expression distribution by gene and/or developmental stage. Play with 'bw' and 'n' if you like.

jBw <- 0.5
jn <- 200
densityplot(~geneExp, oDat, groups = gType, auto.key = T, bw = jBw, n = jn, 
    main = paste("bw=", jBw, ",n=", jn))

plot of chunk unnamed-chunk-9


densityplot(~geneExp, oDat, groups = devStage, auto.key = T, bw = jBw, n = jn, 
    main = paste("bw=", jBw, ", n=", jn))

plot of chunk unnamed-chunk-9


densityplot(~geneExp | devStage, oDat, groups = gType, auto.key = T, grid = T, 
    bw = jBw, n = jn, main = paste("bw=", jBw, ", n=", jn))

plot of chunk unnamed-chunk-9

Boxplot

There is also a time and place for boxplots, obtained with the lattice function bwplot() for “box-and-whiskers plot”.

bwplot(geneExp ~ devStage, oDat)

plot of chunk unnamed-chunk-10

The vertical bar | still works ….

bwplot(geneExp ~ devStage | gType, oDat)

plot of chunk unnamed-chunk-11

A violinplot is a hybrid of densityplot and histogram.

bwplot(geneExp ~ devStage, oDat, panel = panel.violin)

plot of chunk unnamed-chunk-12

This is our first explicit glimpse of how a panel function might be specified to some non-default value. More on that can be found in Sarkar's book and the STAT545A materials.

Heatmaps

Now we need a larger dataset. Let's load the real full data matrix and experimental design now.

The experimental data, i.e. covariate information for the samples, is available in plain text form in the file GSE4051_design.tsv but you will notice below that we prefer to load it from a RDS file for better factor levels.

prDat <- read.table("GSE4051_data.tsv")
str(prDat, max.level = 0)
## 'data.frame':    29949 obs. of  39 variables:

prDes <- readRDS("GSE4051_design.rds")
str(prDes)
## 'data.frame':    39 obs. of  4 variables:
##  $ sidChar : chr  "Sample_11" "Sample_12" "Sample_2" "Sample_9" ...
##  $ sidNum  : num  11 12 2 9 36 37 38 39 16 17 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 5 5 5 5 5 5 5 5 1 1 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 2 2 2 2 1 1 1 1 2 2 ...

Let's draw 50 probesets at random – but in a repeatable way!

set.seed(1)
(yo <- sample(1:nrow(prDat), size = 50))
##  [1]  7952 11145 17156 27198  6040 26902 28287 19786 18837  1850  6167
## [12]  5286 20568 11499 23046 14899 21481 29690 11375 23269 27975  6350
## [23] 19503  3758  7997 11555   401 11442 26023 10184 14424 17938 14766
## [34]  5571 24751 19997 23759  3229 21647 12302 24554 19353 23416 16540
## [45] 15842 23605   698 14271 21897 20713

hDat <- prDat[yo, ]
str(hDat)
## 'data.frame':    50 obs. of  39 variables:
##  $ Sample_11: num  9.09 6.23 7 6.4 6.22 ...
##  $ Sample_12: num  9.21 6.47 7.17 6.78 5.76 ...
##  $ Sample_2 : num  9.05 6.2 7.2 6.54 5.9 ...
##  $ Sample_9 : num  8.89 7.67 6.86 6.72 6.26 ...
##  $ Sample_36: num  10.41 7.9 6.57 6.87 6.09 ...
##  $ Sample_37: num  9.48 6.36 6.99 6.85 6.08 ...
##  $ Sample_38: num  10.14 7.24 6.95 6.99 6.13 ...
##  $ Sample_39: num  9.48 6.21 6.97 7.13 6.02 ...
##  $ Sample_16: num  8.51 6.66 7.75 7.25 5.5 ...
##  $ Sample_17: num  8.18 7.95 6.85 6.38 7.51 ...
##  $ Sample_6 : num  7.96 8.45 7.42 6.19 7.64 ...
##  $ Sample_20: num  8.3 8.25 6.91 6.79 6.4 ...
##  $ Sample_21: num  8.33 9.14 6.55 6.23 6.69 ...
##  $ Sample_22: num  8.43 8.19 6.59 6.69 6.14 ...
##  $ Sample_23: num  8.49 8.66 6.58 6.34 6.34 ...
##  $ Sample_13: num  8.73 6.83 7.14 6.46 6.25 ...
##  $ Sample_15: num  8.57 6.86 7.12 6.72 5.84 ...
##  $ Sample_18: num  8.96 6.95 6.96 6.31 6.44 ...
##  $ Sample_19: num  8.65 6.69 7.03 6.91 6.32 ...
##  $ Sample_32: num  9.7 7.62 6.96 6.89 6.82 ...
##  $ Sample_33: num  8.72 6.83 7.36 7.21 5.93 ...
##  $ Sample_34: num  8.58 6.73 7.63 7.19 5.93 ...
##  $ Sample_35: num  8.54 6.91 7.14 6.71 5.59 ...
##  $ Sample_14: num  8.46 6.99 7.14 6.78 6.29 ...
##  $ Sample_3 : num  8.53 7.14 7.23 6.88 6.15 ...
##  $ Sample_5 : num  8.45 6.71 7.36 6.92 6 ...
##  $ Sample_8 : num  8.62 6.66 7.9 6.97 5.95 ...
##  $ Sample_24: num  8.34 7.49 7.17 6.84 5.83 ...
##  $ Sample_25: num  8.14 7.39 7.12 7.02 5.85 ...
##  $ Sample_26: num  8.45 6.94 7.46 7.43 6.32 ...
##  $ Sample_27: num  8.25 6.5 7.23 6.91 5.8 ...
##  $ Sample_1 : num  8.66 6.81 7.22 6.56 6.03 ...
##  $ Sample_10: num  8.68 7.41 6.97 6.5 5.99 ...
##  $ Sample_4 : num  8.74 7.23 6.91 6.78 5.82 ...
##  $ Sample_7 : num  8.69 6.61 7.43 7.1 5.64 ...
##  $ Sample_28: num  8.63 6.46 7.45 7.17 6.03 ...
##  $ Sample_29: num  8.58 7.84 6.72 6.91 6.31 ...
##  $ Sample_30: num  8.28 7.01 6.81 6.75 5.81 ...
##  $ Sample_31: num  8.47 6.88 7.18 6.89 5.79 ...

The functions for heatmapping expect a matrix not a data.frame, so we will convert hDat and also transpose for a nicer heatmap orientation below. I also give the samples more informative names that capture genotype and developmental stage.

hDat <- as.matrix(t(hDat))
rownames(hDat) <- with(prDes, paste(devStage, gType, sidChar, sep = "_"))
str(hDat)
##  num [1:39, 1:50] 9.09 9.21 9.05 8.89 10.41 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:39] "4_weeks_NrlKO_Sample_11" "4_weeks_NrlKO_Sample_12" "4_weeks_NrlKO_Sample_2" "4_weeks_NrlKO_Sample_9" ...
##   ..$ : chr [1:50] "1426822_at" "1431375_s_at" "1440076_at" "1456157_at" ...

The basic built-in function for heatmapping is heatmap()

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8))

plot of chunk unnamed-chunk-16

Some of the other built-in color schemes aren't quite as likely to make your eyes bleed …

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", col = cm.colors(256), margins = c(5, 
    8))

plot of chunk unnamed-chunk-17

But long-term it is good to learn how to take control of the color scheme. To help with that, we load the package RColorBrewer that has some useful pre-selected palettes I often use as the basis for a variety of color tasks.

install.packages("RColorBrewer")
## Installing package into 'C:/Users/JML/Documents/R/win-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(RColorBrewer)
display.brewer.all()

plot of chunk unnamed-chunk-18

Warning: this will seem arcane. To map a small number of colors into a ramp or palette suitable for encoding a quantitative variable as colors, the function colorRampPalette() is useful. Its somewhat surprising return value is a function itself that takes a number n as input and returns a correspondingly sized color ramp. Here I prepare to use a subdued gray palette and also a more stimulating pale blue to purple palette.

jGreysFun <- colorRampPalette(brewer.pal(n = 9, "Greys"))
jBuPuFun <- colorRampPalette(brewer.pal(n = 9, "BuPu"))

Let's revisit the first heatmap in grays….

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8), col = jGreysFun(256))

plot of chunk unnamed-chunk-20

and the blue-purple palette.

heatmap(hDat, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8), col = jBuPuFun(256))

plot of chunk unnamed-chunk-21

By specifying Rowv = NA, Colv = NA, scale = “none”, we have been suppressing some rather common heatmap features:

the inclusion of row and column dendrograms and the normalization of the data.

Let's look at the heatmap as it would be rendered by default.

heatmap(hDat, margins = c(5, 8), col = jBuPuFun(256))

plot of chunk unnamed-chunk-22

Now we allow scaling within column:

heatmap(hDat, col = jBuPuFun(256), margins = c(5, 8), scale = c("column"))

plot of chunk unnamed-chunk-23

Adding color legend

Finally we try out another popular heatmapping function heatmap.2() from the gplots package. This adds an automatic color legend, which helps you determine what each color extreme actually means.

install.packages("gplots")
## Installing package into 'C:/Users/JML/Documents/R/win-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(gplots)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(hDat, col = jGreysFun(256), trace = "none")

plot of chunk unnamed-chunk-24

heatmap.2(hDat, col = jBuPuFun(256), trace = "none")

plot of chunk unnamed-chunk-24

Overplotting

Now that we've loaded the main dataset we can also explore high-volume scatterplotting and the solutions to overplotting. First let's pick two samples at random to plot against each other.

set.seed(924)
(yo <- sample(1:ncol(prDat), size = 2))
## [1] 25 24
y <- prDat[[yo[1]]]
z <- prDat[[yo[2]]]
str(y)
##  num [1:29949] 7.16 9.55 9.84 8.33 8.5 ...
str(z)
##  num [1:29949] 7.09 9.56 9.88 8.57 8.59 ...
library(lattice)
xyplot(y ~ z, asp = 1)

plot of chunk unnamed-chunk-25

See the giant dark point cloud of death? What's going on in there? Who knows. The overplotting is potentially hiding information about the joint distribution. Also, these plots use up lots of ink when printing and take a long time to load if you're giving a talk. Stop using plain vanilla scatterplot tools for very high-volume plots. The superior alternatives divide the plane into small neighborhoods and shade them according to how many points fall there.

The base package graphics ('base' meaning always installed, always loaded) offers the smoothScatter() function which shades the plane according to a 2D kernel density estimate.

smoothScatter(y ~ z, asp = 1)

plot of chunk unnamed-chunk-26

You can see that we were missing some information in the dark cloud above. There is one main clump of data, concentrated around (6, 6) and then petering out diagonally up the x = y line. There's arguably a second, smaller clump of data on a steeper line running through the points ~(10, 8) and ~(14, 14).

The xyplot() function in lattice can produce a similar plot by specifying a smoothScatter-type of panel function.

xyplot(y ~ z, asp = 1, panel = panel.smoothScatter, nbin = 150)

plot of chunk unnamed-chunk-27

The add-on package hexbin implements hexagonal binning. Basically the plane is divided into hexagons and shaded as described above.

install.packages("hexbin")
## Installing package into 'C:/Users/JML/Documents/R/win-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(grid)
library(hexbin)
hexbinplot(y ~ z)

plot of chunk unnamed-chunk-28

Plot Matrix

Finally, if you want to plot several variables against each other, i.e. create a scatterplot matrix, there are several functions for that.

- base graphics offers **pairs()**
- lattice offers **splom()** for scatterplot matrix
- hexbin offers **hexplom()**
- ggplot2 offers **plotmatrix()** (sorry we won't go here today)

We need to take a slightly larger sample of columns now.

set.seed(3)
(yo <- sample(1:ncol(prDat), size = 4))
## [1]  7 31 15 12
pairDat <- subset(prDat, select = yo)
str(pairDat)
## 'data.frame':    29949 obs. of  4 variables:
##  $ Sample_38: num  7.37 9.44 9.48 8.49 8.34 ...
##  $ Sample_27: num  7.21 8.64 9.43 8.33 8.43 ...
##  $ Sample_23: num  7.07 10.13 9.91 8.49 8.64 ...
##  $ Sample_20: num  7.24 9.48 10.01 8.36 8.59 ...

plot matrix using the base function pairs() - is a bit slow and we get the usual awful dark point clouds.

pairs(pairDat)

plot of chunk unnamed-chunk-30

However, pairs() can be combined with smoothScatter() for a better result. Somewhat faster and definitely better looking, more informative.

pairs(pairDat, panel = function(...) smoothScatter(..., add = T))

plot of chunk unnamed-chunk-31

Here's splom() from lattice, first using the default, non-high-volume panel function.

splom(pairDat)

plot of chunk unnamed-chunk-32

splom() from lattice again, but using a smoothScatter-type panel function. Much faster! More informative!

splom(pairDat, panel = panel.smoothScatter, raster = T)

plot of chunk unnamed-chunk-33

plot matrix using hexplom()

hexplom(pairDat)

plot of chunk unnamed-chunk-34