STAT540 Seminar 03: Introduction to R graphics using ggplot2

Rebecca Johnston 22/01/13

Load required packages
Read photoRec data
Scatter plots
Strip plots
Density plots
Boxplots
Overplotting and plot matrix
Take home problem

Load required packages

library("ggplot2")  # for graphing
library("hexbin")  # for making hexbin objects

## Loading required package: grid
## Loading required package: lattice

library("reshape2")  # for reshaping data from wide to tall format (melt)

Read photoRec data

Save rds file from STAT540 github. Note this is a different data structure compared to sem01. There are now 7 variables including sidChar and sidNum.

kDat <- readRDS("GSE4051_MINI.rds")

Sanity checks to ensure the data loaded successfully:

str(kDat)

## 'data.frame':    39 obs. of  7 variables:
##  $ sidChar   : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum    : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType     : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

head(kDat)

##      sidChar sidNum devStage gType crabHammer eggBomb poisonFang
## 12 Sample_20     20      E16    wt     10.220   7.462      7.370
## 13 Sample_21     21      E16    wt     10.020   6.890      7.177
## 14 Sample_22     22      E16    wt      9.642   6.720      7.350
## 15 Sample_23     23      E16    wt      9.652   6.529      7.040
## 9  Sample_16     16      E16 NrlKO      8.583   6.470      7.494
## 10 Sample_17     17      E16 NrlKO     10.140   7.065      7.005

table(kDat$devStage, kDat$gType)

##          
##           wt NrlKO
##   E16      4     3
##   P2       4     4
##   P6       4     4
##   P10      4     4
##   4_weeks  4     4

The full photoRec dataset has 39 samples and 29,949 probesets. Choose 2 or 20 or 200 random probesets/genes and look for gene expression differences between the two genotypes, wild type versus knockout. Make use of the graphing techniques discussed this week such as scatter plots, box plot, etc.

Choose 20 random genes to start:

set.seed(50)
(someGene <- sample(nrow(prDat), 20))

##  [1] 21226 13108  5990 22971 15367  1339 20958 19353  1259  3220 11693
## [12]  8077 19187  2322  8301 20239 25006 10913  2219  5057

Create new dataframe with only these 20 random genes:

tDat <- prDat[someGene, ]
str(tDat, max.level = 0)

## 'data.frame':    20 obs. of  39 variables:

What structure do we need the data to be in? To use ggplot2, we will need separate columns for: probeID, sampleID, gExp, gType.

What about creating a dataframe manually by writing a function? First “stack” the gene expression values:

stDat <- stack(tDat)
head(stDat)  # Lose probeID

##   values       ind
## 1  7.582 Sample_20
## 2  6.711 Sample_20
## 3  8.169 Sample_20
## 4  6.291 Sample_20
## 5  6.484 Sample_20
## 6  9.889 Sample_20

Write function:

makeDf <- function(a, b, c) {
    # a = tDat b = stDat c = prDes
    probeID <- as.factor(row.names(a))
    gExp <- b$values  # I don't like this code :S better way to get gExp values?
    gType <- c$gType
    sID <- c$sidChar
    data.frame(probeID, gExp, gType)
}

# Call function
newDf <- makeDf(tDat, stDat, prDes)
head(newDf)

##        probeID  gExp gType
## 1   1447589_at 7.582    wt
## 2 1434502_x_at 6.711    wt
## 3   1424102_at 8.169    wt
## 4   1450094_at 6.291    wt
## 5   1437425_at 6.484 NrlKO
## 6   1417242_at 9.889 NrlKO

I don't know if I trust the above result… particularly whether the gene expression values match the correct sample… Is there a simple way I could check this?

I will use the reshape package instead. First, need to make row names a column:

tDat <- data.frame(probeID = row.names(tDat), tDat)
head(tDat, n = 3)

##                   probeID Sample_20 Sample_21 Sample_22 Sample_23
## 1447589_at     1447589_at     7.582     7.106     7.551     7.313
## 1434502_x_at 1434502_x_at     6.711     6.774     6.412     6.573
## 1424102_at     1424102_at     8.169     8.265     8.565     8.419
##              Sample_16 Sample_17 Sample_6 Sample_24 Sample_25 Sample_26
## 1447589_at       8.139     7.374    7.430     7.836     7.894     7.707
## 1434502_x_at     7.456     6.835    6.561     7.081     6.960     7.237
## 1424102_at       8.432     8.022    8.388     8.388     8.425     8.344
##              Sample_27 Sample_14 Sample_3 Sample_5 Sample_8 Sample_28
## 1447589_at       7.849     7.943    7.927    8.028    8.281     7.791
## 1434502_x_at     7.131     6.938    7.323    7.311    7.203     6.999
## 1424102_at       8.368     8.719    8.645    8.792    8.164     8.393
##              Sample_29 Sample_30 Sample_31 Sample_1 Sample_10 Sample_4
## 1447589_at       7.495     7.704     7.885    7.904     7.956    7.897
## 1434502_x_at     6.771     6.911     7.080    7.151     6.857    7.153
## 1424102_at       8.322     8.548     8.550    8.467     8.340    8.576
##              Sample_7 Sample_32 Sample_33 Sample_34 Sample_35 Sample_13
## 1447589_at      8.293     7.410     7.968     8.261     7.657     7.748
## 1434502_x_at    7.321     6.592     7.149     7.358     6.913     6.849
## 1424102_at      8.316     8.506     8.032     8.192     8.673     7.951
##              Sample_15 Sample_18 Sample_19 Sample_36 Sample_37 Sample_38
## 1447589_at       7.967     7.464     7.952     7.435     7.647     7.317
## 1434502_x_at     6.895     6.685     6.679     6.649     6.763     7.018
## 1424102_at       7.986     8.071     8.329     8.807     8.371     8.258
##              Sample_39 Sample_11 Sample_12 Sample_2 Sample_9
## 1447589_at       7.506     7.672     7.693    7.613    7.576
## 1434502_x_at     6.759     6.671     7.003    6.894    6.576
## 1424102_at       8.184     8.353     8.388    8.033    8.306

Reshape to tall format using melt, where ID variable is “probeID”:

tDat <- melt(tDat, id.vars = "probeID")
head(tDat, n = 3)

##        probeID  variable value
## 1   1447589_at Sample_20 7.582
## 2 1434502_x_at Sample_20 6.711
## 3   1424102_at Sample_20 8.169

names(tDat) <- c("probeID", "sidChar", "gExp")  # Rename columns

Merge prDes to obtain gType using “sidChar” as common column:

tDat <- merge(tDat, prDes, by = "sidChar")

For simplicity, rename probes to smaller names:

levels(tDat$probeID) <- c(LETTERS[1:length(levels(tDat$probeID))])

STRIP PLOT:
Gene expression per gene:

(strPlot <- ggplot(tDat, aes(gExp, probeID, colour = gType)) + geom_point())

plot of chunk unnamed-chunk-49

This is a bit messy… Let's reorder probeID based on gExp and replot:

tDat <- within(tDat, probeID <- reorder(probeID, gExp))

(strPlot <- ggplot(tDat, aes(gExp, probeID, colour = gType)) + geom_point())

plot of chunk unnamed-chunk-50

BOX PLOT:
Gene expression across all probes comparing gType:

(bxPlot <- ggplot(tDat, aes(gType, gExp)) + geom_boxplot())

plot of chunk unnamed-chunk-51

Gene expression per probeID comparing gType:

(bxPlot <- ggplot(tDat, aes(probeID, gExp, fill = interaction(gType))) + geom_boxplot() + 
    theme())

plot of chunk unnamed-chunk-52

DENSITY PLOT:
Gene expression across all 20 genes comparing gType:

(densPlot <- ggplot(tDat, aes(gExp, colour = gType)) + stat_density(geom = "line", 
    position = "identity"))

plot of chunk unnamed-chunk-53

Gene expression per probeID comparing gType:

(densPlot <- ggplot(tDat, aes(gExp, colour = gType)) + stat_density(geom = "line", 
    position = "identity") + facet_wrap(~probeID))

plot of chunk unnamed-chunk-54

Using ggplot visualisation tools alone, it appears that for the 20 probes that have been randomly chosen here, there is no significant difference between expression for wt and KO mice.