STAT540 Seminar 01: Exploration of small gene expression dataset

Gene expression data from photo receptor cells in mice
Rebecca Johnston 11/01/2014

Here I will be exploring prDat, a gene expression dataset from photo recptor cells in mice, from various developmental stages and two genotypes. Note we will only work with gene expression data for three genes, and their names have been replaced with 3 Pokemon attacks (crabHammer, eggBomb, and poisonFang).

Load GSE4051_MINI.txt prDat
Basic exploration of prDat
Indexing and subsetting prDat

Load GSE4051_MINI.txt prDat

Save GSE4051_MINI.txt to wd from STAT540 github

prDat <- read.table("GSE4051_MINI.txt", header = TRUE, row.names = 1)

Basic exploration of prDat

str(prDat)  # Data structure

## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

summary(prDat)  # Data summary

##      sample        devStage   gType      crabHammer       eggBomb    
##  Min.   : 1.0   4_weeks:8   NrlKO:19   Min.   : 8.21   Min.   :6.14  
##  1st Qu.:10.5   E16    :7   wt   :20   1st Qu.: 8.94   1st Qu.:6.28  
##  Median :20.0   P10    :8              Median : 9.61   Median :6.76  
##  Mean   :20.0   P2     :8              Mean   : 9.43   Mean   :6.79  
##  3rd Qu.:29.5   P6     :8              3rd Qu.: 9.83   3rd Qu.:7.09  
##  Max.   :39.0                          Max.   :10.34   Max.   :8.17  
##    poisonFang  
##  Min.   :6.74  
##  1st Qu.:7.19  
##  Median :7.35  
##  Mean   :7.38  
##  3rd Qu.:7.48  
##  Max.   :8.58

nrow(prDat)  # How many rows are there?

## [1] 39

ncol(prDat)  # How many columns or variables are there?

## [1] 6

head(prDat)  # Inspect the first few observations

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_20     20      E16    wt     10.220   7.462      7.370
## Sample_21     21      E16    wt     10.020   6.890      7.177
## Sample_22     22      E16    wt      9.642   6.720      7.350
## Sample_23     23      E16    wt      9.652   6.529      7.040
## Sample_16     16      E16 NrlKO      8.583   6.470      7.494
## Sample_17     17      E16 NrlKO     10.140   7.065      7.005

tail(prDat)  # Or the last few

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_38     38  4_weeks    wt      9.767   6.608      7.329
## Sample_39     39  4_weeks    wt     10.200   7.003      7.320
## Sample_11     11  4_weeks NrlKO      9.677   7.204      6.981
## Sample_12     12  4_weeks NrlKO      9.129   7.165      7.350
## Sample_2       2  4_weeks NrlKO      9.744   7.107      7.075
## Sample_9       9  4_weeks NrlKO      9.822   6.558      7.043

prDat[5:16, ]  # Or a random sample

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_16     16      E16 NrlKO      8.583   6.470      7.494
## Sample_17     17      E16 NrlKO     10.140   7.065      7.005
## Sample_6       6      E16 NrlKO     10.340   7.017      6.735
## Sample_24     24       P2    wt      8.869   6.587      7.508
## Sample_25     25       P2    wt      9.038   6.170      7.449
## Sample_26     26       P2    wt      9.611   6.870      7.511
## Sample_27     27       P2    wt      8.613   6.800      7.843
## Sample_14     14       P2 NrlKO      9.572   6.138      7.250
## Sample_3       3       P2 NrlKO      9.414   6.166      7.200
## Sample_5       5       P2 NrlKO      8.925   6.269      7.405
## Sample_8       8       P2 NrlKO      9.116   6.264      8.016
## Sample_28     28       P6    wt      8.214   6.530      7.428

prDat[sample(nrow(prDat), size = 6), ]  # Random sample using Jenny's code! Cool!

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_34     34      P10    wt      8.519   6.757      8.584
## Sample_20     20      E16    wt     10.220   7.462      7.370
## Sample_8       8       P2 NrlKO      9.116   6.264      8.016
## Sample_13     13      P10 NrlKO      9.838   7.228      7.459
## Sample_4       4       P6 NrlKO      8.992   6.270      7.342
## Sample_16     16      E16 NrlKO      8.583   6.470      7.494

QUESTION: What does row correspond to – different genes or different mice?
ANSWER: Rows correspond to different mice (samples).

names(prDat)  # What are the variable names?

## [1] "sample"     "devStage"   "gType"      "crabHammer" "eggBomb"   
## [6] "poisonFang"

str(prDat)  # What 'flavor' is each variable, i.e. numeric, character, factor?

## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

QUESTION: For “sample”, does each integer between 1 and nrows occur exactly once?

nrow(prDat)  # length(prDat$sample)

## [1] 39

table(prDat$sample)

## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1

sort(prDat$sample)  # Sort sample numbers in ascending order

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

seq(1, nrow(prDat))  # Create sequence of numbers from 1 to nrow in dataset

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

sort(prDat$sample) == seq(1, nrow(prDat))  # Determine if 2 variables are equal

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

all.equal(sort(prDat$sample), seq(1, nrow(prDat)))

## [1] TRUE

identical(sort(prDat$sample), seq(1, nrow(prDat)))

## [1] TRUE

ANSWER: Yes, each integer between 1 and nrows occur exactly once for “sample”.

levels(prDat$devStage)  # For each factor variable, what are the levels?

## [1] "4_weeks" "E16"     "P10"     "P2"      "P6"

levels(prDat$gType)

## [1] "NrlKO" "wt"

summary(prDat$devStage)  # How many observations for each level of devStage?

## 4_weeks     E16     P10      P2      P6 
##       8       7       8       8       8

summary(prDat$gType)  # For gType?

## NrlKO    wt 
##    19    20

table(prDat$devStage, prDat$gType)  # Perform a cross-tabulation of devStage and gType

##          
##           NrlKO wt
##   4_weeks     4  4
##   E16         3  4
##   P10         4  4
##   P2          4  4
##   P6          4  4

QUESTION: What do you think the intended experimental design was? What actually happened in real life?
ANSWER: From the cross-tabulation above of genotype (gType) and developmental stage (devStage), I would assume that the intended experimental design was to compare two different genotypes in mice (wild type and knock out) across 5 different developmental stages. There were meant to be 4 mice for each of the 5 developmental stages, and 4 mice for each of the two genotypes, but it appears in “real life”, one of the knock out mice (NrlKO) at developmental stage E16 was not able to be obtained.

QUESTION: For each quantitative variable, what are the extremes? How about average or median?
ANSWER: Use summary for the variable names we want

summary(prDat[, c("crabHammer", "eggBomb", "poisonFang")])

##    crabHammer       eggBomb       poisonFang  
##  Min.   : 8.21   Min.   :6.14   Min.   :6.74  
##  1st Qu.: 8.94   1st Qu.:6.28   1st Qu.:7.19  
##  Median : 9.61   Median :6.76   Median :7.35  
##  Mean   : 9.43   Mean   :6.79   Mean   :7.38  
##  3rd Qu.: 9.83   3rd Qu.:7.09   3rd Qu.:7.48  
##  Max.   :10.34   Max.   :8.17   Max.   :8.58

range(prDat$crabHammer)  # returns min and max

## [1]  8.214 10.340

fivenum(prDat$crabHammer)  # returns Tukey's five number summary

## [1]  8.214  8.938  9.611  9.830 10.340

Indexing and subsetting prDat

Create a new data.frame called weeDat only containing observations for which expression of poisonFang is above 7.5

prDat$poisonFang > 7.5

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## [12] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE

weeDat <- prDat$poisonFang[which(prDat$poisonFang > 7.5)]  # Use 'which' argument and indexing
str(weeDat)

##  num [1:9] 7.51 7.51 7.84 8.02 7.75 ...

For how many observations poisonFang > 7.5? How do they break down by genotype and developmental stage?

length(weeDat)

## [1] 9

prDat[prDat$poisonFang %in% weeDat, ]

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_24     24       P2    wt      8.869   6.587      7.508
## Sample_26     26       P2    wt      9.611   6.870      7.511
## Sample_27     27       P2    wt      8.613   6.800      7.843
## Sample_8       8       P2 NrlKO      9.116   6.264      8.016
## Sample_7       7       P6 NrlKO      8.803   6.188      7.754
## Sample_33     33      P10    wt      9.004   7.082      8.086
## Sample_34     34      P10    wt      8.519   6.757      8.584
## Sample_15     15      P10 NrlKO      9.746   7.226      7.786
## Sample_19     19      P10 NrlKO      9.771   7.081      7.586

prDat[prDat$poisonFang %in% weeDat, c("poisonFang", "gType", "devStage")]

##           poisonFang gType devStage
## Sample_24      7.508    wt       P2
## Sample_26      7.511    wt       P2
## Sample_27      7.843    wt       P2
## Sample_8       8.016 NrlKO       P2
## Sample_7       7.754 NrlKO       P6
## Sample_33      8.086    wt      P10
## Sample_34      8.584    wt      P10
## Sample_15      7.786 NrlKO      P10
## Sample_19      7.586 NrlKO      P10

prDat[prDat$poisonFang > 7.5, ]

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_24     24       P2    wt      8.869   6.587      7.508
## Sample_26     26       P2    wt      9.611   6.870      7.511
## Sample_27     27       P2    wt      8.613   6.800      7.843
## Sample_8       8       P2 NrlKO      9.116   6.264      8.016
## Sample_7       7       P6 NrlKO      8.803   6.188      7.754
## Sample_33     33      P10    wt      9.004   7.082      8.086
## Sample_34     34      P10    wt      8.519   6.757      8.584
## Sample_15     15      P10 NrlKO      9.746   7.226      7.786
## Sample_19     19      P10 NrlKO      9.771   7.081      7.586

Print the observations with row names “Sample_16” and “Sample_38” to screen, showing only the 3 gene expression variables.

prDat[c("Sample_16", "Sample_38"), c("crabHammer", "eggBomb", "poisonFang")]

##           crabHammer eggBomb poisonFang
## Sample_16      8.583   6.470      7.494
## Sample_38      9.767   6.608      7.329

Which samples have expression of eggBomb less than the 0.10 quantile?

sort(prDat$eggBomb)

##  [1] 6.138 6.155 6.166 6.170 6.188 6.211 6.264 6.269 6.269 6.270 6.286
## [12] 6.347 6.470 6.529 6.530 6.558 6.587 6.608 6.720 6.757 6.800 6.870
## [23] 6.890 6.992 7.003 7.017 7.065 7.081 7.082 7.107 7.165 7.204 7.226
## [34] 7.228 7.438 7.462 7.574 7.866 8.173

quantile(prDat$eggBomb, probs = 0.1)  # Answer = 6.1844

##   10% 
## 6.184

# quantile does not return a number. Can you code this instead of quoting
# it?
prDat[prDat$eggBomb < 6.1844, c("eggBomb", "sample")]

##           eggBomb sample
## Sample_25   6.170     25
## Sample_14   6.138     14
## Sample_3    6.166      3
## Sample_35   6.155     35

# Yes you can! Just add the two lines together and voila! Jenny's code:
rownames(prDat[prDat$eggBomb < quantile(prDat$eggBomb, 0.1), ])

## [1] "Sample_25" "Sample_14" "Sample_3"  "Sample_35"