Seminar 1: Explore a small gene expression dataset

By Erica Acton

Load photoRec data.

prDat <- read.table("GSE4051_MINI.txt", header = TRUE, row.names = 1)
str(prDat)

## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

Basic exploration of prDat

How many rows?

nrow(prDat)

## [1] 39

How many columns?

ncol(prDat)

## [1] 6

Inspect the first few observations or the last few or for a random sample.

head(prDat)

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_20     20      E16    wt     10.220   7.462      7.370
## Sample_21     21      E16    wt     10.020   6.890      7.177
## Sample_22     22      E16    wt      9.642   6.720      7.350
## Sample_23     23      E16    wt      9.652   6.529      7.040
## Sample_16     16      E16 NrlKO      8.583   6.470      7.494
## Sample_17     17      E16 NrlKO     10.140   7.065      7.005

tail(prDat)

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_38     38  4_weeks    wt      9.767   6.608      7.329
## Sample_39     39  4_weeks    wt     10.200   7.003      7.320
## Sample_11     11  4_weeks NrlKO      9.677   7.204      6.981
## Sample_12     12  4_weeks NrlKO      9.129   7.165      7.350
## Sample_2       2  4_weeks NrlKO      9.744   7.107      7.075
## Sample_9       9  4_weeks NrlKO      9.822   6.558      7.043

prDat[9, ]

##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_25     25       P2    wt      9.038    6.17      7.449

What does row correspond to - different genes or different mice?

rownames(prDat)

##  [1] "Sample_20" "Sample_21" "Sample_22" "Sample_23" "Sample_16"
##  [6] "Sample_17" "Sample_6"  "Sample_24" "Sample_25" "Sample_26"
## [11] "Sample_27" "Sample_14" "Sample_3"  "Sample_5"  "Sample_8" 
## [16] "Sample_28" "Sample_29" "Sample_30" "Sample_31" "Sample_1" 
## [21] "Sample_10" "Sample_4"  "Sample_7"  "Sample_32" "Sample_33"
## [26] "Sample_34" "Sample_35" "Sample_13" "Sample_15" "Sample_18"
## [31] "Sample_19" "Sample_36" "Sample_37" "Sample_38" "Sample_39"
## [36] "Sample_11" "Sample_12" "Sample_2"  "Sample_9"

# each row corresponds to different mice

What are the variable names?

names(prDat)

## [1] "sample"     "devStage"   "gType"      "crabHammer" "eggBomb"   
## [6] "poisonFang"

What flavor is each variable?

str(prDat)

## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

# sample = integer, devStage & gType = factors, genes = numbers)

Do a sanity check that each integer between 1 and the number of rows in the dataset occurs exactly once.

seq(prDat)

## [1] 1 2 3 4 5 6

sort(prDat$sample)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

seq_len(nrow(prDat))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39


all(sort(prDat$sample) == seq_len(nrow(prDat)))

## [1] TRUE

identical(sort(prDat$sample), seq_len(nrow(prDat)))

## [1] TRUE

For each factor variable, what are the levels?

# for devStage
levels(prDat$devStage)

## [1] "4_weeks" "E16"     "P10"     "P2"      "P6"

# for gType
levels(prDat$gType)

## [1] "NrlKO" "wt"

How many observations do we have for each level of devStage? gType?

# for devStage
summary(prDat$devStage)

## 4_weeks     E16     P10      P2      P6 
##       8       7       8       8       8

# for gType
summary(prDat$gType)

## NrlKO    wt 
##    19    20

Perform cross-tabulation of devStage and gType.

table(prDat$devStage, prDat$gType)

##          
##           NrlKO wt
##   4_weeks     4  4
##   E16         3  4
##   P10         4  4
##   P2          4  4
##   P6          4  4

addmargins(with(prDat, table(devStage, gType)))

##          gType
## devStage  NrlKO wt Sum
##   4_weeks     4  4   8
##   E16         3  4   7
##   P10         4  4   8
##   P2          4  4   8
##   P6          4  4   8
##   Sum        19 20  39

If you had to guess, what do you think the intended experimental design was? What happened in real life?

I would assumed the experimental design was to have 4 mice for each genotype and developmental stage, but that one of the NrlKO mice at embryonic stage died.

For each quantitative variable, what are the extremes? How about the average or median?

# samples
range(prDat$sample)

## [1]  1 39


# poisonfang
summary(prDat$poisonFang)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.74    7.19    7.35    7.38    7.48    8.58

# eggbomb
summary(prDat$eggBomb)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.14    6.28    6.76    6.79    7.09    8.17

# crabhammer
summary(prDat$crabHammer)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.21    8.94    9.61    9.43    9.83   10.30

Print the observations with row names “Sample_16” and “Sample_38” to screen, showing only the 3 gene expression variables.

prDat[c("Sample_16", "Sample_38"), c("crabHammer", "eggBomb", "poisonFang")]

##           crabHammer eggBomb poisonFang
## Sample_16      8.583   6.470      7.494
## Sample_38      9.767   6.608      7.329

Which samples have eggBomb less than the 0.10 quartile?

rownames(prDat[prDat$eggBomb < quantile(prDat$eggBomb, 0.1), ])

## [1] "Sample_25" "Sample_14" "Sample_3"  "Sample_35"