BIOF 540 Seminar 1

The purpose of this document is to practice using R with the BIOF/STAT 540 seminar 1 tutorial.

Start by loading the data:

prDat <- read.table("http://www.ugrad.stat.ubc.ca/~stat540/data/photoRec/GSE4051_MINI.txt", 
    header = TRUE, row.names = 1)
str(prDat)
## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

# loading without arguments
prDatTest <- read.table("http://www.ugrad.stat.ubc.ca/~stat540/data/photoRec/GSE4051_MINI.txt")
str(prDatTest)
## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

As far as I can tell, in this case there does not seem to be any difference if no arguments are passed. Though I suspect this may not always be the case.

Exploration of the data:

How many rows are there?

nrow(prDat)
## [1] 39

How many variables?

ncol(prDat)
## [1] 6
length(prDat)
## [1] 6

overall dimensions

dim(prDat)
## [1] 39  6

First few observations or the last few or a random sample

head(prDat)
##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_20     20      E16    wt     10.220   7.462      7.370
## Sample_21     21      E16    wt     10.020   6.890      7.177
## Sample_22     22      E16    wt      9.642   6.720      7.350
## Sample_23     23      E16    wt      9.652   6.529      7.040
## Sample_16     16      E16 NrlKO      8.583   6.470      7.494
## Sample_17     17      E16 NrlKO     10.140   7.065      7.005
tail(prDat)
##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_38     38  4_weeks    wt      9.767   6.608      7.329
## Sample_39     39  4_weeks    wt     10.200   7.003      7.320
## Sample_11     11  4_weeks NrlKO      9.677   7.204      6.981
## Sample_12     12  4_weeks NrlKO      9.129   7.165      7.350
## Sample_2       2  4_weeks NrlKO      9.744   7.107      7.075
## Sample_9       9  4_weeks NrlKO      9.822   6.558      7.043
prDat[sample(nrow(prDat), 6), ]
##           sample devStage gType crabHammer eggBomb poisonFang
## Sample_12     12  4_weeks NrlKO      9.129   7.165      7.350
## Sample_9       9  4_weeks NrlKO      9.822   6.558      7.043
## Sample_13     13      P10 NrlKO      9.838   7.228      7.459
## Sample_31     31       P6    wt      8.693   6.211      7.409
## Sample_33     33      P10    wt      9.004   7.082      8.086
## Sample_21     21      E16    wt     10.020   6.890      7.177

What does row correspond to – different genes or different mice? As seen above the genes are the 3 columns (pokemon attacks), and there seems to be information about mice age and treatment (wt or NrlKO). Thus it would seem each row corresponds to different mice.

What are the variable names?

names(prDat)
## [1] "sample"     "devStage"   "gType"      "crabHammer" "eggBomb"   
## [6] "poisonFang"

What “flavor” is each variable, i.e. numeric, character, factor?

str(prDat)
## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...

Sanity check that each integer between 1 and the number of rows in the dataset occurs exactly once.

sort(prDat$sample)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
seq_len(nrow(prDat))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
sort(prDat$sample) == seq_len(nrow(prDat))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
all.equal(sort(prDat$sample), seq_len(nrow(prDat)))
## [1] TRUE

For each factor variable, what are the levels?

str(prDat)
## 'data.frame':    39 obs. of  6 variables:
##  $ sample    : int  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage  : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ gType     : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
##  $ crabHammer: num  10.22 10.02 9.64 9.65 8.58 ...
##  $ eggBomb   : num  7.46 6.89 6.72 6.53 6.47 ...
##  $ poisonFang: num  7.37 7.18 7.35 7.04 7.49 ...
levels(prDat$devStage)
## [1] "4_weeks" "E16"     "P10"     "P2"      "P6"
levels(prDat$gType)
## [1] "NrlKO" "wt"

How many observations do we have for each level of devStage? For gType?

summary(prDat$devStage)
## 4_weeks     E16     P10      P2      P6 
##       8       7       8       8       8
summary(prDat$gType)
## NrlKO    wt 
##    19    20

Perform a cross-tabulation of devStage and gType

table(prDat$devStage, prDat$gType)
##          
##           NrlKO wt
##   4_weeks     4  4
##   E16         3  4
##   P10         4  4
##   P2          4  4
##   P6          4  4

For each quantitative variable, what are the extremes? How about average or median? Extremes:

min(prDat$crabHammer)
## [1] 8.214
max(prDat$crabHammer)
## [1] 10.34
min(prDat$eggBomb)
## [1] 6.138
max(prDat$eggBomb)
## [1] 8.173
min(prDat$poisonFang)
## [1] 6.735
max(prDat$poisonFang)
## [1] 8.584

average or median:

mean(prDat$crabHammer)
## [1] 9.428
median(prDat$crabHammer)
## [1] 9.611
mean(prDat$eggBomb)
## [1] 6.788
median(prDat$eggBomb)
## [1] 6.757
mean(prDat$poisonFang)
## [1] 7.379
median(prDat$poisonFang)
## [1] 7.35

Spread/Summary information:

range(prDat$crabHammer)
## [1]  8.214 10.340
summary(prDat$crabHammer)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.21    8.94    9.61    9.43    9.83   10.30
fivenum(prDat$crabHammer)
## [1]  8.214  8.938  9.611  9.830 10.340
quantile(prDat$crabHammer)
##     0%    25%    50%    75%   100% 
##  8.214  8.938  9.611  9.830 10.340

range(prDat$eggBomb)
## [1] 6.138 8.173
summary(prDat$eggBomb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.14    6.28    6.76    6.79    7.09    8.17
fivenum(prDat$eggBomb)
## [1] 6.138 6.278 6.757 7.095 8.173
quantile(prDat$eggBomb)
##    0%   25%   50%   75%  100% 
## 6.138 6.278 6.757 7.095 8.173

range(prDat$poisonFang)
## [1] 6.735 8.584
summary(prDat$poisonFang)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.74    7.19    7.35    7.38    7.48    8.58
fivenum(prDat$poisonFang)
## [1] 6.735 7.188 7.350 7.476 8.584
quantile(prDat$poisonFang)
##    0%   25%   50%   75%  100% 
## 6.735 7.188 7.350 7.476 8.584

If you had to take a wild guess, what do you think the intended experimental design was? What actually happened in real life?

Based on the type of columns I would guess that the experiment was to see if knocking out a gene (Nrl) had any effect on the expression of genes (in real life these would be genes, not pokemon attacks).

Indexing and subsetting prDat

Create a new data.frame called weeDat only containing observations for which expression of poisonFang is above 7.5.

weeDat <- subset(prDat, poisonFang > 7.5)

For how many observations poisonFang > 7.5? How do they break down by genotype and developmental stage?

nrow(weeDat)
## [1] 9
table(weeDat$gType)
## 
## NrlKO    wt 
##     4     5
table(weeDat$devStage)
## 
## 4_weeks     E16     P10      P2      P6 
##       0       0       4       4       1

Print the observations with row names “Sample_16” and “Sample_38” to screen, showing only the 3 gene expression variables.

prDat[c("Sample_16", "Sample_38"), c("crabHammer", "eggBomb", "poisonFang")]
##           crabHammer eggBomb poisonFang
## Sample_16      8.583   6.470      7.494
## Sample_38      9.767   6.608      7.329

Which samples have expression of eggBomb less than the 0.10 quantile?

quantileVal <- quantile(prDat$eggBomb, 0.1)
prDat[prDat$eggBomb < quantileVal, 1]
## [1] 25 14  3 35