Gene expression data from photo receptor cells in mice
Rebecca Johnston 11/01/2014
Here I will be exploring prDat, a gene expression dataset from photo recptor cells in mice, from various developmental stages and two genotypes. Note we will only work with gene expression data for three genes, and their names have been replaced with 3 Pokemon attacks (crabHammer, eggBomb, and poisonFang).
Save GSE4051_MINI.txt to wd from STAT540 github
prDat <- read.table("GSE4051_MINI.txt", header = TRUE, row.names = 1)
str(prDat) # Data structure
## 'data.frame': 39 obs. of 6 variables:
## $ sample : int 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
## $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ...
## $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ...
## $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...
summary(prDat) # Data summary
## sample devStage gType crabHammer eggBomb
## Min. : 1.0 4_weeks:8 NrlKO:19 Min. : 8.21 Min. :6.14
## 1st Qu.:10.5 E16 :7 wt :20 1st Qu.: 8.94 1st Qu.:6.28
## Median :20.0 P10 :8 Median : 9.61 Median :6.76
## Mean :20.0 P2 :8 Mean : 9.43 Mean :6.79
## 3rd Qu.:29.5 P6 :8 3rd Qu.: 9.83 3rd Qu.:7.09
## Max. :39.0 Max. :10.34 Max. :8.17
## poisonFang
## Min. :6.74
## 1st Qu.:7.19
## Median :7.35
## Mean :7.38
## 3rd Qu.:7.48
## Max. :8.58
nrow(prDat) # How many rows are there?
## [1] 39
ncol(prDat) # How many columns or variables are there?
## [1] 6
head(prDat) # Inspect the first few observations
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_20 20 E16 wt 10.220 7.462 7.370
## Sample_21 21 E16 wt 10.020 6.890 7.177
## Sample_22 22 E16 wt 9.642 6.720 7.350
## Sample_23 23 E16 wt 9.652 6.529 7.040
## Sample_16 16 E16 NrlKO 8.583 6.470 7.494
## Sample_17 17 E16 NrlKO 10.140 7.065 7.005
tail(prDat) # Or the last few
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_38 38 4_weeks wt 9.767 6.608 7.329
## Sample_39 39 4_weeks wt 10.200 7.003 7.320
## Sample_11 11 4_weeks NrlKO 9.677 7.204 6.981
## Sample_12 12 4_weeks NrlKO 9.129 7.165 7.350
## Sample_2 2 4_weeks NrlKO 9.744 7.107 7.075
## Sample_9 9 4_weeks NrlKO 9.822 6.558 7.043
prDat[5:16, ] # Or a random sample
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_16 16 E16 NrlKO 8.583 6.470 7.494
## Sample_17 17 E16 NrlKO 10.140 7.065 7.005
## Sample_6 6 E16 NrlKO 10.340 7.017 6.735
## Sample_24 24 P2 wt 8.869 6.587 7.508
## Sample_25 25 P2 wt 9.038 6.170 7.449
## Sample_26 26 P2 wt 9.611 6.870 7.511
## Sample_27 27 P2 wt 8.613 6.800 7.843
## Sample_14 14 P2 NrlKO 9.572 6.138 7.250
## Sample_3 3 P2 NrlKO 9.414 6.166 7.200
## Sample_5 5 P2 NrlKO 8.925 6.269 7.405
## Sample_8 8 P2 NrlKO 9.116 6.264 8.016
## Sample_28 28 P6 wt 8.214 6.530 7.428
prDat[sample(nrow(prDat), size = 6), ] # Random sample using Jenny's code! Cool!
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_34 34 P10 wt 8.519 6.757 8.584
## Sample_20 20 E16 wt 10.220 7.462 7.370
## Sample_8 8 P2 NrlKO 9.116 6.264 8.016
## Sample_13 13 P10 NrlKO 9.838 7.228 7.459
## Sample_4 4 P6 NrlKO 8.992 6.270 7.342
## Sample_16 16 E16 NrlKO 8.583 6.470 7.494
QUESTION: What does row correspond to – different genes or different mice?
ANSWER: Rows correspond to different mice (samples).
names(prDat) # What are the variable names?
## [1] "sample" "devStage" "gType" "crabHammer" "eggBomb"
## [6] "poisonFang"
str(prDat) # What 'flavor' is each variable, i.e. numeric, character, factor?
## 'data.frame': 39 obs. of 6 variables:
## $ sample : int 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
## $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ...
## $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ...
## $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...
QUESTION: For “sample”, does each integer between 1 and nrows occur exactly once?
nrow(prDat) # length(prDat$sample)
## [1] 39
table(prDat$sample)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1
sort(prDat$sample) # Sort sample numbers in ascending order
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
seq(1, nrow(prDat)) # Create sequence of numbers from 1 to nrow in dataset
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
sort(prDat$sample) == seq(1, nrow(prDat)) # Determine if 2 variables are equal
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
all.equal(sort(prDat$sample), seq(1, nrow(prDat)))
## [1] TRUE
identical(sort(prDat$sample), seq(1, nrow(prDat)))
## [1] TRUE
ANSWER: Yes, each integer between 1 and nrows occur exactly once for “sample”.
levels(prDat$devStage) # For each factor variable, what are the levels?
## [1] "4_weeks" "E16" "P10" "P2" "P6"
levels(prDat$gType)
## [1] "NrlKO" "wt"
summary(prDat$devStage) # How many observations for each level of devStage?
## 4_weeks E16 P10 P2 P6
## 8 7 8 8 8
summary(prDat$gType) # For gType?
## NrlKO wt
## 19 20
table(prDat$devStage, prDat$gType) # Perform a cross-tabulation of devStage and gType
##
## NrlKO wt
## 4_weeks 4 4
## E16 3 4
## P10 4 4
## P2 4 4
## P6 4 4
QUESTION: What do you think the intended experimental design was? What actually happened in real life?
ANSWER: From the cross-tabulation above of genotype (gType) and developmental stage (devStage), I would assume that the intended experimental design was to compare two different genotypes in mice (wild type and knock out) across 5 different developmental stages. There were meant to be 4 mice for each of the 5 developmental stages, and 4 mice for each of the two genotypes, but it appears in “real life”, one of the knock out mice (NrlKO) at developmental stage E16 was not able to be obtained.
QUESTION: For each quantitative variable, what are the extremes? How about average or median?
ANSWER: Use summary for the variable names we want
summary(prDat[, c("crabHammer", "eggBomb", "poisonFang")])
## crabHammer eggBomb poisonFang
## Min. : 8.21 Min. :6.14 Min. :6.74
## 1st Qu.: 8.94 1st Qu.:6.28 1st Qu.:7.19
## Median : 9.61 Median :6.76 Median :7.35
## Mean : 9.43 Mean :6.79 Mean :7.38
## 3rd Qu.: 9.83 3rd Qu.:7.09 3rd Qu.:7.48
## Max. :10.34 Max. :8.17 Max. :8.58
range(prDat$crabHammer) # returns min and max
## [1] 8.214 10.340
fivenum(prDat$crabHammer) # returns Tukey's five number summary
## [1] 8.214 8.938 9.611 9.830 10.340
Create a new data.frame called weeDat only containing observations for which expression of poisonFang is above 7.5
prDat$poisonFang > 7.5
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## [12] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE
weeDat <- prDat$poisonFang[which(prDat$poisonFang > 7.5)] # Use 'which' argument and indexing
str(weeDat)
## num [1:9] 7.51 7.51 7.84 8.02 7.75 ...
For how many observations poisonFang > 7.5? How do they break down by genotype and developmental stage?
length(weeDat)
## [1] 9
prDat[prDat$poisonFang %in% weeDat, ]
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_24 24 P2 wt 8.869 6.587 7.508
## Sample_26 26 P2 wt 9.611 6.870 7.511
## Sample_27 27 P2 wt 8.613 6.800 7.843
## Sample_8 8 P2 NrlKO 9.116 6.264 8.016
## Sample_7 7 P6 NrlKO 8.803 6.188 7.754
## Sample_33 33 P10 wt 9.004 7.082 8.086
## Sample_34 34 P10 wt 8.519 6.757 8.584
## Sample_15 15 P10 NrlKO 9.746 7.226 7.786
## Sample_19 19 P10 NrlKO 9.771 7.081 7.586
prDat[prDat$poisonFang %in% weeDat, c("poisonFang", "gType", "devStage")]
## poisonFang gType devStage
## Sample_24 7.508 wt P2
## Sample_26 7.511 wt P2
## Sample_27 7.843 wt P2
## Sample_8 8.016 NrlKO P2
## Sample_7 7.754 NrlKO P6
## Sample_33 8.086 wt P10
## Sample_34 8.584 wt P10
## Sample_15 7.786 NrlKO P10
## Sample_19 7.586 NrlKO P10
prDat[prDat$poisonFang > 7.5, ]
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_24 24 P2 wt 8.869 6.587 7.508
## Sample_26 26 P2 wt 9.611 6.870 7.511
## Sample_27 27 P2 wt 8.613 6.800 7.843
## Sample_8 8 P2 NrlKO 9.116 6.264 8.016
## Sample_7 7 P6 NrlKO 8.803 6.188 7.754
## Sample_33 33 P10 wt 9.004 7.082 8.086
## Sample_34 34 P10 wt 8.519 6.757 8.584
## Sample_15 15 P10 NrlKO 9.746 7.226 7.786
## Sample_19 19 P10 NrlKO 9.771 7.081 7.586
Print the observations with row names “Sample_16” and “Sample_38” to screen, showing only the 3 gene expression variables.
prDat[c("Sample_16", "Sample_38"), c("crabHammer", "eggBomb", "poisonFang")]
## crabHammer eggBomb poisonFang
## Sample_16 8.583 6.470 7.494
## Sample_38 9.767 6.608 7.329
Which samples have expression of eggBomb less than the 0.10 quantile?
sort(prDat$eggBomb)
## [1] 6.138 6.155 6.166 6.170 6.188 6.211 6.264 6.269 6.269 6.270 6.286
## [12] 6.347 6.470 6.529 6.530 6.558 6.587 6.608 6.720 6.757 6.800 6.870
## [23] 6.890 6.992 7.003 7.017 7.065 7.081 7.082 7.107 7.165 7.204 7.226
## [34] 7.228 7.438 7.462 7.574 7.866 8.173
quantile(prDat$eggBomb, probs = 0.1) # Answer = 6.1844
## 10%
## 6.184
# quantile does not return a number. Can you code this instead of quoting
# it?
prDat[prDat$eggBomb < 6.1844, c("eggBomb", "sample")]
## eggBomb sample
## Sample_25 6.170 25
## Sample_14 6.138 14
## Sample_3 6.166 3
## Sample_35 6.155 35
# Yes you can! Just add the two lines together and voila! Jenny's code:
rownames(prDat[prDat$eggBomb < quantile(prDat$eggBomb, 0.1), ])
## [1] "Sample_25" "Sample_14" "Sample_3" "Sample_35"