The purpose of this document is to practice using R with the BIOF/STAT 540 seminar 1 tutorial.
Start by loading the data:
prDat <- read.table("http://www.ugrad.stat.ubc.ca/~stat540/data/photoRec/GSE4051_MINI.txt",
header = TRUE, row.names = 1)
str(prDat)
## 'data.frame': 39 obs. of 6 variables:
## $ sample : int 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
## $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ...
## $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ...
## $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...
# loading without arguments
prDatTest <- read.table("http://www.ugrad.stat.ubc.ca/~stat540/data/photoRec/GSE4051_MINI.txt")
str(prDatTest)
## 'data.frame': 39 obs. of 6 variables:
## $ sample : int 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
## $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ...
## $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ...
## $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...
As far as I can tell, in this case there does not seem to be any difference if no arguments are passed. Though I suspect this may not always be the case.
How many rows are there?
nrow(prDat)
## [1] 39
How many variables?
ncol(prDat)
## [1] 6
length(prDat)
## [1] 6
overall dimensions
dim(prDat)
## [1] 39 6
First few observations or the last few or a random sample
head(prDat)
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_20 20 E16 wt 10.220 7.462 7.370
## Sample_21 21 E16 wt 10.020 6.890 7.177
## Sample_22 22 E16 wt 9.642 6.720 7.350
## Sample_23 23 E16 wt 9.652 6.529 7.040
## Sample_16 16 E16 NrlKO 8.583 6.470 7.494
## Sample_17 17 E16 NrlKO 10.140 7.065 7.005
tail(prDat)
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_38 38 4_weeks wt 9.767 6.608 7.329
## Sample_39 39 4_weeks wt 10.200 7.003 7.320
## Sample_11 11 4_weeks NrlKO 9.677 7.204 6.981
## Sample_12 12 4_weeks NrlKO 9.129 7.165 7.350
## Sample_2 2 4_weeks NrlKO 9.744 7.107 7.075
## Sample_9 9 4_weeks NrlKO 9.822 6.558 7.043
prDat[sample(nrow(prDat), 6), ]
## sample devStage gType crabHammer eggBomb poisonFang
## Sample_12 12 4_weeks NrlKO 9.129 7.165 7.350
## Sample_9 9 4_weeks NrlKO 9.822 6.558 7.043
## Sample_13 13 P10 NrlKO 9.838 7.228 7.459
## Sample_31 31 P6 wt 8.693 6.211 7.409
## Sample_33 33 P10 wt 9.004 7.082 8.086
## Sample_21 21 E16 wt 10.020 6.890 7.177
What does row correspond to – different genes or different mice? As seen above the genes are the 3 columns (pokemon attacks), and there seems to be information about mice age and treatment (wt or NrlKO). Thus it would seem each row corresponds to different mice.
What are the variable names?
names(prDat)
## [1] "sample" "devStage" "gType" "crabHammer" "eggBomb"
## [6] "poisonFang"
What “flavor” is each variable, i.e. numeric, character, factor?
str(prDat)
## 'data.frame': 39 obs. of 6 variables:
## $ sample : int 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
## $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ...
## $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ...
## $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...
Sanity check that each integer between 1 and the number of rows in the dataset occurs exactly once.
sort(prDat$sample)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
seq_len(nrow(prDat))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
sort(prDat$sample) == seq_len(nrow(prDat))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
all.equal(sort(prDat$sample), seq_len(nrow(prDat)))
## [1] TRUE
For each factor variable, what are the levels?
str(prDat)
## 'data.frame': 39 obs. of 6 variables:
## $ sample : int 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage : Factor w/ 5 levels "4_weeks","E16",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ gType : Factor w/ 2 levels "NrlKO","wt": 2 2 2 2 1 1 1 2 2 2 ...
## $ crabHammer: num 10.22 10.02 9.64 9.65 8.58 ...
## $ eggBomb : num 7.46 6.89 6.72 6.53 6.47 ...
## $ poisonFang: num 7.37 7.18 7.35 7.04 7.49 ...
levels(prDat$devStage)
## [1] "4_weeks" "E16" "P10" "P2" "P6"
levels(prDat$gType)
## [1] "NrlKO" "wt"
How many observations do we have for each level of devStage? For gType?
summary(prDat$devStage)
## 4_weeks E16 P10 P2 P6
## 8 7 8 8 8
summary(prDat$gType)
## NrlKO wt
## 19 20
Perform a cross-tabulation of devStage and gType
table(prDat$devStage, prDat$gType)
##
## NrlKO wt
## 4_weeks 4 4
## E16 3 4
## P10 4 4
## P2 4 4
## P6 4 4
For each quantitative variable, what are the extremes? How about average or median? Extremes:
min(prDat$crabHammer)
## [1] 8.214
max(prDat$crabHammer)
## [1] 10.34
min(prDat$eggBomb)
## [1] 6.138
max(prDat$eggBomb)
## [1] 8.173
min(prDat$poisonFang)
## [1] 6.735
max(prDat$poisonFang)
## [1] 8.584
average or median:
mean(prDat$crabHammer)
## [1] 9.428
median(prDat$crabHammer)
## [1] 9.611
mean(prDat$eggBomb)
## [1] 6.788
median(prDat$eggBomb)
## [1] 6.757
mean(prDat$poisonFang)
## [1] 7.379
median(prDat$poisonFang)
## [1] 7.35
Spread/Summary information:
range(prDat$crabHammer)
## [1] 8.214 10.340
summary(prDat$crabHammer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.21 8.94 9.61 9.43 9.83 10.30
fivenum(prDat$crabHammer)
## [1] 8.214 8.938 9.611 9.830 10.340
quantile(prDat$crabHammer)
## 0% 25% 50% 75% 100%
## 8.214 8.938 9.611 9.830 10.340
range(prDat$eggBomb)
## [1] 6.138 8.173
summary(prDat$eggBomb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.14 6.28 6.76 6.79 7.09 8.17
fivenum(prDat$eggBomb)
## [1] 6.138 6.278 6.757 7.095 8.173
quantile(prDat$eggBomb)
## 0% 25% 50% 75% 100%
## 6.138 6.278 6.757 7.095 8.173
range(prDat$poisonFang)
## [1] 6.735 8.584
summary(prDat$poisonFang)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.74 7.19 7.35 7.38 7.48 8.58
fivenum(prDat$poisonFang)
## [1] 6.735 7.188 7.350 7.476 8.584
quantile(prDat$poisonFang)
## 0% 25% 50% 75% 100%
## 6.735 7.188 7.350 7.476 8.584
If you had to take a wild guess, what do you think the intended experimental design was? What actually happened in real life?
Based on the type of columns I would guess that the experiment was to see if knocking out a gene (Nrl) had any effect on the expression of genes (in real life these would be genes, not pokemon attacks).
Create a new data.frame called weeDat only containing observations for which expression of poisonFang is above 7.5.
weeDat <- subset(prDat, poisonFang > 7.5)
For how many observations poisonFang > 7.5? How do they break down by genotype and developmental stage?
nrow(weeDat)
## [1] 9
table(weeDat$gType)
##
## NrlKO wt
## 4 5
table(weeDat$devStage)
##
## 4_weeks E16 P10 P2 P6
## 0 0 4 4 1
Print the observations with row names “Sample_16” and “Sample_38” to screen, showing only the 3 gene expression variables.
prDat[c("Sample_16", "Sample_38"), c("crabHammer", "eggBomb", "poisonFang")]
## crabHammer eggBomb poisonFang
## Sample_16 8.583 6.470 7.494
## Sample_38 9.767 6.608 7.329
Which samples have expression of eggBomb less than the 0.10 quantile?
quantileVal <- quantile(prDat$eggBomb, 0.1)
prDat[prDat$eggBomb < quantileVal, 1]
## [1] 25 14 3 35