load() to load the contents of the R data file derisi.rda. You should now see a new variable derisi in your environment. What class() is it? load("C:/Users/cyk/Desktop/Columbia/Spring/genomics of gene regulation/Assignment 2/derisi.rda")
class(derisi)
## [1] "data.frame"
derisi. head(derisi)
## ORF Name R1 R2 R3 R4 R5 R6 R7 R1.Bkg R2.Bkg
## 1 YHR007C ERG11 7896 7484 14679 14617 9853 7599 6490 1155 1984
## 2 YBR218C PYC2 12144 11177 10241 4820 4950 7047 17035 1074 1694
## 3 YAL051W FUN43 4478 6435 6230 6848 5111 7180 4497 1140 1950
## 4 YAL053W 6343 8243 6743 3304 3556 4694 3849 1020 1897
## 5 YAL054C ACS1 1542 3044 2076 1695 1753 4806 10802 1082 1940
## 6 YAL055W 1769 3243 2094 1367 1853 3580 1956 975 1821
## R3.Bkg R4.Bkg R5.Bkg R6.Bkg R7.Bkg G1 G2 G3 G4 G5 G6
## 1 1323 1171 914 2445 981 8432 7173 11736 16798 12315 16111
## 2 1243 876 1211 2444 742 11509 10226 13372 6500 6255 9024
## 3 1649 1183 898 2637 927 5865 5895 5345 6302 5400 7933
## 4 1196 881 1045 2518 697 6762 7454 6323 3595 4689 5660
## 5 1504 1108 902 2610 980 3138 3785 2419 2114 2763 3561
## 6 1185 851 1047 2536 698 2844 4069 2583 1651 2530 3484
## G7 G1.Bkg G2.Bkg G3.Bkg G4.Bkg G5.Bkg G6.Bkg G7.Bkg
## 1 13931 2404 2561 1598 1506 1696 2667 1244
## 2 6904 2148 2527 1641 1196 1553 2569 848
## 3 5026 2422 2496 1902 1501 1644 2808 1154
## 4 4145 2107 2663 1607 1162 1577 2544 857
## 5 1897 2405 2528 1847 1445 1713 2767 1142
## 6 1550 1674 2648 1591 1114 1528 2668 870
derisi? Hint: try dim(), nrow(), and ncol(). dim(derisi)
## [1] 6102 30
nrow(derisi)
## [1] 6102
ncol(derisi)
## [1] 30
derisi? Based on your reading of DeRisi et al. (1997) what does each column represent? Extract these names from the data frame using names(). colnames(derisi)
## [1] "ORF" "Name" "R1" "R2" "R3" "R4" "R5"
## [8] "R6" "R7" "R1.Bkg" "R2.Bkg" "R3.Bkg" "R4.Bkg" "R5.Bkg"
## [15] "R6.Bkg" "R7.Bkg" "G1" "G2" "G3" "G4" "G5"
## [22] "G6" "G7" "G1.Bkg" "G2.Bkg" "G3.Bkg" "G4.Bkg" "G5.Bkg"
## [29] "G6.Bkg" "G7.Bkg"
The “ORF” represents the systematic name of each genes; the “Name” represents the name of each genes; the R1 to R7 represent the red fluorecence values of each gene under different growth periods, and R1.Bkg to R7.Bkg are the values of each background; G1 to G7 and G1.Bkg to G7.Bkg represent the green fluorescence values of each gene.
derisi that contains the expression data for the gene ACR1. Hint: use which() on a logical statement involving one of the colums of derisi and the == comparison operator? Try to construct this command step by step. derisi[which(derisi$Name == "ACR1"),]
## ORF Name R1 R2 R3 R4 R5 R6 R7 R1.Bkg R2.Bkg R3.Bkg
## 3309 YJR095W ACR1 1443 2041 2141 1233 1896 3003 11675 998 1400 916
## R4.Bkg R5.Bkg R6.Bkg R7.Bkg G1 G2 G3 G4 G5 G6 G7 G1.Bkg
## 3309 795 1286 1364 695 2664 3325 1928 1621 2355 2710 1710 1985
## G2.Bkg G3.Bkg G4.Bkg G5.Bkg G6.Bkg G7.Bkg
## 3309 2438 1350 1142 1668 2076 886
derisi$ORF[which(derisi$Name == "ACR1")]
## [1] YJR095W
## 6102 Levels: YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C ... YPR204W
intensities. r = derisi[,3:9]-derisi[,10:16]
g = derisi[,17:23]-derisi[,24:30]
intensities = data.frame(r,g)
intensities to compute the log-ratio (base two) of the experimental condtion over the control condition. Store these values in logratios. ratio = r/g
colnames(ratio) = c("T1","T2","T3","T4","T5","T6","T7")
logratio = log2(ratio)
ratio$T7[which(derisi$Name == "ACR1")]
## [1] 13.32524
hours that contains the time in hours at which each time point was collected (hint: use the concatenation function c()). Hours = c(1:7)*2+7
plot(Hours,logratio[which(derisi$Name == "ACR1"),], ylab="Log2 Ratio", type="b")
Yes. it is consistent with the Figure 5B of DeRisi et al.
plot(Hours,logratio[which(derisi$Name == "GPP1"),], ylab="Log2 Ratio", type="b")
derisi$ORF[which(logratio$T7 == max(logratio$T7))]
## [1] YKR097W
## 6102 Levels: YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C ... YPR204W
derisi$ORF[which(logratio$T7 == min(logratio$T7))]
## [1] YGR160W
## 6102 Levels: YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C ... YPR204W
mean(logratio$T1)
## [1] -0.08848662
This value is near the 0, which refers to that R/G is close to 1. This is reasonable since at the first time point, the twp gruops of sample are the same. So thier fluorescence values should be similar.
summary() and hist() to generate a summary and plot a histogram of the same data. summary(logratio$T1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.61801 -0.22583 -0.06102 -0.08849 0.07630 0.65626
hist(logratio$T1,main=paste("Histogram of Log2 Ratio at T1"),xlab="Log2 Ratio")
A = (log2(r$R1) + log2(g$G1))/2
M = log2(r$R1) - log2(g$G1)
plot(A,M)
Since the value of test and reference should be close at the first time 1, so the MA plot should disperse around the 0 for M. The result is consistent woth the expectation. Also, viewing the density of plots at M=0, it is illustrated that the noises of samples are random under a standard distribution.
plot.ecdf(logratio$T1)
According to the ECDF diagram, the distribution is similar to a standard distribution.
plot.ecdf(logratio$T1)
plot.ecdf(logratio$T7, add=TRUE)
The slope of ECDF diagram at the latest time point is milder, which reflects the distribution of that time is more disperse and the variation betwwen control and exprimental gruops is greater. And this might no loger be a standard distribution.