Assignment 2 - Working with microarray data

Use load() to load the contents of the R data file derisi.rda. You should now see a new variable derisi in your environment. What class() is it?

    load("C:/Users/cyk/Desktop/Columbia/Spring/genomics of gene regulation/Assignment 2/derisi.rda")
    class(derisi)

## [1] "data.frame"

Use head to look at the first few rows of derisi.

    head(derisi)

##       ORF  Name    R1    R2    R3    R4   R5   R6    R7 R1.Bkg R2.Bkg
## 1 YHR007C ERG11  7896  7484 14679 14617 9853 7599  6490   1155   1984
## 2 YBR218C  PYC2 12144 11177 10241  4820 4950 7047 17035   1074   1694
## 3 YAL051W FUN43  4478  6435  6230  6848 5111 7180  4497   1140   1950
## 4 YAL053W        6343  8243  6743  3304 3556 4694  3849   1020   1897
## 5 YAL054C  ACS1  1542  3044  2076  1695 1753 4806 10802   1082   1940
## 6 YAL055W        1769  3243  2094  1367 1853 3580  1956    975   1821
##   R3.Bkg R4.Bkg R5.Bkg R6.Bkg R7.Bkg    G1    G2    G3    G4    G5    G6
## 1   1323   1171    914   2445    981  8432  7173 11736 16798 12315 16111
## 2   1243    876   1211   2444    742 11509 10226 13372  6500  6255  9024
## 3   1649   1183    898   2637    927  5865  5895  5345  6302  5400  7933
## 4   1196    881   1045   2518    697  6762  7454  6323  3595  4689  5660
## 5   1504   1108    902   2610    980  3138  3785  2419  2114  2763  3561
## 6   1185    851   1047   2536    698  2844  4069  2583  1651  2530  3484
##      G7 G1.Bkg G2.Bkg G3.Bkg G4.Bkg G5.Bkg G6.Bkg G7.Bkg
## 1 13931   2404   2561   1598   1506   1696   2667   1244
## 2  6904   2148   2527   1641   1196   1553   2569    848
## 3  5026   2422   2496   1902   1501   1644   2808   1154
## 4  4145   2107   2663   1607   1162   1577   2544    857
## 5  1897   2405   2528   1847   1445   1713   2767   1142
## 6  1550   1674   2648   1591   1114   1528   2668    870

What are the dimensions of derisi? Hint: try dim(), nrow(), and ncol().

    dim(derisi)

## [1] 6102   30

    nrow(derisi)

## [1] 6102

    ncol(derisi)

## [1] 30

What are the names of each column in derisi? Based on your reading of DeRisi et al. (1997) what does each column represent? Extract these names from the data frame using names().

    colnames(derisi)

##  [1] "ORF"    "Name"   "R1"     "R2"     "R3"     "R4"     "R5"    
##  [8] "R6"     "R7"     "R1.Bkg" "R2.Bkg" "R3.Bkg" "R4.Bkg" "R5.Bkg"
## [15] "R6.Bkg" "R7.Bkg" "G1"     "G2"     "G3"     "G4"     "G5"    
## [22] "G6"     "G7"     "G1.Bkg" "G2.Bkg" "G3.Bkg" "G4.Bkg" "G5.Bkg"
## [29] "G6.Bkg" "G7.Bkg"

The “ORF” represents the systematic name of each genes; the “Name” represents the name of each genes; the R1 to R7 represent the red fluorecence values of each gene under different growth periods, and R1.Bkg to R7.Bkg are the values of each background; G1 to G7 and G1.Bkg to G7.Bkg represent the green fluorescence values of each gene.

What is the index of the row of derisi that contains the expression data for the gene ACR1. Hint: use which() on a logical statement involving one of the colums of derisi and the == comparison operator? Try to construct this command step by step.

    derisi[which(derisi$Name == "ACR1"),]

##          ORF Name   R1   R2   R3   R4   R5   R6    R7 R1.Bkg R2.Bkg R3.Bkg
## 3309 YJR095W ACR1 1443 2041 2141 1233 1896 3003 11675    998   1400    916
##      R4.Bkg R5.Bkg R6.Bkg R7.Bkg   G1   G2   G3   G4   G5   G6   G7 G1.Bkg
## 3309    795   1286   1364    695 2664 3325 1928 1621 2355 2710 1710   1985
##      G2.Bkg G3.Bkg G4.Bkg G5.Bkg G6.Bkg G7.Bkg
## 3309   2438   1350   1142   1668   2076    886

Construct a command that returns the systematic ORF name for ACR1.

    derisi$ORF[which(derisi$Name == "ACR1")]

## [1] YJR095W
## 6102 Levels: YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C ... YPR204W

For each time point, subtract the background intensity for each ‘channel’ experimental (red or green) separately. Store the result in a data frame named intensities.

    r = derisi[,3:9]-derisi[,10:16]
    g = derisi[,17:23]-derisi[,24:30]
    intensities = data.frame(r,g)

Use intensities to compute the log-ratio (base two) of the experimental condtion over the control condition. Store these values in logratios.

    ratio = r/g
    colnames(ratio) = c("T1","T2","T3","T4","T5","T6","T7")
    logratio = log2(ratio)

What is the fold induction/repression of ACR1 at the last time point?

    ratio$T7[which(derisi$Name == "ACR1")]

## [1] 13.32524

Create a one dimensional array called hours that contains the time in hours at which each time point was collected (hint: use the concatenation function c()).

    Hours = c(1:7)*2+7

Plot the logratio for ACR1 as a function of time (in hours). Is the expression profile consistent with what you see in Figure 5B of DeRisi et al.?

    plot(Hours,logratio[which(derisi$Name == "ACR1"),], ylab="Log2 Ratio", type="b")

Yes. it is consistent with the Figure 5B of DeRisi et al.

Pick a downregulated gene from Figure 5E and make a similar plot.

    plot(Hours,logratio[which(derisi$Name == "GPP1"),], ylab="Log2 Ratio", type="b")

Which gene is most strongly upregulated and downregulated, respectively, at the end of the time course?

    derisi$ORF[which(logratio$T7 == max(logratio$T7))]

## [1] YKR097W
## 6102 Levels: YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C ... YPR204W

    derisi$ORF[which(logratio$T7 == min(logratio$T7))]

## [1] YGR160W
## 6102 Levels: YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C ... YPR204W

Compute the mean of the logratios (across all genes) at the first time point. What is the value? Does this make sense?

    mean(logratio$T1)

## [1] -0.08848662

This value is near the 0, which refers to that R/G is close to 1. This is reasonable since at the first time point, the twp gruops of sample are the same. So thier fluorescence values should be similar.

Use summary() and hist() to generate a summary and plot a histogram of the same data.

    summary(logratio$T1)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.61801 -0.22583 -0.06102 -0.08849  0.07630  0.65626

    hist(logratio$T1,main=paste("Histogram of Log2 Ratio at T1"),xlab="Log2 Ratio")

Generate an MA plot for time point 1. Make sure to label the X and Y axes as “A” and “M”, respectively. What does this plot tell you about the data?

    A = (log2(r$R1) + log2(g$G1))/2
    M = log2(r$R1) - log2(g$G1)
    plot(A,M)

Since the value of test and reference should be close at the first time 1, so the MA plot should disperse around the 0 for M. The result is consistent woth the expectation. Also, viewing the density of plots at M=0, it is illustrated that the noises of samples are random under a standard distribution.

Plot an ECDF of the logratios at time point 1. Describe the distribution.

    plot.ecdf(logratio$T1)

According to the ECDF diagram, the distribution is similar to a standard distribution.

Add an ECDF for the latest time point to the same plot. How do they differ?

    plot.ecdf(logratio$T1)
    plot.ecdf(logratio$T7, add=TRUE)

The slope of ECDF diagram at the latest time point is milder, which reflects the distribution of that time is more disperse and the variation betwwen control and exprimental gruops is greater. And this might no loger be a standard distribution.

Assignment 2 - Working with microarray data

Yikun Chen

Mon Jan 29 23:37:27 2018