IMPORTANT: Name your file using your name: “LAST_FIRST_hw4”

Whenever you write your own code/function, it has to be annotated, describing what your code/ function does and what input arguments it takes.

For the homework to be considered complete, submit it as both an .rmd file, as well as an .html or .pfd version using the ‘Knit’ option.

Assignment:

In this homework, we are taking a look at the gene expression data from the DeRisi et al. 1997 Science paper. I am providing you with a dataset that contains >6000 yeast genes (rows), their gene ids (column 1), and 7 time points containing log ratios of fluorescence intensities as measured using a micro-array. These log ratios reflect the relative abundance of mRNA for each gene at the given time point with respect to its abundance at time point 0 (control sample). Please refer to the lecture slides for additional detail. Note: at time point 0, the log ratio was created by taking two samples and labeling the cDNA (proxy for RNA) with either Cy3 or Cy5 fluorophores and taking the ratio of that.

  1. What does the t_0h time point measure? (1p)

The t_0h time point is the baseline measurement of the relative abundance of mRNA at the start of the experiment (amount of mRNA at 0 hours) to compare changes in gene expression with at later time points.

  1. Load DERISI_logratio.RData into your environment using the load() function. Assign it to the variable (data_list). Hint: load() works just like read.csv(); you have to specify the path to the DERISI_logratio.RData file. Take a look at data_list, what do you see? Use head() on the variable stored in data_list. What does this tell you about how load() works differently than read.csv()? (1p)

load() is different in that it shows all the origincal objects in the .Rdata file under the same names. On the other hand, read.csv only returns a single data frame that is assigned a variable manually.

load("/Users/karissayan/Downloads/DERISI_logratios.RData")
head(derisi)
##   gene_name  t_0h t_9.5h t_11.5h t_13.5h t_15.5h t_18.5h t_20.5h my_fav_genes
## 1   YPL026C -0.01   0.29   -0.42   -0.29   -0.07    0.04    0.06        FALSE
## 2   YNL168C  0.00   0.19    0.12   -0.36    0.06    0.11   -0.15        FALSE
## 3   YDR007W  0.06   0.03    0.07   -0.34   -0.03    0.08   -0.07        FALSE
## 4   YHR165C  0.15   0.00    0.11   -0.23    0.03    0.01   -0.03        FALSE
## 5   YJL145W  0.30   0.07    0.19   -0.23   -0.09    0.04   -0.10        FALSE
## 6   YGR267C  0.20   0.07    0.11   -0.32   -0.07    0.26   -0.23        FALSE
  1. Transform each time point into a Z score using the mean and standard deviation time point 0 (the control). Make sure to remove ’NA’s when computing the mean and sd at t0. Do so by first creating a function named “Z_transform” that takes 3 arguments (X,m,s): the 1st argument X is the vector of values to transform, argument 2&3 are the mean (m) and st-dev (s) you want to use to normalize the data. Z_transform should output a vector of the Z-values of X. Run Z_transform on all time point columns separately, including t0! This essentially tells you how each relative expression value compares to the logratio distribution given by the control sample (t0). Store the Z-scores in a new data frame called Z_df. Hint: you can use apply on margin 2 to run Z_transform on each time point column of your data. Check if Z_df is a data.frame using class(). If not turn it into one using as.data.frame(). (4p)
Z_transform=function(X,m,s){
  (X-m)/s
}
Z_transform
## function(X,m,s){
##   (X-m)/s
## }
t0=derisi$t_0h
m=mean(t0,na.rm=TRUE)
s=sd(t0,na.rm=TRUE)
Z_df=apply(derisi[, 2:8], 2, function(col) Z_transform(col, m, s))
class(Z_df)
## [1] "matrix" "array"
Z_df1=as.data.frame(Z_df)
  1. What is the normalized (Z-transformed) logratio of the ACR1 gene (id = ‘YJR095W’) at the last time point? (1p)
ACR1=derisi[derisi$gene_name=="YJR095W",]
ACR1_last=ACR1$t_18.5h
ACR1_z=(ACR1_last-m)/s
ACR1_z
## [1] 5.783406
  1. Create a one dimensional array called hours that contains the numeric value of hours for each time point. Now plot the logratio for ACR1 as a function of time (in hours) using plot() with type=‘l’. Is the expression profile consistent with what you see in Figure 5B of DeRisi et al.? (2p)

Yes, the expression profile is consistent with ACR1’s expression profile in Fig. 5B (DeRisi et al.).

hours=c(0,1,2,3,4,5,6)
ACR1_logratio=as.numeric(ACR1[,2:8])
plot(hours,ACR1_logratio,type="l",xlab="Time (hours)",ylab="logratio",main="ACR1 Expression Over Time")

  1. compute the mean and sd for the Z-scores at t0 and time point 7 (20.5h). What do you notice? What does it mean if the mean of Z_scores is <0? (1p)

The mean of the z-score at time point 7 is negative. Since the mean of z-scores at 20.5 hours is less than 0, this means that the gene expression at that time point is lower than the baseline level of expression at 0 hours.

m_t0_z=mean(Z_df1$t_0h,na.rm=TRUE)
s_t0_z=sd(Z_df1$t_0h,na.rm=TRUE)
m_t0_z
## [1] 3.429617e-15
s_t0_z
## [1] 1
m_t7_z=mean(Z_df1$t_20.5h,na.rm=TRUE)
s_t7_z=sd(Z_df1$t_20.5h,na.rm=TRUE)
m_t7_z
## [1] -0.4650906
s_t7_z
## [1] 3.548807
  1. Turn the Z scores at t0 and t20.5h into p-values using pnorm() and setting the lower.tail=F (name them t0_pvals and t205_vals). Plot the distribution of p-values for each timepoint respectively using either hist() or plot.ecdf(). When calling plot.ecdf on the second set of t205_vals, use plot.ecdf() with the argument add=T to show both cdfs in one plot. Give them separate colors. (Note: ecdf plots the empirical cumulative probability density function, as based on the sampling data). How do the distributions of p-values differ between t0 and t20.5? What is the biological interpretation of this? (4p)

You can see that the p-values for t0 are pretty evenly spread between 0 and 1 (with higher frequencies toward the middle), whereas the p-values for t20.5 are much more skewed toward the edges and lower in frequency in the middle. Biologically, this bimodal distribution means that at t20.5, genes are significantly up- or down-regulated compared to the baseline t0 (which is evenly spread because it is the baseline distribution). The yeast cells are likely going through a diauxic shift, which would entail upregulating genes related to respiration and downregulating genes related to fermentation.

t0_pvals=pnorm(Z_df1$t_0h,lower.tail=FALSE)
t205_pvals=pnorm(Z_df1$t_20.5h,lower.tail = FALSE)
hist(t0_pvals,col="blue",main="Histogram of p-values",xlab="p-value",breaks=50)
hist(t205_pvals,col="red",add=TRUE,breaks=50)

  1. How many genes have a p-value smaller than 10−3 at t20.5? How many genes would you expect to have a p-value smaller than 10−3 under the null hypothesis that the logratios were independently drawn from the null distribution of t0? (remember to remove NA values using na.rm=T if using sum()). (2p)

688 genes have a p-value smaller than 10-3 at t20.5. Under the null hypothesis, the p-values would be more uniformly distributed, so I would expect that the number of genes with a p-value smaller than 10-3 would be the number of genes at t20.5 multiplied by 10-3.

sum(t205_pvals<1e-3,na.rm=TRUE)
## [1] 688
  1. What fraction of the genes with a p-values smaller than 10−3 do you expect to be false discoveries (i.e., what is the expected false discovery rate)? What about an alternative p-value threshold of 10−5? Explain what this FDR means in your own words. (3p)

The FDR at a p-value threshold of 10-3 is 0.87%. The FDR at an alternative p-value threshold of 10-5 is 0.014%. An alternative p-value threshold of 10-5 is lower, which leads to a lower FDR, meaning that we can be more confident that the genes are truly changing their expression and not just false positives due to random chance from a large sample size. This indicates that the yeast cells are genuinely undergoing a metabolic change, shifting their gene expression to support respiration.

obs_1e3=sum(t205_pvals<1e-3,na.rm=TRUE)
exp_1e3=sum(!is.na(t205_pvals))*1e-3
fdr_1e3=exp_1e3/obs_1e3
fdr_1e3
## [1] 0.008680233
obs_1e5=sum(t205_pvals<1e-5,na.rm=TRUE)
exp_1e5=sum(!is.na(t205_pvals))*1e-5
fdr_1e5=exp_1e5/obs_1e5
fdr_1e5
## [1] 0.0001382407
  1. Use the 10-5 threshold to identify significantly up-regulated genes at t20.5. Store these significant normalized logratios under “up_set”. Remove NA values by using the is.na() function. How many upregulated genes are there? (2p)

There are 432 upregulated genes.

up_set=Z_df1$t_20.5h[t205_pvals<1e-5 & !is.na(t205_pvals)]
length(up_set)
## [1] 432
  1. You are a long-time yeast expert and want to know if your favorite yeast genes are involved in the diauxic shift. Whether you gene is in the favorite list is stored in the ‘my_fav_genes’ column in DERISI_logratio.RData. Compute if the relative expression of your favorite gene set is significantly different from a) the null distribution (all genes at t0), and b) a sample of the exact same number of genes, but sampled at random from t20.5. Which test do you use for a) and which one for b) ? What can you conclude about your favorite genes from this result? (4p)

I could use a t-test for both A and B. The low p-value in A shows that the expression of my favorite genes are significantly different from baseline, which means they are involved in the diauxic shift. The significant p-value in B shows that the expression is also significantly different from a random sample at t20.5h, which means that these genes are not just randomly involved in the diauxic shift but functionally related to the metabolic changes occuring during the shift.

fave=Z_df1[derisi$my_fav_genes,"t_20.5h"]
t.test(fave,Z_df1$t_0h,alternative="two.sided",na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  fave and Z_df1$t_0h
## t = 12.873, df = 28.03, p-value = 2.746e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6.040842 8.327012
## sample estimates:
##    mean of x    mean of y 
## 7.183927e+00 3.429617e-15
set.seed(1)
random=sample(Z_df1$t_20.5h,length(fave),replace=FALSE)
t.test(fave,random,alternative="two.sided",na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  fave and random
## t = 8.8995, df = 55.557, p-value = 2.807e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.701380 9.014419
## sample estimates:
##  mean of x  mean of y 
##  7.1839269 -0.1739727

Bonus:

  1. Describe how you can verify whether the usage of the test in Q11 b) is justified. What other test is there that would be more suited for data that is not normally distributed? Use this test to compute whether the two samples in 11b) are different. (Ignore the warning) (2p)

I can check if the t test is justified by constructing a histogram to check for normal distribution,and I could conduct a variance test to make sure the two groups are of equal variances. Another test more suited for data that is not normally distributed is the Wilcox test. The p-value is very low, which means that my favorite genes are actively involved in the diauxic shift.

set.seed(1)
random=sample(Z_df1$t_20.5h,length(fave),replace=FALSE)
wilcox.test(fave,random,alternative="two.sided",na.rm=TRUE)
## Warning in wilcox.test.default(fave, random, alternative = "two.sided", :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  fave and random
## W = 832, p-value = 1.606e-10
## alternative hypothesis: true location shift is not equal to 0