IMPORTANT: Name your file using your name: “LAST_FIRST_hw4”
Whenever you write your own code/function, it has to be annotated, describing what your code/ function does and what input arguments it takes.
For the homework to be considered complete, submit it as both an .rmd file, as well as an .html or .pfd version using the ‘Knit’ option.
Assignment:
In this homework, we are taking a look at the gene expression data from the DeRisi et al. 1997 Science paper. I am providing you with a dataset that contains >6000 yeast genes (rows), their gene ids (column 1), and 7 time points containing log ratios of fluorescence intensities as measured using a micro-array. These log ratios reflect the relative abundance of mRNA for each gene at the given time point with respect to its abundance at time point 0 (control sample). Please refer to the lecture slides for additional detail. Note: at time point 0, the log ratio was created by taking two samples and labeling the cDNA (proxy for RNA) with either Cy3 or Cy5 fluorophores and taking the ratio of that.
The t_0h time point is the baseline measurement of the relative abundance of mRNA at the start of the experiment (amount of mRNA at 0 hours) to compare changes in gene expression with at later time points.
load() is different in that it shows all the origincal objects in the .Rdata file under the same names. On the other hand, read.csv only returns a single data frame that is assigned a variable manually.
load("/Users/karissayan/Downloads/DERISI_logratios.RData")
head(derisi)
## gene_name t_0h t_9.5h t_11.5h t_13.5h t_15.5h t_18.5h t_20.5h my_fav_genes
## 1 YPL026C -0.01 0.29 -0.42 -0.29 -0.07 0.04 0.06 FALSE
## 2 YNL168C 0.00 0.19 0.12 -0.36 0.06 0.11 -0.15 FALSE
## 3 YDR007W 0.06 0.03 0.07 -0.34 -0.03 0.08 -0.07 FALSE
## 4 YHR165C 0.15 0.00 0.11 -0.23 0.03 0.01 -0.03 FALSE
## 5 YJL145W 0.30 0.07 0.19 -0.23 -0.09 0.04 -0.10 FALSE
## 6 YGR267C 0.20 0.07 0.11 -0.32 -0.07 0.26 -0.23 FALSE
Z_transform=function(X,m,s){
(X-m)/s
}
Z_transform
## function(X,m,s){
## (X-m)/s
## }
t0=derisi$t_0h
m=mean(t0,na.rm=TRUE)
s=sd(t0,na.rm=TRUE)
Z_df=apply(derisi[, 2:8], 2, function(col) Z_transform(col, m, s))
class(Z_df)
## [1] "matrix" "array"
Z_df1=as.data.frame(Z_df)
ACR1=derisi[derisi$gene_name=="YJR095W",]
ACR1_last=ACR1$t_18.5h
ACR1_z=(ACR1_last-m)/s
ACR1_z
## [1] 5.783406
Yes, the expression profile is consistent with ACR1’s expression profile in Fig. 5B (DeRisi et al.).
hours=c(0,1,2,3,4,5,6)
ACR1_logratio=as.numeric(ACR1[,2:8])
plot(hours,ACR1_logratio,type="l",xlab="Time (hours)",ylab="logratio",main="ACR1 Expression Over Time")
The mean of the z-score at time point 7 is negative. Since the mean of z-scores at 20.5 hours is less than 0, this means that the gene expression at that time point is lower than the baseline level of expression at 0 hours.
m_t0_z=mean(Z_df1$t_0h,na.rm=TRUE)
s_t0_z=sd(Z_df1$t_0h,na.rm=TRUE)
m_t0_z
## [1] 3.429617e-15
s_t0_z
## [1] 1
m_t7_z=mean(Z_df1$t_20.5h,na.rm=TRUE)
s_t7_z=sd(Z_df1$t_20.5h,na.rm=TRUE)
m_t7_z
## [1] -0.4650906
s_t7_z
## [1] 3.548807
You can see that the p-values for t0 are pretty evenly spread between 0 and 1 (with higher frequencies toward the middle), whereas the p-values for t20.5 are much more skewed toward the edges and lower in frequency in the middle. Biologically, this bimodal distribution means that at t20.5, genes are significantly up- or down-regulated compared to the baseline t0 (which is evenly spread because it is the baseline distribution). The yeast cells are likely going through a diauxic shift, which would entail upregulating genes related to respiration and downregulating genes related to fermentation.
t0_pvals=pnorm(Z_df1$t_0h,lower.tail=FALSE)
t205_pvals=pnorm(Z_df1$t_20.5h,lower.tail = FALSE)
hist(t0_pvals,col="blue",main="Histogram of p-values",xlab="p-value",breaks=50)
hist(t205_pvals,col="red",add=TRUE,breaks=50)
688 genes have a p-value smaller than 10-3 at t20.5. Under the null hypothesis, the p-values would be more uniformly distributed, so I would expect that the number of genes with a p-value smaller than 10-3 would be the number of genes at t20.5 multiplied by 10-3.
sum(t205_pvals<1e-3,na.rm=TRUE)
## [1] 688
The FDR at a p-value threshold of 10-3 is 0.87%. The FDR at an alternative p-value threshold of 10-5 is 0.014%. An alternative p-value threshold of 10-5 is lower, which leads to a lower FDR, meaning that we can be more confident that the genes are truly changing their expression and not just false positives due to random chance from a large sample size. This indicates that the yeast cells are genuinely undergoing a metabolic change, shifting their gene expression to support respiration.
obs_1e3=sum(t205_pvals<1e-3,na.rm=TRUE)
exp_1e3=sum(!is.na(t205_pvals))*1e-3
fdr_1e3=exp_1e3/obs_1e3
fdr_1e3
## [1] 0.008680233
obs_1e5=sum(t205_pvals<1e-5,na.rm=TRUE)
exp_1e5=sum(!is.na(t205_pvals))*1e-5
fdr_1e5=exp_1e5/obs_1e5
fdr_1e5
## [1] 0.0001382407
There are 432 upregulated genes.
up_set=Z_df1$t_20.5h[t205_pvals<1e-5 & !is.na(t205_pvals)]
length(up_set)
## [1] 432
I could use a t-test for both A and B. The low p-value in A shows that the expression of my favorite genes are significantly different from baseline, which means they are involved in the diauxic shift. The significant p-value in B shows that the expression is also significantly different from a random sample at t20.5h, which means that these genes are not just randomly involved in the diauxic shift but functionally related to the metabolic changes occuring during the shift.
fave=Z_df1[derisi$my_fav_genes,"t_20.5h"]
t.test(fave,Z_df1$t_0h,alternative="two.sided",na.rm=TRUE)
##
## Welch Two Sample t-test
##
## data: fave and Z_df1$t_0h
## t = 12.873, df = 28.03, p-value = 2.746e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.040842 8.327012
## sample estimates:
## mean of x mean of y
## 7.183927e+00 3.429617e-15
set.seed(1)
random=sample(Z_df1$t_20.5h,length(fave),replace=FALSE)
t.test(fave,random,alternative="two.sided",na.rm=TRUE)
##
## Welch Two Sample t-test
##
## data: fave and random
## t = 8.8995, df = 55.557, p-value = 2.807e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.701380 9.014419
## sample estimates:
## mean of x mean of y
## 7.1839269 -0.1739727
Bonus:
I can check if the t test is justified by constructing a histogram to check for normal distribution,and I could conduct a variance test to make sure the two groups are of equal variances. Another test more suited for data that is not normally distributed is the Wilcox test. The p-value is very low, which means that my favorite genes are actively involved in the diauxic shift.
set.seed(1)
random=sample(Z_df1$t_20.5h,length(fave),replace=FALSE)
wilcox.test(fave,random,alternative="two.sided",na.rm=TRUE)
## Warning in wilcox.test.default(fave, random, alternative = "two.sided", :
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: fave and random
## W = 832, p-value = 1.606e-10
## alternative hypothesis: true location shift is not equal to 0