We will be comparing the controls from each of our RNA-seq experiments. We’ve already got count tables from galaxy, which we can import into R, then use them for differential expression analysis using DESeq2. Depending of these results for each comparison, we can move forward with a GO term analysis
First, install DESeq2 using BiocManager. This has to be done each time I restart this R project to get access to the library.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2", version = "3.8")
Bioconductor version 3.8 (BiocManager 1.30.4), R 3.5.1 (2018-07-02)
Installing package(s) 'DESeq2'
cannot open URL 'https://bioconductor.org/packages/3.8/data/experiment/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'cannot open URL 'https://bioconductor.org/packages/3.8/workflows/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'trying URL 'https://bioconductor.org/packages/3.8/bioc/bin/macosx/el-capitan/contrib/3.5/DESeq2_1.22.2.tgz'
Content type 'application/x-gzip' length 4063874 bytes (3.9 MB)
==================================================
downloaded 3.9 MB
The downloaded binary packages are in
/var/folders/y5/3bzw9l010_j9cj6hmdjm2sv1rmdgzj/T//Rtmpw26jvs/downloaded_packages
installation path not writeable, unable to update packages: class, cluster, codetools, foreign, lattice, MASS, Matrix,
mgcv, nlme, survival
Update old packages: 'backports'
Update all/some/none? [a/s/n]:
a
cannot open URL 'https://bioconductor.org/packages/3.8/data/experiment/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'cannot open URL 'https://bioconductor.org/packages/3.8/workflows/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'
There is a binary version available but the source version is later:
no
trying URL 'https://cran.cnr.berkeley.edu/bin/macosx/el-capitan/contrib/3.5/backports_1.1.3.tgz'
Content type 'application/x-gzip' length 53570 bytes (52 KB)
==================================================
downloaded 52 KB
The downloaded binary packages are in
/var/folders/y5/3bzw9l010_j9cj6hmdjm2sv1rmdgzj/T//Rtmpw26jvs/downloaded_packages
library("DESeq2")
Working first on comparing N2 W/ Etoh to N2 W/ Auxin: Produced table containing appropriate file names and labels.
Importing count tables into R using sample table. Formating consistency is important in sample table. Directory refers to the folder the files from the sample table are in.
sample_table_N2_Etoh_N2_auxin <- read.table("N2_Aux_N2_Etoh.txt", header = TRUE, sep = "\t")
N2_EA_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_N2_Etoh_N2_auxin, directory = "Count_tables", design = ~ Condition)
N2_EA_counts_vst <- vst(N2_EA_counts, blind = FALSE)
Visualizing count data using PCA. This will help us judge the consistancy of our data and the legitimacy of any potential results.
plotPCA(N2_EA_counts_vst, intgroup=c ("Condition", "Replicate"))
This plot seems decent. The N2 with auxin replicates are very spread out, while there is one outlier in the ethanol replicates, they are all aligned horizontally. A spread out PCA plot like this one could be the result of, because these controls are very similar (both N2), there isn’t an easy way to seperate them.
Calculating differential expression using DESeq2:
N2_EA_DESeq2 <- DESeq(N2_EA_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
N2_EA_results <- results(N2_EA_DESeq2, contrast = c("Condition", "aux", "etoh"))
summary(N2_EA_results)
out of 27238 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 0, 0%
LFC < 0 (down) : 0, 0%
outliers [1] : 55, 0.2%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
This summary shows that there are no genes are are significantly differently expressed between these two controls, from the zeros after LFC > 0 and LFC < 0
Exporting results into a tab-delimited table:
N2_EA_res_ordered <- N2_EA_results[order(N2_EA_results$pvalue),]
N2_EA_res_sig_0.1 <- subset(N2_EA_res_ordered, padj < 0.1)
write.table(as.data.frame(N2_EA_res_sig_0.1), file = "N2 EA Results P 0.1.txt", sep = "\t")
The table is devoid of any genes, again confirming that there are no genes that are expressed sifnificantly differently between these controls, which is more or less to be expected since they are both N2.
I would’ve generated some more plots to continue exploring the data, but given the lack of any significance, there isn’t really a point to do that.
Moving onto N2 with ethanol compared to KRY85 with ethanol, repeating the same steps:
sample_table_N2_Etoh_KRY85 <- read.table("KRY85_N2EtOH.txt", header = TRUE, sep = "\t")
KRY_N2_E_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_N2_Etoh_KRY85, directory = "Count_tables", design = ~ Condition)
KRY_N2_E_0counts_vst <- vst(KRY_N2_E_counts, blind = FALSE)
Visualizing count data using PCA.
plotPCA(KRY_N2_E_counts_vst, intgroup=c ("Condition", "Replicate"))
This PCA plot doesn’t look great, one replicate in each treatment is pretty seperated from the rest in their treatment (KRY85:2 and N2_EtOH:3). We’ll move forward with the analysis, but if we encounter unexpected results, this could be the cause.
Calculating differential expression using DESeq2:
KRY85_N2_E_DESeq2 <- DESeq(KRY_N2_E_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
KRY85_N2_E_results <- results(KRY85_N2_E_DESeq2, contrast = c("Condition", "KRY85", "N2_EtOH"))
summary(KRY85_N2_E_results)
out of 26509 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 206, 0.78%
LFC < 0 (down) : 674, 2.5%
outliers [1] : 116, 0.44%
low counts [2] : 10958, 41%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
This summary looks much more interesting than the last one, with over 3% of the genes being either up or down regulated. We should see more significant results as we continue to analysis these data sets.
Exporting results into a tab-delimited table:
KRY85_N2_E_res_ordered <- KRY85_N2_E_results[order(KRY85_N2_E_results$pvalue),]
KRY85_N2_E_res_sig_0.1 <- subset(KRY85_N2_E_res_ordered, padj < 0.1)
write.table(as.data.frame(KRY85_N2_E_res_sig_0.1), file = "KRY85 N2 E Results P 0.1.txt", sep = "\t")
There weren’t any genes on the table we exported for the N2 with ethanol and N2 with auxin comparison, but that is certainly not the case here! There are 881 rows in this table, which means there are 880 genes that are significantly differently expressed between N2 with ethanol and KRY85 with ethanol Ideally, since these are both intended to be controls, there should be as little difference between them as possible, but these data suggest there are large differences between them.
Visualizing the results in R using MA plot:
plotMA(KRY85_N2_E_results)
This plot again confirms the large and significant differences between these two “controls”. You can also see that, as the mean of normalized counts increases, genes with a smaller log fold change are considered significant, which is pretty intuitive, but still cool to see.
Installing volcano plot tool:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EnhancedVolcano")
Bioconductor version 3.8 (BiocManager 1.30.4), R 3.5.1 (2018-07-02)
Installing package(s) 'EnhancedVolcano'
cannot open URL 'https://bioconductor.org/packages/3.8/data/experiment/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'cannot open URL 'https://bioconductor.org/packages/3.8/workflows/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'trying URL 'https://bioconductor.org/packages/3.8/bioc/bin/macosx/el-capitan/contrib/3.5/EnhancedVolcano_1.0.1.tgz'
Content type 'application/x-gzip' length 3562216 bytes (3.4 MB)
==================================================
downloaded 3.4 MB
The downloaded binary packages are in
/var/folders/y5/3bzw9l010_j9cj6hmdjm2sv1rmdgzj/T//Rtmpw26jvs/downloaded_packages
installation path not writeable, unable to update packages: class, cluster, codetools, foreign, lattice, MASS, Matrix,
mgcv, nlme, survival
Update old packages: 'backports'
Update all/some/none? [a/s/n]:
a
cannot open URL 'https://bioconductor.org/packages/3.8/data/experiment/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'cannot open URL 'https://bioconductor.org/packages/3.8/workflows/bin/macosx/el-capitan/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'
There is a binary version available but the source version is later:
no
trying URL 'https://cran.cnr.berkeley.edu/bin/macosx/el-capitan/contrib/3.5/backports_1.1.3.tgz'
Content type 'application/x-gzip' length 53570 bytes (52 KB)
==================================================
downloaded 52 KB
The downloaded binary packages are in
/var/folders/y5/3bzw9l010_j9cj6hmdjm2sv1rmdgzj/T//Rtmpw26jvs/downloaded_packages
library(EnhancedVolcano)
Creating volcano plot of data to visualize the magnitude and significance of our data:
EnhancedVolcano(KRY85_N2_E_results,
lab = rownames(KRY85_N2_E_results),
x = "log2FoldChange",
y = "padj",
pCutoff = 0.01,
ylim = c(0, 12))
This communicates a similar idea as the MA plot, but it also helps to see that more genes have significantly lower expression than have significantly higher expression.
Moving on to comparing KRY85 with ethanol to N2 with auxin. Given the results from the N2 with ethanol and N2 with auxin comparison from before, I expect these results to be very similar to those from the KRY85 with ethanol and N2 with ethanol comparison.
sample_table_N2_Aux_KRY85 <- read.table("KRY85_N2auxin.txt", header = TRUE, sep = "\t")
KRY_N2_A_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_N2_Aux_KRY85, directory = "Count_tables", design = ~ Condition)
KRY_N2_A_counts_vst <- vst(KRY_N2_A_counts, blind = FALSE)
Visualizing count data using PCA.
plotPCA(KRY_N2_A_counts_vst, intgroup=c ("Condition", "Replicate"))
This looks very similar to the PCA plot for the KRY85 w/ ethanol and N2 w/ ethanol comparison, which makes since given how similar each N2 control is.
Calculating differential expression using DESeq2:
KRY85_N2_A_DESeq2 <- DESeq(KRY_N2_A_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
KRY85_N2_A_results <- results(KRY85_N2_A_DESeq2, contrast = c("Condition", "KRY85", "auxin"))
summary(KRY85_N2_A_results)
out of 26819 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 244, 0.91%
LFC < 0 (down) : 651, 2.4%
outliers [1] : 132, 0.49%
low counts [2] : 11602, 43%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Again, this result is quite similar to the last, with about 3% of the genes being expressed significantly differently between these controls.
Exporting results into a tab-delimited table:
KRY85_N2_A_res_ordered <- KRY85_N2_A_results[order(KRY85_N2_A_results$pvalue),]
KRY85_N2_A_res_sig_0.1 <- subset(KRY85_N2_A_res_ordered, padj < 0.1)
write.table(as.data.frame(KRY85_N2_A_res_sig_0.1), file = "KRY85 N2 A Results P 0.1.txt", sep = "\t")
This comparison produced a table with 895 genes with an adjusted p-value less than 0.1. That is 15 more genes than from the last comparison, which isn’t a huge difference.
Visualizing the results in R:
plotMA(KRY85_N2_A_results)
This plot again demonstrates the large differences between the KRY85 with ethanol and N2 with Auxin “controls”.
Creating volcano plot of data:
EnhancedVolcano(KRY85_N2_A_results,
lab = rownames(KRY85_N2_A_results),
x = "log2FoldChange",
y = "padj")
This graph once again resembles the one produced for the last comparison and again shows the large amount of difference between KRY85 and N2 controls in general. This comparison and the last may signify that the inherent differences between KRY85 and N2, even in the abscence of auxin, may prove to be to great to allow any accurate comparisons to reveal the effect of NHR-25.
Redoing the comparison between N2 and KRY85 with ethanol, without the PCA outliers this time:
sample_table_N2_Etoh_KRY85_pca <- read.table("KRY85_N2EtOH_pca_fix.txt", header = TRUE, sep = "\t")
incomplete final line found by readTableHeader on 'KRY85_N2EtOH_pca_fix.txt'
KRY_N2_E_pca_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_N2_Etoh_KRY85_pca, directory = "Count_tables", design = ~ Condition)
KRY_N2_E_pca_counts_vst <- vst(KRY_N2_E_pca_counts, blind = FALSE)
Visualizing count data using PCA.
plotPCA(KRY_N2_E_pca_counts_vst, intgroup=c ("Condition", "Replicate"))
At first glance, this PCA plot doesn’t look much better than the one which included the outliers. However, on closer inspection, the treatments are aligned on PC1 and even though they are pretty spread on PC2,it only has 20% variance across, so in all, these are grouped tighter than they were last time. This should allow us to have more confidence in any results of this comparison.
Calculating differential expression using DESeq2:
KRY85_N2_E_pca_DESeq2 <- DESeq(KRY_N2_E_pca_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
KRY85_N2_E_pca_results <- results(KRY85_N2_E_pca_DESeq2, contrast = c("Condition", "KRY85", "N2_EtOH"))
summary(KRY85_N2_E_pca_results)
out of 25897 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 401, 1.5%
LFC < 0 (down) : 599, 2.3%
outliers [1] : 0, 0%
low counts [2] : 11169, 43%
(mean count < 5)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
This summary looks similar to how it did when the outliers were included, though “outliers” has gone from 0.44% to 0% and LFC > 0 from .78% to 1.5%. I’m not totally sure of the significance of this, but at the very least, it shows that the differences we observed between KRY85 and N2 weren’t just because of variations in the datasets shown in the PCA plot.
Exporting results into a tab-delimited table:
KRY85_N2_E_pca_res_ordered <- KRY85_N2_E_pca_results[order(KRY85_N2_E_pca_results$pvalue),]
KRY85_N2_E_pca_res_sig_0.1 <- subset(KRY85_N2_E_pca_res_ordered, padj < 0.1)
write.table(as.data.frame(KRY85_N2_E_pca_res_sig_0.1), file = "KRY85 N2 E PCA fix Results P 0.1.txt", sep = "\t")
With this comparison, which excluded the outliers from the first PCA plot, we still see 1000 genes (exactly) with padj values less than 0.1. Again, this suggests that the differences we’ve observed between KRY85 and N2 are real and not due to just the PCA outliers.
Moving on from N2 and KRY85 into investigating the RNAi control as well, for now, I’ll just compare it to N2 with ethanol:
Want to import count tables from HT-Seq into R. That is what this next chunk does.
sample_table_N2_E_RNAi <- read.table("N2_Etoh_N2_RNAi.txt", header = TRUE, sep = "\t")
N2_E_RNAi_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_N2_E_RNAi, directory = "Count_tables", design = ~ Condition)
N2_E_RNAi_counts_vst <- vst(N2_E_RNAi_counts, blind = FALSE)
Visualizing count data.
plotPCA(N2_E_RNAi_counts_vst, intgroup=c ("Condition", "Replicate"))
Both treatments seem relatively tightly grouped on poth principal components, so we can move on to further differential expression analysis.
Calculating differential expression using DESeq2:
N2_E_RNAi_DESeq2 <- DESeq(N2_E_RNAi_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
N2_E_RNAi_results <- results(N2_E_RNAi_DESeq2, contrast = c("Condition", "etoh", "rnai"))
summary(N2_E_RNAi_results)
out of 26005 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 5382, 21%
LFC < 0 (down) : 4530, 17%
outliers [1] : 31, 0.12%
low counts [2] : 6829, 26%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Compared to previous comparisons, this is very suprising! Over one third of the entire C. elegans genome is differentially regulated in some way! It turns out that the RNAi worms were grown in a different environment and fed a different strain of E. coli, which could explain these results.
Exporting results into a tab-delimited table:
N2_E_RNAi_res_ordered <- N2_E_RNAi_results[order(N2_E_RNAi_results$pvalue),]
N2_E_RNAi_res_sig_0.1 <- subset(N2_E_RNAi_res_ordered, padj < 0.1)
write.table(as.data.frame(N2_E_RNAi_res_sig_0.1), file = "N2 EtOH RNAi Results P 0.1.txt", sep = "\t")
As in the summary from before, we see 9912 genes that are significantly differentially regulated, much more than the less than 1000 genes we’ve seen in our other comparisons.
Visualizing the results in R:
plotMA(N2_E_RNAi_results)
This plot further emphasizes the staggering number of significant genes we’ve found in this comparison.
Creating volcano plot of data:
EnhancedVolcano(N2_E_RNAi_results,
lab = rownames(N2_E_RNAi_results),
x = "log2FoldChange",
y = "padj")
Again, we see very prevalent differences between N2 with ethanol and N2 with RNAi, this also shows that (as confirmed by the excel spreadsheet) the gene with the lowest adjusted p-value, dct-16, has a padj of 1.57E-71, which is absolutely staggering.
N2 with Auxin and N2 with RNAi which should be similar:
Want to import count tables from HT-Seq into R. That is what this next chunk does.
sample_table_N2_A_RNAi <- read.table("Auxin_RNAi.txt", header = TRUE, sep = "\t")
N2_A_RNAi_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_N2_A_RNAi, directory = "Count_tables", design = ~ Condition)
N2_A_RNAi_counts_vst <- vst(N2_A_RNAi_counts, blind = FALSE)
Visualizing count data.
plotPCA(N2_A_RNAi_counts_vst, intgroup=c ("Condition", "Replicate"))
Calculation differential expression using DESeq2:
N2_A_RNAi_DESeq2 <- DESeq(N2_A_RNAi_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
N2_A_RNAi_results <- results(N2_A_RNAi_DESeq2, contrast = c("Condition", "auxin", "RNAi"))
summary(N2_A_RNAi_results)
out of 26344 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 4954, 19%
LFC < 0 (down) : 4015, 15%
outliers [1] : 38, 0.14%
low counts [2] : 7400, 28%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Exporting results into a tab-delimited table:
N2_A_RNAi_res_ordered <- N2_A_RNAi_results[order(N2_A_RNAi_results$pvalue),]
N2_A_RNAi_res_sig_0.1 <- subset(N2_A_RNAi_res_ordered, padj < 0.1)
write.table(as.data.frame(N2_A_RNAi_res_sig_0.1), file = "N2 Auxin RNAi Results P 0.1.txt", sep = "\t")
Visualizing the results in R:
plotMA(N2_A_RNAi_results)
Creating volcano plot of data:
EnhancedVolcano(N2_A_RNAi_results,
lab = rownames(N2_A_RNAi_results),
x = "log2FoldChange",
y = "padj")
Overall, the results of this comparison line up closely with the results of the last comparison, which is expected given how similar the two N2 treatments have been throughout.
KRY85 with ethanol and N2 with RNAi, this comparison may yield different results, given observed differences between N2 and KRY85:
Want to import count tables from HT-Seq into R. That is what this next chunk does.
sample_table_KRY85_RNAi <- read.table("KRY85_RNAi.txt", header = TRUE, sep = "\t")
KRY85_RNAi_counts <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table_KRY85_RNAi, directory = "Count_tables", design = ~ Condition)
KRY85_RNAi_counts_vst <- vst(KRY85_RNAi_counts, blind = FALSE)
Visualizing count data.
plotPCA(KRY85_RNAi_counts_vst, intgroup=c ("Condition", "Replicate"))
Calculation differential expression using DESeq2:
KRY85_RNAi_DESeq2 <- DESeq(KRY85_RNAi_counts)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
KRY85_RNAi_results <- results(KRY85_RNAi_DESeq2, contrast = c("Condition", "KRY85", "RNAi"))
summary(KRY85_RNAi_results)
out of 25079 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 5491, 22%
LFC < 0 (down) : 5496, 22%
outliers [1] : 106, 0.42%
low counts [2] : 6093, 24%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Exporting results into a tab-delimited table:
KRY85_RNAi_res_ordered <- KRY85_RNAi_results[order(KRY85_RNAi_results$pvalue),]
KRY85_RNAi_res_sig_0.1 <- subset(KRY85_RNAi_res_ordered, padj < 0.1)
write.table(as.data.frame(KRY85_RNAi_res_sig_0.1), file = "KRY85 RNAi Results P 0.1.txt", sep = "\t")
Visualizing the results in R:
plotMA(KRY85_RNAi_results)
Creating volcano plot of data:
EnhancedVolcano(KRY85_RNAi_results,
lab = rownames(KRY85_RNAi_results),
x = "log2FoldChange",
y = "padj")
These results ended up being pretty similar, so it seems the vast differences between the RNAi and the other datasets overpowered differences between the N2s and the KRY85.
Investigating gene set enrichment for controls with differential expression (using worm base): N2 with ethanol versus KRY85 with ethanol: N2 with Auxin versus KRY85 with ethanol:
N2 with Ethanol versus N2 with RNAi:
N2 with Auxin versus N2 with RNAi:
KRY85 with Ethanol versus N2 with RNAi:
Only the upregulated genes in KRY85 as compared to N2 with Ethanol: