Download and load these libraries into your R workspace (you may need to also use the Bioconductor package to get the DEseq package)
library(stringr)
library(ggplot2)
library(dplyr)
library(tibble)
library(purrr)
library(DESeq2)
library(adegenet)
library(pheatmap)
library(RColorBrewer)
library(ggrepel)
#setwd("<PATH TO A FOLDER WITH THE SALMON COUNT FOLDERS AND SAMPLES.TXT DATA FRAME FROM CANVAS>")
Salmon will save results in a folder for each sample you run. The main results file is called quant.sf, which we will need to import into the R environment. There are also auxillary files with more details about specific aspects of the Salmon run, including details on how transcript counts were adjusted based on gc-bias and/or sequence-specific bias, statistics on mapping success, and metadata for the quantification run, including which settings were active and which files were used as input.
You can use the following code to import the quant.sf files into R.
*!!! The file paths will only work if the Salmon output folders have been copied into the working directory for your R session. My Salmon output folders are named with the sample ID and _quant (e.g. BP027_quant). You will need to name your _quant folders accordingly so the information in the Sample_Data.txt file can be matched up with the count data for the corresponding sample.*
samples.df <- read.table("./Sample_Data.txt", header=T) #read in the table with sex and capture date information on the individuals we have transcriptome data on
salmon_files <- list.files("./", pattern = "quant") #create a list of the file names with salmon read counts
for(i in 1:length(salmon_files)){ #create a for loop to read in the read count files and automatically name them with the sample ID from the file name
readcount <- read.delim(str_c("./", salmon_files[i], "/quant.sf")) #this will read in the Salmon output files for all samples
assign(str_c(str_split_1(salmon_files[i], "_")[1],"_salmon.rc"), readcount) #this will name the R data frames with the corresponding sample name
}
head(BP027_salmon.rc) #print the first few lines of one of the read count tables to see the data structure
## Name Length EffectiveLength TPM NumReads
## 1 QSF_LOC100695742.1.2 1205 954.000 0.000000 0.000
## 2 QSF_LOC100700970.1.2 3295 1815.641 7.889663 57.997
## 3 QSF_LOC100690464.1.1 1907 1656.000 0.000000 0.000
## 4 QSF_GTF2A1.1.2 855 275.991 23.268096 26.000
## 5 QSF_DEP1A.1.5 2502 2251.000 0.000000 0.000
## 6 QSF_LOC101157626.1.1 2078 1729.089 0.372177 2.605
The output data from Salmon will be saved as text files (quant.sf) with five columns:
ggplot(data=BP032_salmon.rc)+
geom_histogram(aes(x=NumReads), stat="bin", bins=400)+
labs(x="raw read counts", y="number of genes")
We can plot histograms of the read counts for each transcript. We can see that the overwhelming majority of transcripts only showed up in our samples a handful of times (<100), but there are a small number of transcripts with tens of thousands of associated reads.
This leaves us with the question of how to model the data. Most of
our typical statistical tools that rely on a normal distribution won’t
work. We can see what a normal distribution with our data’s mean and
variance looks like (in red) versus the actual distribuition of our data
looks like (in grey).
Poisson distributions are commonly used to model count data, and we
can see this is a better fit to what our data looks like:
Some statistical methods applied to read count data do use Poisson distributions to model read counts, and this works well when comparing individual samples. However, Poisson distributions rely on the assumption that the mean=variance. We can look at means and variance of across our samples to see if this is true:
salmon_count.list <- objects(pattern = "_salmon.rc") #create a list with all of the salmon read count objects in the R environment
salmon_counts <- NULL #create a null object to turn into a table with the salmon read counts from all of the samples
for(i in 1:length(salmon_count.list)){ #create a for loop to create a list with tables for each sample with just the gene names and read counts
salmon_counts[i] <- list(get(salmon_count.list[i]) %>% dplyr::select("Name", "NumReads"))
}
salmon_counts <- salmon_counts %>% purrr::reduce(left_join, by="Name") #match up the read counts by gene name across all the samples and put them into a single data frame
colnames(salmon_counts) <- c("Name", lapply(salmon_files, function(x) str_split_1(x, pattern= "_")[1])) #rename the columns of the data frame with sample IDs
salmon_counts_summary.df <- as.data.frame(salmon_counts$Name) #create a new data frame that we will use to summarize the mean and variance of each transcript across our samples
salmon_counts_summary.df$Means <- rowSums(salmon_counts[,-1]) #calculate the mean read counts of each transcript across samples
salmon_counts_summary.df$Var <- apply(salmon_counts[,-1], 1, var) #calculate the variance in read counts for each transcript across samples
ggplot(data=salmon_counts_summary.df, aes(x=Means, y=Var))+ #make a plot comparing how means and variance change with read abundance in our data
geom_point()+ #add points for each transcript
geom_line(aes(x=Means, y=Means), color="red")+ #add a red line with slope=1 showing the assumed Poisson relationship between mean and variance
scale_x_log10()+ #change the x axis to a log scale
scale_y_log10() #change the y axis to a log scale
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
From this graph, we can see that the variance in highly expressed
transcripts increases faster than the mean. If our data followed a
Poisson distribution, the black points would fall along the red
line.
So, a Poisson distribution is probably not appropriate for our data.
It is pretty good at estimating our mean, but we need a distribution
that can decouple variance from the mean. There is a distribution that
will meet our expectations, the negative binomial distribution.
Now we know what our data looks like, we can look for possible outliers and filter out information that we do not need. We will also need to standardize our data across samples so we can make direct comparisons.
Filtering out low-expressed transcripts is not strictly necessary for most analyses, but we can significantly speed up some steps downstream and require fewer computational resources if we remove transcripts that are too rare for us to draw meaningful conclusions about.
The simplest filter we can do is to remove transcripts that do not show up in our samples at all. Salmon creates entries based on sequences in the reference transcriptome, but every transcript will not be expressed in every tissue, so we can very easily remove some transcripts that will not impact our analysis at all.
salmon_counts_filtered.df <- salmon_counts[rowSums(salmon_counts[,-1]) > 0,] #get rid of transcripts that are not expressed in any of the samples
nrow(salmon_counts)
## [1] 69441
nrow(salmon_counts_filtered.df)
## [1] 56051
We can also be more aggressive with our filtering. Here, we can remove transcripts with read counts less than 5 in fewer than 3 of our samples. The exact values will depend on your dataset. The number of read counts should be based on your sequencing depth (usually 5-10, higher sequencing depths are more likely to accurately reflect the abundance of rare reads). The number of passing samples should be based on the smallest treatment group that you wish to draw comparisons about. In this case, we have 3 male fish, so we set the minimum number of passing samples to 3 so we could identify any genes are expressed above our threshold value in males.
salmon_counts_filtered.df <- salmon_counts[rowSums(salmon_counts[,-1] > 5) > 3, ]
nrow(salmon_counts)
## [1] 69441
nrow(salmon_counts_filtered.df)
## [1] 29980
salmon_counts_filtered_summary.df <- as.data.frame(salmon_counts_filtered.df$Name) #create a new data frame that we will use to summarize the mean and variance of each transcript across our samples
salmon_counts_filtered_summary.df$Means <- rowSums(salmon_counts_filtered.df[,-1]) #calculate the mean read counts of each transcript across samples
salmon_counts_filtered_summary.df$Var <- apply(salmon_counts_filtered.df[,-1], 1, var) #calculate the variance in read counts for each transcript across samples
ggplot()+
geom_histogram(data=salmon_counts_filtered_summary.df, aes(x=Means), stat="bin", bins=400, color="red")+
geom_histogram(data=salmon_counts_summary.df, aes(x=Means), stat="bin", bins=400, alpha=0.5)+
labs(x="raw read counts", y="number of genes")+
xlim(-10, 5000)
Ultimately, our goal is to compare levels of expression between samples and identify genes with different expression patterns among our sample treatments. To do this, we need to account for the fact that our samples have different sequencing depths, perhaps different overall expression levels, different levels of extraction efficiency, etc. that prevents the raw counts from being directly comparable. We have a few options to approach normalization. 1. CPM - counts per million - Divides read counts by total number of reads mapped - Most basic method, not suitable for sample comparisons with RNA-seq; ok for TagSeq 2. TPM - transcripts per million - Divides read counts by transcript length and total number of reads mapped - Suitable for sample comparisons with RNA-seq, not valid for TagSeq 3. RPKM - reads per kilobase of exon per million reads mapped - Divides read counts by length of exons of reference sequences and total number of reads mapped - Suitable for sample comparisons with RNA-seq, not valid for TagSeq. Useful for alternative splicing 4. DEseq median of ratios - read counts for each sample is re-scaled based on the median ratio of the sample counts across transcripts to the geometric mean read counts across all the samples - Assumes that most genes are not differentially expressed - Works similarly to CPM, but is less susceptible to bias due to differentially expressed high-abundance transcripts. Geometric means are biased towards lower values.
dds <- DESeqDataSetFromMatrix(countData=round(salmon_counts_filtered.df[,-1]), colData = samples.df, design= ~Sex) #create the DEseq input object
## converting counts to integer mode
dds.norm <- estimateSizeFactors(dds) #apply the DEseq median of ratios normalization to the read counts in the DEseq input object
sizeFactors(dds.norm) #print the median of ratios correction factors
## BP027 BP028 BP031 BP032 BP042 BP044 BP046 BP048
## 0.9339448 1.1729449 1.1911803 1.2804712 1.4255384 0.8476093 1.0249450 0.5712278
colSums(salmon_counts_filtered.df[,-1]) #print the total read count for each sample
## BP027 BP028 BP031 BP032 BP042 BP044 BP046 BP048
## 2840572 3348252 3452948 3619042 3641577 2513016 3139630 1816486
Above you can see how the median of ratios normalization factors relate to total read abundance for each of the samples. Read counts are divided by the sample’s median of ratios to generate the normalized read counts. Samples with more total reads have higher normalization factors.
We can see if there are any samples that are behaving particularly unusually by running some unsupervised clustering analysis on our data. If we see any outliers, we can remove them. Plus, we can begin to get a sense of whether or not our sample treatments had a large effect on the variation present in our data.
The first clustering analysis we will run is a principal components analysis:
rld <- rlog(dds.norm, blind=T) #apply a log2 transformation to our normalized samples for running the PCA
rld.mat <- assay(rld) #pull out a matrix of the log2 transformed normalized read counts from the rld object
rld.rank <- apply(rld.mat, 1, "var") #estimate the variance of each transcript
rld.rank <- rank(rld.rank) #rank the transcripts from most to least variable
rld.mat_mostvar <- rld.mat[rld.rank <= 1000, ] #make a matrix with the read counts of the 1000 most variable transcripts
sample.pca <- dudi.pca(t(rld.mat_mostvar), scannf=F, nf=6) #run a PCA on the 1000 most variable transcripts, keeping the first 6 principal components
pca.df <- samples.df %>% left_join(add_rownames(sample.pca$li), by=c("Sample_ID" = "rowname")) #combine the PCA output with the sample information data frame for plotting with ggplot
#GGplot for PCs 1 & 2
ggplot(data=pca.df, aes(x=Axis1, y=Axis2, color=Date_Sampled, shape=Sex))+
geom_point(size=4)+
theme_classic()
#GGplot for PCs 3 & 4
ggplot(data=pca.df, aes(x=Axis3, y=Axis4, color=Date_Sampled, shape=Sex))+
geom_point(size=4)+
theme_classic()
#GGplot for PCs 5 & 6
ggplot(data=pca.df, aes(x=Axis5, y=Axis6, color=Date_Sampled, shape=Sex))+
geom_point(size=4)+
theme_classic()
Another clustering analysis and visualization that can be helpful is hierarchical clustering using a UPGMA tree and a heatmap:
rownames(samples.df) <- samples.df$Sample_ID #add the sample IDs as the rownames of the sample information data frame because the heatmap matches up rownames in our sample information to the sample names in the read count matrix
pheatmap(cor(rld.mat), annotation_row = samples.df) #plot a heatmap, make sure to convert the read count matrix into a correlation matrix as input
How would you describe how the data is clustering? Do either of our “treatment” factors appear to have much of an effect on overall transcriptional variation? Are there any samples you think warrant removal as outliers?
The main function of DEseq is built around creating a model of dispersion and finding genes that significantly differ from that model.
Dispersion = (Variance - Mean)/(Mean^2) Higher variance -> higher dispersion Higher mean -> lower dispersion
Because there are typically a lot of genes and relatively few samples in transcriptomics studies, the DEseq model makes dispersion estimates more robust by averaging dispersion estimates across genes with similar levels of expression. This is called shrinkage. DEseq does not apply shirnkage to genes with much higher dispersion estimate than average.
The shrunken dispersion estimates are then used to parameterize negative binomial distributions used in generalize linear models. A GLM is run on the read counts for each transcript with the negative binomial distribution generated by DEseq used as the null model, and factors (i.e. treatments) we are interested in used as explanatory variables. The significance of those factors in the GLMs is calculated (pvalue) and adjusted for multiple comparisons based on the number of transcripts being tested (padj; Benjamin-Holchberg method is default, but others are available).
We set the factors being tested in the design option when we are setting up our DESeq object, using the formula notation in R (e.g. ~ Factor 1 + Factor 2)
dds <- DESeqDataSetFromMatrix(countData=round(salmon_counts_filtered.df[,-1]), colData = samples.df, design= ~Sex) #create the DEseq input object, the design is a model
## converting counts to integer mode
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
DEseq.run1 <- DESeq(dds) #run the DEseq on our DEseq input object
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(DEseq.run1) #plot the dispersion versus mean normalized counts
We will need to specify certain contrasts in order to calculate the statistics used to identify differentially expressed genes. Below, we set up a contrast to identify the genes that are expressed higher in males than females (and vice-versa, but note that the log fold change differences for genes expressed higher in females will be negative). Note that we can also set up the contrast with females first; the signs of some of the statistics would change, but the p-values would stay the same.
run1_results <- results(DEseq.run1, contrast = c("Sex", "M", "F")) #run differential expression tests to find transcripts that are differentially expressed by male and females
plotMA(run1_results) #plot log change of the selected contrast against the average expression across all samples
Above we can see the log fold change of the specified contrast for each gene plotted against the average expression level (mean normalized read count). This plot can tell us if our differentially expressed genes tend to be highly or lowly expressed (or have a variety of expression levels), as well as the number of genes that appear to be up-regulated and down-regulated based on treatment group.
We also will want to save a data frame of the p-values and test statistics for our contrast. We can do that using the code below. Our DEseq results include log-fold change, the Wald statistics used to assess how different our differentially expressed genes are, and the unadjusted and multiple-test corrected p-values.
run1_results.df <- cbind(salmon_counts_filtered.df, as.data.frame(run1_results@listData)) #combine the read count data frame with the DEseq results data frame
head(run1_results) #print the first few rows of the combined read count/results data frame
## log2 fold change (MLE): Sex M vs F
## Wald test p-value: Sex M vs F
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 2 48.6177 -0.2381548 0.303370 -0.785031 0.43243552 0.999944
## 4 32.7707 -0.0608359 0.348905 -0.174363 0.86158050 0.999944
## 9 35.4480 -0.3622838 0.409401 -0.884912 0.37620414 0.999944
## 10 51.1678 -0.2956637 0.385987 -0.765994 0.44367962 0.999944
## 12 82.8309 0.7995054 0.277477 2.881337 0.00395991 0.999944
## 15 140.9725 0.0883306 0.202865 0.435416 0.66326066 0.999944
In our results data frame, we can find the gene IDs of transcripts that were significantly differentially expressed and easily plot the normalized read counts of that gene from samples divided by our two factors of interest. Here, we plot the expression of transcript QSF_FKBP3.60.116 (row number 40137) and see that it tended to have almost 10x greater expression in females compared to males.
plotCounts(dds, gene="40137", intgroup = "Sex") #plot the normalized read counts for gene no. 40137, one of the transcripts identified as being significantly differentially expressed
We can build a heatmap of expression of our differentially expressed genes to quickly visualize how expression varied across all our samples. This can tell us which individuals had similar expression profiles and how consistent differences in expression were across our contrasts.
normalized_counts.df <- counts(dds.norm, normalized=T) %>% data.frame() %>% rownames_to_column(var="gene") #build a data frame with the normalized read counts and a column with the transcript numerical ID given by DEseq
normalized_counts.df$Transcript_ID <- salmon_counts_filtered.df$Name #add the transcript name that refers back the the reference transcript from the gentrome.fa file
run1_results.df <- cbind(normalized_counts.df, as.data.frame(run1_results@listData)) #make a data frame that combines the normalized counts and the results of the DEseq contrast tests
run1_sigresults.df <- run1_results.df[run1_results.df$padj < 0.05 & is.na(run1_results.df$padj)==F,] #create a data frame with only the read counts and results for the significantly differentially expressed genes
normcounts_sigresults.df <- run1_sigresults.df[,2:9] #subset the normalized counts for the significantly differentially expressed genes
pheatmap(normcounts_sigresults.df, cluster_rows = T, show_rownames = F, annotation_col = samples.df, scale="row") #create a heatmap from the normalized counts of all the genes that were significantly differentially expressed.
Finally, volcano plots are a common way of visualizing differential expression analysis results. The x-axis shows log2 fold change in both directions, so most of the genes that are not differentially expressed are shown in the middle. -Log10 adjusted p-values are plotted on the y axis. Very small p-values will become big numbers with -log10 transformed. Genes higher up on the y-axis are more confidently differentially expressed.
volcano_data.df <- run1_results.df %>% mutate(thresholdOE=padj < 0.05 & abs(log2FoldChange) >= 0.58) #create a data frame to serve as input for a ggplot object, using hte DEseq results data frame as a base and adding a new True/False column that says if a transcript was identified as being significantly differentially expressed AND if it had an absolute fold change >1.5
for(i in 1:nrow(volcano_data.df)){ #create a for loop to populate another new column in the volcano data data frame
volcano_data.df$genelabels[i] <- if((rank(volcano_data.df$padj) <= 10)[i]) volcano_data.df$Transcript_ID[i] else NA #copy the transcript name to the new column which we will used to label the top 10 differentially expressed genes with the lowest adjusted p-values
}
#GGplot for the Volcano Plot
ggplot(volcano_data.df)+
geom_point(aes(x=log2FoldChange, y=-log10(padj), color=thresholdOE))+ #add the points, coloring by TRUE/FALSE column we created, and -log10 transforming the padj value along the y-axis
geom_text_repel(aes(x=log2FoldChange, y=-log10(padj), label=genelabels))+ #add labels of the gene names for the top 10 differentially expressed genes
labs(main="M vs.F", x="Fold Change in Expression (log2)", y="Adjusted P-value (-log10)")+
theme_classic()
Run through the above code with your files and make sure you can get the same results.
Q1: Answer the questions from line 204. How would you describe how the data is clustering? Do either of our “treatment” factors appear to have much of an effect on overall transcriptional variation? Are there any samples you think warrant removal as outliers? Explain your reasoning.
Starting at the beginning of part 3, re-run the analysis, but this time use “Date_Sampled” as the variable of interest instead of “Sex”. To help with your interpretation, note that the samples collected on July 23rd were at the peak of a heat wave and were thermally stressed while the samples collected on August 1 had spent several days at less stressful temperatures.
Q2: How many transcripts did you find were differentially expressed with the Date Sampled contrast compared to the Sex constrast?
Q3: Submit the new Heatmap of Differentially Expressed Genes (as a jpeg, pdf, or png). Write a brief caption explaining what the plot is showing and what your major takeaways are.
Q4: Submit the new Volcano Plot (as a jpeg, pdf, or png). Write a brief caption explaining what the plot is showing and what your major takeaways are.