NGS Week 10 Homework Assignment

Task 1: Run Salmon

Q1.1: Review the logging output of Salmon. Report the mapping rates for each sample produced by Salmon. Do you consider these to be too low? If so, how might you re-run Salmon to increase the mapping rates (see Salmon website above)? [ 1 point ]

PDAC253= 8.71844%
PDAC266=11.1517%
PDAC273=22.8452%
PDAC282=15.9158%
PDAC286=11.8775%
PDAC306=8.93615%
PDAC316=17.2182%
PDAC318=12.1166%

# Since the mapping rate for all 8 samples seem to be at the lower end (between 8%-23%), I would consider these to be too low which is likely due to kmer lengths to be too stringent. To re-run Salmon with the goal of increasing mapping rate by choosing a smaller k-mer length, we will need to re-run the salmon indexer using the script:

salmon index -t Pdac_Barhee_chr_unan_180126.all.maker.transcripts_HC_shuffled_normalized.fa -i Pdac_Barhee_chr_unan_180126.all.maker.transcripts_HC_shuffled_normalized_index -k 31

#If we were to re-run the salmon at a different kmer length,then we would need to create a new index using the script above and specify a different, smaller k value. After, we can re-run Salmon using the new index file.

Q1.2:Choose a sample and review the output file “quant.sf”. What are the columns in the output? Please provide an explanation of each column (see “Salmon Output File Formats” in “Output” section of documentation page referenced above).

The columns in the outputs are “Name”, “length”, “effectivelength”, “TPM” and “NumReads”. The name column identifies the name of the target transcript as specified from the input transcript database file. The length column specifies the length of each target transcript in terms of nucleotides. The effectivelength column is essentially the computed effective length of each target transcript and takes into account factors affecting the the probability of sampling fragments in the transcript such as fragment length distribution and gc-fragment bias. The TPM column is essentially and estimation of the relative abundance of each transcript as in transcripts per million (TPM). Finally, the NumReads column provides and estimation of the number of reads mapping to each of the transcript.

Q1.3a. What library type did Salmon infer for the input reads?

IU (“inward”,“unstranded”)

Q1.3b What library type do you think Salmon would have inferred if the date palm RNA-seq had been a stranded library preparation? Please use the two-letter syntax reported in the Salmon documentation.

IS (“inward”, “stranded”)

Q1.3c Explain in your own words what is the difference between a stranded (=“strand-specific”) and unstranded library?

A stranded or strand-specific library describes libraries in which the orientation of the sequencing reads are identified. In other words, we would know or be able to identify which strand ( the plus or minus strand of the DNA template) each reads is from. In an unstranded library, this information is not known or in other words, we would not be able to know whether the sequencing read comes from the plus or the minus strand (orientation) of the DNA template.A stranded library require a more complicated library prep which would most often involve labeling the plus and minus strands of the cDNA to distinguish between reads coming from either strands.

Q1.3d Which do you think is typically preferred for performing DGE analysis, stranded or unstranded libraries? Why?

Stranded because genes and non-coding RNA that overlaps on opposite strands of DNA can not be determined with the standard approach or unstranded libraries and can be problematic in quantifying gene expression and thus DGE. If a non-coding RNA on one strand have features that overlaps the same gene on the opposite strand, when conducting a standard approach (unstranded libraries), we don’t know if the reads that are produced comes from the plus strand or minus strand, which is problematic in quantifying gene expression. Therefore, when performing DGE analysis, I think researchers typically prefer performing DGE analysis with stranded libraries.

Task 2: Run tximport to create a DESeq2DataSet

Q2.1: A nice property of TPMs is that TPMs for alternative transcripts can be added to get the TPM value for all transcripts in a gene. Why does DESeq2 need to convert TPMs to gene counts instead of just using the gene-level TPMs directly in the statistical analysis [ 1 point ]?

Traditional count-based methods for differential gene expression has contributed to high false positives due to differential transcript usage. For examples, genes that do not differ in the amount of mRNA per cell but do differ in mixture of isoforms. To get around this particular problem, it was recommended to use tools such as Salmon to estimate transcript abundances, sum the TPMs to get gene-level TPM values and then convert these values to counts to conduct standard count-based gene level analysis. Tools such as tximport is made flexible to be used with different tools aimed at quantifying transcript abundance and converting TPMS to counts (DESeq2 requires counts as inputs). In this way, analysts can still use traditional count-based methods for gene-level analysis (as DGE is normally performed using count-based inference method) but avoiding differential transcript usage.

Task 3: Run DESeq2

Q3.1: Please report the commands you executed to complete your analysis [ 1 point ].

directory <- "/Users/bonnie/Desktop/ngs.week10"

library(tximport)
sample_names <- c('PDAC253','PDAC282','PDAC286','PDAC316','PDAC266','PDAC273','PDAC306','PDAC318')
sample_condition <- c(rep('highSucrose',4),rep('lowSucrose',4))


files <- file.path("/Users/bonnie/Desktop/ngs.week10",sample_names,paste(sample_names,".transcripts_quant",sep=""),'quant.sf')
names(files) <- sample_names

file.exists("/Users/bonnie/Desktop/ngs.week10/PDAC253/PDAC253.transcripts_quant/quant.sf")

tx2gene <- read.table("/Users/bonnie/Desktop/ngs.week10/Pdac_Barhee_chr_unan_180126_maker_HC.tx2gene",header=F,sep=",")

txi <- tximport(files, type="salmon", tx2gene=tx2gene)

samples <- data.frame(sample_names=sample_names,condition=sample_condition)
row.names(samples) <- sample_names


library("DESeq2")
ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ condition)
class(ddsTxi)

ddsTxi

keep <- rowSums(counts(ddsTxi)) >= 10
dds <- ddsTxi[keep,]

dds <- DESeq(dds)
class(dds)

res.shrunk <- lfcShrink(dds, contrast = c('condition','lowSucrose','highSucrose'), type= "ashr" )
res.shrunkOrdered <- res.shrunk[order(res.shrunk$pvalue),]

res.shrunkOrdered

library(ggplot2)
ggplot(as.data.frame(res.shrunk),aes(pvalue)) + geom_histogram(fill="light blue",color='black')

Q3.2: Which pattern (a-e) do you observe in your histogram? What does this suggest about the impact of sugar composition on gene expression? Please include your p-value histogram in your answer. [ 1 point ].

an enrichment of low p-values. This is the expected result if there is a large class of differentially expressed genes between treatment and control.

This suggests that sugar composition does affect gene expression as shown with the low p-values (rejecting the null hypothesis that sugar composition does not affect gene expression). This shows that certain genes are differentially expressed between date palm fruits with different sugar composition.

Q3.3. Report the results table for the top 10 differentially expressed genes according to adjusted p-value (i.e., FDR). Then, for each of the three genes below Do you consider the gene(s) to be differentially expressed, both statistically and biologically in terms of log fold-change? Why or why not [ 1 point ].

log2 fold change (MMSE): condition lowSucrose vs highSucrose 
Wald test p-value: condition lowSucrose vs highSucrose 
DataFrame with 3 rows and 5 columns
                       baseMean log2FoldChange     lfcSE      pvalue        padj
                      <numeric>      <numeric> <numeric>   <numeric>   <numeric>
Pdac_HC_chr14G0022900  8502.970       9.692887  0.607679 5.62630e-60 9.79145e-56
Pdac_HC_chr14G0023100  6923.473       5.869339  0.650034 1.23136e-21 5.35735e-18
Pdac_HC_chr14G0028200   582.807       0.364476  0.216343 4.88953e-02 2.26907e-01

# I would consider both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes to be differentially expressed as they have a high log2 FC of greater than 1, which means that the statistically differentially expressed gene shows at least a two-fold change in expression. Both genes also have a low padj value which is the adjusted p-value (the FDR). Both also passed the FDR threshold which I set as 0.01 which is on the more stringent side. A FDR threshold that I sat at 0.01 means that 1% or less of these genes are expected to be false positives. Since both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 have a padj value of greater than 0.01 with high log2FoldChange (at least greater than 1), I would only consider these two genes be differentially expressed, both statistically and biologically in terms of log fold-change. I would not consider Pdac_HC_chr14G0028200 to be differentially expressed as it does not have a log2FoldChange of at least 1 and did not pass the FDC threshold.

Q3.4: Show your code and plots. What is the relationship of genewise dispersions and the mean of normalized counts in each of your plots? Do you see a difference among the three methods? Should the fitType be changed from the default “parametric” method as run by the DESeq wrapper function? [ 1 point ]

dds <- estimateDispersions(dds, fitType="parametric")
plotDispEsts(dds)

dds <- estimateDispersions(dds, fitType="local")
plotDispEsts(dds)

dds <- estimateDispersions(dds, fitType="mean")
plotDispEsts(dds)

From the plot generated from the parametric fit, the parametric curve does seem to fit the observed dispersion mean relationship. The curve from the local fit also does seem to fit the observed dispersion mean relationship. I do see a difference among the three methods (please see the plots attached). The mean fit method does not seem provide a good fit to the dispersion-mean relationship. Thus fitType should not be changed as the default parametric fit does seem to fit the observed dispersion mean relationship the best.

Q3.5: In Week 8 and 9, you ran DGE analysis with an exon union approach, while this week you ran DGE using a Salmon + tximport + DESeq2 workflow. What are three reasons why the Salmon + tximport + DESeq2 workflow may be preferred over the STAR + htseq-count + DESeq2 workflow? [ 1 point ]

The use of Salmon + tximport + DESeq2 workflow may be preferred over the STAR + htseq-count + DESeq2 workflow when conducting certain gene-level analysis (differential transcript expression) due to a number of reasons. (1) The use of transcript abundance estimation method allows for correcting potential changes in gene length across samples including the differential isoform usage problem that may cause high false positives in standard method (e.g. genes that do not differ in the amount of mRNA per cell but do differ in mixture of isoforms), (2) some tools are faster and require less resources such as disk usage and memory as with Salmon they don’t output BAM files (in the case of Salmon they used quasi-mapping where reads are mapped to transcripts in which they originate from and not to their known positions) and (3) increases sensitivity as this method allows alignment to multiple genes with homologous sequence instead of discarding these fragments.