PDAC253= 8.71844%
PDAC266=11.1517%
PDAC273=22.8452%
PDAC282=15.9158%
PDAC286=11.8775%
PDAC306=8.93615%
PDAC316=17.2182%
PDAC318=12.1166%
# Since the mapping rate for all 8 samples seem to be at the lower end (between 8%-23%), I would consider these to be too low which is likely due to kmer lengths to be too stringent. To re-run Salmon with the goal of increasing mapping rate by choosing a smaller k-mer length, we will need to re-run the salmon indexer using the script:
salmon index -t Pdac_Barhee_chr_unan_180126.all.maker.transcripts_HC_shuffled_normalized.fa -i Pdac_Barhee_chr_unan_180126.all.maker.transcripts_HC_shuffled_normalized_index -k 31
#If we were to re-run the salmon at a different kmer length,then we would need to create a new index using the script above and specify a different, smaller k value. After, we can re-run Salmon using the new index file.
The columns in the outputs are “Name”, “length”, “effectivelength”, “TPM” and “NumReads”. The name column identifies the name of the target transcript as specified from the input transcript database file. The length column specifies the length of each target transcript in terms of nucleotides. The effectivelength column is essentially the computed effective length of each target transcript and takes into account factors affecting the the probability of sampling fragments in the transcript such as fragment length distribution and gc-fragment bias. The TPM column is essentially and estimation of the relative abundance of each transcript as in transcripts per million (TPM). Finally, the NumReads column provides and estimation of the number of reads mapping to each of the transcript.
A stranded or strand-specific library describes libraries in which the orientation of the sequencing reads are identified. In other words, we would know or be able to identify which strand ( the plus or minus strand of the DNA template) each reads is from. In an unstranded library, this information is not known or in other words, we would not be able to know whether the sequencing read comes from the plus or the minus strand (orientation) of the DNA template.A stranded library require a more complicated library prep which would most often involve labeling the plus and minus strands of the cDNA to distinguish between reads coming from either strands.
Stranded because genes and non-coding RNA that overlaps on opposite strands of DNA can not be determined with the standard approach or unstranded libraries and can be problematic in quantifying gene expression and thus DGE. If a non-coding RNA on one strand have features that overlaps the same gene on the opposite strand, when conducting a standard approach (unstranded libraries), we don’t know if the reads that are produced comes from the plus strand or minus strand, which is problematic in quantifying gene expression. Therefore, when performing DGE analysis, I think researchers typically prefer performing DGE analysis with stranded libraries.
Traditional count-based methods for differential gene expression has contributed to high false positives due to differential transcript usage. For examples, genes that do not differ in the amount of mRNA per cell but do differ in mixture of isoforms. To get around this particular problem, it was recommended to use tools such as Salmon to estimate transcript abundances, sum the TPMs to get gene-level TPM values and then convert these values to counts to conduct standard count-based gene level analysis. Tools such as tximport is made flexible to be used with different tools aimed at quantifying transcript abundance and converting TPMS to counts (DESeq2 requires counts as inputs). In this way, analysts can still use traditional count-based methods for gene-level analysis (as DGE is normally performed using count-based inference method) but avoiding differential transcript usage.
directory <- "/Users/bonnie/Desktop/ngs.week10"
library(tximport)
sample_names <- c('PDAC253','PDAC282','PDAC286','PDAC316','PDAC266','PDAC273','PDAC306','PDAC318')
sample_condition <- c(rep('highSucrose',4),rep('lowSucrose',4))
files <- file.path("/Users/bonnie/Desktop/ngs.week10",sample_names,paste(sample_names,".transcripts_quant",sep=""),'quant.sf')
names(files) <- sample_names
file.exists("/Users/bonnie/Desktop/ngs.week10/PDAC253/PDAC253.transcripts_quant/quant.sf")
tx2gene <- read.table("/Users/bonnie/Desktop/ngs.week10/Pdac_Barhee_chr_unan_180126_maker_HC.tx2gene",header=F,sep=",")
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
samples <- data.frame(sample_names=sample_names,condition=sample_condition)
row.names(samples) <- sample_names
library("DESeq2")
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ condition)
class(ddsTxi)
ddsTxi
keep <- rowSums(counts(ddsTxi)) >= 10
dds <- ddsTxi[keep,]
dds <- DESeq(dds)
class(dds)
res.shrunk <- lfcShrink(dds, contrast = c('condition','lowSucrose','highSucrose'), type= "ashr" )
res.shrunkOrdered <- res.shrunk[order(res.shrunk$pvalue),]
res.shrunkOrdered
library(ggplot2)
ggplot(as.data.frame(res.shrunk),aes(pvalue)) + geom_histogram(fill="light blue",color='black')
This suggests that sugar composition does affect gene expression as shown with the low p-values (rejecting the null hypothesis that sugar composition does not affect gene expression). This shows that certain genes are differentially expressed between date palm fruits with different sugar composition.
log2 fold change (MMSE): condition lowSucrose vs highSucrose
Wald test p-value: condition lowSucrose vs highSucrose
DataFrame with 3 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
Pdac_HC_chr14G0022900 8502.970 9.692887 0.607679 5.62630e-60 9.79145e-56
Pdac_HC_chr14G0023100 6923.473 5.869339 0.650034 1.23136e-21 5.35735e-18
Pdac_HC_chr14G0028200 582.807 0.364476 0.216343 4.88953e-02 2.26907e-01
# I would consider both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes to be differentially expressed as they have a high log2 FC of greater than 1, which means that the statistically differentially expressed gene shows at least a two-fold change in expression. Both genes also have a low padj value which is the adjusted p-value (the FDR). Both also passed the FDR threshold which I set as 0.01 which is on the more stringent side. A FDR threshold that I sat at 0.01 means that 1% or less of these genes are expected to be false positives. Since both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 have a padj value of greater than 0.01 with high log2FoldChange (at least greater than 1), I would only consider these two genes be differentially expressed, both statistically and biologically in terms of log fold-change. I would not consider Pdac_HC_chr14G0028200 to be differentially expressed as it does not have a log2FoldChange of at least 1 and did not pass the FDC threshold.
dds <- estimateDispersions(dds, fitType="parametric")
plotDispEsts(dds)
dds <- estimateDispersions(dds, fitType="local")
plotDispEsts(dds)
dds <- estimateDispersions(dds, fitType="mean")
plotDispEsts(dds)
From the plot generated from the parametric fit, the parametric curve does seem to fit the observed dispersion mean relationship. The curve from the local fit also does seem to fit the observed dispersion mean relationship. I do see a difference among the three methods (please see the plots attached). The mean fit method does not seem provide a good fit to the dispersion-mean relationship. Thus fitType should not be changed as the default parametric fit does seem to fit the observed dispersion mean relationship the best.
The use of Salmon + tximport + DESeq2 workflow may be preferred over the STAR + htseq-count + DESeq2 workflow when conducting certain gene-level analysis (differential transcript expression) due to a number of reasons. (1) The use of transcript abundance estimation method allows for correcting potential changes in gene length across samples including the differential isoform usage problem that may cause high false positives in standard method (e.g. genes that do not differ in the amount of mRNA per cell but do differ in mixture of isoforms), (2) some tools are faster and require less resources such as disk usage and memory as with Salmon they don’t output BAM files (in the case of Salmon they used quasi-mapping where reads are mapped to transcripts in which they originate from and not to their known positions) and (3) increases sensitivity as this method allows alignment to multiple genes with homologous sequence instead of discarding these fragments.