class: DESeqDataSet
dim: 28595 8
#Dimension is 28595 genes by 8 samples
metadata(1): version
assays(1): counts
rownames(28595): Pdac_HC_000007FG0000100 Pdac_HC_000007FG0000200 ... Pdac_HC_chr9G0155000
Pdac_HC_chr9G0155100
rowData names(0):
colnames(8): PDAC253.htseq_count.txt PDAC282.htseq_count.txt ... PDAC306.htseq_count.txt
PDAC318.htseq_count.txt
colData names(1): condition
class: DESeqDataSet
dim: 20029 8
#20029 genes were retained after removing the low count genes
metadata(1): version
assays(4): counts mu H cooks
rownames(20029): Pdac_HC_000007FG0000100 Pdac_HC_000007FG0000200 ... Pdac_HC_chr9G0154800
Pdac_HC_chr9G0155000
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(8): PDAC253.htseq_count.txt PDAC282.htseq_count.txt ... PDAC306.htseq_count.txt
PDAC318.htseq_count.txt
colData names(2): condition sizeFactor
####Q3.1: Include the hierarchical clustering result, your plotPCA command, and the PCA plot for you answer [ 1 point ].
Please see attached image for the hierarchical clustering result and PCA plot. PlotPCA command:
rld <- rlog(dds)
dists <- dist(t(assay(rld)))
plot(hclust(dists))
plotPCA(rld)
No, hierarchical clustering results are used to indicate the distances between sample gene expression profiles and not used for statistical analysis purposes.
f.all of the above
Yes, I do see evidence of a batch effect which is shown in one of the points (outlier) that belongs to the lowSucrose group. Three of the lowSucrose samples were clustered together on the bottom while one of the samples appear on the top right corner which may be as a result of a batch effect. However, more information such as the experimental design and how the experiment was conducted is needed to indicate if there is truly a batch effect.
Quality control steps automatically conducted by DESeq: 1. If within a row all samples have zero counts, then the baseMean column will be set to zero and the log2 fold change estimates, p value and adjusted p values will all be set to NA 2. If a row contains samples with extreme count outlier then the p value and adjsuted p value will all be set to NA. The outlier counts are detected by the Cook’s distance. 3. If a row is filtered by automatic independent filtering for having a low mean normalized count, then the adjusted p value will be set to NA
These are the three quality control steps that were described in the vignette on DESeq2 which detects and drops genes of concern or genes whose p values may be problematic from analysis based on the discussed quality control steps that are automatically conducted based on the criteria. According to the summary report as shown below, there are 199 outliers and 2709 low counts which identifies the number of genes that were impacted by default independent filtering.
summary(resOrdered)
out of 20029 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 857, 4.3%
LFC < 0 (down) : 960, 4.8%
outliers [1] : 199, 0.99%
low counts [2] : 2709, 14%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Pdac_HC_chr14G0023100 2915.858 5.653391 0.630509 8.96639 3.06399e-19 2.62293e-15
Pdac_HC_chr14G0022900 2839.320 9.978814 1.205430 8.27822 1.25031e-16 4.28130e-13
Pdac_HC_chr14G0028200 590.073 0.429695 0.249691 1.72091 8.52678e-02 2.95845e-01
#For each gene: Pdac_HC_chr14G0022900, Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0028200, low sucrose has higher expression compared to high sucrose since the log2fold change values are greater than 1 and the numerator is lowsucrose while the denominator is highsucrose. The fold-change expressed on the decimal scale:
Pdac_HC_chr14G0023100 2^5.653391= 50.33155
Pdac_HC_chr14G0022900 2^9.978814= 1009.072
Pdac_HC_chr14G0028200 2^0.429695= 1.346949
# Fold-change expressed on the decimal scale for Pdac_HC_chr14G0023100 is 50.33155, for Pdac_HC_chr14G0022900 is 1009.072 and for Pdac_HC_chr14G0028200 is 1.346949.
#For Pdac_HC_chr14G002820 gene:
count condition
PDAC253.htseq_count.txt 365.4494 highSucrose
PDAC282.htseq_count.txt 544.6490 highSucrose
PDAC286.htseq_count.txt 624.9751 highSucrose
PDAC316.htseq_count.txt 475.6036 highSucrose
PDAC266.htseq_count.txt 727.8490 lowSucrose
PDAC273.htseq_count.txt 524.2659 lowSucrose
PDAC306.htseq_count.txt 701.2766 lowSucrose
PDAC318.htseq_count.txt 760.5131 lowSucrose
#For Pdac_HC_chr14G0023100 gene:
count condition
PDAC253.htseq_count.txt 22.43685 highSucrose
PDAC282.htseq_count.txt 106.79149 highSucrose
PDAC286.htseq_count.txt 269.12122 highSucrose
PDAC316.htseq_count.txt 57.73196 highSucrose
PDAC266.htseq_count.txt 9356.85264 lowSucrose
PDAC273.htseq_count.txt 6044.40094 lowSucrose
PDAC306.htseq_count.txt 2713.47034 lowSucrose
PDAC318.htseq_count.txt 4760.05711 lowSucrose
#For Pdac_HC_chr14G0022900 gene:
count condition
PDAC253.htseq_count.txt 2.494259 highSucrose
PDAC282.htseq_count.txt 2.086440 highSucrose
PDAC286.htseq_count.txt 19.885037 highSucrose
PDAC316.htseq_count.txt 0.500000 highSucrose
PDAC266.htseq_count.txt 9441.867455 lowSucrose
PDAC273.htseq_count.txt 6666.762989 lowSucrose
PDAC306.htseq_count.txt 2941.984454 lowSucrose
PDAC318.htseq_count.txt 3642.977490 lowSucrose
Please see attached image for the plots. It is appropriate to report the shrunken estimates as DESeq2 reported log2 fold changes are often over-estimated or show higher variability specifically for low-expression genes. By shrinking the raw fold change estimates or shrunken estimates, this will generate a better or more accurate estimate of the log2 fold-change for analysis and make respective conclusion from.
res.shrunkOrdered[ row.names(resOrdered) %in% c('Pdac_HC_chr14G0022900','Pdac_HC_chr14G0023100','Pdac_HC_chr14G0028200'), ]
log2 fold change (MMSE): condition lowSucrose vs highSucrose
Wald test p-value: condition lowSucrose vs highSucrose
DataFrame with 3 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
Pdac_HC_chr14G0023100 2915.858 5.372701 0.629431 3.06399e-19 2.62293e-15
Pdac_HC_chr14G0022900 2839.320 8.824298 1.168327 1.25031e-16 4.28130e-13
Pdac_HC_chr14G0028200 590.073 0.324961 0.217307 8.52678e-02 2.95845e-01
I would consider both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes to be differentially expressed as they have a high log2 FC of greater than 1, which means that the statistically differentially expressed gene shows at least a two-fold change in expression. Both genes also have a low padj value which is the adjusted p-value (the FDR). In terms of false positives, since I want to select genes that are statistically significant and biologically significant, I want to be more stringent in terms of choosing the FDR cutoff parameter as we are determining the candidate genes (Q4.1c) that could be responsible for the sugar composition trait on the chromosome 14 sugar QTL based on the differential gene expression result therefore the FDR threshold I applied is 0.01. A FDR threshold at 0.01 means that 1% or less of these genes are expected to be false positives. Both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes have a padj of greater than 0.01 and since they have a log2FC of greater than 1, I would only consider these two genes to be differentially expressed and not Pdac_HC_chr14G0028200 which have a log2 FC of less than 1 and did not pass the padj value threshold.
I would consider both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes to be responsible for the sugar composition trait on the chromosome 14 sugar QTL as they are statistically and biologically differentially expressed genes with high log2 FC value and passes the set padj threshold. If I have to choose between the two genes, I would choose Pdac_HC_chr14G0022900 gene because it has a much higher log2 FC value (8.824298) compared to Pdac_HC_chr14G0023100, showing 453.2923 fold change in expression on the linear scale. But both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes seem to be responsible for the sugar composition trait.