NGS Week 9 Homework Assignment

Task 1: Create a DESeqDataSet object

Q1.1: Report the output indicating what type of object the dds variable is, how many genes, and how many samples are stored in the object? [ 1 point ]

class: DESeqDataSet 
dim: 28595 8 
#Dimension is 28595 genes by 8 samples
metadata(1): version
assays(1): counts
rownames(28595): Pdac_HC_000007FG0000100 Pdac_HC_000007FG0000200 ... Pdac_HC_chr9G0155000
  Pdac_HC_chr9G0155100
rowData names(0):
colnames(8): PDAC253.htseq_count.txt PDAC282.htseq_count.txt ... PDAC306.htseq_count.txt
  PDAC318.htseq_count.txt
colData names(1): condition

Task 2: Pre-filter low count genes, normalize counts and run DESeq

Q2.1 Enter dds at the console to summarize your object. Report the output for your answer and how many genes were retained after removing the low count genes [ 1 point ].

class: DESeqDataSet 
dim: 20029 8 
#20029 genes were retained after removing the low count genes
metadata(1): version
assays(4): counts mu H cooks
rownames(20029): Pdac_HC_000007FG0000100 Pdac_HC_000007FG0000200 ... Pdac_HC_chr9G0154800
  Pdac_HC_chr9G0155000
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(8): PDAC253.htseq_count.txt PDAC282.htseq_count.txt ... PDAC306.htseq_count.txt
  PDAC318.htseq_count.txt
colData names(2): condition sizeFactor

Task 3: Analyze DGE data

####Q3.1: Include the hierarchical clustering result, your plotPCA command, and the PCA plot for you answer [ 1 point ].

Please see attached image for the hierarchical clustering result and PCA plot. PlotPCA command:

rld <- rlog(dds)
dists <- dist(t(assay(rld)))
plot(hclust(dists))
plotPCA(rld)

Q3.2a: Do the samples cluster by sugar composition phenotype in the hierarchical clustering? Explain.

No, hierarchical clustering results are used to indicate the distances between sample gene expression profiles and not used for statistical analysis purposes.

Q3.2b: Does the PCA separate samples by sugar composition? If so, on which axis? Select the single best answer.

Yes. The PCA separates points on both PC1 and PC2 axes.

Q3.2c: In many contexts, the treatment will induce large effects on the transcriptome that often cause biological replicates from the same treatment to cluster together. Another factor that could cause samples to cluster together is batch effects. Which of the following are potential sources of batch effects in RNA-seq analysis?

f.all of the above

Q3.2d: Without any additional information about how the experiment was conducted, do you see any evidence of a batch effect? Why or why not?

Yes, I do see evidence of a batch effect which is shown in one of the points (outlier) that belongs to the lowSucrose group. Three of the lowSucrose samples were clustered together on the bottom while one of the samples appear on the top right corner which may be as a result of a batch effect. However, more information such as the experimental design and how the experiment was conducted is needed to indicate if there is truly a batch effect.

Q3.3: Read the section “More information on results columns” of the DESeq2 vignette and describe three quality control steps that are automatically conducted by DESeq to drop from consideration genes with suspect, or problematic, p-values How many genes were impacted by the default independent filtering? [ 1 point ].

Quality control steps automatically conducted by DESeq: 1. If within a row all samples have zero counts, then the baseMean column will be set to zero and the log2 fold change estimates, p value and adjusted p values will all be set to NA 2. If a row contains samples with extreme count outlier then the p value and adjsuted p value will all be set to NA. The outlier counts are detected by the Cook’s distance. 3. If a row is filtered by automatic independent filtering for having a low mean normalized count, then the adjusted p value will be set to NA

These are the three quality control steps that were described in the vignette on DESeq2 which detects and drops genes of concern or genes whose p values may be problematic from analysis based on the discussed quality control steps that are automatically conducted based on the criteria. According to the summary report as shown below, there are 199 outliers and 2709 low counts which identifies the number of genes that were impacted by default independent filtering.

summary(resOrdered)

out of 20029 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 857, 4.3%
LFC < 0 (down)     : 960, 4.8%
outliers [1]       : 199, 0.99%
low counts [2]     : 2709, 14%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Q3.4a: Report the table of results for the three candidate genes. For each of the candidate genes, which sugar-type (high sucrose or low sucrose) has higher expression? What is the fold-change expressed on the decimal scale (convert from log2 to linear scale)? [ 1 point ]

                       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
                      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Pdac_HC_chr14G0023100  2915.858       5.653391  0.630509   8.96639 3.06399e-19 2.62293e-15
Pdac_HC_chr14G0022900  2839.320       9.978814  1.205430   8.27822 1.25031e-16 4.28130e-13
Pdac_HC_chr14G0028200   590.073       0.429695  0.249691   1.72091 8.52678e-02 2.95845e-01

#For each gene: Pdac_HC_chr14G0022900, Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0028200, low sucrose has higher expression compared to high sucrose since the log2fold change values are greater than 1 and the numerator is lowsucrose while the denominator is highsucrose. The fold-change expressed on the decimal scale:

Pdac_HC_chr14G0023100 2^5.653391= 50.33155
Pdac_HC_chr14G0022900 2^9.978814= 1009.072
Pdac_HC_chr14G0028200 2^0.429695= 1.346949

# Fold-change expressed on the decimal scale for Pdac_HC_chr14G0023100 is 50.33155, for Pdac_HC_chr14G0022900 is 1009.072 and for Pdac_HC_chr14G0028200 is 1.346949.

Q3.4b: Sometimes its useful to report the normalized counts for each gene as a figure or table. For your answer, report the normalized counts for each candidate gene [1 point]

#For Pdac_HC_chr14G002820 gene:
                          count   condition
PDAC253.htseq_count.txt 365.4494 highSucrose
PDAC282.htseq_count.txt 544.6490 highSucrose
PDAC286.htseq_count.txt 624.9751 highSucrose
PDAC316.htseq_count.txt 475.6036 highSucrose
PDAC266.htseq_count.txt 727.8490  lowSucrose
PDAC273.htseq_count.txt 524.2659  lowSucrose
PDAC306.htseq_count.txt 701.2766  lowSucrose
PDAC318.htseq_count.txt 760.5131  lowSucrose

#For Pdac_HC_chr14G0023100 gene:
                             count   condition
PDAC253.htseq_count.txt   22.43685 highSucrose
PDAC282.htseq_count.txt  106.79149 highSucrose
PDAC286.htseq_count.txt  269.12122 highSucrose
PDAC316.htseq_count.txt   57.73196 highSucrose
PDAC266.htseq_count.txt 9356.85264  lowSucrose
PDAC273.htseq_count.txt 6044.40094  lowSucrose
PDAC306.htseq_count.txt 2713.47034  lowSucrose
PDAC318.htseq_count.txt 4760.05711  lowSucrose

#For Pdac_HC_chr14G0022900 gene:
                              count   condition
PDAC253.htseq_count.txt    2.494259 highSucrose
PDAC282.htseq_count.txt    2.086440 highSucrose
PDAC286.htseq_count.txt   19.885037 highSucrose
PDAC316.htseq_count.txt    0.500000 highSucrose
PDAC266.htseq_count.txt 9441.867455  lowSucrose
PDAC273.htseq_count.txt 6666.762989  lowSucrose
PDAC306.htseq_count.txt 2941.984454  lowSucrose
PDAC318.htseq_count.txt 3642.977490  lowSucrose

Q3.5: Use the plotMA function to generate MA plots for the res and res.shrunk objects. Report the plots in your assignment document and answer the question, why is it appropriate to report the shrunken estimates? [ 1 point ].

Please see attached image for the plots. It is appropriate to report the shrunken estimates as DESeq2 reported log2 fold changes are often over-estimated or show higher variability specifically for low-expression genes. By shrinking the raw fold change estimates or shrunken estimates, this will generate a better or more accurate estimate of the log2 fold-change for analysis and make respective conclusion from.

Task 4: The multiple testing problem

Q4.1a: Report your results from the lfcShrink output table for the candidate genes.

res.shrunkOrdered[ row.names(resOrdered) %in% c('Pdac_HC_chr14G0022900','Pdac_HC_chr14G0023100','Pdac_HC_chr14G0028200'), ]

log2 fold change (MMSE): condition lowSucrose vs highSucrose 
Wald test p-value: condition lowSucrose vs highSucrose 
DataFrame with 3 rows and 5 columns
                       baseMean log2FoldChange     lfcSE      pvalue        padj
                      <numeric>      <numeric> <numeric>   <numeric>   <numeric>
Pdac_HC_chr14G0023100  2915.858       5.372701  0.629431 3.06399e-19 2.62293e-15
Pdac_HC_chr14G0022900  2839.320       8.824298  1.168327 1.25031e-16 4.28130e-13
Pdac_HC_chr14G0028200   590.073       0.324961  0.217307 8.52678e-02 2.95845e-01

Q4.1b: Statisticians make the distinction between statistically signficant and biologically significant. Using a criterion that a statistically differentially expressed gene must also show at least a two-fold change in expression (on linear scale) to be biologically meaningful, which genes do you consider to be differentially expressed? Please specify the gene(s), the FDR threshold you applied.

I would consider both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes to be differentially expressed as they have a high log2 FC of greater than 1, which means that the statistically differentially expressed gene shows at least a two-fold change in expression. Both genes also have a low padj value which is the adjusted p-value (the FDR). In terms of false positives, since I want to select genes that are statistically significant and biologically significant, I want to be more stringent in terms of choosing the FDR cutoff parameter as we are determining the candidate genes (Q4.1c) that could be responsible for the sugar composition trait on the chromosome 14 sugar QTL based on the differential gene expression result therefore the FDR threshold I applied is 0.01. A FDR threshold at 0.01 means that 1% or less of these genes are expected to be false positives. Both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes have a padj of greater than 0.01 and since they have a log2FC of greater than 1, I would only consider these two genes to be differentially expressed and not Pdac_HC_chr14G0028200 which have a log2 FC of less than 1 and did not pass the padj value threshold.

Q4.1c: Which of the candidate genes do you think could be responsible for the sugar composition trait on the chromosome 14 sugar QTL?

I would consider both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes to be responsible for the sugar composition trait on the chromosome 14 sugar QTL as they are statistically and biologically differentially expressed genes with high log2 FC value and passes the set padj threshold. If I have to choose between the two genes, I would choose Pdac_HC_chr14G0022900 gene because it has a much higher log2 FC value (8.824298) compared to Pdac_HC_chr14G0023100, showing 453.2923 fold change in expression on the linear scale. But both Pdac_HC_chr14G0023100 and Pdac_HC_chr14G0022900 genes seem to be responsible for the sugar composition trait.