TRGN 510 final project

Huizi Gao

12/9/2021

Using DeSEQ2 to analyze and compare the gene expression differences in lung cancer patients with and without smoking history

Introduction:

Smoking is harmful to health and an important factor inducing cancer, so I want to explore the influence of smoking on gene expression of lung cancer patients through this analysis. In this analysis, I will look for differences in gene expression of lung cancer at stage ib between the patient groups with and without  smoking history. There are 168 files in total including 60 samples of patients who have the smoking history and 108 samples of patients who do not smoke. In order to control the same number of  samples in two groups to analyze, I selected 30 samples separately from these two groups. DeSeq2 is used in this process. 

Data:

All the datasets used in this vignette are from TCGA/GDC Data Portal (https://portal.gdc.cancer.gov)

Primary Site of Cancer: Bonchus and Lung

Program: TCGA

Project: TCGA-LUAD

Vital Status: Alive 

Experimental Strategy: RNA-seq

Workflow Type: HTSeq-Counts

After filtering, a total of 168 samples emerged, including 60 samples of patients with a history of smoking and labeled with “SK” and 108 samples of patients without a history of smoking and labeled with “NSK.” I took 30 samples separately from the two groups for analysis. 

Step 1: data input and construct a DESeqDataSet

#if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
#     install.packages("apeglm")
#     BiocManager::install("BiocParallel")
#     BiocManager::install("IHW")                 
library(apeglm)
library(BiocParallel)
## Warning: package 'BiocParallel' was built under R version 4.1.2
library(IHW)                    
#set up: install and load the packages required.
directory <- "/Users/gaohuizi/Desktop/510-HFP/twosmallgroup" 
# assign the path of my htseq-count samples to "directory" for convinience 
setwd(directory) 
getwd()     
## [1] "/Users/gaohuizi/Desktop/510-HFP/twosmallgroup"
# make sure that the current working directory is where the files located
sampleFiles <- grep("SK",list.files(directory),value=TRUE)
# specify files and read in
sampleFiles 
##  [1] "NSK1"  "NSK10" "NSK11" "NSK12" "NSK13" "NSK14" "NSK15" "NSK16" "NSK17"
## [10] "NSK18" "NSK19" "NSK2"  "NSK20" "NSK21" "NSK22" "NSK23" "NSK24" "NSK25"
## [19] "NSK26" "NSK27" "NSK28" "NSK29" "NSK3"  "NSK30" "NSK4"  "NSK5"  "NSK6" 
## [28] "NSK7"  "NSK8"  "NSK9"  "SK1"   "SK10"  "SK11"  "SK12"  "SK13"  "SK14" 
## [37] "SK15"  "SK16"  "SK17"  "SK18"  "SK19"  "SK2"   "SK20"  "SK21"  "SK22" 
## [46] "SK23"  "SK24"  "SK25"  "SK26"  "SK27"  "SK28"  "SK29"  "SK3"   "SK30" 
## [55] "SK4"   "SK5"   "SK6"   "SK7"   "SK8"   "SK9"
#to check if the "sampleFiles" is correct
sampleCondition <- sub("(.*SK).*","\\1",sampleFiles) 
# abtain condition status  
sampleCondition
##  [1] "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK"
## [13] "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "NSK"
## [25] "NSK" "NSK" "NSK" "NSK" "NSK" "NSK" "SK"  "SK"  "SK"  "SK"  "SK"  "SK" 
## [37] "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK" 
## [49] "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"  "SK"
# to check if the condition is correct
sampleTable <- data.frame(sampleName = sampleFiles,
                          fileName = sampleFiles,
                          condition = sampleCondition)
head(sampleTable,6)
##   sampleName fileName condition
## 1       NSK1     NSK1       NSK
## 2      NSK10    NSK10       NSK
## 3      NSK11    NSK11       NSK
## 4      NSK12    NSK12       NSK
## 5      NSK13    NSK13       NSK
## 6      NSK14    NSK14       NSK
# to check "sampleTable"
sampleTable$condition <- factor(sampleTable$condition)
# The data of the condition column in SampleTable are organized and stored as data objects of different levels
sampleTable$condition
##  [1] NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK
## [20] NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK SK  SK  SK  SK  SK  SK  SK  SK 
## [39] SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK 
## [58] SK  SK  SK 
## Levels: NSK SK
# to check if the levels are what I need
library("DESeq2")
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.1.2
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Warning: package 'GenomicRanges' was built under R version 4.1.2
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
#Htseq-count files are used in my analysis so I can use DESeqDataSetFromHTSeq() here.
                                       directory = directory,
                                       design= ~ condition)
ddsHTSeq
## class: DESeqDataSet 
## dim: 60483 60 
## metadata(1): version
## assays(1): counts
## rownames(60483): ENSG00000000003.13 ENSG00000000005.5 ...
##   ENSGR0000280767.1 ENSGR0000281849.1
## rowData names(0):
## colnames(60): NSK1 NSK10 ... SK8 SK9
## colData names(1): condition
# build up the DESeqDataSet
ddsHTSeq$condition <- factor(ddsHTSeq$condition, levels = c("NSK","SK"))
ddsHTSeq$condition
##  [1] NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK
## [20] NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK NSK SK  SK  SK  SK  SK  SK  SK  SK 
## [39] SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK  SK 
## [58] SK  SK  SK 
## Levels: NSK SK
#reset the factor levels, to tell DESeq2 which level I want to compare against

Step 2: gene differential expression analysis

ddsHTSeq <- DESeq(ddsHTSeq)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 8687 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(ddsHTSeq)
# extract the result table from the differential analysis above 
res
## log2 fold change (MLE): condition SK vs NSK 
## Wald test p-value: condition SK vs NSK 
## DataFrame with 60483 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat    pvalue
##                     <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.13 3125.04995      -0.101553  0.203158 -0.499871 0.6171660
## ENSG00000000005.5     3.02716       0.107599  0.550764  0.195363 0.8451088
## ENSG00000000419.11 1483.19336      -0.242811  0.127203 -1.908843 0.0562824
## ENSG00000000457.12  890.18941       0.155545  0.127797  1.217130 0.2235548
## ENSG00000000460.15  441.31424       0.161581  0.247103  0.653902 0.5131749
## ...                       ...            ...       ...       ...       ...
## ENSGR0000275287.3           0             NA        NA        NA        NA
## ENSGR0000276543.3           0             NA        NA        NA        NA
## ENSGR0000277120.3           0             NA        NA        NA        NA
## ENSGR0000280767.1           0             NA        NA        NA        NA
## ENSGR0000281849.1           0             NA        NA        NA        NA
##                         padj
##                    <numeric>
## ENSG00000000003.13  0.852917
## ENSG00000000005.5   0.947795
## ENSG00000000419.11  0.328078
## ENSG00000000457.12  0.583603
## ENSG00000000460.15  0.799798
## ...                      ...
## ENSGR0000275287.3         NA
## ENSGR0000276543.3         NA
## ENSGR0000277120.3         NA
## ENSGR0000280767.1         NA
## ENSGR0000281849.1         NA
# to check the result table
res <- results(ddsHTSeq, contrast=c("condition","SK","NSK"))
res
## log2 fold change (MLE): condition SK vs NSK 
## Wald test p-value: condition SK vs NSK 
## DataFrame with 60483 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat    pvalue
##                     <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.13 3125.04995      -0.101553  0.203158 -0.499871 0.6171660
## ENSG00000000005.5     3.02716       0.107599  0.550764  0.195363 0.8451088
## ENSG00000000419.11 1483.19336      -0.242811  0.127203 -1.908843 0.0562824
## ENSG00000000457.12  890.18941       0.155545  0.127797  1.217130 0.2235548
## ENSG00000000460.15  441.31424       0.161581  0.247103  0.653902 0.5131749
## ...                       ...            ...       ...       ...       ...
## ENSGR0000275287.3           0             NA        NA        NA        NA
## ENSGR0000276543.3           0             NA        NA        NA        NA
## ENSGR0000277120.3           0             NA        NA        NA        NA
## ENSGR0000280767.1           0             NA        NA        NA        NA
## ENSGR0000281849.1           0             NA        NA        NA        NA
##                         padj
##                    <numeric>
## ENSG00000000003.13  0.852917
## ENSG00000000005.5   0.947795
## ENSG00000000419.11  0.328078
## ENSG00000000457.12  0.583603
## ENSG00000000460.15  0.799798
## ...                      ...
## ENSGR0000275287.3         NA
## ENSGR0000276543.3         NA
## ENSGR0000277120.3         NA
## ENSGR0000280767.1         NA
## ENSGR0000281849.1         NA

Step 3: change shrinkage for visualization and ranking

resultsNames(ddsHTSeq)
## [1] "Intercept"           "condition_SK_vs_NSK"
#get the number of coefficient
resLFC <- lfcShrink(ddsHTSeq, coef="condition_SK_vs_NSK", type="apeglm") 
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
## log2 fold change (MAP): condition SK vs NSK 
## Wald test p-value: condition SK vs NSK 
## DataFrame with 60483 rows and 5 columns
##                      baseMean log2FoldChange      lfcSE    pvalue      padj
##                     <numeric>      <numeric>  <numeric> <numeric> <numeric>
## ENSG00000000003.13 3125.04995   -7.92388e-06 0.00144267 0.6171660  0.852917
## ENSG00000000005.5     3.02716   -6.01038e-06 0.00144270 0.8451088  0.947795
## ENSG00000000419.11 1483.19336   -7.02132e-05 0.00144346 0.0562824  0.328078
## ENSG00000000457.12  890.18941    1.41207e-05 0.00144264 0.2235548  0.583603
## ENSG00000000460.15  441.31424    2.83565e-06 0.00144267 0.5131749  0.799798
## ...                       ...            ...        ...       ...       ...
## ENSGR0000275287.3           0             NA         NA        NA        NA
## ENSGR0000276543.3           0             NA         NA        NA        NA
## ENSGR0000277120.3           0             NA         NA        NA        NA
## ENSGR0000280767.1           0             NA         NA        NA        NA
## ENSGR0000281849.1           0             NA         NA        NA        NA
#change shrinkage and check
#library("BiocParallel")
#register(MulticoreParam(4))
#parallelization for speed-up should be done at the beginning of the analysis if necessary

Step 4: P-value adjustment

resOrdered <- res[order(res$pvalue),]
#order the result table 
summary(res)
## 
## out of 55653 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1339, 2.4%
## LFC < 0 (down)     : 322, 0.58%
## outliers [1]       : 0, 0%
## low counts [2]     : 19393, 35%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#summarize the basic tallies 
sum(res$padj < 0.1, na.rm=TRUE)
## [1] 1661
# count how many <0.1 
res05 <- results(ddsHTSeq, alpha=0.05)
summary(res05)
## 
## out of 55653 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 760, 1.4%
## LFC < 0 (down)     : 175, 0.31%
## outliers [1]       : 0, 0%
## low counts [2]     : 20467, 37%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
sum(res05$padj < 0.05, na.rm=TRUE)
## [1] 935

Step 5: independent hypothesis weighting

library("IHW")
resIHW <- results(ddsHTSeq, filterFun=ihw)
summary(resIHW)
## 
## out of 55653 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1696, 3%
## LFC < 0 (down)     : 329, 0.59%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
sum(resIHW$padj < 0.1, na.rm=TRUE)
## [1] 2025
metadata(resIHW)$ihwResult
## ihwResult object with 60483 hypothesis tests 
## Nominal FDR control level: 0.1 
## Split into 37 bins, based on an ordinal covariate

Step 6: explore and export results

plotMA(res, ylim=c(-5,5))

#plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet

The lower left or upper right of the dots represent the more abundant and variable genes

plotMA(resLFC, ylim=c(-5,5))

#the MA-plot for the shrunken log2 fold changes,removing the noise associated with log2 fold changes from low count genes

Eliminate noise, more intuitive

#idx <- identify(res$baseMean, res$log2FoldChange)
#rownames(res)[idx]

step 7: alternative shrinkage estimators

resultsNames(ddsHTSeq)
## [1] "Intercept"           "condition_SK_vs_NSK"
resNorm <- lfcShrink(ddsHTSeq, coef=2, type="normal")
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
## 
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
resAsh <- lfcShrink(ddsHTSeq, coef=2, type="ashr")
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041
# we have two coefficients
par(mfrow=c(1,3), mar=c(4,4,2,1))
xlim <- c(1,1e5); ylim <- c(-3,3)
plotMA(resLFC, xlim=xlim, ylim=ylim, main="apeglm")
plotMA(resNorm, xlim=xlim, ylim=ylim, main="normal")
plotMA(resAsh, xlim=xlim, ylim=ylim, main="ashr")

The noise in apeglm is lowest, which is better.

Step 8: plot counts

plotCounts(ddsHTSeq, gene=which.min(res$padj), intgroup="condition")

Specify the gene which had the smallest p value from the results table created the plot counts. it is ENSG00000075043.16, it is downregulated

d <- plotCounts(ddsHTSeq, gene=which.min(res$padj), intgroup="condition", 
                returnData=TRUE)
#the function should only return a data.frame for plotting with ggplot.
library("ggplot2")
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:IHW':
## 
##     alpha
ggplot(d, aes(x=condition, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))

Specify the gene which had the smallest p value from the results table created the plot counts. it is ENSG00000075043.16, it is downregulated.

mcols(res)$description
## [1] "mean of normalized counts for all samples"  
## [2] "log2 fold change (MLE): condition SK vs NSK"
## [3] "standard error: condition SK vs NSK"        
## [4] "Wald statistic: condition SK vs NSK"        
## [5] "Wald test p-value: condition SK vs NSK"     
## [6] "BH adjusted p-values"
#which variables and tests were used can be found by calling the function mcols on the results object

Step 9: exporting results to CSV files

resSig <- subset(resOrdered, padj < 0.1)
resSig
## log2 fold change (MLE): condition SK vs NSK 
## Wald test p-value: condition SK vs NSK 
## DataFrame with 1661 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat      pvalue
##                     <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000075043.16   116.9112       -5.20579  0.741800  -7.01778 2.25419e-12
## ENSG00000198028.3     24.7702       -5.45361  0.797955  -6.83449 8.22998e-12
## ENSG00000198883.10    52.0476        4.60650  0.689186   6.68397 2.32549e-11
## ENSG00000202538.1    780.1032        5.94446  0.911730   6.51997 7.03198e-11
## ENSG00000108244.15   680.8237       -4.27175  0.655880  -6.51300 7.36639e-11
## ...                       ...            ...       ...       ...         ...
## ENSG00000275833.1     4.78618       1.152818  0.406263   2.83762  0.00454519
## ENSG00000112992.15 3674.25885       0.476019  0.167779   2.83718  0.00455146
## ENSG00000241048.1     2.23282       3.208083  1.130780   2.83705  0.00455322
## ENSG00000204021.3    14.69839       2.299613  0.810568   2.83704  0.00455340
## ENSG00000259055.1     1.88827       3.144356  1.108699   2.83608  0.00456715
##                           padj
##                      <numeric>
## ENSG00000075043.16 8.18723e-08
## ENSG00000198028.3  1.49456e-07
## ENSG00000198883.10 2.81540e-07
## ENSG00000202538.1  4.62892e-07
## ENSG00000108244.15 4.62892e-07
## ...                        ...
## ENSG00000275833.1    0.0996263
## ENSG00000112992.15   0.0996263
## ENSG00000241048.1    0.0996263
## ENSG00000204021.3    0.0996263
## ENSG00000259055.1    0.0998668
#Exporting only the results which pass an adjusted p value threshold
write.csv(as.data.frame(resSig), 
          file="condition_results.csv")

Step 10: extracting transformed values

vsd <- vst(ddsHTSeq, blind=FALSE)
head(assay(vsd), 3)
##                         NSK1     NSK10     NSK11     NSK12     NSK13     NSK14
## ENSG00000000003.13 10.629839 10.246565 10.653592 11.490291  9.931235 12.307884
## ENSG00000000005.5   4.055727  4.055727  4.558601  4.381082  4.505090  4.055727
## ENSG00000000419.11 11.041604 10.244680 11.091010 11.084157 10.616650 11.720351
##                        NSK15    NSK16     NSK17     NSK18     NSK19      NSK2
## ENSG00000000003.13 10.814780 12.20348 12.294173 12.660084 10.911129 12.985426
## ENSG00000000005.5   4.553865  8.08755  5.078974  4.513144  4.055727  5.171051
## ENSG00000000419.11 10.764180 10.32809 10.505190 10.454354 10.509104 10.827314
##                        NSK20     NSK21     NSK22     NSK23     NSK24     NSK25
## ENSG00000000003.13 11.433397 10.383498 11.410823 11.499521 11.739228 11.918269
## ENSG00000000005.5   4.055727  5.388742  4.831131  4.345698  4.055727  4.438132
## ENSG00000000419.11 10.244262 10.082618 10.266306 10.633186 10.127650 11.055699
##                        NSK26     NSK27     NSK28     NSK29      NSK3     NSK30
## ENSG00000000003.13 12.541064 10.248761 10.659039 11.279465 10.787505 12.008173
## ENSG00000000005.5   4.505203  5.066085  4.983873  4.492204  5.055465  4.481925
## ENSG00000000419.11 10.978442 10.240340 10.653378 10.328885 10.412738 10.727329
##                         NSK4      NSK5      NSK6      NSK7      NSK8      NSK9
## ENSG00000000003.13 11.628547 12.203363 12.146297 11.096884 11.496832 12.284463
## ENSG00000000005.5   4.055727  4.418644  4.769275  4.308518  4.055727  4.055727
## ENSG00000000419.11  9.530335 11.012188 10.676569  9.901513 11.412967 10.796537
##                          SK1     SK10      SK11      SK12      SK13      SK14
## ENSG00000000003.13 10.465428 12.67751 10.860296 11.350796 12.487648 11.643013
## ENSG00000000005.5   5.590728  4.62645  4.055727  4.564486  4.055727  4.767747
## ENSG00000000419.11 10.207398 10.21258 10.343061  9.546005 11.315880  9.869542
##                         SK15      SK16      SK17      SK18      SK19       SK2
## ENSG00000000003.13 10.981093 11.897893 10.191228 11.627883  9.979860 11.658049
## ENSG00000000005.5   4.327786  4.055727  4.055727  4.899873  4.424726  4.055727
## ENSG00000000419.11 10.996255 10.771323  9.388993 10.631706 10.060417  9.652692
##                         SK20      SK21      SK22      SK23      SK24      SK25
## ENSG00000000003.13 10.191619 11.890599 11.701638 11.215111 11.788175 11.061951
## ENSG00000000005.5   4.691715  4.847086  4.316406  4.347482  4.055727  4.055727
## ENSG00000000419.11 10.129510 10.924587 10.267262 10.227988 11.002766 10.388950
##                         SK26      SK27      SK28      SK29       SK3      SK30
## ENSG00000000003.13 11.241665 10.915631 11.718055 11.051335 11.762035 11.023705
## ENSG00000000005.5   4.486288  4.055727  4.055727  4.055727  4.762194  5.512894
## ENSG00000000419.11  9.721033 10.960501 10.463742 10.688946 10.753419  9.923269
##                          SK4       SK5       SK6       SK7       SK8       SK9
## ENSG00000000003.13 10.443595 10.213536 11.451066 12.290541 12.895637 12.527255
## ENSG00000000005.5   4.614486  4.985174  4.580513  5.156962  4.370546  4.674567
## ENSG00000000419.11 10.467602 10.020022 10.346851 10.597663 10.998154 10.095902
#extract the matrix of normalized values.
BiocManager::install("vsn")
## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.1 (2021-08-10)
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'vsn'

Step 11: effects of transformations on the variance

ntd <- normTransform(ddsHTSeq)
library("vsn")
meanSdPlot(assay(ntd))

meanSdPlot(assay(vsd))

The standard deviation of the transformed data, across samples, against the mean, using the shifted logarithm transformation

Step 12: heatmap of the count matrix

#install.packages("plyr")
#library("plyr")
#install.packages("stringr")
library("pheatmap")
library("stringr")

select <- order(rowMeans(counts(ddsHTSeq,normalized=TRUE)),
                decreasing=TRUE)[1:20]
#All genes were sorted according to their expression levels from high to low, and the top 20 are selected
df <- as.data.frame(colData(ddsHTSeq)[,c("condition")])
colnames(assay(vsd)) <- str_sub(colnames(assay(vsd)), 1, 21)
rownames(df) <- colnames(assay(vsd))
pheatmap(assay(vsd)[select,],cellwidth = 3,cellheight = 8,fontsize_row = 9,fontsize_col =4, annotation_col=df)

The 20 genes with the highest expression level in the two groups were selected to draw heat maps, with blue representing the samples with relatively low expression level, red representing the samples with relatively high expression level, and light red, light blue and white representing the genes with intermediate expression level. The greater the difference in the red and blue distribution of a certain gene between the two groups, indicating that there was a significant difference in the expression of this gene between the two groups.

Step 13: heatmap of the sample-to-sample distances

sampleDists <- dist(t(assay(vsd)))
library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         fontsize_row = 5,
         col=colors)

#I apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances

A heatmap of the distance matrix, which gives similarities and dissimilarities between samples.

Step 14: principal component plot of the samples

plotPCA(vsd, intgroup=c("condition"))

#plot PCA graph

PCA can distinguish the control group from the experimental group. In PCA, if the samples are clustered together, the difference between these samples is small; on the contrary, the farther the distance between samples is, the greater the difference between samples is.

Conclusion:

  1. I deleted the Cluster_rows/cols arguments, so the heat map automatically aggregates the most relevant samples based on the correlation of genes and samples. From the analysis results of the heat map, it can be seen that the horizontal color contrast of PGC, SFTPC, SFTPA2 and SFTPA1 genes is obvious, and their expression levels vary greatly in different samples. Sixteen samples from NSK10 to SK11 are aggregated together due to similarity, and the four genes discussed above shows high expression consistently. However, SK samples and NSK samples accounted for half and half, so the expression patterns of the four genes were similar in the samples within this range. In a total of 13 samples from SK7 to SK1, the above four genes showS extremely low expression, but the number of SK samples and NSK samples was also close to 1:1, so the expression patterns of these four genes were relatively consistent in the two groups of samples within this interval. From SK17 to SK26, the samples with very low gene expression level are NSK type, and the samples with significantly high gene expression level are SK type. Compared with samples from NSK11 to NSK7, the expression levels of the four genes are higher in SK17 to SK30 regions with more SK samples. In summary, it can be concluded that there are a few differences that is not so obvious in the expression of these four genes in a certain sample interval. Low credibility of the results. The heatmap confirms this conclusion as well. The upregulation of SFTPC gene is associated with the progression of lung cancer, and some variation of SFTPA2 gene is associated with idiopathic pulmonary fibrosis. SFTPA1 is a classic biomarker of alveolar cell canceration, and PGC gene is a muscle fiber specific related gene, which seems not to be a typical lung cancer related gene.

  2. The above results may be due to data selection problems. The sample size of this analysis was too small and the statistical results were not representative, so significant differences in gene expression could not be observed. It is also possible that mutations associated with smoking were not among the genes studied in the analysis.

  3. In the PCA diagram, SK group and NSK group have a high degree of overlap, except for a region in the upper left corner where SK samples have obvious aggregation. Therefore, the available genes have no obvious differential expression in most samples, and only a few samples shows differences between smokers and non-smokers, which is consistent with the conclusion obtained in the heat map. The difference in the upper left corner of PCA does not exclude the possibility of systematic error.

Known issues:

  1. In this analysis, all data are manually downloaded, integrated, labeled and decompressed, which is a tedious and error-prone process. When I tried to write codes to implement the process from data download to decompression, I got bugs that are hard to fix. If the fix later succeeds, I might reintroduce the correct code into the analysis.
  2. The sample size of data for this analysis is too small. The original sample size was 168, but for the sake of analysis efficiency and control that the two groups had the same sample size, I reduced the sample size of each group to 30. But 168 samples of data can still be analyzed by my Vignette.
  3. In my Vignette, I omitted the pre-filting step, which was skipped because it was more error-prone and didn’t significantly improve the speed of the code.
  4. Co-variates are not taken into account in this analysis, because there is no factor data type that can be used in addition to condition in the sample information stored in Coldata, so multi-factor analysis cannot be carried out. The solution may be to re-select the analyzed data and check if paired -end/single read information is included.
  5. I used the webpage to change the differentially expressed genes into Hugo name, actually it would be better to use the code of biomart package.

Reference:

  1. Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8

  2. Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. 10.1093/bioinformatics/bty895

    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2. 10.1093/biostatistics/kxw041