Final Project

Abstract

In our lab, we are interested in one disease called CHARGE syndrome, which caused by mutating the CHD7 gene in patients. For normal cells, during the process of inducing neural crest cells(NCC) to form cranial mesenchymal cells(CMC), the shape of the nucleus of cells are changing, which may related to the genome expression reorganization. However, this nucleus shape changing does not appear in the CHARGE syndrome patient cells(G08 cell lines), and comes back in the CHD7 corrected cell lines. Therefore, we want to know whether the gene expression during the transition from NCC to CMC has any defects, by compare the RNA-seq data at 0 hours(NCC) and 96 hours(CMC) in both CHARGE patient cells, and corrected cells as control.

I am going to use STAR to align the genome. Later, use the featurecounts to get the reading counts and use DEseq2 to do the differential gene analysis. In the second part, usint Salmon to get the TPMS, and using EdgeR, DEseq2 or LMMA to analyze the TPM data.

1. Command line part

Seting up STAR

Follow the instructions to install STAR at: https://github.com/alexdobin/STAR (It may requir the right version of gcc, it didn’t work on my computer, so I did the alignment on the server.) Download the Homo sapiens GRCh38 genome at: https://support.illumina.com/sequencing/sequencing_software/igenome.html

Unzip the GRCh38

tar -vxzf Homo_sapiens_NCBI_GRCh38.tar.gz

Build a STAR index

Using your template genome DNA to make an index and help you align your RNA-seq data.

–genomeDir: the direction of your final template index.
–genomeFastaFiles: your template genome.fa file. –sjdbGTFfile:your template genes.gtf file, shouls in the anotation file. –sjdbOverhang:can be think as the maximum possible overhang for your reads. Default is 100. –runThreadN: the number of cores you want to use to run this project. My computer has 4 cores, the server has 12 cores.

STAR\ 
> --runMode genomeGenerate \
> --genomeDir /home/yuhan/Finalproject/STAR_index \
> --genomeFastaFiles /home/yuhan/Finalproject/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa \
> --sjdbGTFfile /home/yuhan/Finalproject/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf \
> --sjdbOverhang 100 \
> --runThreadN 12

Get the raw data

You can download your own data on website. For this project, I got the RNA-seq data from my labmember. He collected the cells to do RNA-seq at 8 cell types:WT NCC cells 1 and 2(duplicates), WT CMC cells 1 and 2(duplicates), G08-correacted NCC(corrected the mutant chd7 gene in G08 cells), G08-corrected CMC, G08 NCC, G08 CMC.

I upload the data to the server by using:

scp [source file] [username]@[destination server]
eg: scp /Users/lundai/Desktop/RNA-seq data/KS-1-G082-0_ATCACG_L001_R1_001.fastq.gz yuhan@trgn.bioinform.io:./Finalproject/

Align the genome

Align the genome by using the following commands. –genomeDir: the direction of your final template index.
–readFilesIn: your RNA-seq fastq zip data. –readFilesCommand gunzip -c: I used the zip data, so I need to add this command to unzip my fasq file. –outFileNamePrefix: the final output direction and the prefix you want to add to your final data, to help you distinguish the different data. –outSAMtype BAM SortedByCoordinate: if you want to have sorted BAM file as your output file. –quantMode GeneCounts: count the gene numbers –outSAMattributes: a string of desired SAM sttributes, in the order desired for the output SAM. –runThreadN: the number of cores you want to use to run this project. My computer has 4 cores, the server has 12 cores.

STAR\
> --genomeDir /home/yuhan/Finalproject/STAR_index \
> --readFilesIn /home/yuhan/Finalproject/KS-1-G082-0_ATCACG_L001_R1_001.fastq.gz \
>--readFilesCommand gunzip -c \
>--outFileNamePrefix /home/yuhan/Finalproject/Datacomparation/STAR_mapping/G08-2 \
> --outSAMtype BAM SortedByCoordinate \
>--quantMode GeneCounts \
>--outSAMattributes AS nM jI jM \
> --runThreadN 8

Download the aligned BAM files

Using the following code to download the SAM and BAM files.

scp [username]@[destination server]:[source file] [directory]
eg: scp yuhan@trgn.bioinform.io:./Finalproject/Datacomparation/G08-1Aligned.out.sam /Users/lundai/Desktop/RNA-seq-compare

2.R data analyzation

Now move to R and do the data analyzation

Set up libraries and directory

library(Rsamtools)

## Loading required package: GenomeInfoDb

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: Biostrings

## Loading required package: XVector

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

library(Rsubread)
library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:XVector':
## 
##     slice

## The following object is masked from 'package:IRanges':
## 
##     slice

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Biostrings':
## 
##     collapse, intersect, setdiff, setequal, union

## The following object is masked from 'package:XVector':
## 
##     slice

## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union

## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:IRanges':
## 
##     space

## The following object is masked from 'package:S4Vectors':
## 
##     space

## The following object is masked from 'package:stats':
## 
##     lowess

library(DESeq2)

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## The following object is masked from 'package:dplyr':
## 
##     count

## Loading required package: BiocParallel

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following object is masked from 'package:Biostrings':
## 
##     type

## The following objects are masked from 'package:base':
## 
##     aperm, apply

library(apeglm)
library(ggrepel)
library(EnhancedVolcano)

## Warning: package 'EnhancedVolcano' was built under R version 3.6.0

library(pheatmap)
library('RColorBrewer')
library(vsn)
setwd("/Users/lundai/Desktop/RNA-seq-compare")

featurecounts

Using the Featurecounts to count the reads of the RNA-seq data.

RNA_data <- featureCounts(files = c("WT-NCC-1.bam","WT-NCC-2.bam","G08-corr-NCC.bam","G08-NCC.bam","WT-CMC-1.bam","WT-CMC-2.bam","G08-corr-CMC.bam","G08-CMC.bam"),
                          #annotation
                          annot.inbuilt = "mm10",
                          annot.ext = "genes.gtf",
                          isGTFAnnotationFile = TRUE, 
                          GTF.featureType = "exon",
                          GTF.attrType = "gene_name",
                          chrAliases = NULL,
                          # level of summarization
                          useMetaFeatures=TRUE)

For some sample, the alignments are really small. Why this happen?

input the tables and creat the Coltable for doing the DESeq analysis.

After align the data, I saved them as a csv file and than read them. I changed tthe title to make them more understandable. Also created a Col file which contains the name of the samples and the condition of the samples such as “treatment” and “control” et al.

write.table(x=data.frame(RNA_data$annotation[,c("GeneID")],(RNA_data$counts + 1),stringsAsFactors=FALSE),file="counts.csv",quote=FALSE,sep="\t",row.names=FALSE)

##              Condition
## WT_NCC_1      control1
## WT_NCC_2      control1
## G08_corr_NCC  control2
## G08_NCC       control3
## WT_CMC_1         test1
## WT_CMC_2         test1

DEseq2

After the feature count, I use the DEseq to help me analyze the data to caculate the fold changes and p values, you can find the manual of DESeq by typing:

vignette("DESeq2")

Establish the DESeq Data Set

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = Col,
                              design = ~ Condition)
#some values in assay are negative???Didn't find any nagetive data
#check: summary(is.numeric(countdata[,1])), logical,solution:add row.name=1

Differential expression analysis

dds <- DESeq(dds)
#Error in checkForExperimentalReplicates(object, modelMatrix) : 
# The design matrix has the same number of samples and coefficients to fit,so estimation of dispersion is not possible. Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.

Solution: Install the DESeq 1.20.0 version. This time you will get an warning message, but you will still get your results. Remember to restart the R after install the new version.

dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

res <- results(dds)
head(res,3)

## log2 fold change (MLE): Condition test3 vs control1 
## Wald test p-value: Condition test3 vs control1 
## DataFrame with 3 rows and 6 columns
##                   baseMean    log2FoldChange            lfcSE
##                  <numeric>         <numeric>        <numeric>
## DDX11L1   107.789954356499 -2.13680916217934 1.25457768538966
## WASH7P    4096.69289403373 -2.92712718610039 1.27334960345585
## MIR6859-1 26.3815577107416 -1.72148064337821 1.35473112375054
##                        stat             pvalue               padj
##                   <numeric>          <numeric>          <numeric>
## DDX11L1   -1.70320992240163 0.0885287904518496  0.124384837745232
## WASH7P     -2.2987616112309 0.0215184798089485 0.0367860582785829
## MIR6859-1 -1.27071757133056  0.203829142186771  0.255788972170917

Compare different groups

res_WT <- results(dds, contrast=c("Condition","control1","test1"))
res_G08_corr <- results(dds, contrast=c("Condition","control2","test2"))
res_G08 <- results(dds, contrast=c("Condition","control3","test3"))

Log fold change shrinkage for visualization and ranking

resultsNames(dds)

## [1] "Intercept"                      "Condition_control2_vs_control1"
## [3] "Condition_control3_vs_control1" "Condition_test1_vs_control1"   
## [5] "Condition_test2_vs_control1"    "Condition_test3_vs_control1"

resLFC_WT <- lfcShrink(dds, coef="Condition_test1_vs_control1", type="apeglm")

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     bioRxiv. https://doi.org/10.1101/303255

head(resLFC_WT,3)

## log2 fold change (MAP): Condition test1 vs control1 
## Wald test p-value: Condition test1 vs control1 
## DataFrame with 3 rows and 5 columns
##                   baseMean        log2FoldChange               lfcSE
##                  <numeric>             <numeric>           <numeric>
## DDX11L1   107.789954356499 -9.49839419517838e-07 0.00144269390506929
## WASH7P    4096.69289403373 -3.71791190963565e-07 0.00144269368536049
## MIR6859-1 26.3815577107416  8.51857211052023e-07 0.00144269395683869
##                      pvalue              padj
##                   <numeric>         <numeric>
## DDX11L1   0.326427725833295 0.999102114315248
## WASH7P    0.690938958142953 0.999102114315248
## MIR6859-1 0.381981017784859 0.999102114315248

Can noly get the fold change between the WT NCC and WT CMC, so to get the fold changes among other two groups, need to seprate and prepear the data again, which is shown in the following section.

Get the fold changes from other two groups

#Cannot caculate the Fold Changes between G08 and G08-corr
counts1 <- counts[c("G08_corr_NCC","G08_corr_CMC")]
counts2 <- counts[c("G08_NCC","G08_CMC")]
Col1 <- read.csv(file = "Col1.csv",sep = ",",header = TRUE, row.names = 1)

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'Col1.csv'

Col2 <- read.csv(file = "Col2.csv",sep = ",",header = TRUE, row.names = 1)

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'Col2.csv'

dds1 <- DESeqDataSetFromMatrix(countData = counts1,
                               colData = Col1,
                               design = ~ Condition)
dds2 <- DESeqDataSetFromMatrix(countData = counts2,
                               colData = Col2,
                               design = ~ Condition)
dds1 <- DESeq(dds1)

## estimating size factors

## estimating dispersions

## Warning in checkForExperimentalReplicates(object, modelMatrix): 
## 
##   Deprectation note: Analysis of designs without replicates will be removed
##   in the Oct 2018 release: DESeq2 v1.22.0, after which DESeq2 will give an error.

## Warning in checkForExperimentalReplicates(object, modelMatrix): 
## 
##   The design matrix has the same number of samples and coefficients to fit,
##   estimating dispersion by treating samples as replicates. This analysis
##   is not useful for accurate differential expression analysis, and arguably
##   not for data exploration either, as large differences appear as high dispersion.

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

resLFC_G08_corr <- lfcShrink(dds1, coef="Condition_test2_vs_control2", type="apeglm")

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     bioRxiv. https://doi.org/10.1101/303255

dds2 <- DESeq(dds2)

## estimating size factors

## estimating dispersions

## Warning in checkForExperimentalReplicates(object, modelMatrix): 
## 
##   Deprectation note: Analysis of designs without replicates will be removed
##   in the Oct 2018 release: DESeq2 v1.22.0, after which DESeq2 will give an error.

## Warning in checkForExperimentalReplicates(object, modelMatrix): 
## 
##   The design matrix has the same number of samples and coefficients to fit,
##   estimating dispersion by treating samples as replicates. This analysis
##   is not useful for accurate differential expression analysis, and arguably
##   not for data exploration either, as large differences appear as high dispersion.

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

resLFC_G08 <- lfcShrink(dds2, coef="Condition_test3_vs_control3", type="apeglm")

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     bioRxiv. https://doi.org/10.1101/303255

p-value

Here I use alpha=0.05 as my significant p value.

#p-value
resOrdered <- res[order(res$pvalue),]
summary(res)

## 
## out of 30017 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 8760, 29%
## LFC < 0 (down)     : 3501, 12%
## outliers [1]       : 0, 0%
## low counts [2]     : 12221, 41%
## (mean count < 7)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

##p-value<0.05
res05 <- results(dds, alpha=0.05)
summary(res05)

## 
## out of 30017 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 8068, 27%
## LFC < 0 (down)     : 2930, 9.8%
## outliers [1]       : 0, 0%
## low counts [2]     : 12803, 43%
## (mean count < 8)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

sum(res05$padj < 0.05, na.rm=TRUE)

## [1] 10998

Extracting transformed values

#Top 50 changed genes in each group
#make rlog transformed data
vsd <- vst(dds, blind=FALSE)
head(assay(vsd),3)

##            WT_NCC_1  WT_NCC_2 G08_corr_NCC   G08_NCC  WT_CMC_1  WT_CMC_2
## DDX11L1    8.205501  7.458635     5.001204  5.876045  7.449673  5.983062
## WASH7P    13.550362 12.267291     9.681378 10.110276 13.269094 11.471658
## MIR6859-1  4.972664  5.342347     2.345593  3.364815  6.463764  5.458604
##           G08_corr_CMC   G08_CMC
## DDX11L1       5.114766  5.813540
## WASH7P        6.929833 10.123847
## MIR6859-1     2.308247  3.727328

Check the effects of transformations on the variance

Befor the transtoemation, the sd vs mean plot is:

meanSdPlot(assay(dds))

Aftre the transformation, the sd va mean plot is:

meanSdPlot(assay(vsd))

Heatmap for the count matrix

set up the annotation

To add the annotaion on the heatmap, I need to set up the df which shows the conditions by:

select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:50]
df <- as.data.frame(colData(dds)[,c("Condition")])
rownames(df) <- colnames(dds)
names(df)[1] <- c("Conditin")

Drawing the heatmap by using the pheatmap library.

pheatmap(assay(vsd)[select,],scale="row",
         trace="none", dendrogram="column",fontsize_row = 5, 
         cluster_rows =TRUE, cellwidth = 30,cellheight = 5,
         main = "Heatmap of the count matrix", show_rownames=TRUE,
         cluster_cols=FALSE,cluster_row=TRUE, 
         legend = TRUE, annotation_col = df)

heatmap for the log2 Fold Change between different samples

I also want to compare the log2 Fold Changes between different groups. So at first I get the first 1000 changed genes in each group and than combind them together to create the candidate highly changed genes file:FC_top.

write.table(x=data.frame(resLFC_WT$log2FoldChange, stringsAsFactors=FALSE),file="FC_WT50.csv",quote=FALSE,sep="\t",row.names=TRUE)
write.table(x=data.frame(resLFC_G08_corr$log2FoldChange, stringsAsFactors=FALSE),file="FC_Cor50.csv",quote=FALSE,sep="\t",row.names=TRUE)
write.table(x=data.frame(resLFC_G08$log2FoldChange, stringsAsFactors=FALSE),file="FC_G0850.csv",quote=FALSE,sep="\t",row.names=TRUE)
FC_WT<-read.csv(file = "FC_WT50.csv",sep = "\t",header = TRUE, row.names = NULL)
FC_WT <- FC_WT[order(-abs(FC_WT$resLFC_WT.log2FoldChange)),]
FC_Cor<-read.csv(file = "FC_Cor50.csv",sep = "\t",header = TRUE, row.names = NULL)
FC_Cor <- FC_Cor[order(-abs(FC_Cor$resLFC_G08_corr.log2FoldChange)),]
FC_G08<-read.csv(file = "FC_G0850.csv",sep = "\t",header = TRUE, row.names = NULL)
FC_G08 <- FC_G08[order(-abs(FC_G08$resLFC_G08.log2FoldChange)),]


FC_WT1000 <-slice(FC_WT, 1:1000)
FC_Cor1000 <-slice(FC_Cor, 1:1000)
FC_G081000 <-slice(FC_G08, 1:1000)
FC_top <-full_join(FC_WT1000, FC_Cor1000, by = "row.names")
FC_top <-full_join(FC_top, FC_G081000, by = "row.names")
FC_top <-inner_join(FC_top, FC_WT, by = "row.names")
FC_top <-inner_join(FC_top, FC_Cor, by = "row.names")
FC_top <-inner_join(FC_top, FC_G08, by = "row.names")
row.names(FC_top)=FC_top[,1]
FC_top <- FC_top[c(5:7)]
names(FC_top)[1:3] <- c("WT","G08_corr","G08")

Heatmap for log2 Fold change

Then draw the heatmap.

pheatmap(FC_top, scale="row",
         trace="none", dendrogram="column",cluster_rows =TRUE,
         fontsize_row = 1, show_rownames=TRUE,
         cluster_cols=FALSE, legend = TRUE,main = "log2(FoldChange) of 3000 top genes from NCC to CMC",
         color = colorRampPalette(c("red","black","green"))(255))

Volcano Plot

I also want to draw the volcano plot for the genes changes from NCC to CMC in each cell types.alphs=0.05 is my significant p value. By using the plot code it is really wordy, and the none significant spots won’t show.

WT NCC vs WT CMC

alpha <- 0.05 # Threshold on the p-value
# par(mfrow=c(1,2))
# Compute significance, with a maximum of 320 for the p-values set to 0 due to limitation of computation precision
res_WT$sig <- -log10(res_WT$padj)
sum(is.infinite(res_WT$sig))

## [1] 0

res_WT[is.infinite(res_WT$sig),"sig"] <- 350
genes.to.plot <- !is.na(res_WT$pvalue)
range(res_WT[genes.to.plot, "log2FoldChange"])

## [1] -7.238799  8.521509

## Volcano plot of adjusted p-values
cols <- densCols(res_WT$log2FoldChange, res_WT$sig)
cols[res_WT$pvalue ==0] <- "purple"
res_WT$pch <- 19
res_WT$pch[res_WT$pvalue ==0] <- 6
plot(res_WT$log2FoldChange, 
     res_WT$sig, 
     col=cols, panel.first=grid(),
     main="WT Volcano plot", 
     xlab="Effect size: log2(fold-change)",
     ylab="-log10(adjusted p-value)",
     pch=res_WT$pch, cex=0.4)
abline(v=0)
abline(v=c(-2.5,2.5), col="brown")
abline(h=-log10(alpha), col="brown")

## Plot the names of a reasonable number of genes, by selecting those begin not only significant but also having a strong effect size
gn.selected <- abs(res_WT$log2FoldChange) > 2 & res_WT$padj < alpha 
text(res_WT$log2FoldChange[gn.selected],
     -log10(res_WT$padj)[gn.selected],
     lab=rownames(res_WT)[gn.selected ], cex=0.6)

res_WT$sig <- -log10(res_WT$padj)
sum(is.infinite(res_WT$sig))

## [1] 0

res_WT[is.infinite(res_WT$sig),"sig"] <- 350
genes.to.plot <- !is.na(res_WT$pvalue)
range(res_WT[genes.to.plot, "log2FoldChange"])

## [1] -7.238799  8.521509

With Enhanced Volcano

However, if using the Enhancedvolcano library, it is much easier and the plot are prettier. So I used the Enhanced volcano to draw the volcano plots for the 3 cell types.

EnhancedVolcano(res_WT,
                lab = rownames(res_WT),
                x = "log2FoldChange",
                y = "pvalue")

G08 corrected

With Enhanced Volcano

EnhancedVolcano(res_G08_corr,
                lab = rownames(res_G08_corr),
                x = "log2FoldChange",
                y = "pvalue")

G08

With Rnhanced Volcano

EnhancedVolcano(res_G08,
                lab = rownames(res_G08),
                x = "log2FoldChange",
                y = "pvalue")

PCA

Draw a PCA plot to see the first two principal conponents

pcaData <- plotPCA(vsd, intgroup=c("Condition"), returnData=TRUE)
percentVar <- round(500 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=Condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()