## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## Welcome to enrichR
## Checking connection ...
## Enrichr ... Connection is Live!
## FlyEnrichr ... Connection is available!
## WormEnrichr ... Connection is available!
## YeastEnrichr ... Connection is available!
## FishEnrichr ... Connection is available!
##
## Attaching package: 'igraph'
##
## The following object is masked from 'package:GenomicRanges':
##
## union
##
## The following object is masked from 'package:IRanges':
##
## union
##
## The following object is masked from 'package:S4Vectors':
##
## union
##
## The following objects are masked from 'package:BiocGenerics':
##
## normalize, path, union
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
##
## Loading required package: graph
##
## Attaching package: 'graph'
##
## The following objects are masked from 'package:igraph':
##
## degree, edges, intersection
##
## Loading required package: grid
##
## Attaching package: 'Rgraphviz'
##
## The following objects are masked from 'package:IRanges':
##
## from, to
##
## The following objects are masked from 'package:S4Vectors':
##
## from, to
The data was fetched from the CGGA (Chinese Glioma Genome Atlas). The sample collection methods and details regarding the sequencing platforms and protocols used can be found on The CGGA Website. [NOTE: Hyperlink Website] The Dataset used is for 325 patients diagnosed with different grades of Glioma as classified by the WHO standards (Grade II, Grade III, Grade IV). The data was obtained from RNA-seq experiments on Illumina platforms and is presented as gene count estimates. The gene count estimates were generated using STAR + RSEM (RNA-seq with Expectation Maximization).
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323 ## Raw vs Expected Count Estimates
The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads (multireads for short) entirely. Another involves partitioning and distributing portions of a multiread’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes. https://biowize.wordpress.com/2014/03/04/understanding-rsem-raw-read-counts-vs-expected-counts/
The typical procedure is to import the RSEM output that includes information about the different gene isoforms obtained from multiple transcripts using the protocol described in the DESeq2 documentation for dealing with gene-level count estimates. The protocol uses the ‘tximport’ package and takes care of pre-processing the data for import into a DESeq2 object, which can then be used to do the DEA. The issue here is that we only have the gene-level count estimates matrix with no additional information, so we do not have the entirety of RSEM output. This issue was raised before and the advised solution was to typically round the count estimates to integer values and then construct the DESeq2 object from the count matrix and run the downstream analysis as usual.
https://support.bioconductor.org/p/94003/#94028 https://www.biostars.org/p/320594/ https://support.bioconductor.org/p/94003/#94028
setwd('/home/mehdimerbah/R Projects/RNASeqDA/data')
disease_raw_data <- read.table("CGGA_mRNAseq_325_RSEM_genes.txt",head=T,row.names = 1)
disease_raw_data[1:5, 1:5]
## CGGA_1001 CGGA_1006 CGGA_1007 CGGA_1011 CGGA_1015
## A1BG 12.64 7.03 30.09 6.64 1.83
## A1BG-AS1 2.12 1.13 6.64 4.32 1.39
## A2M 452.92 106.54 206.70 707.17 824.32
## A2M-AS1 3.30 0.13 0.63 1.61 1.34
## A2ML1 0.04 0.33 4.96 1.59 0.00
setwd('/home/mehdimerbah/R Projects/RNASeqDA/data')
metadata <- read.table("CGGA_mRNAseq_325_clinical.txt", sep ="\t", head=T,row.names = 1)
meta <- metadata[order(row.names(metadata)),]
raw <- disease_raw_data[,order(colnames(disease_raw_data))]
NAs <- rownames(meta[is.na(meta$Grade),])
meta <- meta[!rownames(meta)%in%NAs, ]
raw <- raw[ ,!colnames(raw)%in%NAs]
meta$Grade <- as.factor(meta$Grade)
rounded <- round(raw)
rounded[1:10,1:5]
## CGGA_1001 CGGA_1004 CGGA_1005 CGGA_1006 CGGA_1007
## A1BG 13 21 8 7 30
## A1BG-AS1 2 2 3 1 7
## A2M 453 277 68 107 207
## A2M-AS1 3 1 0 0 1
## A2ML1 0 1 4 0 5
## A2MP1 0 0 0 0 0
## A3GALT2 0 0 0 1 0
## A4GALT 4 4 2 3 7
## A4GNT 0 0 0 1 0
## AAAS 28 29 25 24 26
#png("grade_distribution.png")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(meta, aes(x= Grade, fill=Histology)) + geom_bar() +ggtitle("Grade Distribution Across all Patients")
pca <- prcomp(t(rounded))
percentVar <- round(100*pca$sdev^2/sum(pca$sdev^2),1)
sd_ratio <- sqrt(percentVar[2] / percentVar[1])
pca_data <- data.frame(Sample = rownames(pca$x), PC1 = pca$x[,1], PC2 = pca$x[,2], Grade = meta$Grade)
ggplot(pca_data, aes(x = PC1, y = PC2, label = Sample))+ geom_point(aes(colour = Grade)) + #geom_text()+
ggtitle("PCA of Gene-level Count Estimates") +
xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
dds <- DESeqDataSetFromMatrix(countData = rounded, colData = meta, design = ~ Grade)
## converting counts to integer mode
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
res <- DESeq(dds)
## estimating size factors
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## -- replacing outliers and refitting for 534 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
expRes_4v2 <- results(res)
head(expRes_4v2[order(expRes_4v2$padj),], 10)
## log2 fold change (MLE): Grade WHO.IV vs WHO.II
## Wald test p-value: Grade WHO.IV vs WHO.II
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## TIMP1 419.13034 4.51390 0.2543976 17.7435 1.93583e-70 4.11653e-66
## ALDH5A1 20.46041 -1.56108 0.0926748 -16.8448 1.14637e-63 1.21888e-59
## COL1A2 108.34726 3.93948 0.2393097 16.4618 6.89628e-61 4.88832e-57
## GLUD1 140.23541 -1.53151 0.0935634 -16.3687 3.20014e-60 1.70127e-56
## COL6A2 69.82074 3.53397 0.2164736 16.3252 6.53468e-60 2.77920e-56
## CRY2 18.17199 -1.39783 0.0857925 -16.2932 1.10313e-59 3.90969e-56
## SERPINH1 55.88977 2.57997 0.1611367 16.0111 1.06954e-57 3.24912e-54
## WDFY3-AS2 6.10986 -1.64079 0.1031799 -15.9022 6.11949e-57 1.62664e-53
## IGFBP2 221.58926 3.51387 0.2217199 15.8482 1.44588e-56 3.41630e-53
## H19 58.72693 5.89746 0.3733887 15.7944 3.39845e-56 7.22681e-53
write.csv(as.data.frame(expRes_4v2[order(expRes_4v2$padj),]), file="DEGs_4v2.csv")
DESeq2::plotDispEsts(res)
plotMA(expRes_4v2, ylim=c(-4,4))
log2FoldChange<- expRes_4v2$log2FoldChange
padj<- expRes_4v2$padj
pvalue<- expRes_4v2$pvalue
alpha <- 0.05 # Threshold on the adjusted p-value
cols <- densCols(log2FoldChange, -log10(padj))
plot(log2FoldChange, -log10(padj), col=cols, panel.first=grid(),
main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
pch=20, cex=0.6)
abline(v=0)
abline(v=c(-2,2), col="brown")
abline(h=-log10(alpha), col="brown")
filtered <- abs(log2FoldChange) > 2 & padj < alpha
text(log2FoldChange[filtered],
-log10(padj)[filtered],
lab=rownames(res)[filtered], cex=0.4)
## Filtering Expression results Grades: 4 vs 2
filtered_4v2 <- subset(expRes_4v2, padj<0.05 & abs(log2FoldChange)>1)
gene_list_4v2 <- rownames(filtered_4v2)
## Extracting Expression results Grades: 4 vs 3
expRes_4v3 <- results(res, contrast = c("Grade","WHO.IV", "WHO.III"))
#head(expRes_4v3[order(expRes_4v3$padj),], 10)
write.csv(as.data.frame(expRes_4v3[order(expRes_4v3$padj),]), file="DEGs_4v3.csv")
filtered_4v3 <- subset(expRes_4v3, expRes_4v3$padj<0.05 & abs(expRes_4v3$log2FoldChange)>1)
gene_list_4v3 <- rownames(filtered_4v3)
## Extracting Expression results Grades: 3 vs 2
expRes_3v2 <- results(res, contrast = c("Grade","WHO III", "WHO II"))
#head(expRes_3v2[order(expRes_3v2$padj),], 10)
write.csv(as.data.frame(expRes_3v2[order(expRes_3v2$padj),]), file="DEGs_3v2.csv")
filtered_3v2 <- subset(expRes_3v2, expRes_3v2$padj<0.05 & abs(expRes_3v2$log2FoldChange)>1)
gene_list_3v2 <- rownames(filtered_3v2)
# Geneset enrichment with enrichR
dbs <- c("GO_Biological_Process_2018", "Reactome_2016", "KEGG_2019_Human", "RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO", "TF_Perturbations_Followed_by_Expression")
enriched_4v2 <- enrichr(gene_list_4v2, "GO_Biological_Process_2018")
## Uploading data to Enrichr... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
enriched_4v3 <- enrichr(gene_list_4v3, "GO_Biological_Process_2018")
## Uploading data to Enrichr... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
enriched_3v2 <- enrichr(gene_list_3v2, "GO_Biological_Process_2018")
## Uploading data to Enrichr... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
plotEnrich(enriched_4v2[[1]], title= "Enrichment 4 vs 2")
plotEnrich(enriched_4v3[[1]], title= "Enrichment 4 vs 3")
plotEnrich(enriched_3v2[[1]], title= "Enrichment 3 vs 2")
length(gene_list_3v2)
## [1] 879
length(gene_list_4v3)
## [1] 712
length(gene_list_4v2)
## [1] 3526
intersect(intersect(gene_list_3v2, gene_list_4v2), gene_list_4v3)
## [1] "ABCC3" "ABCC8" "AC006126.4" "AC062021.1"
## [5] "AC132217.4" "AC147651.4" "ACBD7" "ADAM12"
## [9] "ADAMTS14" "ANGPTL4" "ANXA2" "BCL2A1"
## [13] "C15orf48" "C1QL3" "C6orf141" "CA9"
## [17] "CALN1" "CD163" "CHI3L1" "CHRNA9"
## [21] "CLCF1" "CLEC5A" "COL13A1" "COL1A1"
## [25] "COL1A2" "COL3A1" "COL5A1" "COL5A2"
## [29] "COL6A2" "COL6A3" "COL8A1" "CTB-1I21.1"
## [33] "CTD-2380F24.1" "CXCL1" "DDN" "DKK1"
## [37] "DLGAP1-AS4" "EMILIN1" "ETNPPL" "F2RL2"
## [41] "F5" "FAM196B" "FAM19A1" "FAM20A"
## [45] "FER1L4" "FMOD" "FN1" "FOSL1"
## [49] "FOXD2-AS1" "FRMPD4" "FSTL5" "G0S2"
## [53] "GABRA1" "GABRB2" "GABRG1" "GDF15"
## [57] "GPX8" "GRIN3A" "H19" "HMGCLL1"
## [61] "HP" "HPSE2" "HS3ST3A1" "HSPA6"
## [65] "HTR2A" "HTRA3" "IGF2" "IGFBP2"
## [69] "IGFBP3" "IGLV2-11" "IL2RA" "IL8"
## [73] "KCNIP2" "KCNJ11" "KDELR3" "KIF20A"
## [77] "KSR2" "LAMB1" "LIF" "LINC00152"
## [81] "LOX" "LOXL1" "LTF" "LUM"
## [85] "METTL7B" "MFAP2" "MGAT4C" "MIR155HG"
## [89] "MMP19" "MMP9" "MSMP" "MXRA5"
## [93] "NNMT" "NT5C1A" "OCIAD2" "OLFM3"
## [97] "PCOLCE" "PDLIM4" "PHLDA2" "PI3"
## [101] "PLA2G2A" "PLA2G5" "PLAU" "PLEK2"
## [105] "PLP2" "PODNL1" "PON1" "POSTN"
## [109] "PRF1" "PRKCG" "PRLHR" "PTPRT"
## [113] "PTTG1" "PTX3" "RARRES2" "RASL10A"
## [117] "RBP1" "RHOD" "RP1-20B21.4" "RP1-261G23.7"
## [121] "RP1-293L6.1" "RP1-34H18.1" "RP11-116O18.1" "RP11-143K11.1"
## [125] "RP11-158M2.5" "RP11-161H23.5" "RP11-245J24.1" "RP11-290F20.3"
## [129] "RP11-47I22.1" "RP11-806H10.4" "RP11-81K13.1" "RP4-792G4.2"
## [133] "RP5-1119A7.17" "S100A4" "S100A9" "SAA1"
## [137] "SAA2" "SCN3B" "SDC1" "SEC61G"
## [141] "SELL" "SERPINA1" "SERPINA5" "SERPINE1"
## [145] "SLC14A2" "SLC22A6" "SLC25A48" "SNAI2"
## [149] "SPHKAP" "SRPX2" "STC1" "STEAP3"
## [153] "STRA6" "TACC3" "TDO2" "THBD"
## [157] "TIMP1" "TMEM151B" "TMEM155" "TMSB4XP4"
## [161] "TNFAIP6" "TRDC" "TREM1" "TTPA"
## [165] "VEGFA" "ZMYND10"
Removing the CGGA_761 sample and restructuring the Datasets
rounded_No761 <- rounded[, !(names(rounded) %in% "CGGA_761")]
dim(rounded_No761)
## [1] 24326 320
dim(rounded)
## [1] 24326 321
meta_No761 <- meta[!(names(rounded) %in% "CGGA_761"), ]
dim(meta_No761)
## [1] 320 12
dim(meta)
## [1] 321 12
Now we can run the Diff Expression analysis on the new Datasets
dds_No761 <- DESeqDataSetFromMatrix(countData = rounded_No761, colData = meta_No761, design = ~ Grade)
## converting counts to integer mode
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
res_No761 <- DESeq(dds_No761)
## estimating size factors
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## -- replacing outliers and refitting for 534 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
expRes_No761<- results(res_No761)
head(expRes_No761[order(expRes_No761$padj),], 10)
## log2 fold change (MLE): Grade WHO.IV vs WHO.II
## Wald test p-value: Grade WHO.IV vs WHO.II
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## TIMP1 420.29200 4.52383 0.2545050 17.7750 1.10364e-70 2.34690e-66
## ALDH5A1 20.48372 -1.56234 0.0929785 -16.8032 2.31165e-63 2.45786e-59
## COL1A2 108.62411 3.94890 0.2395216 16.4866 4.57923e-61 3.24591e-57
## COL6A2 69.99068 3.54325 0.2165843 16.3597 3.70941e-60 1.97202e-56
## GLUD1 140.43154 -1.53088 0.0938492 -16.3121 8.09954e-60 3.44473e-56
## CRY2 18.19159 -1.39860 0.0860724 -16.2491 2.26572e-59 8.03008e-56
## SERPINH1 55.92422 2.58538 0.1614805 16.0105 1.07989e-57 2.87048e-54
## WDFY3-AS2 6.10427 -1.65560 0.1033772 -16.0151 1.00176e-57 2.87048e-54
## IGFBP2 221.76638 3.52037 0.2222465 15.8399 1.64966e-56 3.89777e-53
## H19 58.91391 5.90787 0.3735425 15.8158 2.42149e-56 5.14931e-53
filtered_No761 <- subset(expRes_No761, expRes_No761$padj<0.05 & abs(expRes_No761$log2FoldChange)>1)
dim(filtered_No761)
## [1] 3547 6
gene_list_No761 <- rownames(filtered_No761)
length(gene_list_4v2)
## [1] 3526
length(gene_list_No761)
## [1] 3547
# The following should find the conjunction of the two gene_list sets
length(intersect(gene_list_4v2, gene_list_No761))
## [1] 3516
# check which genes are not common
list_uncommon <- setdiff(union(gene_list_4v2, gene_list_No761), intersect(gene_list_4v2, gene_list_No761))
There does not seem to be a great difference between the results with and without the CGGA_761 sample
Plotting the results
cols <- densCols(expRes_No761$log2FoldChange, -log10(expRes_No761$padj))
plot(expRes_No761$log2FoldChange, -log10(expRes_No761$padj), col=cols, panel.first=grid(),
main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
pch=20, cex=0.6)
abline(v=0)
abline(v=c(-2,2), col="brown")
abline(h=-log10(alpha), col="brown")
filtered <- abs(expRes_No761$log2FoldChange) > 2 & expRes_No761$padj < alpha
text(expRes_No761$log2FoldChange[filtered],
-log10(expRes_No761$padj)[filtered],
lab=rownames(res_No761)[filtered], cex=0.4)