## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## Welcome to enrichR
## Checking connection ...
## Enrichr ... Connection is Live!
## FlyEnrichr ... Connection is available!
## WormEnrichr ... Connection is available!
## YeastEnrichr ... Connection is available!
## FishEnrichr ... Connection is available!
## 
## Attaching package: 'igraph'
## 
## The following object is masked from 'package:GenomicRanges':
## 
##     union
## 
## The following object is masked from 'package:IRanges':
## 
##     union
## 
## The following object is masked from 'package:S4Vectors':
## 
##     union
## 
## The following objects are masked from 'package:BiocGenerics':
## 
##     normalize, path, union
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union
## 
## Loading required package: graph
## 
## Attaching package: 'graph'
## 
## The following objects are masked from 'package:igraph':
## 
##     degree, edges, intersection
## 
## Loading required package: grid
## 
## Attaching package: 'Rgraphviz'
## 
## The following objects are masked from 'package:IRanges':
## 
##     from, to
## 
## The following objects are masked from 'package:S4Vectors':
## 
##     from, to

Information about the Data

The data was fetched from the CGGA (Chinese Glioma Genome Atlas). The sample collection methods and details regarding the sequencing platforms and protocols used can be found on The CGGA Website. [NOTE: Hyperlink Website] The Dataset used is for 325 patients diagnosed with different grades of Glioma as classified by the WHO standards (Grade II, Grade III, Grade IV). The data was obtained from RNA-seq experiments on Illumina platforms and is presented as gene count estimates. The gene count estimates were generated using STAR + RSEM (RNA-seq with Expectation Maximization).

RSEM

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323 ## Raw vs Expected Count Estimates

The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads (multireads for short) entirely. Another involves partitioning and distributing portions of a multiread’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes. https://biowize.wordpress.com/2014/03/04/understanding-rsem-raw-read-counts-vs-expected-counts/

DESeq2 with RSEM Estimates Matrix

The typical procedure is to import the RSEM output that includes information about the different gene isoforms obtained from multiple transcripts using the protocol described in the DESeq2 documentation for dealing with gene-level count estimates. The protocol uses the ‘tximport’ package and takes care of pre-processing the data for import into a DESeq2 object, which can then be used to do the DEA. The issue here is that we only have the gene-level count estimates matrix with no additional information, so we do not have the entirety of RSEM output. This issue was raised before and the advised solution was to typically round the count estimates to integer values and then construct the DESeq2 object from the count matrix and run the downstream analysis as usual.

Data import and Wrangling

setwd('/home/mehdimerbah/R Projects/RNASeqDA/data')
disease_raw_data <- read.table("CGGA_mRNAseq_325_RSEM_genes.txt",head=T,row.names = 1)
disease_raw_data[1:5, 1:5]
##          CGGA_1001 CGGA_1006 CGGA_1007 CGGA_1011 CGGA_1015
## A1BG         12.64      7.03     30.09      6.64      1.83
## A1BG-AS1      2.12      1.13      6.64      4.32      1.39
## A2M         452.92    106.54    206.70    707.17    824.32
## A2M-AS1       3.30      0.13      0.63      1.61      1.34
## A2ML1         0.04      0.33      4.96      1.59      0.00
setwd('/home/mehdimerbah/R Projects/RNASeqDA/data')
metadata <- read.table("CGGA_mRNAseq_325_clinical.txt", sep ="\t", head=T,row.names = 1)
meta <- metadata[order(row.names(metadata)),]
raw <- disease_raw_data[,order(colnames(disease_raw_data))]
NAs <- rownames(meta[is.na(meta$Grade),])
meta <- meta[!rownames(meta)%in%NAs, ]
raw <- raw[ ,!colnames(raw)%in%NAs]
meta$Grade <- as.factor(meta$Grade)
rounded <- round(raw)
rounded[1:10,1:5]
##          CGGA_1001 CGGA_1004 CGGA_1005 CGGA_1006 CGGA_1007
## A1BG            13        21         8         7        30
## A1BG-AS1         2         2         3         1         7
## A2M            453       277        68       107       207
## A2M-AS1          3         1         0         0         1
## A2ML1            0         1         4         0         5
## A2MP1            0         0         0         0         0
## A3GALT2          0         0         0         1         0
## A4GALT           4         4         2         3         7
## A4GNT            0         0         0         1         0
## AAAS            28        29        25        24        26

EDA

Distribution of the Grades

#png("grade_distribution.png")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(meta, aes(x= Grade, fill=Histology)) + geom_bar() +ggtitle("Grade Distribution Across all Patients")

PCA

pca <- prcomp(t(rounded))

percentVar <- round(100*pca$sdev^2/sum(pca$sdev^2),1)

sd_ratio <- sqrt(percentVar[2] / percentVar[1])

pca_data <- data.frame(Sample = rownames(pca$x), PC1 = pca$x[,1], PC2 = pca$x[,2], Grade = meta$Grade)

ggplot(pca_data, aes(x = PC1, y = PC2, label = Sample))+ geom_point(aes(colour = Grade)) + #geom_text()+
  ggtitle("PCA of Gene-level Count Estimates") +
  xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
  ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))

Differantial Expression Analysis

dds <- DESeqDataSetFromMatrix(countData = rounded, colData = meta, design = ~ Grade)
## converting counts to integer mode
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
res <- DESeq(dds)
## estimating size factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## -- replacing outliers and refitting for 534 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
expRes_4v2 <- results(res)
head(expRes_4v2[order(expRes_4v2$padj),], 10)
## log2 fold change (MLE): Grade WHO.IV vs WHO.II 
## Wald test p-value: Grade WHO.IV vs WHO.II 
## DataFrame with 10 rows and 6 columns
##            baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##           <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## TIMP1     419.13034        4.51390 0.2543976   17.7435 1.93583e-70 4.11653e-66
## ALDH5A1    20.46041       -1.56108 0.0926748  -16.8448 1.14637e-63 1.21888e-59
## COL1A2    108.34726        3.93948 0.2393097   16.4618 6.89628e-61 4.88832e-57
## GLUD1     140.23541       -1.53151 0.0935634  -16.3687 3.20014e-60 1.70127e-56
## COL6A2     69.82074        3.53397 0.2164736   16.3252 6.53468e-60 2.77920e-56
## CRY2       18.17199       -1.39783 0.0857925  -16.2932 1.10313e-59 3.90969e-56
## SERPINH1   55.88977        2.57997 0.1611367   16.0111 1.06954e-57 3.24912e-54
## WDFY3-AS2   6.10986       -1.64079 0.1031799  -15.9022 6.11949e-57 1.62664e-53
## IGFBP2    221.58926        3.51387 0.2217199   15.8482 1.44588e-56 3.41630e-53
## H19        58.72693        5.89746 0.3733887   15.7944 3.39845e-56 7.22681e-53
write.csv(as.data.frame(expRes_4v2[order(expRes_4v2$padj),]), file="DEGs_4v2.csv")
DESeq2::plotDispEsts(res)

plotMA(expRes_4v2, ylim=c(-4,4))

log2FoldChange<- expRes_4v2$log2FoldChange
padj<- expRes_4v2$padj
pvalue<- expRes_4v2$pvalue
alpha <- 0.05 # Threshold on the adjusted p-value
cols <- densCols(log2FoldChange, -log10(padj))
plot(log2FoldChange, -log10(padj), col=cols, panel.first=grid(),
     main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
     pch=20, cex=0.6)
abline(v=0)
abline(v=c(-2,2), col="brown")
abline(h=-log10(alpha), col="brown")

filtered <- abs(log2FoldChange) > 2 & padj < alpha 
text(log2FoldChange[filtered],
     -log10(padj)[filtered],
     lab=rownames(res)[filtered], cex=0.4)

Functional Annotation with enrichR

## Filtering Expression results Grades: 4 vs 2
filtered_4v2 <- subset(expRes_4v2, padj<0.05 & abs(log2FoldChange)>1)
gene_list_4v2 <- rownames(filtered_4v2)


## Extracting Expression results Grades: 4 vs 3
expRes_4v3 <- results(res, contrast = c("Grade","WHO.IV", "WHO.III"))
#head(expRes_4v3[order(expRes_4v3$padj),], 10)
write.csv(as.data.frame(expRes_4v3[order(expRes_4v3$padj),]), file="DEGs_4v3.csv")
filtered_4v3 <- subset(expRes_4v3, expRes_4v3$padj<0.05 & abs(expRes_4v3$log2FoldChange)>1)
gene_list_4v3 <- rownames(filtered_4v3)

## Extracting Expression results Grades: 3 vs 2
expRes_3v2 <- results(res, contrast = c("Grade","WHO III", "WHO II"))
#head(expRes_3v2[order(expRes_3v2$padj),], 10)
write.csv(as.data.frame(expRes_3v2[order(expRes_3v2$padj),]), file="DEGs_3v2.csv")
filtered_3v2 <- subset(expRes_3v2, expRes_3v2$padj<0.05 & abs(expRes_3v2$log2FoldChange)>1)
gene_list_3v2 <- rownames(filtered_3v2)

# Geneset enrichment with enrichR
dbs <- c("GO_Biological_Process_2018", "Reactome_2016", "KEGG_2019_Human", "RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO", "TF_Perturbations_Followed_by_Expression")
enriched_4v2 <- enrichr(gene_list_4v2, "GO_Biological_Process_2018")
## Uploading data to Enrichr... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
enriched_4v3 <- enrichr(gene_list_4v3, "GO_Biological_Process_2018")
## Uploading data to Enrichr... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
enriched_3v2 <- enrichr(gene_list_3v2, "GO_Biological_Process_2018")
## Uploading data to Enrichr... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
plotEnrich(enriched_4v2[[1]], title= "Enrichment 4 vs 2")

plotEnrich(enriched_4v3[[1]], title= "Enrichment 4 vs 3")

plotEnrich(enriched_3v2[[1]], title= "Enrichment 3 vs 2")

Number of Diff Expressed Genes

length(gene_list_3v2)
## [1] 879
length(gene_list_4v3)
## [1] 712
length(gene_list_4v2)
## [1] 3526
intersect(intersect(gene_list_3v2, gene_list_4v2), gene_list_4v3)
##   [1] "ABCC3"         "ABCC8"         "AC006126.4"    "AC062021.1"   
##   [5] "AC132217.4"    "AC147651.4"    "ACBD7"         "ADAM12"       
##   [9] "ADAMTS14"      "ANGPTL4"       "ANXA2"         "BCL2A1"       
##  [13] "C15orf48"      "C1QL3"         "C6orf141"      "CA9"          
##  [17] "CALN1"         "CD163"         "CHI3L1"        "CHRNA9"       
##  [21] "CLCF1"         "CLEC5A"        "COL13A1"       "COL1A1"       
##  [25] "COL1A2"        "COL3A1"        "COL5A1"        "COL5A2"       
##  [29] "COL6A2"        "COL6A3"        "COL8A1"        "CTB-1I21.1"   
##  [33] "CTD-2380F24.1" "CXCL1"         "DDN"           "DKK1"         
##  [37] "DLGAP1-AS4"    "EMILIN1"       "ETNPPL"        "F2RL2"        
##  [41] "F5"            "FAM196B"       "FAM19A1"       "FAM20A"       
##  [45] "FER1L4"        "FMOD"          "FN1"           "FOSL1"        
##  [49] "FOXD2-AS1"     "FRMPD4"        "FSTL5"         "G0S2"         
##  [53] "GABRA1"        "GABRB2"        "GABRG1"        "GDF15"        
##  [57] "GPX8"          "GRIN3A"        "H19"           "HMGCLL1"      
##  [61] "HP"            "HPSE2"         "HS3ST3A1"      "HSPA6"        
##  [65] "HTR2A"         "HTRA3"         "IGF2"          "IGFBP2"       
##  [69] "IGFBP3"        "IGLV2-11"      "IL2RA"         "IL8"          
##  [73] "KCNIP2"        "KCNJ11"        "KDELR3"        "KIF20A"       
##  [77] "KSR2"          "LAMB1"         "LIF"           "LINC00152"    
##  [81] "LOX"           "LOXL1"         "LTF"           "LUM"          
##  [85] "METTL7B"       "MFAP2"         "MGAT4C"        "MIR155HG"     
##  [89] "MMP19"         "MMP9"          "MSMP"          "MXRA5"        
##  [93] "NNMT"          "NT5C1A"        "OCIAD2"        "OLFM3"        
##  [97] "PCOLCE"        "PDLIM4"        "PHLDA2"        "PI3"          
## [101] "PLA2G2A"       "PLA2G5"        "PLAU"          "PLEK2"        
## [105] "PLP2"          "PODNL1"        "PON1"          "POSTN"        
## [109] "PRF1"          "PRKCG"         "PRLHR"         "PTPRT"        
## [113] "PTTG1"         "PTX3"          "RARRES2"       "RASL10A"      
## [117] "RBP1"          "RHOD"          "RP1-20B21.4"   "RP1-261G23.7" 
## [121] "RP1-293L6.1"   "RP1-34H18.1"   "RP11-116O18.1" "RP11-143K11.1"
## [125] "RP11-158M2.5"  "RP11-161H23.5" "RP11-245J24.1" "RP11-290F20.3"
## [129] "RP11-47I22.1"  "RP11-806H10.4" "RP11-81K13.1"  "RP4-792G4.2"  
## [133] "RP5-1119A7.17" "S100A4"        "S100A9"        "SAA1"         
## [137] "SAA2"          "SCN3B"         "SDC1"          "SEC61G"       
## [141] "SELL"          "SERPINA1"      "SERPINA5"      "SERPINE1"     
## [145] "SLC14A2"       "SLC22A6"       "SLC25A48"      "SNAI2"        
## [149] "SPHKAP"        "SRPX2"         "STC1"          "STEAP3"       
## [153] "STRA6"         "TACC3"         "TDO2"          "THBD"         
## [157] "TIMP1"         "TMEM151B"      "TMEM155"       "TMSB4XP4"     
## [161] "TNFAIP6"       "TRDC"          "TREM1"         "TTPA"         
## [165] "VEGFA"         "ZMYND10"

DEA Without 761

Removing the CGGA_761 sample and restructuring the Datasets

rounded_No761 <- rounded[, !(names(rounded) %in% "CGGA_761")]
dim(rounded_No761)
## [1] 24326   320
dim(rounded)
## [1] 24326   321
meta_No761 <- meta[!(names(rounded) %in% "CGGA_761"), ]
dim(meta_No761)
## [1] 320  12
dim(meta)
## [1] 321  12

Now we can run the Diff Expression analysis on the new Datasets

dds_No761 <- DESeqDataSetFromMatrix(countData = rounded_No761, colData = meta_No761, design = ~ Grade)
## converting counts to integer mode
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
res_No761 <- DESeq(dds_No761)
## estimating size factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## -- replacing outliers and refitting for 534 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
expRes_No761<- results(res_No761)
head(expRes_No761[order(expRes_No761$padj),], 10)
## log2 fold change (MLE): Grade WHO.IV vs WHO.II 
## Wald test p-value: Grade WHO.IV vs WHO.II 
## DataFrame with 10 rows and 6 columns
##            baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##           <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## TIMP1     420.29200        4.52383 0.2545050   17.7750 1.10364e-70 2.34690e-66
## ALDH5A1    20.48372       -1.56234 0.0929785  -16.8032 2.31165e-63 2.45786e-59
## COL1A2    108.62411        3.94890 0.2395216   16.4866 4.57923e-61 3.24591e-57
## COL6A2     69.99068        3.54325 0.2165843   16.3597 3.70941e-60 1.97202e-56
## GLUD1     140.43154       -1.53088 0.0938492  -16.3121 8.09954e-60 3.44473e-56
## CRY2       18.19159       -1.39860 0.0860724  -16.2491 2.26572e-59 8.03008e-56
## SERPINH1   55.92422        2.58538 0.1614805   16.0105 1.07989e-57 2.87048e-54
## WDFY3-AS2   6.10427       -1.65560 0.1033772  -16.0151 1.00176e-57 2.87048e-54
## IGFBP2    221.76638        3.52037 0.2222465   15.8399 1.64966e-56 3.89777e-53
## H19        58.91391        5.90787 0.3735425   15.8158 2.42149e-56 5.14931e-53
filtered_No761 <- subset(expRes_No761, expRes_No761$padj<0.05 & abs(expRes_No761$log2FoldChange)>1)
dim(filtered_No761)
## [1] 3547    6
gene_list_No761 <- rownames(filtered_No761)
length(gene_list_4v2)
## [1] 3526
length(gene_list_No761)
## [1] 3547
# The following should find the conjunction of the two gene_list sets
length(intersect(gene_list_4v2, gene_list_No761))
## [1] 3516
# check which genes are not common
list_uncommon <- setdiff(union(gene_list_4v2, gene_list_No761), intersect(gene_list_4v2, gene_list_No761))

There does not seem to be a great difference between the results with and without the CGGA_761 sample

Plotting the results

cols <- densCols(expRes_No761$log2FoldChange, -log10(expRes_No761$padj))
plot(expRes_No761$log2FoldChange, -log10(expRes_No761$padj), col=cols, panel.first=grid(),
     main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
     pch=20, cex=0.6)
abline(v=0)
abline(v=c(-2,2), col="brown")
abline(h=-log10(alpha), col="brown")

filtered <- abs(expRes_No761$log2FoldChange) > 2 & expRes_No761$padj < alpha 
text(expRes_No761$log2FoldChange[filtered],
     -log10(expRes_No761$padj)[filtered],
     lab=rownames(res_No761)[filtered], cex=0.4)