0.1 Introduction to the Notebook

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

This notebook has been created to take you through the steps of analysing data from The Cancer Genome Atlas (TCGA). This Notebook is going to focus on the use of gene expression data but the TCGA also contains data on mutations, copy number variations and miRNA.

1 Background

Cancer is a disease where cells become in an unregulated manner that results in cellular proliferation and the avoidance of apoptosis. As the disease develops the tumour develops more complex processes including angiogenesis and metastasis, which is the process of migration from the initial solid tumour site to other locations around the body.

Successful treatment depends on our undertsanding of these transformations of well regulated into poorly regulated cells. Somatic mutations, copy number variation and the presence of mutations in cell regulatory genes are known to contribute to this process and to increase the likelihood of its occurence. The question that we are trying to address is what are the changes in gene expression between the different stages of cancer.

  1. Healthy cell
  2. Small solid tumour
  3. Larger tumour with angiogenesis
  4. Metastatic cancer

1.1 Gene Expression Data

All of the cells of the body with the exception of the gametes originally contain the same genetic code. The variation between different cell types is dependent on differences in the genes that are switched on. These differences in gene expression can be measured by looking at the levels of messenger RNA for all of the different genes. Originally this was done using microarray technology but this has now been replaced with next generation sequencing based approaches.

By studying the differences in expression between the different stages of cancer we can identify the genetic changes involved in cancer progression. These are unlikely to be dependent on only single genes but are more likely to depend on the coordinated change in regulation of many genes with regulatory and metabolic pathways.

1.2 The TCGA

The Cancer Genome Atlas is a US government project that collects data from a wide range of cancer types. The main ones for focus are lung and breast cancer but there are datasets from all human tissue types. The data can be accessed via the GDC data portal. The availble data for each case includes.

  1. The case ID - this identifies a cancer patient
  2. The project - this is the large-scale experiment from which the data was collected
  3. The gender
  4. The no. of files
  5. The Seq. files - these are the genome sequence files for the expressed genome (exome) sequencing and also miRNA sequencing when available.
  6. The Exp. files - these are the gene expression files (the ones we will use here)
  7. The SNV files - these are the single nucleotide variations
  8. The CNV files - these are the copy number variants which are largescale duplications of segments of DNA
  9. The Meth files - these document the methylation of the genome and are involved in epigenetic effects
  10. Clinical - these are the clinical files and data for the patient
  11. Bio - these are the biospecimen files for the experiment

1.3 An Experiment

You should remember that a fundamental condition for the design of an experiment is that you require two groups. One of these is usually the control and the other is the treated group. The treatment is a variable that is used to distinguish between the two groups. It is possible to have two alternative treatments rather than one treatment and a control and that will still create a valid experiment.

This is important for TCGA data as they did not collect healthy tissue from the patients (I consider this a serious design flaw but they probably forgot to ask a statistician). The reason for this is that we cannot be sure what the natural variation in gene expression is between the tissue samples of healthy cells in different individuals. If we assume that all healthy tissue has exactly the same expression profile for each tissue type then this is not an issue but this might not be true.

1.4 Setting Up

To carry out this research you need to have installed R, R-studio and the R add-on called Bioconductor. After that you will need to build a script that will do the analysis process for you including getting the data from the TCGA. Combining all of the data into a single file for each group, running the comparisons of gene expression between the groups and finally analysing the differentially expressed genes.

At the top of your R code you need the following section which will check for Bioconductor and install the libraries that you need for the analysis.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.13")
BiocManager::install("TCGAbiolinks")
BiocManager::install("clusterProfiler")
BiocManager::install("EDASeq")
BiocManager::install("edgeR")
BiocManager::install("DO.db")
BiocManager::install("GO.db")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("pathview")
install.packages("here")

2 Getting the Data

2.1 Finding your question

The first step is to define your question. The question is definied by some specific treatment that you want to check out and see how it affects gene expression. I decided to look at the clinical stage of the disease to see if there is a difference in expression between two different stages.

2.2 Searching the TCGA

To find your datasets you need to go to the TCGA and do some searches to see what is possible. I selected cases based on the Clincal Tab and used the Diagnosis of AJCC pathological stage to define the different treatment groups

I chose stage i to be the first group and stage iv to be the second group. This created two small datasets for testing purposes but you could select groups based on age, survival etc. I would NOT use ethnicity as it is poorly defined.

Once you have the sets defined you can download a TSV file containing a summary of the files from each set. You will see that some cases have 5 expression files and some have more. This is a problem as the data has been collected twice for some samples and this cannot be handled by the R-script. Any cases with more than 5 files need to be removed. This will then give you two spreadsheets summarising the data available for the two groups and most importantly their Case IDs.

You now need to create the R scripts to download the two sets of data based on their Case IDs.

2.3 Downloading the First Dataset

### Access the library of commands from TCGAbiolinks
library(TCGAbiolinks)

### Define a variable called barcodes1 that includes the ids for the experiments that you want in your dataset
# Note the format of r variables. this is a vector defined by the c() format, each element is defined
# by a comma and as they are names they are in quotation marks.
barcodes1 <- c("TCGA-5L-AAT1", "TCGA-UU-A93S","TCGA-A8-A07W", "TCGA-AN-A0FJ", "TCGA-AC-A62V", "TCGA-BH-A18J", 
         "TCGA-B6-A0I9", "TCGA-AO-A0J5", "TCGA-B6-A0IB", "TCGA-A8-A08T", "TCGA-A2-A0SW", "TCGA-LL-A73Z", 
         "TCGA-A2-A0CS", "TCGA-PL-A8LX", "TCGA-A2-A0T2", "TCGA-A8-A08J", "TCGA-A2-A0SV", "TCGA-A8-A08O",
         "TCGA-BH-A1FH", "TCGA-B6-A3ZX")

### Create a query callled BRCAiv for the database using query terms AND the barcode variable bar
query.exp.BRCAiv <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "results", 
                  sample.type = c("Primary Tumor"),
                  legacy = TRUE,
                  barcode=barcodes1)
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg19
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-BRCA
## --------------------
## oo Filtering results
## --------------------
## ooo By platform
## ooo By data.type
## ooo By file.type
## ooo By barcode
## ooo By sample.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
### Download the result of the query to BRCAiv after you have checked that the query worked properly
GDCdownload(query.exp.BRCAiv)
## Downloading data for project TCGA-BRCA
## Of the 20 files for download 20 already exist.
## All samples have been already downloaded

2.4 Downloading the Second Dataset

For this second example the barcode variable is set out as a column this is to show that the shape of the data you put into the variable doesn’t matter. The only things that matter are the c, the brackets and the speech marks. It is easier to put a long list like this in column format because this comes directly from the spreadsheet from TCGA giving the Case ID.

### This second dataset needs a new barcode variable barcodes2
barcodes2 <- c ("TCGA-BH-A0B6",
           "TCGA-GM-A2D9",
           "TCGA-BH-A0HA",
           "TCGA-BH-A18P",
           "TCGA-AR-A2LE",
           "TCGA-BH-A0W7",
           "TCGA-E2-A1IN",
           "TCGA-E2-A1LH",
           "TCGA-GM-A2DH",
           "TCGA-GM-A2DO",
           "TCGA-A8-A095",
           "TCGA-BH-A0BP",
           "TCGA-E2-A152",
           "TCGA-BH-A0H5",
           "TCGA-BH-A209",
           "TCGA-BH-A0WA",
           "TCGA-BH-A0BW",
           "TCGA-Z7-A8R6",
           "TCGA-BH-A0B8",
           "TCGA-E2-A14Z",
           "TCGA-A8-A06O",
           "TCGA-BH-A0AV",
           "TCGA-E2-A1IH",
           "TCGA-AR-A255",
           "TCGA-BH-A0B0",
           "TCGA-BH-A0HY",
           "TCGA-OL-A66J",
           "TCGA-AR-A2LR",
           "TCGA-AR-A1AK",
           "TCGA-AR-A1AY",
           "TCGA-GM-A2DI",
           "TCGA-GM-A2DD",
           "TCGA-B6-A40B",
           "TCGA-BH-A0BR",
           "TCGA-BH-A0DX",
           "TCGA-BH-A0BL",
           "TCGA-BH-A18K",
           "TCGA-AO-A03V",
           "TCGA-E2-A154",
           "TCGA-A2-A3XZ",
           "TCGA-AR-A24S",
           "TCGA-GM-A2DL",
           "TCGA-AR-A1AJ",
           "TCGA-EW-A1J6",
           "TCGA-BH-A1FD",
           "TCGA-A2-A0YF",
           "TCGA-E2-A15C",
           "TCGA-EW-A1IY",
           "TCGA-A7-A0CD",
           "TCGA-AR-A1AP",
           "TCGA-AR-A24N",
           "TCGA-A8-A0AD",
           "TCGA-E2-A15O",
           "TCGA-BH-A0BG",
           "TCGA-S3-AA14",
           "TCGA-E2-A1II",
           "TCGA-GM-A2DK",
           "TCGA-BH-A0DO",
           "TCGA-AR-A1AX",
           "TCGA-BH-A1FG",
           "TCGA-A1-A0SE",
           "TCGA-BH-A18S",
           "TCGA-BH-A0H6",
           "TCGA-BH-A0BO",
           "TCGA-AR-A24P",
           "TCGA-BH-A1ET",
           "TCGA-BH-A1EU",
           "TCGA-E2-A156",
           "TCGA-B6-A1KI",
           "TCGA-B6-A0X0",
           "TCGA-E2-A15J",
           "TCGA-BH-A0BQ",
           "TCGA-A2-A259",
           "TCGA-E2-A1IF",
           "TCGA-A2-A0EP",
           "TCGA-BH-A0C3",
           "TCGA-E2-A1IJ",
           "TCGA-BH-A0H3",
           "TCGA-E2-A1IO",
           "TCGA-A2-A0YI",
           "TCGA-AR-A252",
           "TCGA-A8-A08A",
           "TCGA-B6-A402",
           "TCGA-AO-A03M",
           "TCGA-E2-A14U",
           "TCGA-A7-A3IY",
           "TCGA-AO-A03U",
           "TCGA-E2-A15F",
           "TCGA-E2-A14S",
           "TCGA-A1-A0SB"
)

### Create a query caller BRCAi using the bar1 variable in the barcodes and the same query format as before
query.exp.BRCAi <- GDCquery(project = "TCGA-BRCA",
                             data.category = "Gene expression",
                             data.type = "Gene expression quantification",
                             platform = "Illumina HiSeq", 
                             file.type  = "results", 
                             sample.type = c("Primary Tumor"),
                             legacy = TRUE,
                             barcode=barcodes2)
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg19
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-BRCA
## --------------------
## oo Filtering results
## --------------------
## ooo By platform
## ooo By data.type
## ooo By file.type
## ooo By barcode
## ooo By sample.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
###  Download this second dataset into BRCAi
GDCdownload(query.exp.BRCAi)
## Downloading data for project TCGA-BRCA
## Of the 90 files for download 90 already exist.
## All samples have been already downloaded

2.5 Creating the R Datafiles

You can now write R datafiles for each of the two datasets so that these can be loaded into R for any future analysis and you do not need to download the data again from the TCGA.

BRCAi.exp <- GDCprepare(query.exp.BRCAi, 
                      save = TRUE, 
                      summarizedExperiment = TRUE, 
                      save.filename = "BRCAillumina_HiSeq.rda")
## -------------------
## oo Reading 90 files
## -------------------
## 
|                                                    |  0%                      
|                                                    |1.111111% ~2 s remaining  
|=                                                   |2.222222% ~1 s remaining  
|=                                                   |3.333333% ~1 s remaining  
|==                                                  |4.444444% ~1 s remaining  
|==                                                  |5.555556% ~1 s remaining  
|===                                                 |6.666667% ~1 s remaining  
|====                                                |7.777778% ~1 s remaining  
|====                                                |8.888889% ~1 s remaining  
|=====                                               | 10% ~1 s remaining       
|=====                                               |11.11111% ~1 s remaining  
|======                                              |12.22222% ~1 s remaining  
|======                                              |13.33333% ~1 s remaining  
|=======                                             |14.44444% ~1 s remaining  
|========                                            |15.55556% ~1 s remaining  
|========                                            |16.66667% ~1 s remaining  
|=========                                           |17.77778% ~1 s remaining  
|=========                                           |18.88889% ~1 s remaining  
|==========                                          | 20% ~1 s remaining       
|==========                                          |21.11111% ~1 s remaining  
|===========                                         |22.22222% ~1 s remaining  
|============                                        |23.33333% ~1 s remaining  
|============                                        |24.44444% ~1 s remaining  
|=============                                       |25.55556% ~1 s remaining  
|=============                                       |26.66667% ~1 s remaining  
|==============                                      |27.77778% ~2 s remaining  
|===============                                     |28.88889% ~1 s remaining  
|===============                                     | 30% ~1 s remaining       
|================                                    |31.11111% ~1 s remaining  
|================                                    |32.22222% ~1 s remaining  
|=================                                   |33.33333% ~1 s remaining  
|=================                                   |34.44444% ~1 s remaining  
|==================                                  |35.55556% ~1 s remaining  
|===================                                 |36.66667% ~1 s remaining  
|===================                                 |37.77778% ~1 s remaining  
|====================                                |38.88889% ~1 s remaining  
|====================                                | 40% ~1 s remaining       
|=====================                               |41.11111% ~1 s remaining  
|=====================                               |42.22222% ~1 s remaining  
|======================                              |43.33333% ~1 s remaining  
|=======================                             |44.44444% ~1 s remaining  
|=======================                             |45.55556% ~1 s remaining  
|========================                            |46.66667% ~1 s remaining  
|========================                            |47.77778% ~1 s remaining  
|=========================                           |48.88889% ~1 s remaining  
|==========================                          | 50% ~1 s remaining       
|==========================                          |51.11111% ~1 s remaining  
|===========================                         |52.22222% ~1 s remaining  
|===========================                         |53.33333% ~1 s remaining  
|============================                        |54.44444% ~1 s remaining  
|============================                        |55.55556% ~1 s remaining  
|=============================                       |56.66667% ~1 s remaining  
|==============================                      |57.77778% ~1 s remaining  
|==============================                      |58.88889% ~1 s remaining  
|===============================                     | 60% ~1 s remaining       
|===============================                     |61.11111% ~1 s remaining  
|================================                    |62.22222% ~1 s remaining  
|================================                    |63.33333% ~1 s remaining  
|=================================                   |64.44444% ~1 s remaining  
|==================================                  |65.55556% ~1 s remaining  
|==================================                  |66.66667% ~1 s remaining  
|===================================                 |67.77778% ~1 s remaining  
|===================================                 |68.88889% ~1 s remaining  
|====================================                | 70% ~1 s remaining       
|====================================                |71.11111% ~0 s remaining  
|=====================================               |72.22222% ~0 s remaining  
|======================================              |73.33333% ~0 s remaining  
|======================================              |74.44444% ~0 s remaining  
|=======================================             |75.55556% ~0 s remaining  
|=======================================             |76.66667% ~0 s remaining  
|========================================            |77.77778% ~0 s remaining  
|=========================================           |78.88889% ~0 s remaining  
|=========================================           | 80% ~0 s remaining       
|==========================================          |81.11111% ~0 s remaining  
|==========================================          |82.22222% ~0 s remaining  
|===========================================         |83.33333% ~0 s remaining  
|===========================================         |84.44444% ~0 s remaining  
|============================================        |85.55556% ~0 s remaining  
|=============================================       |86.66667% ~0 s remaining  
|=============================================       |87.77778% ~0 s remaining  
|==============================================      |88.88889% ~0 s remaining  
|==============================================      | 90% ~0 s remaining       
|===============================================     |91.11111% ~0 s remaining  
|===============================================     |92.22222% ~0 s remaining  
|================================================    |93.33333% ~0 s remaining  
|=================================================   |94.44444% ~0 s remaining  
|=================================================   |95.55556% ~0 s remaining  
|==================================================  |96.66667% ~0 s remaining  
|==================================================  |97.77778% ~0 s remaining  
|=================================================== |98.88889% ~0 s remaining  
|====================================================|100% ~0 s remaining       
|====================================================|100%                      Completed after 1 s
## -------------------
## oo Merging 90 files
## -------------------
## Starting to add information to samples
##  => Add clinical information to samples
##  => Adding TCGA molecular information from marker papers
##  => Information will have prefix 'paper_'
## brca subtype information from:doi.org/10.1016/j.ccell.2018.03.014
## => Saving file: BRCAillumina_HiSeq.rda
## => File saved
BRCAiv.exp <- GDCprepare(query.exp.BRCAiv, 
                      save = TRUE, 
                      summarizedExperiment = TRUE, 
                      save.filename = "BRCAivIllumina_HiSeq.rda")
## -------------------
## oo Reading 20 files
## -------------------
## 
|                                                    |  0%                      
|==                                                  |  5% ~0 s remaining       
|=====                                               | 10% ~0 s remaining       
|=======                                             | 15% ~0 s remaining       
|==========                                          | 20% ~0 s remaining       
|=============                                       | 25% ~0 s remaining       
|===============                                     | 30% ~0 s remaining       
|==================                                  | 35% ~0 s remaining       
|====================                                | 40% ~0 s remaining       
|=======================                             | 45% ~0 s remaining       
|==========================                          | 50% ~0 s remaining       
|============================                        | 55% ~0 s remaining       
|===============================                     | 60% ~0 s remaining       
|=================================                   | 65% ~0 s remaining       
|====================================                | 70% ~0 s remaining       
|=======================================             | 75% ~0 s remaining       
|=========================================           | 80% ~0 s remaining       
|============================================        | 85% ~0 s remaining       
|==============================================      | 90% ~0 s remaining       
|=================================================   | 95% ~0 s remaining       
|====================================================|100% ~0 s remaining       
|====================================================|100%                      Completed after 0 s
## -------------------
## oo Merging 20 files
## -------------------
## Starting to add information to samples
##  => Add clinical information to samples
##  => Adding TCGA molecular information from marker papers
##  => Information will have prefix 'paper_'
## brca subtype information from:doi.org/10.1016/j.ccell.2018.03.014
## => Saving file: BRCAivIllumina_HiSeq.rda
## => File saved

3 Finding the Differentially Expressed Genes

The next step is to actually identify the genes that are differentially expressed between the two experimental groups. This also involved preprocessing and normalisation steps.

3.1 Preprocessing and Normalisation

You apply a filter based on correlation to remove any samples within each group that are potential outliers. Then you normalise the two datasets to make sure that all of the samples are measured at a common scale.

dataPrep_BRCAi <- TCGAanalyze_Preprocessing(object = BRCAi.exp,
                                          cor.cut = 0.6,    
                                          datatype = "raw_count",
                                          filename = "BRCAi_IlluminaHiSeq_RNASeqV2.png")

dataPrep_BRCAiv <- TCGAanalyze_Preprocessing(object = BRCAiv.exp,
                                          cor.cut = 0.6, 
                                          datatype = "raw_count",
                                          filename = "BRCAiv_IlluminaHiSeq_RNASeqV2.png")

dataNorm <- TCGAanalyze_Normalization(tabDF = cbind(dataPrep_BRCAi, dataPrep_BRCAiv),
                                      geneInfo = TCGAbiolinks::geneInfo,
                                      method = "gcContent") 
## I Need about  27 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
## Step 1 of 4: newSeqExpressionSet ...
## Step 2 of 4: withinLaneNormalization ...
## Step 3 of 4: betweenLaneNormalization ...
## Step 4 of 4: .quantileNormalization ...

You need to take a quick look at the images generated from the correlation step to see if there are any outliers with a low level of correlation.

knitr::include_graphics(here::here("BRCAiv_IlluminaHiSeq_RNASeqV2.png"))
The Boxplot of the correlations between the BRCAiv samples

The Boxplot of the correlations between the BRCAiv samples

As you can see from the plot the correlations are all very high and there is good overlap between all of the boxplots. Three might be slightly lower than the rest but they are not significantly different.

3.2 Filtering the Combined Data

This will remove genes from the dataset that fall outside a specified range. This filtering can be based on the mean, the variance, the fold change or the effect size.

Filtering based on the mean is common but there are statistical and biological reasons why this might not be a good approach to take. 1) The data is unlikely to be normally distributed and so the mean is a poor measure 2) Genes with very low or very high expression levels might be the ones with significant differential expression.

It might be best not to filter the data at all or to use another method. Fold change or effect size could be reasonable but variance is likely to have the same statistical issues as the mean if the data is not normally distributed.

### The filtering cutoff for the data was set to 0.05 that means the middle 95% of the data will be kept
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile",
                                  qnt.cut =  0.05)  

### Save the composite normalised datafile
save(dataFilt, file = paste0("BRCAi_BRCAiv_Norm_IlluminaHiSeq.rda"))

Alternatively using fold change

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "filter2")  

### Save the composite normalised datafile
save(dataFilt, file = paste0("BRCAi_BRCAiv_Norm_IlluminaHiSeq.rda"))

3.3 Splitting The Normalised Data

You now need to create two R objects that correspond to the normalised and filtered data for the two experimental conditions. This is based on the barcodes1 and barcodes2 variables.

### Split the data into subsets according the different barcodes.
dataFiltBRCAi <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% barcodes2)
dataFiltBRCAiv <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% barcodes1)

3.4 Determining the Differentially Expressed Genes

This is the final step of the process and there are two different options for doing this. One based on the exact test and one using the General Linear Model of the Negative Binomial. Don’t worry about the second process it is just called the glmLRT model and it is the preferred model.

You also need to adjust the FDR cut to a false discovery rate that you are happy with. I have set it to be 5% for this case. Ideally you want to have between 1000 and 3000 differentially expressed genes as this is likely to show a number of pathways. If you don’t have the cut off you can be overwhelmed by the number of identified genes.

### Determine the set of differentially expressed genes between the two subsets
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltBRCAi,
                            mat2 = dataFiltBRCAiv,
                            Cond1type = "BRCAi",
                            Cond2type = "BRCAiv",
                            fdr.cut = 0.1,
                            method = "exactTest")
## Batch correction skipped since no factors provided
## ----------------------- DEA -------------------------------
## there are Cond1 type BRCAi in  90 samples
## there are Cond2 type BRCAiv in  20 samples
## there are  19866 features as miRNA or genes
## I Need about  73 seconds for this DEA. [Processing 30k elements /s]
## ----------------------- END DEA -------------------------------
### Check how many genes you have.
nrow(dataDEGs)
## [1] 2421

Now that you have a list of differentially expressed genes the next step is the genome set enrichment analysis. This identifies the pathways and biological processes that are affected by the differential expression.

4 Genome Set Enrichment Analysis (GSEA)

4.1 Genome Ontology Analysis

The genome ontology (GO) is a specific grammar and vocabulary used to describe biological pathways, cellular locations and biological processes. It was created to try and standardise biological knowledge in the post genomic era.

Genome set enrichment analysis takes the list of genes identified as differentially expressed and uses the genome ontology to idnetify process and locations where there are a disproportionate number of differentially expressed genes.

This produces four barplots one for each of; 1) Biological Process 2) Cellular Component 3) Pathways 4) Molecular Function

### Gene Set Enrichment Analysis GSEA 
ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes BRCAi Vs BRCAiv", RegulonList = rownames(dataDEGs))
## [1] "I need about  1 minute to finish complete  Enrichment analysis GO[BP,MF,CC] and Pathways... "
## [1] "GO Enrichment Analysis BP completed....done"
## [1] "GO Enrichment Analysis MF completed....done"
## [1] "GO Enrichment Analysis CC completed....done"
## [1] "Pathway Enrichment Analysis completed....done"
## Display the GSEA Results ##

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        GOBPTab = ansEA$ResBP, 
                        nRGTab = rownames(dataDEGs),
                        nBar = 20,
                        filename="BP.pdf")
## png 
##   2
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResCC),
                        GOCCTab = ansEA$ResCC,
                        nRGTab = rownames(dataDEGs),
                        nBar = 20,
                        filename="CC.pdf")
## png 
##   2
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResPat),
                        PathTab = ansEA$ResPat,
                        nRGTab = rownames(dataDEGs),
                        nBar = 20,
                        filename="Pat.pdf")
## png 
##   2
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResMF),
                        GOMFTab = ansEA$ResMF, 
                        nRGTab = rownames(dataDEGs),
                        nBar = 20,
                        filename="MF.pdf")
## png 
##   2

While GO analysis reveals some of the features of the differential gene expression we would like to take a more detailed look at pathway analysis.

4.2 Detailed Look at Pathway Analysis

The first step is to convert the gene symbols used in the experimental analysis to Entrez IDs that can be used for searching the KEGG database of pathways. This is complex code but it just does a simple function.

library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(clusterProfiler)
## 
## clusterProfiler v4.0.0  For help: https://guangchuangyu.github.io/software/clusterProfiler
## 
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
## 
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:IRanges':
## 
##     slice
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## The following object is masked from 'package:stats':
## 
##     filter
library(pathview)
## 
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
# DEGs TopTable
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"BRCAi","BRCAiv",
                                          dataFilt[,colnames(dataFiltBRCAi)],
                                          dataFilt[,colnames(dataFiltBRCAiv)])

dataDEGsFiltLevel$GeneID <- 0


# Converting Gene symbol to geneID
eg = as.data.frame(bitr(dataDEGsFiltLevel$mRNA,
                        fromType="SYMBOL",
                        toType="ENTREZID",
                        OrgDb="org.Hs.eg.db"))
## Loading required package: org.Hs.eg.db
## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:clusterProfiler':
## 
##     select
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(dataDEGsFiltLevel$mRNA, fromType = "SYMBOL", toType =
## "ENTREZID", : 13.26% of input gene IDs are fail to map...
eg <- eg[!duplicated(eg$SYMBOL),]

dataDEGsFiltLevel <- dataDEGsFiltLevel[dataDEGsFiltLevel$mRNA %in% eg$SYMBOL,]

dataDEGsFiltLevel <- dataDEGsFiltLevel[order(dataDEGsFiltLevel$mRNA,decreasing=FALSE),]
eg <- eg[order(eg$SYMBOL,decreasing=FALSE),]

# table(eg$SYMBOL == dataDEGsFiltLevel$mRNA) should be TRUE
all(eg$SYMBOL == dataDEGsFiltLevel$mRNA)
## [1] TRUE
dataDEGsFiltLevel$GeneID <- eg$ENTREZID

dataDEGsFiltLevel_sub <- subset(dataDEGsFiltLevel, select = c("GeneID", "logFC"))
genelistDEGs <- as.numeric(dataDEGsFiltLevel_sub$logFC)
names(genelistDEGs) <- dataDEGsFiltLevel_sub$GeneID

###################################################
# The following exports the Entrez Gene IDs to a csv file

output <- names(genelistDEGs)
write.csv(output, "gene_list.csv")

Once you have a list of Entrez Gene IDs you can use this to search KEGG by using DAVID https://david.ncifcrf.gov/

You upload the list as a gene list and carry out the functional annotation analysis. Then you pic the chart for KEGG-PATHWAY. In this case there is only one pathway with a significant value for Benjamini and Hochberg which is the proteasome if you click on this in the bottom left hand corner of the image is a number. For the proteasome this is 03050. You need this number for the next steps. This is used in the R code to create the pathway images.

Next you want to plot the differential expression on the KEGG pathway in R and Bioconductor. You do this with the following code.

hsa03050 <- pathview::pathview(gene.data  = genelistDEGs,
                               pathway.id = "hsa03050",
                               species    = "hsa",
                               limit = list(gene=as.integer(max(abs(genelistDEGs)))))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory D:/Dropbox/R-TCGA
## Info: Writing image file hsa03050.pathview.png
hsa05204 <- pathview::pathview(gene.data  = genelistDEGs,
                               pathway.id = "hsa05204",
                               species    = "hsa",
                               limit = list(gene=as.integer(max(abs(genelistDEGs)))))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory D:/Dropbox/R-TCGA
## Info: Writing image file hsa05204.pathview.png

This generates the following image.

knitr::include_graphics(here::here("hsa03050.pathview.png"))
The Differential Gene Expression in the Proteasome

The Differential Gene Expression in the Proteasome