This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
This notebook has been created to take you through the steps of analysing data from The Cancer Genome Atlas (TCGA). This Notebook is going to focus on the use of gene expression data but the TCGA also contains data on mutations, copy number variations and miRNA.
Cancer is a disease where cells become in an unregulated manner that results in cellular proliferation and the avoidance of apoptosis. As the disease develops the tumour develops more complex processes including angiogenesis and metastasis, which is the process of migration from the initial solid tumour site to other locations around the body.
Successful treatment depends on our undertsanding of these transformations of well regulated into poorly regulated cells. Somatic mutations, copy number variation and the presence of mutations in cell regulatory genes are known to contribute to this process and to increase the likelihood of its occurence. The question that we are trying to address is what are the changes in gene expression between the different stages of cancer.
All of the cells of the body with the exception of the gametes originally contain the same genetic code. The variation between different cell types is dependent on differences in the genes that are switched on. These differences in gene expression can be measured by looking at the levels of messenger RNA for all of the different genes. Originally this was done using microarray technology but this has now been replaced with next generation sequencing based approaches.
By studying the differences in expression between the different stages of cancer we can identify the genetic changes involved in cancer progression. These are unlikely to be dependent on only single genes but are more likely to depend on the coordinated change in regulation of many genes with regulatory and metabolic pathways.
The Cancer Genome Atlas is a US government project that collects data from a wide range of cancer types. The main ones for focus are lung and breast cancer but there are datasets from all human tissue types. The data can be accessed via the GDC data portal. The availble data for each case includes.
You should remember that a fundamental condition for the design of an experiment is that you require two groups. One of these is usually the control and the other is the treated group. The treatment is a variable that is used to distinguish between the two groups. It is possible to have two alternative treatments rather than one treatment and a control and that will still create a valid experiment.
This is important for TCGA data as they did not collect healthy tissue from the patients (I consider this a serious design flaw but they probably forgot to ask a statistician). The reason for this is that we cannot be sure what the natural variation in gene expression is between the tissue samples of healthy cells in different individuals. If we assume that all healthy tissue has exactly the same expression profile for each tissue type then this is not an issue but this might not be true.
To carry out this research you need to have installed R, R-studio and the R add-on called Bioconductor. After that you will need to build a script that will do the analysis process for you including getting the data from the TCGA. Combining all of the data into a single file for each group, running the comparisons of gene expression between the groups and finally analysing the differentially expressed genes.
At the top of your R code you need the following section which will check for Bioconductor and install the libraries that you need for the analysis.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(version = "3.13")
BiocManager::install("TCGAbiolinks")
BiocManager::install("clusterProfiler")
BiocManager::install("EDASeq")
BiocManager::install("edgeR")
BiocManager::install("DO.db")
BiocManager::install("GO.db")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("pathview")
BiocManagerinstall.packages("here")
The first step is to define your question. The question is definied by some specific treatment that you want to check out and see how it affects gene expression. I decided to look at the clinical stage of the disease to see if there is a difference in expression between two different stages.
To find your datasets you need to go to the TCGA and do some searches to see what is possible. I selected cases based on the Clincal Tab and used the Diagnosis of AJCC pathological stage to define the different treatment groups
I chose stage i to be the first group and stage iv to be the second group. This created two small datasets for testing purposes but you could select groups based on age, survival etc. I would NOT use ethnicity as it is poorly defined.
Once you have the sets defined you can download a TSV file containing a summary of the files from each set. You will see that some cases have 5 expression files and some have more. This is a problem as the data has been collected twice for some samples and this cannot be handled by the R-script. Any cases with more than 5 files need to be removed. This will then give you two spreadsheets summarising the data available for the two groups and most importantly their Case IDs.
You now need to create the R scripts to download the two sets of data based on their Case IDs.
### Access the library of commands from TCGAbiolinks
library(TCGAbiolinks)
### Define a variable called barcodes1 that includes the ids for the experiments that you want in your dataset
# Note the format of r variables. this is a vector defined by the c() format, each element is defined
# by a comma and as they are names they are in quotation marks.
<- c("TCGA-5L-AAT1", "TCGA-UU-A93S","TCGA-A8-A07W", "TCGA-AN-A0FJ", "TCGA-AC-A62V", "TCGA-BH-A18J",
barcodes1 "TCGA-B6-A0I9", "TCGA-AO-A0J5", "TCGA-B6-A0IB", "TCGA-A8-A08T", "TCGA-A2-A0SW", "TCGA-LL-A73Z",
"TCGA-A2-A0CS", "TCGA-PL-A8LX", "TCGA-A2-A0T2", "TCGA-A8-A08J", "TCGA-A2-A0SV", "TCGA-A8-A08O",
"TCGA-BH-A1FH", "TCGA-B6-A3ZX")
### Create a query callled BRCAiv for the database using query terms AND the barcode variable bar
<- GDCquery(project = "TCGA-BRCA",
query.exp.BRCAiv data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
sample.type = c("Primary Tumor"),
legacy = TRUE,
barcode=barcodes1)
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg19
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-BRCA
## --------------------
## oo Filtering results
## --------------------
## ooo By platform
## ooo By data.type
## ooo By file.type
## ooo By barcode
## ooo By sample.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
### Download the result of the query to BRCAiv after you have checked that the query worked properly
GDCdownload(query.exp.BRCAiv)
## Downloading data for project TCGA-BRCA
## Of the 20 files for download 20 already exist.
## All samples have been already downloaded
For this second example the barcode variable is set out as a column this is to show that the shape of the data you put into the variable doesn’t matter. The only things that matter are the c, the brackets and the speech marks. It is easier to put a long list like this in column format because this comes directly from the spreadsheet from TCGA giving the Case ID.
### This second dataset needs a new barcode variable barcodes2
<- c ("TCGA-BH-A0B6",
barcodes2 "TCGA-GM-A2D9",
"TCGA-BH-A0HA",
"TCGA-BH-A18P",
"TCGA-AR-A2LE",
"TCGA-BH-A0W7",
"TCGA-E2-A1IN",
"TCGA-E2-A1LH",
"TCGA-GM-A2DH",
"TCGA-GM-A2DO",
"TCGA-A8-A095",
"TCGA-BH-A0BP",
"TCGA-E2-A152",
"TCGA-BH-A0H5",
"TCGA-BH-A209",
"TCGA-BH-A0WA",
"TCGA-BH-A0BW",
"TCGA-Z7-A8R6",
"TCGA-BH-A0B8",
"TCGA-E2-A14Z",
"TCGA-A8-A06O",
"TCGA-BH-A0AV",
"TCGA-E2-A1IH",
"TCGA-AR-A255",
"TCGA-BH-A0B0",
"TCGA-BH-A0HY",
"TCGA-OL-A66J",
"TCGA-AR-A2LR",
"TCGA-AR-A1AK",
"TCGA-AR-A1AY",
"TCGA-GM-A2DI",
"TCGA-GM-A2DD",
"TCGA-B6-A40B",
"TCGA-BH-A0BR",
"TCGA-BH-A0DX",
"TCGA-BH-A0BL",
"TCGA-BH-A18K",
"TCGA-AO-A03V",
"TCGA-E2-A154",
"TCGA-A2-A3XZ",
"TCGA-AR-A24S",
"TCGA-GM-A2DL",
"TCGA-AR-A1AJ",
"TCGA-EW-A1J6",
"TCGA-BH-A1FD",
"TCGA-A2-A0YF",
"TCGA-E2-A15C",
"TCGA-EW-A1IY",
"TCGA-A7-A0CD",
"TCGA-AR-A1AP",
"TCGA-AR-A24N",
"TCGA-A8-A0AD",
"TCGA-E2-A15O",
"TCGA-BH-A0BG",
"TCGA-S3-AA14",
"TCGA-E2-A1II",
"TCGA-GM-A2DK",
"TCGA-BH-A0DO",
"TCGA-AR-A1AX",
"TCGA-BH-A1FG",
"TCGA-A1-A0SE",
"TCGA-BH-A18S",
"TCGA-BH-A0H6",
"TCGA-BH-A0BO",
"TCGA-AR-A24P",
"TCGA-BH-A1ET",
"TCGA-BH-A1EU",
"TCGA-E2-A156",
"TCGA-B6-A1KI",
"TCGA-B6-A0X0",
"TCGA-E2-A15J",
"TCGA-BH-A0BQ",
"TCGA-A2-A259",
"TCGA-E2-A1IF",
"TCGA-A2-A0EP",
"TCGA-BH-A0C3",
"TCGA-E2-A1IJ",
"TCGA-BH-A0H3",
"TCGA-E2-A1IO",
"TCGA-A2-A0YI",
"TCGA-AR-A252",
"TCGA-A8-A08A",
"TCGA-B6-A402",
"TCGA-AO-A03M",
"TCGA-E2-A14U",
"TCGA-A7-A3IY",
"TCGA-AO-A03U",
"TCGA-E2-A15F",
"TCGA-E2-A14S",
"TCGA-A1-A0SB"
)
### Create a query caller BRCAi using the bar1 variable in the barcodes and the same query format as before
<- GDCquery(project = "TCGA-BRCA",
query.exp.BRCAi data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
sample.type = c("Primary Tumor"),
legacy = TRUE,
barcode=barcodes2)
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg19
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-BRCA
## --------------------
## oo Filtering results
## --------------------
## ooo By platform
## ooo By data.type
## ooo By file.type
## ooo By barcode
## ooo By sample.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
### Download this second dataset into BRCAi
GDCdownload(query.exp.BRCAi)
## Downloading data for project TCGA-BRCA
## Of the 90 files for download 90 already exist.
## All samples have been already downloaded
You can now write R datafiles for each of the two datasets so that these can be loaded into R for any future analysis and you do not need to download the data again from the TCGA.
<- GDCprepare(query.exp.BRCAi,
BRCAi.exp save = TRUE,
summarizedExperiment = TRUE,
save.filename = "BRCAillumina_HiSeq.rda")
## -------------------
## oo Reading 90 files
## -------------------
##
| | 0%
| |1.111111% ~2 s remaining
|= |2.222222% ~1 s remaining
|= |3.333333% ~1 s remaining
|== |4.444444% ~1 s remaining
|== |5.555556% ~1 s remaining
|=== |6.666667% ~1 s remaining
|==== |7.777778% ~1 s remaining
|==== |8.888889% ~1 s remaining
|===== | 10% ~1 s remaining
|===== |11.11111% ~1 s remaining
|====== |12.22222% ~1 s remaining
|====== |13.33333% ~1 s remaining
|======= |14.44444% ~1 s remaining
|======== |15.55556% ~1 s remaining
|======== |16.66667% ~1 s remaining
|========= |17.77778% ~1 s remaining
|========= |18.88889% ~1 s remaining
|========== | 20% ~1 s remaining
|========== |21.11111% ~1 s remaining
|=========== |22.22222% ~1 s remaining
|============ |23.33333% ~1 s remaining
|============ |24.44444% ~1 s remaining
|============= |25.55556% ~1 s remaining
|============= |26.66667% ~1 s remaining
|============== |27.77778% ~2 s remaining
|=============== |28.88889% ~1 s remaining
|=============== | 30% ~1 s remaining
|================ |31.11111% ~1 s remaining
|================ |32.22222% ~1 s remaining
|================= |33.33333% ~1 s remaining
|================= |34.44444% ~1 s remaining
|================== |35.55556% ~1 s remaining
|=================== |36.66667% ~1 s remaining
|=================== |37.77778% ~1 s remaining
|==================== |38.88889% ~1 s remaining
|==================== | 40% ~1 s remaining
|===================== |41.11111% ~1 s remaining
|===================== |42.22222% ~1 s remaining
|====================== |43.33333% ~1 s remaining
|======================= |44.44444% ~1 s remaining
|======================= |45.55556% ~1 s remaining
|======================== |46.66667% ~1 s remaining
|======================== |47.77778% ~1 s remaining
|========================= |48.88889% ~1 s remaining
|========================== | 50% ~1 s remaining
|========================== |51.11111% ~1 s remaining
|=========================== |52.22222% ~1 s remaining
|=========================== |53.33333% ~1 s remaining
|============================ |54.44444% ~1 s remaining
|============================ |55.55556% ~1 s remaining
|============================= |56.66667% ~1 s remaining
|============================== |57.77778% ~1 s remaining
|============================== |58.88889% ~1 s remaining
|=============================== | 60% ~1 s remaining
|=============================== |61.11111% ~1 s remaining
|================================ |62.22222% ~1 s remaining
|================================ |63.33333% ~1 s remaining
|================================= |64.44444% ~1 s remaining
|================================== |65.55556% ~1 s remaining
|================================== |66.66667% ~1 s remaining
|=================================== |67.77778% ~1 s remaining
|=================================== |68.88889% ~1 s remaining
|==================================== | 70% ~1 s remaining
|==================================== |71.11111% ~0 s remaining
|===================================== |72.22222% ~0 s remaining
|====================================== |73.33333% ~0 s remaining
|====================================== |74.44444% ~0 s remaining
|======================================= |75.55556% ~0 s remaining
|======================================= |76.66667% ~0 s remaining
|======================================== |77.77778% ~0 s remaining
|========================================= |78.88889% ~0 s remaining
|========================================= | 80% ~0 s remaining
|========================================== |81.11111% ~0 s remaining
|========================================== |82.22222% ~0 s remaining
|=========================================== |83.33333% ~0 s remaining
|=========================================== |84.44444% ~0 s remaining
|============================================ |85.55556% ~0 s remaining
|============================================= |86.66667% ~0 s remaining
|============================================= |87.77778% ~0 s remaining
|============================================== |88.88889% ~0 s remaining
|============================================== | 90% ~0 s remaining
|=============================================== |91.11111% ~0 s remaining
|=============================================== |92.22222% ~0 s remaining
|================================================ |93.33333% ~0 s remaining
|================================================= |94.44444% ~0 s remaining
|================================================= |95.55556% ~0 s remaining
|================================================== |96.66667% ~0 s remaining
|================================================== |97.77778% ~0 s remaining
|=================================================== |98.88889% ~0 s remaining
|====================================================|100% ~0 s remaining
|====================================================|100% Completed after 1 s
## -------------------
## oo Merging 90 files
## -------------------
## Starting to add information to samples
## => Add clinical information to samples
## => Adding TCGA molecular information from marker papers
## => Information will have prefix 'paper_'
## brca subtype information from:doi.org/10.1016/j.ccell.2018.03.014
## => Saving file: BRCAillumina_HiSeq.rda
## => File saved
<- GDCprepare(query.exp.BRCAiv,
BRCAiv.exp save = TRUE,
summarizedExperiment = TRUE,
save.filename = "BRCAivIllumina_HiSeq.rda")
## -------------------
## oo Reading 20 files
## -------------------
##
| | 0%
|== | 5% ~0 s remaining
|===== | 10% ~0 s remaining
|======= | 15% ~0 s remaining
|========== | 20% ~0 s remaining
|============= | 25% ~0 s remaining
|=============== | 30% ~0 s remaining
|================== | 35% ~0 s remaining
|==================== | 40% ~0 s remaining
|======================= | 45% ~0 s remaining
|========================== | 50% ~0 s remaining
|============================ | 55% ~0 s remaining
|=============================== | 60% ~0 s remaining
|================================= | 65% ~0 s remaining
|==================================== | 70% ~0 s remaining
|======================================= | 75% ~0 s remaining
|========================================= | 80% ~0 s remaining
|============================================ | 85% ~0 s remaining
|============================================== | 90% ~0 s remaining
|================================================= | 95% ~0 s remaining
|====================================================|100% ~0 s remaining
|====================================================|100% Completed after 0 s
## -------------------
## oo Merging 20 files
## -------------------
## Starting to add information to samples
## => Add clinical information to samples
## => Adding TCGA molecular information from marker papers
## => Information will have prefix 'paper_'
## brca subtype information from:doi.org/10.1016/j.ccell.2018.03.014
## => Saving file: BRCAivIllumina_HiSeq.rda
## => File saved
The next step is to actually identify the genes that are differentially expressed between the two experimental groups. This also involved preprocessing and normalisation steps.
You apply a filter based on correlation to remove any samples within each group that are potential outliers. Then you normalise the two datasets to make sure that all of the samples are measured at a common scale.
<- TCGAanalyze_Preprocessing(object = BRCAi.exp,
dataPrep_BRCAi cor.cut = 0.6,
datatype = "raw_count",
filename = "BRCAi_IlluminaHiSeq_RNASeqV2.png")
<- TCGAanalyze_Preprocessing(object = BRCAiv.exp,
dataPrep_BRCAiv cor.cut = 0.6,
datatype = "raw_count",
filename = "BRCAiv_IlluminaHiSeq_RNASeqV2.png")
<- TCGAanalyze_Normalization(tabDF = cbind(dataPrep_BRCAi, dataPrep_BRCAiv),
dataNorm geneInfo = TCGAbiolinks::geneInfo,
method = "gcContent")
## I Need about 27 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
## Step 1 of 4: newSeqExpressionSet ...
## Step 2 of 4: withinLaneNormalization ...
## Step 3 of 4: betweenLaneNormalization ...
## Step 4 of 4: .quantileNormalization ...
You need to take a quick look at the images generated from the correlation step to see if there are any outliers with a low level of correlation.
::include_graphics(here::here("BRCAiv_IlluminaHiSeq_RNASeqV2.png")) knitr
The Boxplot of the correlations between the BRCAiv samples
As you can see from the plot the correlations are all very high and there is good overlap between all of the boxplots. Three might be slightly lower than the rest but they are not significantly different.
This will remove genes from the dataset that fall outside a specified range. This filtering can be based on the mean, the variance, the fold change or the effect size.
Filtering based on the mean is common but there are statistical and biological reasons why this might not be a good approach to take. 1) The data is unlikely to be normally distributed and so the mean is a poor measure 2) Genes with very low or very high expression levels might be the ones with significant differential expression.
It might be best not to filter the data at all or to use another method. Fold change or effect size could be reasonable but variance is likely to have the same statistical issues as the mean if the data is not normally distributed.
### The filtering cutoff for the data was set to 0.05 that means the middle 95% of the data will be kept
<- TCGAanalyze_Filtering(tabDF = dataNorm,
dataFilt method = "quantile",
qnt.cut = 0.05)
### Save the composite normalised datafile
save(dataFilt, file = paste0("BRCAi_BRCAiv_Norm_IlluminaHiSeq.rda"))
Alternatively using fold change
<- TCGAanalyze_Filtering(tabDF = dataNorm,
dataFilt method = "filter2")
### Save the composite normalised datafile
save(dataFilt, file = paste0("BRCAi_BRCAiv_Norm_IlluminaHiSeq.rda"))
You now need to create two R objects that correspond to the normalised and filtered data for the two experimental conditions. This is based on the barcodes1 and barcodes2 variables.
### Split the data into subsets according the different barcodes.
<- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% barcodes2)
dataFiltBRCAi <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% barcodes1) dataFiltBRCAiv
This is the final step of the process and there are two different options for doing this. One based on the exact test and one using the General Linear Model of the Negative Binomial. Don’t worry about the second process it is just called the glmLRT model and it is the preferred model.
You also need to adjust the FDR cut to a false discovery rate that you are happy with. I have set it to be 5% for this case. Ideally you want to have between 1000 and 3000 differentially expressed genes as this is likely to show a number of pathways. If you don’t have the cut off you can be overwhelmed by the number of identified genes.
### Determine the set of differentially expressed genes between the two subsets
<- TCGAanalyze_DEA(mat1 = dataFiltBRCAi,
dataDEGs mat2 = dataFiltBRCAiv,
Cond1type = "BRCAi",
Cond2type = "BRCAiv",
fdr.cut = 0.1,
method = "exactTest")
## Batch correction skipped since no factors provided
## ----------------------- DEA -------------------------------
## there are Cond1 type BRCAi in 90 samples
## there are Cond2 type BRCAiv in 20 samples
## there are 19866 features as miRNA or genes
## I Need about 73 seconds for this DEA. [Processing 30k elements /s]
## ----------------------- END DEA -------------------------------
### Check how many genes you have.
nrow(dataDEGs)
## [1] 2421
Now that you have a list of differentially expressed genes the next step is the genome set enrichment analysis. This identifies the pathways and biological processes that are affected by the differential expression.
The genome ontology (GO) is a specific grammar and vocabulary used to describe biological pathways, cellular locations and biological processes. It was created to try and standardise biological knowledge in the post genomic era.
Genome set enrichment analysis takes the list of genes identified as differentially expressed and uses the genome ontology to idnetify process and locations where there are a disproportionate number of differentially expressed genes.
This produces four barplots one for each of; 1) Biological Process 2) Cellular Component 3) Pathways 4) Molecular Function
### Gene Set Enrichment Analysis GSEA
<- TCGAanalyze_EAcomplete(TFname="DEA genes BRCAi Vs BRCAiv", RegulonList = rownames(dataDEGs)) ansEA
## [1] "I need about 1 minute to finish complete Enrichment analysis GO[BP,MF,CC] and Pathways... "
## [1] "GO Enrichment Analysis BP completed....done"
## [1] "GO Enrichment Analysis MF completed....done"
## [1] "GO Enrichment Analysis CC completed....done"
## [1] "Pathway Enrichment Analysis completed....done"
## Display the GSEA Results ##
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
GOBPTab = ansEA$ResBP,
nRGTab = rownames(dataDEGs),
nBar = 20,
filename="BP.pdf")
## png
## 2
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResCC),
GOCCTab = ansEA$ResCC,
nRGTab = rownames(dataDEGs),
nBar = 20,
filename="CC.pdf")
## png
## 2
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResPat),
PathTab = ansEA$ResPat,
nRGTab = rownames(dataDEGs),
nBar = 20,
filename="Pat.pdf")
## png
## 2
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResMF),
GOMFTab = ansEA$ResMF,
nRGTab = rownames(dataDEGs),
nBar = 20,
filename="MF.pdf")
## png
## 2
While GO analysis reveals some of the features of the differential gene expression we would like to take a more detailed look at pathway analysis.
The first step is to convert the gene symbols used in the experimental analysis to Entrez IDs that can be used for searching the KEGG database of pathways. This is complex code but it just does a simple function.
library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
library(clusterProfiler)
##
## clusterProfiler v4.0.0 For help: https://guangchuangyu.github.io/software/clusterProfiler
##
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:stats':
##
## filter
library(pathview)
##
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
##
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
# DEGs TopTable
<- TCGAanalyze_LevelTab(dataDEGs,"BRCAi","BRCAiv",
dataDEGsFiltLevel colnames(dataFiltBRCAi)],
dataFilt[,colnames(dataFiltBRCAiv)])
dataFilt[,
$GeneID <- 0
dataDEGsFiltLevel
# Converting Gene symbol to geneID
= as.data.frame(bitr(dataDEGsFiltLevel$mRNA,
eg fromType="SYMBOL",
toType="ENTREZID",
OrgDb="org.Hs.eg.db"))
## Loading required package: org.Hs.eg.db
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:clusterProfiler':
##
## select
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(dataDEGsFiltLevel$mRNA, fromType = "SYMBOL", toType =
## "ENTREZID", : 13.26% of input gene IDs are fail to map...
<- eg[!duplicated(eg$SYMBOL),]
eg
<- dataDEGsFiltLevel[dataDEGsFiltLevel$mRNA %in% eg$SYMBOL,]
dataDEGsFiltLevel
<- dataDEGsFiltLevel[order(dataDEGsFiltLevel$mRNA,decreasing=FALSE),]
dataDEGsFiltLevel <- eg[order(eg$SYMBOL,decreasing=FALSE),]
eg
# table(eg$SYMBOL == dataDEGsFiltLevel$mRNA) should be TRUE
all(eg$SYMBOL == dataDEGsFiltLevel$mRNA)
## [1] TRUE
$GeneID <- eg$ENTREZID
dataDEGsFiltLevel
<- subset(dataDEGsFiltLevel, select = c("GeneID", "logFC"))
dataDEGsFiltLevel_sub <- as.numeric(dataDEGsFiltLevel_sub$logFC)
genelistDEGs names(genelistDEGs) <- dataDEGsFiltLevel_sub$GeneID
###################################################
# The following exports the Entrez Gene IDs to a csv file
<- names(genelistDEGs)
output write.csv(output, "gene_list.csv")
Once you have a list of Entrez Gene IDs you can use this to search KEGG by using DAVID https://david.ncifcrf.gov/
You upload the list as a gene list and carry out the functional annotation analysis. Then you pic the chart for KEGG-PATHWAY. In this case there is only one pathway with a significant value for Benjamini and Hochberg which is the proteasome if you click on this in the bottom left hand corner of the image is a number. For the proteasome this is 03050. You need this number for the next steps. This is used in the R code to create the pathway images.
Next you want to plot the differential expression on the KEGG pathway in R and Bioconductor. You do this with the following code.
<- pathview::pathview(gene.data = genelistDEGs,
hsa03050 pathway.id = "hsa03050",
species = "hsa",
limit = list(gene=as.integer(max(abs(genelistDEGs)))))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory D:/Dropbox/R-TCGA
## Info: Writing image file hsa03050.pathview.png
<- pathview::pathview(gene.data = genelistDEGs,
hsa05204 pathway.id = "hsa05204",
species = "hsa",
limit = list(gene=as.integer(max(abs(genelistDEGs)))))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory D:/Dropbox/R-TCGA
## Info: Writing image file hsa05204.pathview.png
This generates the following image.
::include_graphics(here::here("hsa03050.pathview.png")) knitr
The Differential Gene Expression in the Proteasome