Head and Neck Cancer

Author

Dr Andrew Dalby

The Project

The objective of the project is to investigate possible biomarkers for head and neck cancer, with a possible focus on subtypes with hypoxia. As there is no keyword for hypoxia in the TCGA the question was changed to look at the differences between stages i and ii and stage iii as this is when cancer transitions from local to invade more remote environments.

The data to be used on the project comes from The Cancer Genome Atlas (TCGA). This is a database of cancer samples. collected under a grant by the US government. Originally the data was in a wide range of formats but this legacy data has been replaced with more homogenous experiments in order to reduce experimental variation. The legacy data that I orginally worked on is no longer available through the download portal.

You access the data from the GDC Portal https://portal.gdc.cancer.gov/

First you can select the tumour region visually on the schematic drawings of the male and female bodies. In this case the head and neck. Once you have selected a cancer type you need to use the cohort builder to build the set of patients that you will use for the analysis.

From the general menu I suggest picking a single project as all of the data within a project will have been collected following a specific protocol. It is very important to remove as much variation between patients as possible so that you can see as clearly as possible any patterns produced by cancer itself.

Ignore the demographics tab as it is pointless and quite simply racist.

From the general diagnosis menu you are going to use this to define the two groups you are going to use for training. You are going to use AJCC clinical stage. Stages I and II are most likely to contain hypoxia which is when the tumour is starved of oxygen. When you get to stage III the tumour has its own blood supply and hypoxia is less likely. You will create two groups for the first select stage i and stage ii and later for the second select stage iii.

From the Disease History menu you do not want people with previous tumours as tumours of the jaw and throat can be secondary tumours from lung cancer and that would not be a fair test as that is really a different type of cancer. You also want patients with no prior treatments and no other synchronous malignancy.

Ignore the Treatment and Exposure menus as this data is too incomplete to use.

Ignore the Biospecimen men. In an ideal world there would be paired data for each patient with a tumour and non-tumour sample taken from similar locations so that you can control for between patient variation but in this case only a small number of control cases were collected.

In available data the simplest data to start with is Transcriptome Profile data as the data category. This is how the amounts of genes vary between samples. This is if they are switched on and if they are how intense is the signal. The data type is Gene Expression Quantification and the type of experiment is RNA-Seq (before there were other legacy methods but the TCGA has repeated all the experiments with a single experimental platform). The platform used in this case is Illumina for all samples and this is important as different providers can gives different results. Finally you need the access for the data to be open so that you can download it.

Once you have the 106 (or 100 for stage iii cases) you need to download a tsv - tab separated variable file that describes the data you wish to download. You can open with with open office calc. numbers or Excel to see column delimited data. Do not forget to save it in the spreadsheet format to make it easier to open next time.

Setting up R with Bioconductor

There is an R/bioconductor library that allows you to connect to the TCGA and download the data. First you need to install R and R-studio. Then within R you can install Bioconductor and the TCGABiolinks library.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.23")
BiocManager::install("TCGAbiolinks")
BiocManager::install("EDASeq")
BiocManager::install("edgeR")
BiocManager::install("SummarizedExperiment")
BiocManager::install("genefilter")
BiocManager::install("ConsensusClusterPlus")
BiocManager::install("ComplexHeatmap")

You only need to do this once.

Normalisation of the Data and Filtering

You next need to normalise the data to put all of the samples on to the same scale as you are looking at differences and if one sample is more intense than another then the differences will not reflect reality. This also means merging the two datasets to make a single dataset where both are scaled together.

hnsc_norm <- TCGAanalyze_Normalization(
  tabDF = cbind(dataPrep_hnsc_early,dataPrep_hnsc_late),
  geneInfo = geneInfoHT)
I Need about  136 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Step 3 of 4: betweenLaneNormalization ...
Step 4 of 4: exprs ...

Then quantile filtering is used to clean up the normalised data.

hnsc_filt <- TCGAanalyze_Filtering(tabDF = hnsc_norm,
                                  method = "quantile",
                                  qnt.cut =  0.05)  

write.csv2(hnsc_filt, "hnsc_normalised_filtered.csv")

Now the data set is divided again so that the two sets can be compared and hypothesis testing can distinguish which genes are differentially expressed between the two sets.

hnsc_early_filt <- subset(hnsc_filt, select = substr(colnames(hnsc_filt),1,12) %in% bar1)
hnsc_late_filt <- subset(hnsc_filt, select = substr(colnames(hnsc_filt),1,12) %in% bar2)

Identifying Differentially Expressed Genes

Now you can identify the genes that are differentially expressed.

diff_expressed_genes <- TCGAanalyze_DEA(mat1 = hnsc_early_filt,
                                        mat2 = hnsc_late_filt,
                                        Cond1type = "hnsc_early",
                                        Cond2type = "hnsc_late",
                                        fdr.cut = 0.01 ,
                                        logFC.cut = 1,
                                        method = "glmLRT")
Batch correction skipped since no factors provided
----------------------- DEA -------------------------------
o 102 samples in Cond1type hnsc_early
o 92 samples in Cond2type hnsc_late
o 42540 features as miRNA or genes 
This may take some minutes...
----------------------- END DEA -------------------------------

Gene Set Enrichment Analysis

This can then be processed to create a genelist and you can carry out tests for enrichment analysis - are specific processes/pathways being regulated rather than looking at individual genes.

gene.list <- diff_expressed_genes[,6]

ansEA <- TCGAanalyze_EAcomplete(
  TFname = "DEA Genes Stages I and II Vs Stage III",
  RegulonList=gene.list
)
[1] "I need about  1 minute to finish complete  Enrichment analysis GO[BP,MF,CC] and Pathways... "
[1] "GO Enrichment Analysis BP completed....done"
[1] "GO Enrichment Analysis MF completed....done"
[1] "GO Enrichment Analysis CC completed....done"
[1] "Pathway Enrichment Analysis completed....done"

This gene list enrichment analysis can then be visualised graphically

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        filename = NULL,
                        GOBPTab = ansEA$ResBP, 
                        nRGTab = gene.list,
                        nBar = 10)

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        filename = NULL,
                        GOCCTab = ansEA$ResCC,
                        nRGTab = gene.list,
                        nBar = 10)

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        filename = NULL,
                        PathTab = ansEA$ResPat,
                        nRGTab = gene.list,
                        nBar = 10)

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        filename = NULL,
                        GOMFTab = ansEA$ResMF, 
                        nRGTab = gene.list,
                        nBar = 10)

TCGAvisualize_EAbarplot(
  tf = rownames(ansEA$ResBP), 
  GOBPTab = ansEA$ResBP,
  GOCCTab = ansEA$ResCC,
  GOMFTab = ansEA$ResMF,
  PathTab = ansEA$ResPat,
  nRGTab = gene.list, 
  nBar = 10
)
png 
  2