++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # ANÁLISIS DE ENRIQUECIMIENTO ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PUNTOS CLAVES

Un gen puede estar asociado a varios terminos GO/KEGG
Los resultados de los análisis de enriquecimiento usando terminos GO o KEGG frecuentemente difieren entre sí.
Los métodos de eliminación de redundancia de terminos son dinámicos y se siguen desarrollando.
Los análisis de enriquecimiento se hacen sobre las anotaciones y no sobre los genes.

1. OBJETIVOS

Aprender la secuencia de pasos necesaria para asociar los ID de genes con los terminos GO/KEGG
Definir las rutas metabólicas sobrerrepresentadas en un grupo de genes

2. TEMARIO

Biomart
- Generalidades
- Instalación de Biomart
- Grupo de edatos en Biomart
- Hacere un “query” en Biomart
- Búsqueda para definir parámetros
- Fijando resultados
Asociación de genes
- Asociación de términos GO con Biomart
- Asociación de términos GO con base de datos externos
Análisis de sobreexpresión

3. BIOMART

3.1 Generalidades

Biomart es un repositorio de proyectos intercinacionales, abierto a la comunidad científica. Veamos una pequeña introducción sobre el tema (https://youtu.be/C2g37X_uMok) Este repositorio tiene un “browser” específico para la busqueda de datos. R nos permite hacer las busquedas y bajar las bases de datos de forma remota con el paquete BiomaRt. Podemos ingresar a la herramienta en linea desde este link: https://www.ensembl.org/biomart/martview/96f2cd8f0a3ce4d797271467fd332b69

3.2 Instalación de Biomart

Primero necesitamos instalar biomart a traves de bioconductor, con la siguiente función (esto toma un poco de tiempo y hay que colocar “a” en la consola, demora bastante tiempo): NOTA: Para instalar la última versión de BiomaRt es necesario tener una versión de R de 4.0 (para verificar la versión de R, podemos usar la función verison() ejecutada directamente en la consola)

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.11")
BiocManager::install("biomaRt")

Una vez que se haya descargado el paquete, es necesario instalarlo en el entorno de trabajo utilizando la función library()

library(biomaRt)

3.3 Grupo de datos en Biomart

Como mencionamos anteriormente Biomart es un repositorio de bases de datos internacional de libre accesso. Para saber a que datos podemos acceder con BiomaRt podemos utilizar la función listmart. Si Uds entrar por su navegador, podrán ver que esto son las posibiliades que se desplegan en “CHOOSE DATABASE” (https://m.ensembl.org/biomart/martview/d5341457766611dc7d72538ef6c78f9e)

biomaRt::listMarts()

##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 101
## 2   ENSEMBL_MART_MOUSE      Mouse strains 101
## 3     ENSEMBL_MART_SNP  Ensembl Variation 101
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 101

Podemos observar que tiene cuatro elementos. Para establecer una conección con cualquiera de las bases de datos disponibles en BiomaRt usarermos la función usemart() y la guardaremos en un objeto llamado ensembl:

ensembl <- biomaRt::useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL")

Una vez que seleccionamos la base de datos, podemos seleccionar un grupo de datos (subset). Para saber la lista de subsets que tiene la base de datos usamos la función listDatasets():

datasets <- biomaRt::listDatasets(ensembl)

Podemos observar que la función listDatasets() genera un objeto data.frame que tiene 3 columnas (“dataset”, “description”, “version”), y que cuenta con 203 observaciones

head(datasets)

##                      dataset                                  description
## 1   acalliptera_gene_ensembl             Eastern happy genes (fAstCal1.2)
## 2 acarolinensis_gene_ensembl               Anole lizard genes (AnoCar2.0)
## 3  acchrysaetos_gene_ensembl              Golden eagle genes (bAquChr1.2)
## 4  acitrinellus_gene_ensembl               Midas cichlid genes (Midas_v5)
## 5  amelanoleuca_gene_ensembl                        Panda genes (ailMel1)
## 6    amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0)
##                  version
## 1             fAstCal1.2
## 2              AnoCar2.0
## 3             bAquChr1.2
## 4               Midas_v5
## 5                ailMel1
## 6 Astyanax_mexicanus-2.0

dim(datasets)

## [1] 203   3

str(datasets)

## 'data.frame':    203 obs. of  3 variables:
##  $ dataset    : 'AsIs' chr  "acalliptera_gene_ensembl" "acarolinensis_gene_ensembl" "acchrysaetos_gene_ensembl" "acitrinellus_gene_ensembl" ...
##  $ description: 'AsIs' chr  "Eastern happy genes (fAstCal1.2)" "Anole lizard genes (AnoCar2.0)" "Golden eagle genes (bAquChr1.2)" "Midas cichlid genes (Midas_v5)" ...
##  $ version    : 'AsIs' chr  "fAstCal1.2" "AnoCar2.0" "bAquChr1.2" "Midas_v5" ...

Para encontrar donde se encuantra el registro correspondinete a " Homo sapiens " (hsapiens_gene_ensembl), podemos usar la función str_detect() del paquete “stringr”.

# install.packages("stringr", dependencies = T)
library(stringr)
datasets[stringr::str_detect(string = "hsapiens_gene_ensembl",pattern = as.character(datasets$dataset) ),]

##                  dataset              description    version
## 80 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

Observamos que el registro de ser humano está en la posición número 80 y que la versión es la GRCh38.p13 (http://www.ensembl.org/Homo_sapiens/Info/Annotation) Para seleccinar la base de datos de seres humanos, podemos resescribir nuestro objeto “ensembl” usando la función useDataset():

ensembl <- biomaRt::useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl)
str(ensembl)

## Formal class 'Mart' [package "biomaRt"] with 7 slots
##   ..@ biomart   : chr "ENSEMBL_MART_ENSEMBL"
##   ..@ host      : chr "https://www.ensembl.org:443/biomart/martservice"
##   ..@ vschema   : chr "default"
##   ..@ version   : chr(0) 
##   ..@ dataset   : chr "hsapiens_gene_ensembl"
##   ..@ filters   :'data.frame':   432 obs. of  9 variables:
##   .. ..$ name           : chr [1:432] "chromosome_name" "start" "end" "band_start" ...
##   .. ..$ description    : chr [1:432] "Chromosome/scaffold name" "Start" "End" "Band Start" ...
##   .. ..$ options        : chr [1:432] "[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,CHR_HG1_PATCH,CHR_HG26_PATCH,CHR_HG28_PATCH,CHR_HG30_"| __truncated__ "[]" "[]" "[]" ...
##   .. ..$ fullDescription: chr [1:432] "" "Determine which base pair on the specified chromosome/scaffold to begin range" "Determine which base pair on the specified chromosome/scaffold to end range" "" ...
##   .. ..$ filters        : chr [1:432] "filters" "filters" "filters" "filters" ...
##   .. ..$ type           : chr [1:432] "text" "text" "text" "text" ...
##   .. ..$ operation      : chr [1:432] "=" ">=" "<=" "=" ...
##   .. ..$ filters8       : chr [1:432] "hsapiens_gene_ensembl__gene__main" "hsapiens_gene_ensembl__gene__main" "hsapiens_gene_ensembl__gene__main" "pointer dataset" ...
##   .. ..$ filters9       : chr [1:432] "name_1059" "seq_region_end_1020" "seq_region_start_1020" "band_1027" ...
##   ..@ attributes:'data.frame':   3061 obs. of  7 variables:
##   .. ..$ name           : chr [1:3061] "ensembl_gene_id" "ensembl_gene_id_version" "ensembl_transcript_id" "ensembl_transcript_id_version" ...
##   .. ..$ description    : chr [1:3061] "Gene stable ID" "Gene stable ID version" "Transcript stable ID" "Transcript stable ID version" ...
##   .. ..$ fullDescription: chr [1:3061] "Stable ID of the Gene" "Versionned stable ID of the Gene" "Stable ID of the Transcript" "Versionned stable ID of the Transcript" ...
##   .. ..$ page           : chr [1:3061] "feature_page" "feature_page" "feature_page" "feature_page" ...
##   .. ..$ attributes5    : chr [1:3061] "html,txt,csv,tsv,xls" "html,txt,csv,tsv,xls" "html,txt,csv,tsv,xls" "html,txt,csv,tsv,xls" ...
##   .. ..$ attributes6    : chr [1:3061] "hsapiens_gene_ensembl__gene__main" "hsapiens_gene_ensembl__gene__main" "hsapiens_gene_ensembl__transcript__main" "hsapiens_gene_ensembl__transcript__main" ...
##   .. ..$ attributes7    : chr [1:3061] "stable_id_1023" "gene__main_stable_id_version" "stable_id_1066" "transcript__main_stable_id_version" ...

Ahora el objeto “ensembl” generado es tipo lista, creado por el paquete (Objeto S4), que tiene 7 slots:

@ biomart: vector tipo caracter que tiene el nombre del datasets (“ENSEMBL_MART_ENSEMBL”).
@ host: vector tipo caracter que tiene la dirección de donde se descargó el dataset “https://www.ensembl.org:443/biomart/martservice”)
@ vschema: vector tipo caracter que tiene el detalle de como fue importado el dataset (“default”)
@ version: vector tipo caracter que debe especificar la versión del dataset.
@ dataset: vector tipo caracter que especifica el nombre del dataset (“hsapiens_gene_ensembl”).
@ filters: objeto tipo data.frame que tiene las varibles por las que podemos filtra nuestra información (432)
@ attributes: objeto tipo data.frame que especifica los atributos de las variables (3061)

Para darnos una idea del contenido de los filtros que podemos usar veamos una parte del data.frame “@filters”

head(ensembl@filters)[,1:2]

##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

El paquete Biomart nos ofrece una función que nos permite ver que filtros podríamos utilizar en nuestro dataset con la función “listFilters()”.

filtros <- biomaRt::listFilters(ensembl)
head(filtros)

##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

Como verán esta función es muy parecida a la que usamos anteriormente. De la misma manera podemos explorar el data.frame “Attrributes”

head(ensembl@attributes)

##                            name                  description
## 1               ensembl_gene_id               Gene stable ID
## 2       ensembl_gene_id_version       Gene stable ID version
## 3         ensembl_transcript_id         Transcript stable ID
## 4 ensembl_transcript_id_version Transcript stable ID version
## 5            ensembl_peptide_id            Protein stable ID
## 6    ensembl_peptide_id_version    Protein stable ID version
##                          fullDescription         page          attributes5
## 1                  Stable ID of the Gene feature_page html,txt,csv,tsv,xls
## 2       Versionned stable ID of the Gene feature_page html,txt,csv,tsv,xls
## 3            Stable ID of the Transcript feature_page html,txt,csv,tsv,xls
## 4 Versionned stable ID of the Transcript feature_page html,txt,csv,tsv,xls
## 5                                        feature_page html,txt,csv,tsv,xls
## 6                                        feature_page html,txt,csv,tsv,xls
##                                attributes6                         attributes7
## 1        hsapiens_gene_ensembl__gene__main                      stable_id_1023
## 2        hsapiens_gene_ensembl__gene__main        gene__main_stable_id_version
## 3  hsapiens_gene_ensembl__transcript__main                      stable_id_1066
## 4  hsapiens_gene_ensembl__transcript__main  transcript__main_stable_id_version
## 5 hsapiens_gene_ensembl__translation__main                      stable_id_1070
## 6 hsapiens_gene_ensembl__translation__main translation__main_stable_id_version

Y el paquete BiomaRt también nos ofrece una función para listar los atributos que podemos utilizar con nuestro dataset, con la función listAtrributes()

atributos <- biomaRt::listAttributes(ensembl)
head(atributos)

##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

La lista de atributos es muy extensa. Biomart agrupo estos atributos en grupos, usando la columna “page” como factor. Para saber cuales son las categorias disponibles para este factor, Biomart nos ofrece la función attributePages

biomaRt::attributePages(mart = ensembl)

## [1] "feature_page" "structure"    "homologs"     "snp"          "snp_somatic" 
## [6] "sequences"

Vemos que se consideran 6 categorías. Nosotros podemos llamar a una de las categorías en específico y mostrar en pantalla la lista de atributos que están considerado dentro “snp” por ejemplo.

head(biomaRt::listAttributes(mart = ensembl, page = "snp"))

##                               name                  description page
## 2922               ensembl_gene_id               Gene stable ID  snp
## 2923       ensembl_gene_id_version       Gene stable ID version  snp
## 2924                       version               Version (gene)  snp
## 2925         ensembl_transcript_id         Transcript stable ID  snp
## 2926 ensembl_transcript_id_version Transcript stable ID version  snp
## 2927            transcript_version         Version (transcript)  snp

3.4 Hacer un “query” en Biomart

Para generar el “query” usamos la función getBM(), que tiene 3 parámetros:

attributes: Es un objeto vector con la lista de atributos que quiero importar
filters: Es un objeto vector con la lista de filtros para seleccionar mi “query”.
values: Es un objeto vector con valores para los filtros.
mart: es el objeto Mart que vamos a analizar, en nuestro caso es ensembl.

affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id', 'external_gene_name'), 
      filters = 'affy_hg_u133_plus_2', 
      values = affyids, 
      mart = ensembl)

##   affy_hg_u133_plus_2 entrezgene_id external_gene_name
## 1         209310_s_at           837              CASP4
## 2           207500_at           838              CASP5
## 3           202763_at           836              CASP3

geneID = c("3064","9370","6855", "351", "7124")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id', 'external_gene_name'), 
      filters = 'entrezgene_id', 
      values = geneID, 
      mart = ensembl)

##    affy_hg_u133_plus_2 entrezgene_id external_gene_name
## 1                               3064                HTT
## 2          202389_s_at          3064                HTT
## 3          202390_s_at          3064                HTT
## 4          214953_s_at           351                APP
## 5            200602_at           351                APP
## 6          211277_x_at           351                APP
## 7                                351                APP
## 8            213200_at          6855                SYP
## 9                               6855                SYP
## 10         207113_s_at          7124                TNF
## 11           207175_at          9370             ADIPOQ

3.5 Busqueda para definir parámetros en los filtros, atributos y datasets

3.5.1 Busqueda de parámetros válidos para filtros, atributos y datasets

Cuando desconozcamos el nombre exacto del atributo, filtro o dataset; el paquete BiomaRt tiene funciones específicas para hacerlo como las funciones searchDatasets(), searchAtributes() y searchFilters(). Por ejemplo si queremos buscar una base de datos de Homo sapiens podemos usar la función searchDatasets():

biomaRt::searchDatasets(mart = ensembl, pattern = "hsapiens")

##                  dataset              description    version
## 80 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

Ahora vamos a buscar las posiciones de los datasets para los siguientes organimos:ratón de liberatorio( Mus musculus ), rata de laboratorio ( rnorvegicus ), zebra fish ( Dario rerio )

biomaRt::searchDatasets(mart = ensembl, pattern = "mmusculus")

##                    dataset             description   version
## 106 mmusculus_gene_ensembl Mouse genes (GRCm38.p6) GRCm38.p6

biomaRt::searchDatasets(mart = ensembl, pattern = "drerio")

##                dataset              description version
## 55 drerio_gene_ensembl Zebrafish genes (GRCz11)  GRCz11

biomaRt::searchDatasets(mart = ensembl, pattern = "rnorvegicus")

##                      dataset          description  version
## 160 rnorvegicus_gene_ensembl Rat genes (Rnor_6.0) Rnor_6.0

Para buscar en los atributos aquellos que estén relacionados con el patrón “hgnc”, usamos la siguiente línea de comando:

biomaRt::searchAttributes(mart = ensembl, pattern = "hgnc")

##               name        description         page
## 63         hgnc_id            HGNC ID feature_page
## 64     hgnc_symbol        HGNC symbol feature_page
## 94 hgnc_trans_name Transcript name ID feature_page

Si queremos encontrar los atributos relacionados a “affymetrics” podemos buscar aquellos que tengan el patrón “affy”

biomaRt::searchAttributes(mart = ensembl, pattern = "affy")

##                      name                 description         page
## 104          affy_hc_g110          AFFY HC G110 probe feature_page
## 105         affy_hg_focus         AFFY HG Focus probe feature_page
## 106         affy_hg_u133a         AFFY HG U133A probe feature_page
## 107       affy_hg_u133a_2       AFFY HG U133A 2 probe feature_page
## 108         affy_hg_u133b         AFFY HG U133B probe feature_page
## 109   affy_hg_u133_plus_2   AFFY HG U133 Plus 2 probe feature_page
## 110          affy_hg_u95a          AFFY HG U95A probe feature_page
## 111        affy_hg_u95av2        AFFY HG U95Av2 probe feature_page
## 112          affy_hg_u95b          AFFY HG U95B probe feature_page
## 113          affy_hg_u95c          AFFY HG U95C probe feature_page
## 114          affy_hg_u95d          AFFY HG U95D probe feature_page
## 115          affy_hg_u95e          AFFY HG U95E probe feature_page
## 116          affy_hta_2_0          AFFY HTA 2 0 probe feature_page
## 117   affy_huex_1_0_st_v2   AFFY HuEx 1 0 st v2 probe feature_page
## 118         affy_hugenefl         AFFY HuGeneFL probe feature_page
## 119 affy_hugene_1_0_st_v1 AFFY HuGene 1 0 st v1 probe feature_page
## 120 affy_hugene_2_0_st_v1 AFFY HuGene 2 0 st v1 probe feature_page
## 121        affy_primeview        AFFY PrimeView probe feature_page
## 122         affy_u133_x3p         AFFY U133 X3P probe feature_page

Por ejemplo, si deseamos hacer una busqueda más fina usando dos parámetros, por ejemplo que el filtro contenga las palabras “ensembl” y la palabra “id” en el nombre lo podemos ejecutar la siguiente linea de código:

biomaRt::searchFilters(mart = ensembl, pattern = "ensembl.*id")

##                             name
## 56               ensembl_gene_id
## 57       ensembl_gene_id_version
## 58         ensembl_transcript_id
## 59 ensembl_transcript_id_version
## 60            ensembl_peptide_id
## 61    ensembl_peptide_id_version
## 62               ensembl_exon_id
##                                                       description
## 56                       Gene stable ID(s) [e.g. ENSG00000000003]
## 57       Gene stable ID(s) with version [e.g. ENSG00000000003.15]
## 58                 Transcript stable ID(s) [e.g. ENST00000000233]
## 59 Transcript stable ID(s) with version [e.g. ENST00000000233.10]
## 60                    Protein stable ID(s) [e.g. ENSP00000000233]
## 61     Protein stable ID(s) with version [e.g. ENSP00000000233.5]
## 62                              Exon ID(s) [e.g. ENSE00000000003]

3.5.2. Búsqueda de valores válidos para los parámetros definidos para filtros, atributos y datasets

Para definir que valores son válidos en el parámetro “chromosome_name” de los filtros podemos usar la función listFilterValues()

biomaRt::listFilterValues(mart = ensembl, filter = "chromosome_name")[20:40]

##  [1] "20"                         "21"                        
##  [3] "22"                         "CHR_HG1_PATCH"             
##  [5] "CHR_HG26_PATCH"             "CHR_HG28_PATCH"            
##  [7] "CHR_HG30_PATCH"             "CHR_HG76_PATCH"            
##  [9] "CHR_HG107_PATCH"            "CHR_HG109_PATCH"           
## [11] "CHR_HG126_PATCH"            "CHR_HG142_HG150_NOVEL_TEST"
## [13] "CHR_HG151_NOVEL_TEST"       "CHR_HG439_PATCH"           
## [15] "CHR_HG545_PATCH"            "CHR_HG699_PATCH"           
## [17] "CHR_HG705_PATCH"            "CHR_HG708_PATCH"           
## [19] "CHR_HG721_PATCH"            "CHR_HG926_PATCH"           
## [21] "CHR_HG986_PATCH"

Para refinar la busqueda podemos agregar patrones en la busqueda del valor con el parámetro pattern.

biomaRt::searchFilterValues(mart = ensembl, filter = "phenotype_description", pattern = "Crohn")

## [1] "INFLAMMATORY BOWEL DISEASE CROHN DISEASE 1" 
## [2] "INFLAMMATORY BOWEL DISEASE CROHN DISEASE 10"
## [3] "INFLAMMATORY BOWEL DISEASE CROHN DISEASE 19"
## [4] "NON RARE IN EUROPE: Crohn disease"

3.6 Fijando resultados

Para ahorrar el tiempo en las busquedas que ya realizamos anteriormente, el paquete Biomart guarda dicha busqueda en nuestra computadora. Para saber la localización de dichos archivos usamos la fucnión biomartCacheInfo()

biomaRt::biomartCacheInfo()

## biomaRt cache
## - Location: C:\Users\acard\AppData\Local\biomaRt\biomaRt\Cache
## - No. of files: 2
## - Total size: 536 bytes

4. ASOCIACIÓN DE GENES

4.1 Asociación de terminos GO usando base de datos Biomart

Para hacer esta asociación debemos seguir tres sencillos pasos: + Definir el dataset de Biomart correspondiente, para el ejemplo será el de ser humano. + Cargar la lista de genes (de nuestro experimento) + Hacere la asociación con la función getBM()

# Paso 1 definimos el dataset
ensembl <- biomaRt::useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL")
ensembl <- biomaRt::useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl)
# Paso 2 cargamos la base de datos
immunome <- read.csv(paste0(dirname(getwd()),"/data/clase_3/human_gene_set.txt"))
head(immunome)

##   EntrezGeneID
## 1          100
## 2         1003
## 3        10068
## 4          101
## 5        10134
## 6        10154

# Paso 3 asociamos anotacionese
goannot <- biomaRt::getBM(attributes = c("entrezgene_id", "go_id", "name_1006"),
                          filters = "entrezgene_id", 
                          values = immunome,
                          mart = ensembl)

La función getBM() genera un objeto lista, dentro del cual encontramos un data.frame con el número de columnas que nosotros indicamos en el parámetro attributes de la función getBM.

typeof(goannot)

## [1] "list"

str(goannot)

## 'data.frame':    27563 obs. of  3 variables:
##  $ entrezgene_id: int  100 100 100 100 100 100 100 100 100 100 ...
##  $ go_id        : chr  "" "GO:0019239" "GO:0004000" "GO:0009168" ...
##  $ name_1006    : chr  "" "deaminase activity" "adenosine deaminase activity" "purine ribonucleoside monophosphate biosynthetic process" ...

Este data.frame posee 27563 registros

nrow(goannot)

## [1] 27563

head(goannot)

##   entrezgene_id      go_id
## 1           100           
## 2           100 GO:0019239
## 3           100 GO:0004000
## 4           100 GO:0009168
## 5           100 GO:0005737
## 6           100 GO:0016020
##                                                  name_1006
## 1                                                         
## 2                                       deaminase activity
## 3                             adenosine deaminase activity
## 4 purine ribonucleoside monophosphate biosynthetic process
## 5                                                cytoplasm
## 6                                                 membrane

Si generamos una tabla de frecuencias para observar las 20 funciones GO más frecuentes, observamos que

cbind(head(sort(table(goannot$name_1006), decreasing = TRUE),20)/nrow(goannot))

##                                              [,1]
## protein binding                       0.026811305
##                                       0.024598193
## membrane                              0.020135689
## plasma membrane                       0.017886297
## integral component of membrane        0.016398795
## extracellular region                  0.012153975
## cytoplasm                             0.009977143
## extracellular space                   0.009614338
## signal transduction                   0.009106411
## immune response                       0.009070130
## integral component of plasma membrane 0.008598483
## immune system process                 0.007981715
## cytosol                               0.007401226
## extracellular exosome                 0.007292385
## cell surface                          0.006748177
## nucleus                               0.006711896
## inflammatory response                 0.006167689
## innate immune response                0.006058847
## external side of plasma membrane      0.005877444
## cytokine-mediated signaling pathway   0.005696042

Esta tabla puede ser utilizada directamente para el análisis ORA con el paquete topGO.

4.2 Asociación de terminos GO usando base de datos externa

Tambien podemos utilizar bases de datos externas para anotar los terminos GO a nuestra lista de genes, con la base de datos org.Hs.eg.db que está incorporada en el paquete AnnotationDbi. Para ello necesitamos cargar la librería:

library(org.Hs.eg.db)

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

##

AnnotationDbi::columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
## [26] "UNIPROT"

La base de edatos org.Hs.eg.db tiene las siguientes columnas que puede ser asociadas con nuesetra lista de genes. Si deseamos alguna definción específica de alguna columna podememos usar la función help() y el nombre de la columna.

help("SYMBOL")

## starting httpd help server ... done

Para imprimir los valores de una columna podemeos usar la función keys() y especificar el nombre de la columna en el parámetro keytype =

cbind(name = head(keys(org.Hs.eg.db, keytype="GENENAME"),10),
      symbol = head(keys(org.Hs.eg.db, keytype="SYMBOL"),10),
      uniprot = head(keys(org.Hs.eg.db, keytype="UNIPROT"),10),
      path = head(keys(org.Hs.eg.db, keytype="PATH"),10),
      GO = head(keys(org.Hs.eg.db, keytype="GO"),10),
      entrezID = head(keys(org.Hs.eg.db, keytype="ENTREZID"),10)
      )

##       name                                      symbol     uniprot      path   
##  [1,] "alpha-1-B glycoprotein"                  "A1BG"     "P04217"     "04610"
##  [2,] "alpha-2-macroglobulin"                   "A2M"      "V9HWD8"     "00232"
##  [3,] "alpha-2-macroglobulin pseudogene 1"      "A2MP1"    "P01023"     "00983"
##  [4,] "N-acetyltransferase 1"                   "NAT1"     "P18440"     "01100"
##  [5,] "N-acetyltransferase 2"                   "NAT2"     "Q400J6"     "00380"
##  [6,] "N-acetyltransferase pseudogene"          "NATP"     "F5H5R8"     "00970"
##  [7,] "serpin family A member 3"                "SERPINA3" "A4Z6T7"     "00250"
##  [8,] "arylacetamide deacetylase"               "AADAC"    "P11245"     "00280"
##  [9,] "angio associated migratory cell protein" "AAMP"     "A0A024R6P0" "00410"
## [10,] "aralkylamine N-acetyltransferase"        "AANAT"    "P01011"     "00640"
##       GO           entrezID
##  [1,] "GO:0002576" "1"     
##  [2,] "GO:0003674" "2"     
##  [3,] "GO:0005576" "3"     
##  [4,] "GO:0005615" "9"     
##  [5,] "GO:0008150" "10"    
##  [6,] "GO:0031093" "11"    
##  [7,] "GO:0034774" "12"    
##  [8,] "GO:0043312" "13"    
##  [9,] "GO:0062023" "14"    
## [10,] "GO:0070062" "15"

Para poder asociar los téminos GO de la base de datos org.Hs.eg.db con la lista de nuestros genes, que está en el objeto inmunome, podemos hacerlo con la función select() del paquete Biomart. La función select() de Biomart tiene 4 parámetros:

x = que define la base de datos externa
keys = es el vector caracter que define los ID de los genes
columns = La columna de la base de datos externa que se va a asociar a los IDs
keytype = define el tipo de datos del parámetro key

head(immunome)

##   EntrezGeneID
## 1          100
## 2         1003
## 3        10068
## 4          101
## 5        10134
## 6        10154

La función que generaría uan base de datos asociación el termino GO con los códigos EntrezID sería la siguiente.

goannot2 <- biomaRt::select(x = org.Hs.eg.db,
                            keys = as.character(immunome$EntrezGeneID), 
                            columns = "GO", 
                            keytype = "ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

Si nosotros quisieramos asociar el código de las rutas KEGG se utilizaría las siguientes líneas de código

pathannot <- biomaRt::select(x = org.Hs.eg.db,
                            keys = as.character(immunome$EntrezGeneID), 
                            columns = "PATH", 
                            keytype = "ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

NOTA: Este tiene un bug, por lo que su funcionamiento es inestable

5. ANÁLISIS DE SOBREEXPRESIÓN

Para hacer estos análisis debemos cargar nuestra lista de genes de referencia, que para nuestro caso será “gen.ref” y tendrá una longitud de 176 registros

gen.ref <- read.csv(paste0(dirname(getwd()),"/data/clase_3/human_gene_selected_set.txt"))
head(gen.ref)

##   EntrezGeneID
## 1         4282
## 2          717
## 3         7097
## 4          629
## 5         8993
## 6         8685

dim(immunome)

## [1] 892   1

dim(gen.ref)

## [1] 176   1

5.1 Análisis hipergeométrico con GOstat

Vamos a utilizar el paquete GOstat para lo cual necesitamos instalarlo por Bioconductor

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GOstats")

## Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)

## Installing package(s) 'GOstats'

## Installation path not writeable, unable to update packages: MASS, mgcv, nlme

## Old packages: 'cpp11', 'rmarkdown', 'RSQLite', 'survival', 'xfun'

Una vez instalado el paquete “GOstats”, cargamos la librería a nuestro entorno de trabajo con la función library().

library("GOstats")

## Loading required package: Category

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:S4Vectors':
## 
##     expand

## Loading required package: graph

## 
## Attaching package: 'graph'

## The following object is masked from 'package:stringr':
## 
##     boundary

##

## 
## Attaching package: 'GOstats'

## The following object is masked from 'package:AnnotationDbi':
## 
##     makeGOGraph

library("org.Hs.eg.db")

Luego creamos un objeto GOHyperGParams para realizar los cálculos hipergeométricos, con la función new. El objeto GOHyperGParams tiene 7 parámetros: + geneIds = donde se define los ID de los genes de interes + universeGeneIds = donde se define los ID de los genes de referencia (puede ser todo el genoma) + annotation = donde se define se define la base de datos donde se va a referenciar los terminos GO buscados + ontology = donde se define el grupo de terminos GO que se van a asociar * BP = de los procesos biológicos relacionados * CC = de los componentes celulares asociados * MF = de la función molecular asociada + pvalueCutoff = donde se define el valor de p de la significancia + conditional = donde se define los parámetros de las anotaciones del paquete + testDirection = donde se define el tipo de relación que queremos estudiar

paramsGO <- new("GOHyperGParams", 
                geneIds = gen.ref$EntrezGeneID,
                universeGeneIds = immunome$EntrezGeneID,
                annotation = "org.Hs.eg.db", 
                ontology = "BP", 
                pvalueCutoff = 0.01,
                conditional = FALSE, 
                testDirection = "over")

Una vez que ya se creo el objeto GOHyperGParams, podemos realizar el cálculo de enriquecimiento con la función hyperGTest() sobre el objeto creado

over.go <- GOstats::hyperGTest(paramsGO)
head(summary(over.go))

##       GOBPID       Pvalue OddsRatio ExpCount Count Size
## 1 GO:0045087 7.020613e-36  9.531915 56.67047   128  284
## 2 GO:0098542 1.050883e-27  7.095217 68.64310   132  344
## 3 GO:0043207 2.902310e-25  6.800000 79.81756   140  400
## 4 GO:0051707 2.902310e-25  6.800000 79.81756   140  400
## 5 GO:0009607 4.136164e-25  6.758621 80.01710   140  401
## 6 GO:0044419 1.519166e-23  6.873355 89.99430   147  451
##                                         Term
## 1                     innate immune response
## 2         defense response to other organism
## 3       response to external biotic stimulus
## 4                 response to other organism
## 5                response to biotic stimulus
## 6 interspecies interaction between organisms

Los resultados muestran que los términos GO: BP más relevantes son: + innate immune response + defense response to other organism + response to external biotic stimulus + response to other organism + response to biotic stimulus

Ahora podemos realizar el mismo procedimiento para realizar el test hipergeometrico en los caminos referenciados en KEGG

paramsKEGG <- new("KEGGHyperGParams",
                geneIds = gen.ref$EntrezGeneID,
                universeGeneIds = immunome$EntrezGeneID,
                annotation = "org.Hs.eg.db", 
                pvalueCutoff = 0.01, 
                testDirection = "over")
over.kegg <- GOstats::hyperGTest(paramsKEGG)
head(summary(over.kegg))

## 
## KEGG.db contains mappings based on older data because the original
##   resource was removed from the the public domain before the most
##   recent update was produced. This package should now be considered
##   deprecated and future versions of Bioconductor may not have it
##   available.  Users who want more current data are encouraged to look
##   at the KEGGREST or reactome.db packages

##   KEGGID       Pvalue OddsRatio  ExpCount Count Size
## 1  04610 2.125415e-15 19.080107  8.653465    31   38
## 2  04620 1.409175e-10  7.847222 10.475248    30   46
## 3  05215 1.911128e-07 24.232000  3.415842    13   15
## 4  04622 1.829330e-06  9.390244  4.782178    15   21
## 5  04010 3.140323e-06  5.496368  7.742574    20   34
## 6  04650 1.782577e-05  3.235736 15.257426    30   67
##                                        Term
## 1       Complement and coagulation cascades
## 2      Toll-like receptor signaling pathway
## 3                           Prostate cancer
## 4     RIG-I-like receptor signaling pathway
## 5                    MAPK signaling pathway
## 6 Natural killer cell mediated cytotoxicity

El análisis de las rutas mas relevantes es: + Complement and coagulation cascades + Toll-like receptor signaling pathway + Prostate cancer + RIG-I-like receptor signaling pathway + MAPK signaling pathway

5.2 Análisis ORA usando topGO

El paquete topGO al igual que GOstat, permite el análisis hipergeometrico para determinar el enriquecimiento de las vias, pero además topGO permite el análisis utilizando valores de significancia que no es procesado por GOstat. Además cuenta con otro tipo de pruebas aparte de la hipergeométrica. Primero usaremos topGO para realizar la prueba hipergeométrica como GOstat, para ello asociaremos los terminos GO a la base de datos de genes de referencia, que en nuestro caso es immunome

library(org.Hs.eg.db)
library(topGO)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

## 
## groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.

## 
## Attaching package: 'topGO'

## The following object is masked from 'package:GOstats':
## 
##     testName

## The following objects are masked from 'package:Category':
## 
##     ontology, ontology<-, testName

## The following object is masked from 'package:IRanges':
## 
##     members

goannot2 <- select(org.Hs.eg.db, 
                   keys = as.character(immunome$EntrezGeneID),
                   columns="GO",
                   keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

Si vemos la estructura del objeto goannot2 observamos que es un data.frame de 4 columnas, la primera es el ID del gen, la segunda son los terminos GO en código, la tercera es el nivel de evidencia, y la cuarta el termino parámetro GO (BP, MF, CC)

str(goannot2)

## 'data.frame':    25550 obs. of  4 variables:
##  $ ENTREZID: chr  "100" "100" "100" "100" ...
##  $ GO      : chr  "GO:0001666" "GO:0001821" "GO:0001829" "GO:0001883" ...
##  $ EVIDENCE: chr  "IDA" "IEA" "IEA" "IEA" ...
##  $ ONTOLOGY: chr  "BP" "BP" "BP" "MF" ...

Para poder utilizar estos datos en el paquete topGO es necesario generar dos vectores: uno con el ID de los genes (que en nuestro caso se llamará genid) y otro obejeto lista con el código de los terminos GO cuyos nombres de fila correspondan al ID del gen (este vector lo llamaremos gene2GO)

a <- 1
gene2GO <- c()
genevec <- c()
for (genid in immunome$EntrezGeneID){
  genevec[a] <- genid
  gene2GO[a] <- list(goannot2[goannot2$ENTREZID==genid,"GO"])
  a <- a+1
}
names(gene2GO)<-genevec

Ahora, debemos anotar cuales son los genes de interes en la lista de genes de referencia usando una codificación de 1 para gen de interes y 0 para el resto. Esto lo podemos hacer con las siguiente código

geneList <- rep(0,length(immunome$EntrezGeneID))
geneList[immunome$EntrezGeneID %in% gen.ref$EntrezGeneID] <- 1
names(geneList) <- genevec
geneList <- factor(geneList)

Una vez con todos estos elementos podemos construir el objeto topGOdata necesario para el análisis de topGO. Este obejto tiene 5 parámetros: + ontology = donde se define el parámetro de los terminos GO que se va a emplear (BP, CC, MF) + description = donde se asigna un nombre/descripción personalizada al objeto + allGenes = donde se asigna un vector numerico binomial (0 y 1) indicando las posiciones de los genes de inters en la lista de referencia de genes + allGenes = en este parpametro se define que casle de información va a ser almacenada en la variable resultante. + ene2GO = donde se define el objeto lista con los terminos GO

GOdata.MF <- new("topGOdata", 
                 ontology = "MF", 
                 description = "MF on innate immunity genes", 
                 allGenes = geneList, 
                 annot = annFUN.gene2GO,
                 gene2GO = gene2GO)

## 
## Building most specific GOs .....

##  ( 815 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 1132 GO terms and 1441 relations. )

## 
## Annotating nodes ...............

##  ( 852 genes annotated to the GO terms. )

El siguiente paso en es el análisis

# ------------------------------------------------------------------------------
resultFisher.MF.classic <- runTest(GOdata.MF, 
                                   algorithm = "classic", 
                                   statistic = "fisher")

## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 485 nontrivial nodes
##       parameters: 
##           test statistic: fisher

allRes.MF.classic <- GenTable(GOdata.MF, 
                              classicFisher = resultFisher.MF.classic,
                              topNodes = 20)
head(allRes.MF.classic)

##        GO.ID                                    Term Annotated Significant
## 1 GO:0140096 catalytic activity, acting on a protein       103          42
## 2 GO:0035325              Toll-like receptor binding         9           9
## 3 GO:0036094                  small molecule binding        94          38
## 4 GO:0001848                      complement binding        15          12
## 5 GO:0008144                            drug binding        67          30
## 6 GO:0017076               purine nucleotide binding        61          28
##   Expected classicFisher
## 1    20.19       8.4e-08
## 2     1.76       3.6e-07
## 3    18.42       5.5e-07
## 4     2.94       6.0e-07
## 5    13.13       8.1e-07
## 6    11.96       1.1e-06

# ------------------------------------------------------------------------------
resultFisher.MF.weight <- runTest(GOdata.MF, 
                                  algorithm = "weight", 
                                  statistic = "fisher")

## 
##           -- Weight Algorithm -- 
## 
##       The algorithm is scoring 485 nontrivial nodes
##       parameters: 
##           test statistic: fisher : ratio

## 
##   Level 10:  6 nodes to be scored.

## 
##   Level 9:   20 nodes to be scored.

## 
##   Level 8:   26 nodes to be scored.

## 
##   Level 7:   58 nodes to be scored.

## 
##   Level 6:   98 nodes to be scored.

## 
##   Level 5:   133 nodes to be scored.

## 
##   Level 4:   96 nodes to be scored.

## 
##   Level 3:   38 nodes to be scored.

## 
##   Level 2:   9 nodes to be scored.

## 
##   Level 1:   1 nodes to be scored.

allRes.MF.weight <- GenTable(GOdata.MF, 
                             classicFisher = resultFisher.MF.weight,
                             topNodes = 20)
head(allRes.MF.weight)

##        GO.ID                                        Term Annotated Significant
## 1 GO:0035325                  Toll-like receptor binding         9           9
## 2 GO:0001848                          complement binding        15          12
## 3 GO:0035639 purine ribonucleoside triphosphate bindi...        61          28
## 4 GO:0008144                                drug binding        67          30
## 5 GO:0042802                   identical protein binding       178          58
## 6 GO:0046983               protein dimerization activity       118          40
##   Expected classicFisher
## 1     1.76       3.6e-07
## 2     2.94       6.0e-07
## 3    11.96       1.1e-06
## 4    13.13       1.9e-06
## 5    34.89       2.1e-06
## 6    23.13       5.2e-05

# ------------------------------------------------------------------------------
resultFisher.MF.parentchild <- runTest(GOdata.MF, 
                                       algorithm = "parentchild",
                                       statistic = "fisher")

## 
##           -- Parent-Child Algorithm -- 
## 
##       the algorithm is scoring 485 nontrivial nodes
##       parameters: 
##           test statistic: fisher : joinFun = union

## 
##   Level 10:  6 nodes to be scored.

## 
##   Level 9:   20 nodes to be scored.

## 
##   Level 8:   26 nodes to be scored.

## 
##   Level 7:   58 nodes to be scored.

## 
##   Level 6:   98 nodes to be scored.

## 
##   Level 5:   133 nodes to be scored.

## 
##   Level 4:   96 nodes to be scored.

## 
##   Level 3:   38 nodes to be scored.

## 
##   Level 2:   9 nodes to be scored.

allRes.MF.parentchild <- GenTable(GOdata.MF, 
                                  classicFisher = resultFisher.MF.parentchild,
                                  topNodes=20)
head(allRes.MF.parentchild)

##        GO.ID                                        Term Annotated Significant
## 1 GO:0035325                  Toll-like receptor binding         9           9
## 2 GO:0001848                          complement binding        15          12
## 3 GO:0036094                      small molecule binding        94          38
## 4 GO:0008144                                drug binding        67          30
## 5 GO:0042802                   identical protein binding       178          58
## 6 GO:0035639 purine ribonucleoside triphosphate bindi...        61          28
##   Expected classicFisher
## 1     1.76       1.4e-07
## 2     2.94       8.5e-07
## 3    18.42       1.4e-06
## 4    13.13       1.8e-06
## 5    34.89       4.9e-06
## 6    11.96       1.1e-05

Enriquecimiento

Alfredo Cárdenas Rivera

18/9/2020