Authors: Pablo Fonseca / Angela Cánovas

December 3, 2017

Why to use R to search positional and functional candidate genes?

In the last tutorials, we learned how to search for positional and functional candidate genes using several software. This is a great strategy. However, during the process, several output files are created and we must to manage this files in a very caution way. We must convert different outputs to use as input files in others software, combine different results to select candidate genes, to create plots in order to summarize the results. The larger number of genes in the input files and the larger amount of information retrieved using the software showed up to now, can increase the complexity of those processes. R is a good alternative to manage and combine multiple results in a very straightforward way. There are several packages available in R that performs the same (and additional) analyses. Work with this kind of analysis within R allows the design of a pipeline to combine different steps and to create outputs in the same environment. These characteristics result in an analysis that is performed in a much more efficient way and simplify the management process of the intermediary files. Additionally, using R it is possible to create new kind of analysis and /or to adapt previous approaches to better fit to you data. Here, we will see some possibilities to use R to identify positional and functional candidate genes.

Package: biomaRt

The package biomaRt “provides an interface to a growing collection of databases implementing the BioMart software suite (http://www.biomart.org). The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries”. Therefore, it is possibke to retrive all the information available using the online version of Biomart, using R.

Here, we will use the same input file used in the online version of Biomart. The file to be imported in R is called “input_candidate_markers.txt”.

#The following command can  be used to install the package (you just need to remove the comment character #)
#install.packages("biomaRt")

#Loading the input files
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")

#Checking the imported file
head(input.regions)

##   CHR                    SNP        BP A1 TEST NMISS    BETA  STAT
## 1   6 Hapmap25274-BTA-143756  14314482  A  ADD  1093 0.06197 7.352
## 2  12      ARS-BFGL-NGS-1794  88834796  B  ADD  1093 0.14470 7.259
## 3  13  Hapmap48963-BTA-96734    999229  B  ADD  1089 0.05813 7.062
## 4  20 Hapmap61037-rs29016327  45106413  A  ADD  1093 0.05491 6.909
## 5   1     ARS-BFGL-NGS-97757 149746651  A  ADD  1093 0.06505 6.872
## 6  18     ARS-BFGL-NGS-13803  14526709  B  ADD  1090 0.05189 6.780
##          P
## 1 3.82e-13
## 2 7.41e-13
## 3 2.93e-12
## 4 8.28e-12
## 5 1.07e-11
## 6 1.97e-11

#Loading the package
library(biomaRt)

After load the input file and load the package, it is possible to start the search of genes within our selected interval.

The first step is to choose the database and the organism to choose the genes

#The following command shows the databases that are available to retrive information
listMarts()

##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 90
## 2   ENSEMBL_MART_MOUSE      Mouse strains 90
## 3     ENSEMBL_MART_SNP  Ensembl Variation 90
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 90

#We will choose ENSEMBL_MART_ENSEMBL, which correspond to Ensembl Genes 90
#It is necessary to creat a object of the type "Mart" to load the database information. This is performed with the following command:
mart <- useMart("ENSEMBL_MART_ENSEMBL")

#Once the object mart was created using useMart() function, we need to select the organism. The following comand can be used to identify the correspondent argument to each organism

#listDatasets(mart)

#The function useDataset() can be used to insert the organism infortation in the mart object
mart <- useDataset("btaurus_gene_ensembl", mart)

Now, it is time to define the filters and attributes to be retrived from the database

#The following command can be used to list the filters available to use in the retriving process

#listFilters(mart)

#Here, we will use the chromosome region filter ("chromosomal_region") to retrive the genes mapped in the intervals selected
filter<-"chromosomal_region"

#The following command will be used to creat the chromosomal coordinates using a 500Kb upstream and downstream for each slected marker

value<-paste(input.regions$CHR, ":", (input.regions$BP)-500000, ":", (input.regions$BP)+500000, sep="")

value

##  [1] "6:13814482:14814482"   "12:88334796:89334796" 
##  [3] "13:499229:1499229"     "20:44606413:45606413" 
##  [5] "1:149246651:150246651" "18:14026709:15026709" 
##  [7] "14:22752097:23752097"  "12:82524081:83524081" 
##  [9] "13:14646915:15646915"  "15:70206906:71206906" 
## [11] "3:52843172:53843172"   "8:16073737:17073737"  
## [13] "10:93530802:94530802"  "6:230937:1230937"     
## [15] "8:18297386:19297386"   "5:108999140:109999140"
## [17] "1:138594903:139594903" "6:13754604:14754604"  
## [19] "5:63690317:64690317"   "18:13646373:14646373"

#The following command allows to list all the attirbutes possible to be retrived from Biomart

#listAttributes(mart)

#The attributes to be retrived will be inserted in a object
attributes <- c("ensembl_gene_id","hgnc_symbol","chromosome_name","start_position","end_position","entrezgene","external_gene_name")

Now, after to define the filters and the attirbutes, it is time to retrive the information

#At this point, we need to inform the objects created in the previous steps as arguments of the function getBM()
all.genes <- getBM(attributes=attributes, filters=filter, values=value, mart=mart)

#Checking the output
str(all.genes)

## 'data.frame':    157 obs. of  7 variables:
##  $ ensembl_gene_id   : chr  "ENSBTAG00000012798" "ENSBTAG00000030105" "ENSBTAG00000018682" "ENSBTAG00000023384" ...
##  $ hgnc_symbol       : chr  "KCNH8" "" "SETD4" "" ...
##  $ chromosome_name   : int  1 1 1 1 1 1 1 10 10 10 ...
##  $ start_position    : int  138455304 149720302 150025827 150054221 150108438 150144149 150169936 93417338 93617616 93638609 ...
##  $ end_position      : int  138676649 149720392 150048269 150064636 150110508 150153934 150293609 93571380 93677055 93638752 ...
##  $ entrezgene        : int  100336609 NA 540440 515946 510180 516036 514520 281553 785612 NA ...
##  $ external_gene_name: chr  "KCNH8" "" "SETD4" "" ...

head(all.genes)

##      ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000012798       KCNH8               1      138455304
## 2 ENSBTAG00000030105                           1      149720302
## 3 ENSBTAG00000018682       SETD4               1      150025827
## 4 ENSBTAG00000023384                           1      150054221
## 5 ENSBTAG00000030629                           1      150108438
## 6 ENSBTAG00000018688        CBR3               1      150144149
##   end_position entrezgene external_gene_name
## 1    138676649  100336609              KCNH8
## 2    149720392         NA                   
## 3    150048269     540440              SETD4
## 4    150064636     515946                   
## 5    150110508     510180                   
## 6    150153934     516036               CBR3

#Saving the ouput

write.table(all.genes, file="output_biomart_genes_within_interval.txt", row.names=F, sep="\t", quote=F)

Using the command listed above, it is possible to obtain the same results obtained in the online version of biomart. Notice that it is possible to ru these commands for all the organisms available in the Ensembl database, for all the filters and atributes. Check the help material for each function to obtain more information. For example “?getBM()”.

The manual of biomaRt can de found here:

https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf

Retriving the positional candidate genes and QTLs within selected interval using a R function

We created a R function to search genes or QTLs within selected intervals. This function can handle with SNP and haplotype data.

The function is called “find_genes_qtls_around_markers”. For this function, you should inform some arguments:

db_file: For this argument you should inform the name of the file with the gene mapping or QTL information. For the gene mapping, you should use the .gtf file download from Ensembl data base (ftp://ftp.ensembl.org/pub/release-90/gtf/bos_taurus/). For the QTL search, you need to inform the .gff file that can be downloaded from Cattle QTlLdb (https://www.animalgenome.org/cgi-bin/QTLdb/BT/download?file=gbpUMD_3.1).

marker_file: The file with the SNP or haplotype positions. Detail: For SNP files, you must have a column called “CHR” and a column called “BP” with the chromosome and base pair position, respectively. For the haplotype, you must have three columns: “CHR”, “BP1” and “BP2”. All the columns names are capitals, as weel as the PLINK output.

method: “gene” or “qtl”

marker: “snp” or “haplotype”

interval: The interval in base pair

All positions must be informed using base pairs (not Kb or Mb).

Important: This function is a beta version Therefore, errors or bugs might be found. Use with carefully.

In order to execute this fundtion, it is necessary that the packages refGenome, mefa and gtools. These packages can be installed using the following command.

#install.packages(c("refGenome", "mefa","gtools"))

#Importing the input file
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")

#Loading the function from a external file
source("complete_function.R")

#Running the function to search genes in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="Bos_taurus.UMD3.1.90.chr.gtf",marker_file="input_candidate_markers.txt",method="qtl",marker="snp",interval=500000)

## Loading required package: doBy

## Loading required package: RSQLite

## mefa 3.2-7    2016-01-11

## You are using the method: qtl with snp

## Warning in find_genes_qtls_around_markers(db_file =
## "Bos_taurus.UMD3.1.90.chr.gtf", : NAs introduzidos por coerção

## Starting QTL searching using 5e+05 bp as interval

## Starting analysis for chromosome 1

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 3

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 5

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 6

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 8

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 10

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 12

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 13

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 14

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 15

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 18

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 20

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#An output file will be save in your computer. This file is called: "genes_around_SNPs.txt". Now we will import this file and check the output format.
genes.interval<-read.table("genes_around_SNPs.txt", h=T, sep="\t")
head(genes.interval)

##   CHR                    SNP       BP A1 TEST NMISS    BETA  STAT        P
## 1  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 2  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 3  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
## 4  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
## 5  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
## 6  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
##   chr                 ID gene_name start_pos  end_pos
## 1  20 ENSBTAG00000043813   5S_rRNA  45050487 45050605
## 2  20 ENSBTAG00000043184        U6  45414399 45414504
## 3  18 ENSBTAG00000045507    ZNF469  13740622 13753849
## 4  18 ENSBTAG00000015766     ZFPM1  13809878 13826787
## 5  18 ENSBTAG00000018507      <NA>  13860761 13865260
## 6  18 ENSBTAG00000019520    ZC3H18  13879228 13922305
##   Distance_from_gene_start Distance_from_gene_end
## 1                    55926                  55808
## 2                  -307986                -308091
## 3                   405751                 392524
## 4                   336495                 319586
## 5                   285612                 281113
## 6                   267145                 224068

Note that the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each inverval is repeated n time. Where n is equal the number of genes within this interval.

#Importing the input file
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")

#Loading the function from a external file
source("complete_function.R")

#Running the function to search QTLs in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="QTL_UMD_3.1.gff",marker_file="input_candidate_markers.txt",method="qtl",marker="snp",interval=500000)

## You are using the method: qtl with snp

## Starting QTL searching using 5e+05 bp as interval

## Starting analysis for chromosome 1

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 3

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 5

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 6

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 8

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 10

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 12

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 13

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 14

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 15

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 18

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 20

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#An output file will be save in your computer. This file is called: "QTLs_around_SNPs.txt". Now we will import this file and check the output format.
qtls.interval<-read.table("QTLs_around_SNPs.txt", h=T, sep="\t")
head(qtls.interval)

##   CHR                    SNP       BP A1 TEST NMISS    BETA  STAT        P
## 1  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 2  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 3  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 4  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 5  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 6  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
##   chr     database                 QTL_type start_pos  end_pos
## 1  20 Animal QTLdb   Production_Association  44916096 44916136
## 2  20 Animal QTLdb     Exterior_Association  44694544 44694584
## 3  20 Animal QTLdb Reproduction_Association  44694544 44694584
## 4  20 Animal QTLdb     Exterior_Association  44694544 44694584
## 5  20 Animal QTLdb         Milk_Association  44694544 44694584
## 6  20 Animal QTLdb         Milk_Association  44694544 44694584
##                                                                                                                                                                                                                                                                                                                                               extra_info
## 1 QTL_ID=69077;Name=Body weight gain;Abbrev=BWG;PUBMED_ID=19966163;trait_ID=1612;trait=Body weight gain;breed=charolais,gelbvieh,hereford,limousin,pinzgauer,red angus,red poll,simmental,Angus;FlankMarkers=rs29015683;VTO_name=postnatal growth trait;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05
## 2                                                                           QTL_ID=50811;Name=Dairy form;Abbrev=DYF;PUBMED_ID=21831322;trait_ID=1137;trait=Dairy form;breed=holstein;FlankMarkers=rs41641100;VTO_name=back conformation trait;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 3                                                 QTL_ID=50812;Name=Stillbirth (maternal);Abbrev=SB;PUBMED_ID=21831322;trait_ID=1278;trait=Stillbirth (maternal);breed=holstein;FlankMarkers=rs41641100;VTO_name=stillborn offspring quantity;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 4                                                                                       QTL_ID=50813;Name=Foot angle;Abbrev=FANG;PUBMED_ID=21831322;trait_ID=1101;trait=Foot angle;breed=holstein;FlankMarkers=rs41641100;VTO_name=hoof angle;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 5           QTL_ID=50814;Name=Milk fat percentage;Abbrev=FP;PUBMED_ID=21831322;trait_ID=1037;trait=Milk fat percentage;breed=holstein;FlankMarkers=rs41641100;VTO_name=milk fat amount;PTO_name=milk fat content;CMO_name=milk fat percentage;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 6                            QTL_ID=50815;Name=Milk fat yield;Abbrev=FY;PUBMED_ID=21831322;trait_ID=1039;trait=Milk fat yield;breed=holstein;FlankMarkers=rs41641100;VTO_name=milk fat amount;PTO_name=milk fat yield;CMO_name=milk fat yield;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
##   Distance_from_gene_start Distance_from_gene_end
## 1                   190317                 190277
## 2                   411869                 411829
## 3                   411869                 411829
## 4                   411869                 411829
## 5                   411869                 411829
## 6                   411869                 411829

Note that, as well as the gene searching output, the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each inverval is repeated n time. Where n is equal the number of genes within this interval.

The same analyses can be performed to haplotype data. You only need to change the argument informed to “marker=”. To choose the analysis using haplotypes, you need to inform “haplotype”.

Gene Ontology

The Gene Ontology enrichment analyses are also possible to be performed using R. The functions available in the packages GO.db, topGO and GOstats can be used together to perform a enrichment analysis. Additionally, the information present in the database “org.Bt.eg.db” will be used.

Firt, we need to import a table with the genes mapped inside an group of intervals. For this, we will import the table created using the function getBM() in biomaRt package.

#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")
head(genes.interval)

##      ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000012798       KCNH8               1      138455304
## 2 ENSBTAG00000030105                           1      149720302
## 3 ENSBTAG00000018682       SETD4               1      150025827
## 4 ENSBTAG00000023384                           1      150054221
## 5 ENSBTAG00000030629                           1      150108438
## 6 ENSBTAG00000018688        CBR3               1      150144149
##   end_position entrezgene external_gene_name
## 1    138676649  100336609              KCNH8
## 2    149720392         NA                   
## 3    150048269     540440              SETD4
## 4    150064636     515946                   
## 5    150110508     510180                   
## 6    150153934     516036               CBR3

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite(c("GO.db","topGO","GOstats", "org.Bt.eg.db""))

After to import the list with the positional candidate genes, we must map all the genes annoated in the org.Bt.eg.db to the correspondent GO terms.

#Loading packages
library(org.Bt.eg.db)

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

##

library(GO.db)

##

#org.Bt.egGO is an R object that provides mappings between entrez gene identifers and the GO identifers that they are directly associated with
entrez_ID<-org.Bt.egGO

#Getting the entrez gene identifiers that are mapped to a GO ID
mapped_genes <- mappedkeys(entrez_ID)

Using the above commands, it was possible to create a list with the GO terms and the genes associated to each one of the terms. Now, we can use functions from the GO.db and GOstats packages to perform an enrichment test.

#Loading the packages
library("GO.db")
library("GOstats")

## Loading required package: Category

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:IRanges':
## 
##     expand

## Loading required package: graph

## 
## Attaching package: 'GOstats'

## The following object is masked from 'package:AnnotationDbi':
## 
##     makeGOGraph

#Creating the object with the entrez ID for each mapped gene (universe) 
universe <- mapped_genes

#Creating a object with the entrez ID for our positional candidate genes. Notice that we are using each Id only once and the class of the object must be the same that in the universe object (character)
candidate_ID<-as.character(unique(genes.interval$entrezgene))

#Creating the list of parameters to perform the enrichment test. Here, we will perform a enrichment analysis for Biological Process (BP)
params <- new('GOHyperGParams', geneIds=candidate_ID, universeGeneIds=universe, ontology='BP', pvalueCutoff=0.05, conditional=F, testDirection='over', annotation="org.Bt.eg.db")

## Warning in makeValidParams(.Object): removing geneIds not in
## universeGeneIds

#Running a hypergeometric test
hgOver <- hyperGTest(params)
hgOver

## Gene to GO BP  test for over-representation 
## 1068 GO BP ids tested (129 have p < 0.05)
## Selected gene set size: 39 
##     Gene universe size: 4931 
##     Annotation package: org.Bt.eg

#Saving the summary of the results in a object
result <- summary(hgOver)

#Checking the dimensions of the results
dim(result)

## [1] 129   7

#Checking the results
head(result,10)

##        GOBPID       Pvalue OddsRatio    ExpCount Count Size
## 1  GO:0017144 0.0001819737 264.37838 0.023727439     2    3
## 2  GO:0032784 0.0068208644  18.83398 0.126546339     2   16
## 3  GO:0006423 0.0079091462       Inf 0.007909146     1    1
## 4  GO:0016999 0.0079091462       Inf 0.007909146     1    1
## 5  GO:0031635 0.0079091462       Inf 0.007909146     1    1
## 6  GO:0034137 0.0079091462       Inf 0.007909146     1    1
## 7  GO:0035696 0.0079091462       Inf 0.007909146     1    1
## 8  GO:0035723 0.0079091462       Inf 0.007909146     1    1
## 9  GO:0035724 0.0079091462       Inf 0.007909146     1    1
## 10 GO:0038194 0.0079091462       Inf 0.007909146     1    1
##                                                              Term
## 1                                          drug metabolic process
## 2           regulation of DNA-templated transcription, elongation
## 3                                   cysteinyl-tRNA aminoacylation
## 4                                    antibiotic metabolic process
## 5  adenylate cyclase-inhibiting opioid receptor signaling pathway
## 6   positive regulation of toll-like receptor 2 signaling pathway
## 7                                          monocyte extravasation
## 8                       interleukin-15-mediated signaling pathway
## 9                                       CD24 biosynthetic process
## 10                  thyroid-stimulating hormone signaling pathway

Notice that the output table shows all the GO terms with a p-value smaller than the threshold defined in the parameters. However, we do not know which genes are associated with the enriched terms. It is possible to retrive this information using the following command line. For example, let’s seacrh the genes associated with the GO:0032784, “regulation of DNA-templated transcription, elongation”.

#First, we need to create a object with a mapping information for each GO term to Entrez gene ids
go_object <- as.list(org.Bt.egGO2EG)

#After this step, we can easily retrive this information using a indexing filtering in the go_object
go.genes<-go_object[["GO:0032784"]][which(go_object[["GO:0032784"]] %in% candidate_ID)]

#Showing the Entrez ID
go.genes

##      IEA      ISS 
## "505722" "533643"

#Filtering the gene information from the input table
genes.interval[genes.interval$entrezgene %in% go.genes,]

##       ensembl_gene_id hgnc_symbol chromosome_name start_position
## 46 ENSBTAG00000003460                          14       23620858
## 99 ENSBTAG00000019133      ZNF326               3       53385189
##    end_position entrezgene external_gene_name
## 46     23637730     505722              TCEA1
## 99     53419460     533643             ZNF326

Additionally, a multiple testing correction can be in our results. This approach reduce the number of false-positives in our data.

#Creating a new column with the adjusted p-value using FDR correction
result$P.adjust<-p.adjust(result$Pvalue, method = "fdr", n = length(result$Pvalue))

#Checking the results
head(result,10)

##        GOBPID       Pvalue OddsRatio    ExpCount Count Size
## 1  GO:0017144 0.0001819737 264.37838 0.023727439     2    3
## 2  GO:0032784 0.0068208644  18.83398 0.126546339     2   16
## 3  GO:0006423 0.0079091462       Inf 0.007909146     1    1
## 4  GO:0016999 0.0079091462       Inf 0.007909146     1    1
## 5  GO:0031635 0.0079091462       Inf 0.007909146     1    1
## 6  GO:0034137 0.0079091462       Inf 0.007909146     1    1
## 7  GO:0035696 0.0079091462       Inf 0.007909146     1    1
## 8  GO:0035723 0.0079091462       Inf 0.007909146     1    1
## 9  GO:0035724 0.0079091462       Inf 0.007909146     1    1
## 10 GO:0038194 0.0079091462       Inf 0.007909146     1    1
##                                                              Term
## 1                                          drug metabolic process
## 2           regulation of DNA-templated transcription, elongation
## 3                                   cysteinyl-tRNA aminoacylation
## 4                                    antibiotic metabolic process
## 5  adenylate cyclase-inhibiting opioid receptor signaling pathway
## 6   positive regulation of toll-like receptor 2 signaling pathway
## 7                                          monocyte extravasation
## 8                       interleukin-15-mediated signaling pathway
## 9                                       CD24 biosynthetic process
## 10                  thyroid-stimulating hormone signaling pathway
##      P.adjust
## 1  0.02347461
## 2  0.04049741
## 3  0.04049741
## 4  0.04049741
## 5  0.04049741
## 6  0.04049741
## 7  0.04049741
## 8  0.04049741
## 9  0.04049741
## 10 0.04049741

Notice that different multiple correction tests can be used. For more details, check the help material of the p.adjust function.

Metabolic pathways

The metabolic pathways enrichment is an important additional functional analysis to help in the selection of functional candidate genes. The package KEGGprofile allows to perform enrichment analysis using several especies. Here, we will perform a encrihment analysis using the same dataset used for the GO enrichment analysis. The function find_enriched_pathway() is responsible to perform the enrichment test. We need to inform the entrezID list of our positional candidate genes for the function.

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite("KEGGprofile")

#Loading the package
library(KEGGprofile)

## 
## KEGG.db contains mappings based on older data because the original
##   resource was removed from the the public domain before the most
##   recent update was produced. This package should now be
##   considered deprecated and future versions of Bioconductor may
##   not have it available.  Users who want more current data are
##   encouraged to look at the KEGGREST or reactome.db packages

#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")

#Creating a object with the entrez ID for our positional candidate genes.
candidate_ID<-as.character(unique(genes.interval$entrezgene))

#Running the enrichment test
keeg.enriched<-find_enriched_pathway(candidate_ID, species = "bta", returned_pvalue = 0.05, returned_adjpvalue = 0.05, returned_genenumber = 1, download_latest = TRUE)

str(keeg.enriched)

## List of 2
##  $ stastic:'data.frame': 4 obs. of  6 variables:
##   ..$ Pathway_Name: chr [1:4] "Metabolism of xenobiotics by cytochrome P450" "RNA polymerase" "Phagosome" "Synaptic vesicle cycle"
##   ..$ Gene_Found  : int [1:4] 3 2 5 3
##   ..$ Gene_Pathway: int [1:4] 66 30 182 65
##   ..$ Percentage  : num [1:4] 0.05 0.07 0.03 0.05
##   ..$ pvalue      : num [1:4] 0.000352 0.000478 0.000291 0.000332
##   ..$ pvalueAdj   : num [1:4] 0.0373 0.038 0.0373 0.0373
##  $ detail :List of 4
##   ..$ 00980: chr [1:3] "510180" "515946" "516036"
##   ..$ 03020: chr [1:2] "107133113" "614776"
##   ..$ 04145: chr [1:5] "281111" "282657" "287017" "768036" ...
##   ..$ 04721: chr [1:3] "282657" "287017" "529659"

The arguments informed to find_enriched_pathway() were the list of unique Entrez IDs, the especie (bta= Bos taurus), the p-value threshold, the adjusted p-value threshold, the minimum number of annotated genes for enriched pathways. The last argument inform if the lastest version of the KEGG pathway database should be download for the analysis.

Notice that the output of this function is a list. This list is composed by the table with the enrichment results and a vector with the Entrez IDs associated with each pathway. Therefore, we can handle this output to obtain individual objects with this information and filter our positional candidate genes from out input file.

#Creating a object with the enriched pathways
enrich_table<-keeg.enriched$stastic

enrich_table

##                                       Pathway_Name Gene_Found Gene_Pathway
## 00980 Metabolism of xenobiotics by cytochrome P450          3           66
## 03020                               RNA polymerase          2           30
## 04145                                    Phagosome          5          182
## 04721                       Synaptic vesicle cycle          3           65
##       Percentage       pvalue  pvalueAdj
## 00980       0.05 0.0003518575 0.03729690
## 03020       0.07 0.0004775888 0.03796831
## 04145       0.03 0.0002914335 0.03729690
## 04721       0.05 0.0003319200 0.03729690

#Filtering the input file based on the genes associated with the enriched pathway
enriched.genes<-genes.interval[genes.interval$entrezgene %in% keeg.enriched$detail[["04145"]],]

enriched.genes

##        ensembl_gene_id hgnc_symbol chromosome_name start_position
## 44  ENSBTAG00000003450     ATP6V1H              14       23511301
## 58  ENSBTAG00000003895        CYBA              18       13931107
## 89  ENSBTAG00000023730       TUBB3              18       14761260
## 122 ENSBTAG00000014238    ATP6V1E1               5      109564503
## 128 ENSBTAG00000018530       TUBA8               5      109876442
##     end_position entrezgene external_gene_name
## 44      23558678     282657            ATP6V1H
## 58      13938075     281111               CYBA
## 89      14769914     768070              TUBB3
## 122    109578159     287017           ATP6V1E1
## 128    109894993     768036              TUBA8

Notice that only the pathway “04145” (Phagosome) was enriched. Tehrefore, its ID was used in the filtering process. The commands above can be adapter to filter multiple the enriched pathways.

Gene network

The gene network analysis is a interesting approach to better understand the relationship between the positional candidate genes. The identification of gene networks can help to select the functional candidate genes, to identify key regulatory genes and to better understand the biological processes related with the development of complex traits.

The package STRINGdb provides a R interface to the STRING protein-protein interactions database (http://www.string-db.org).

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite("STRINGdb"))

#Loading the package
library(STRINGdb)

## Loading required package: png

## Loading required package: sqldf

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: plyr

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:graph':
## 
##     join

## The following object is masked from 'package:IRanges':
## 
##     desc

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:graph':
## 
##     degree, edges, intersection, union

## The following objects are masked from 'package:IRanges':
## 
##     compare, simplify, union

## The following objects are masked from 'package:S4Vectors':
## 
##     compare, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     normalize, union

## The following object is masked from 'package:gtools':
## 
##     permute

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## Loading required package: RCurl

## Loading required package: bitops

## Loading required package: plotrix

## Loading required package: RColorBrewer

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:plotrix':
## 
##     plotCI

## The following object is masked from 'package:IRanges':
## 
##     space

## The following object is masked from 'package:stats':
## 
##     lowess

## Loading required package: hash

## hash-2.2.6 provided by Decision Patterns

## 
## Attaching package: 'hash'

## The following object is masked from 'package:AnnotationDbi':
## 
##     keys

## The following objects are masked from 'package:IRanges':
## 
##     values, values<-

## The following objects are masked from 'package:S4Vectors':
## 
##     values, values<-

## The following object is masked from 'package:biomaRt':
## 
##     keys

#Creating a Data frame with the Gene symbols
candidate_ID<-data.frame(gene=unique(genes.interval$hgnc_symbol))

#Loading the STRING interface for Bos taurus (ID=9913)
string_db <- STRINGdb$new(species=9913)

## WARNING: You didn't specify a version of the STRING database to use. Hence we will use STRING 10

#Mapping the interactions of the genes presen in our input list
gene_mapped <- string_db$map(candidate_ID, "gene", removeUnmappedRows = TRUE )

## Warning:  we couldn't map to STRING 4% of your identifiers

hits <- gene_mapped$STRING_id

#Plotting the gene network
string_db$plot_network(hits, payload_id=NULL, required_score=NULL, add_link=T, add_summary=T)

It is possible to perform GO and KEGG pathways enrichment analyses using STRINGdb package, as well as in the online version o STRING db.

#GO enrichment analysis
enrichmentGO <- string_db$get_enrichment( hits, category = "Process", methodMT = "fdr", iea = TRUE )

head(enrichmentGO)

##      term_id proteins hits       pvalue  pvalue_fdr
## 1 GO:0050896     1416   10 1.237745e-05 0.009543015
## 2 GO:0019538     1047    8 6.687328e-05 0.017436151
## 3 GO:0051716     1057    8 7.151369e-05 0.017436151
## 4 GO:0071822      334    5 9.045993e-05 0.017436151
## 5 GO:0044267      878    7 1.618923e-04 0.021268626
## 6 GO:0051047       80    3 1.880052e-04 0.021268626
##                       term_description
## 1                 response to stimulus
## 2            protein metabolic process
## 3        cellular response to stimulus
## 4 protein complex subunit organization
## 5   cellular protein metabolic process
## 6     positive regulation of secretion

#KEGG enrichment analysis
enrichmentKEGG <- string_db$get_enrichment( hits, category = "KEGG", methodMT = "fdr", iea = TRUE )

head(enrichmentKEGG)

##   term_id proteins hits       pvalue   pvalue_fdr
## 1   04145      151    5 2.358668e-06 0.0002052041
## 2   04540       84    3 2.415520e-04 0.0088705582
## 3   05323       91    3 3.058813e-04 0.0088705582
## 4   05146      103    3 4.401131e-04 0.0095724593
## 5   04721       61    2 3.525329e-03 0.0546216535
## 6   04918       68    2 4.360907e-03 0.0546216535
##            term_description
## 1                 Phagosome
## 2              Gap junction
## 3      Rheumatoid arthritis
## 4                Amoebiasis
## 5    Synaptic vesicle cycle
## 6 Thyroid hormone synthesis

There are several other options to perform clustering and additional enrichment analysis in the STRINGdb package. Some examples can be found here:

https://rdrr.io/bioc/STRINGdb/f/inst/doc/STRINGdb.pdf

This tutorial showed some interesting analyses that can be performed using R. These analyses can be integrated in a pipeline, where the output from different functions and/or packages can be directly used by other functions/packages. This procedure increase the efficiency and the management process for functional studies when hundred (or even thousands) of genes are analyzed.

Identification of positional and functional candidate genes using R

Authors: Pablo Fonseca / Angela Cánovas

December 3, 2017

Why to use R to search positional and functional candidate genes?

Package: biomaRt

Retriving the positional candidate genes and QTLs within selected interval using a R function

Gene Ontology

Metabolic pathways

Gene network