Authors: Pablo Fonseca / Angela Cánovas

March 22, 2018

Why use R to search for positional and functional candidate genes?

In the last tutorials, we learned how to search for positional and functional candidate genes using several softwares. This is a great strategy. However, during the process, several output files are created and we must manage these files in a very cautious way. We must convert different outputs to use as input files in other software, combine different results to select candidate genes, to create plots in order to summarize the results. The larger number of genes in the input files and the larger amount of information retrieved using the software showed up to now, can increase the complexity of those processes. R is a good alternative to manage and combine multiple results in a very straightforward way. There are several packages available in R that performs the same (and additional) analyses. Work with this kind of analysis within R allows the design of a pipeline to combine different steps and to create outputs in the same environment. These characteristics result in an analysis that is performed in a much more efficient way and simplify the management process of the intermediary files. Additionally, with using R, it is possible to create new kind of analysis and /or to adapt previous approaches to better fit to you data. Here, we will see some possibilities to use R to identify positional and functional candidate genes.

Package: biomaRt

The package biomaRt “provides an interface to a growing collection of databases implementing the BioMart software suite (http://www.biomart.org). The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries”. Therefore, it is possible to retrieve all the information available using the online version of Biomart, using R.

Here, we will use the same input file used in the online version of Biomart. The file to be imported in R is called “input_candidate_markers.txt”.

#The following command can  be used to install the package (you just need to remove the comment character #)
#install.packages("biomaRt")

#Loading the input files
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")

#Checking the imported file
head(input.regions)

##   CHR                    SNP        BP A1 TEST NMISS    BETA  STAT
## 1   6 Hapmap25274-BTA-143756  14314482  A  ADD  1093 0.06197 7.352
## 2  12      ARS-BFGL-NGS-1794  88834796  B  ADD  1093 0.14470 7.259
## 3  13  Hapmap48963-BTA-96734    999229  B  ADD  1089 0.05813 7.062
## 4  20 Hapmap61037-rs29016327  45106413  A  ADD  1093 0.05491 6.909
## 5   1     ARS-BFGL-NGS-97757 149746651  A  ADD  1093 0.06505 6.872
## 6  18     ARS-BFGL-NGS-13803  14526709  B  ADD  1090 0.05189 6.780
##          P
## 1 3.82e-13
## 2 7.41e-13
## 3 2.93e-12
## 4 8.28e-12
## 5 1.07e-11
## 6 1.97e-11

#Loading the package
library(biomaRt)

After loading the input file and loading the package, it is possible to start the search of genes within our selected interval.

The first step is to choose the database and the organism to choose the genes

#The following command shows the databases that are available to retrive information
listMarts()

##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 91
## 2   ENSEMBL_MART_MOUSE      Mouse strains 91
## 3     ENSEMBL_MART_SNP  Ensembl Variation 91
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 91

#We will choose ENSEMBL_MART_ENSEMBL, which correspond to Ensembl Genes 91
#It is necessary to creat a object of the type "Mart" to load the database information. This is performed with the following command:
mart <- useMart("ENSEMBL_MART_ENSEMBL")

#Once the object mart was created using useMart() function, we need to select the organism. The following comand can be used to identify the correspondent argument to each organism

#listDatasets(mart)

#The function useDataset() can be used to insert the organism infortation in the mart object
mart <- useDataset("btaurus_gene_ensembl", mart)

Now, it is time to define the filters and attributes to be retrieved from the database

#The following command can be used to list the filters available to use in the retriving process

#listFilters(mart)

#Here, we will use the chromosome region filter ("chromosomal_region") to retrive the genes mapped in the intervals selected
filter<-"chromosomal_region"

#The following command will be used to creat the chromosomal coordinates using a 500Kb upstream and downstream for each slected marker

value<-paste(input.regions$CHR, ":", (input.regions$BP)-500000, ":", (input.regions$BP)+500000, sep="")

value

##  [1] "6:13814482:14814482"   "12:88334796:89334796" 
##  [3] "13:499229:1499229"     "20:44606413:45606413" 
##  [5] "1:149246651:150246651" "18:14026709:15026709" 
##  [7] "14:22752097:23752097"  "12:82524081:83524081" 
##  [9] "13:14646915:15646915"  "15:70206906:71206906" 
## [11] "3:52843172:53843172"   "8:16073737:17073737"  
## [13] "10:93530802:94530802"  "6:230937:1230937"     
## [15] "8:18297386:19297386"   "5:108999140:109999140"
## [17] "1:138594903:139594903" "6:13754604:14754604"  
## [19] "5:63690317:64690317"   "18:13646373:14646373"

#The following command allows to list all the attirbutes possible to be retrived from Biomart

#listAttributes(mart)

#The attributes to be retrived will be inserted in a object
attributes <- c("ensembl_gene_id","hgnc_symbol","chromosome_name","start_position","end_position","entrezgene","external_gene_name")

Now, after defining the filters and the attributes, it is time to retrieve the information

#At this point, we need to inform the objects created in the previous steps as arguments of the function getBM()
all.genes <- getBM(attributes=attributes, filters=filter, values=value, mart=mart)

#Checking the output
str(all.genes)

## 'data.frame':    157 obs. of  7 variables:
##  $ ensembl_gene_id   : chr  "ENSBTAG00000012798" "ENSBTAG00000030105" "ENSBTAG00000018682" "ENSBTAG00000023384" ...
##  $ hgnc_symbol       : chr  "KCNH8" "" "" "" ...
##  $ chromosome_name   : int  1 1 1 1 1 1 1 10 10 10 ...
##  $ start_position    : int  138455304 149720302 150025827 150054221 150108438 150144149 150169936 93417338 93617616 93638609 ...
##  $ end_position      : int  138676649 149720392 150048269 150064636 150110508 150153934 150293609 93571380 93677055 93638752 ...
##  $ entrezgene        : int  100336609 NA 540440 515946 510180 516036 514520 281553 785612 NA ...
##  $ external_gene_name: chr  "KCNH8" "" "SETD4" "CBR1" ...

head(all.genes)

##      ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000012798       KCNH8               1      138455304
## 2 ENSBTAG00000030105                           1      149720302
## 3 ENSBTAG00000018682                           1      150025827
## 4 ENSBTAG00000023384                           1      150054221
## 5 ENSBTAG00000030629                           1      150108438
## 6 ENSBTAG00000018688                           1      150144149
##   end_position entrezgene external_gene_name
## 1    138676649  100336609              KCNH8
## 2    149720392         NA                   
## 3    150048269     540440              SETD4
## 4    150064636     515946               CBR1
## 5    150110508     510180          MGC127133
## 6    150153934     516036               CBR3

#Saving the ouput

write.table(all.genes, file="output_biomart_genes_within_interval.txt", row.names=F, sep="\t", quote=F)

Using the command listed above, it is possible to obtain the same results obtained in the online version of biomart. Notice that it is possible to run these commands for all the organisms available in the Ensembl database, for all the filters and attributes. Check the help material for each function to obtain more information. For example “?getBM()”.

The manual of biomaRt can be found here:

https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf

Retrieving the positional candidate genes and QTLs within selected interval using a R function

We created a R function to search genes or QTLs within selected intervals. This function can handle with SNP and haplotype data.

The function is called “find_genes_qtls_around_markers”. For this function, you should inform some arguments:

**db_file:** For this argument you should inform the name of the file with the gene mapping or QTL information. For the gene mapping, you should use the .gtf file download from Ensembl data base (ftp://ftp.ensembl.org/pub/release-90/gtf/bos_taurus/). For the QTL search, you need to inform the .gff file that can be downloaded from Cattle QTlLdb (https://www.animalgenome.org/cgi-bin/QTLdb/BT/download?file=gbpUMD_3.1).

**marker_file:** The file with the SNP or haplotype positions. Detail: For SNP files, you must have a column called “CHR” and a column called “BP” with the chromosome and base pair position, respectively. For the haplotype, you must have three columns: “CHR”, “BP1” and “BP2”. All the columns names are capitals, as weel as the PLINK output.

method: “gene” or “qtl”

marker: “snp” or “haplotype”

interval: The interval in base pair

All positions must be informed using base pairs (not Kb or Mb).

**Important:* This function is a beta version Therefore, errors or bugs might be found. Use with care.

In order to execute this function, it is necessary to installthe packages refGenome, mefa and gtools. These packages can be installed using the following command.

#install.packages(c("refGenome", "mefa","gtools"))

#Importing the input file
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")

#Loading the function from a external file
source("complete_function_v1.1.R")

#Running the function to search genes in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="Bos_taurus.UMD3.1.91.gtf",marker_file="input_candidate_markers.txt",method="gene",marker="snp",interval=500000)

## Loading required package: doBy

## Loading required package: RSQLite

## mefa 3.2-7    2016-01-11

## You are using the method: gene with snp

## 
## [read.gtf.refGenome] Reading file 'Bos_taurus.UMD3.1.91.gtf'.
## 
[GTF]   100000 lines processed.
[GTF]   200000 lines processed.
[GTF]   300000 lines processed.
[GTF]   400000 lines processed.
[GTF]   500000 lines processed.
[GTF]   563839 lines processed.
## [read.gtf.refGenome] Extracting genes table.
## [read.gtf.refGenome] Found 24,616 gene records.
## [read.gtf.refGenome] Finished.

## Starting Gene searching using 5e+05 bp as interval

## Starting analysis for chromosome 1

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 3

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 5

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 6

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 8

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 10

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 12

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 13

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 14

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 15

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 18

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 20

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#An output file will be save in your computer. This file is called: "genes_around_SNPs.txt". Now we will import this file and check the output format.
genes.interval<-read.table("genes_around_SNPs.txt", h=T, sep="\t")
head(genes.interval)

##   CHR                    SNP       BP A1 TEST NMISS    BETA  STAT        P
## 1  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 2  20 Hapmap61037-rs29016327 45106413  A  ADD  1093 0.05491 6.909 8.28e-12
## 3  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
## 4  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
## 5  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
## 6  18     ARS-BFGL-NGS-39953 14146373  A  ADD  1091 0.04613 6.166 9.83e-10
##   chr                 ID gene_name start_pos  end_pos
## 1  20 ENSBTAG00000043813   5S_rRNA  45050487 45050605
## 2  20 ENSBTAG00000043184        U6  45414399 45414504
## 3  18 ENSBTAG00000045507    ZNF469  13740622 13753849
## 4  18 ENSBTAG00000015766     ZFPM1  13809878 13826787
## 5  18 ENSBTAG00000018507      <NA>  13860761 13865260
## 6  18 ENSBTAG00000019520    ZC3H18  13879228 13922305
##   Distance_from_gene_start Distance_from_gene_end
## 1                    55926                  55808
## 2                  -307986                -308091
## 3                   405751                 392524
## 4                   336495                 319586
## 5                   285612                 281113
## 6                   267145                 224068

Note that the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each interval is repeated in time. Where n is equal the number of genes within this interval.

#Importing the input file
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")

#Loading the function from a external file
source("complete_function_v1.1.R")

#Running the function to search QTLs in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="QTL_UMD_3.1.gff.txt",marker_file="input_candidate_markers.txt",method="qtl",marker="snp",interval=500000)

## You are using the method: qtl with snp

## Starting QTL searching using 5e+05 bp as interval

## Starting analysis for chromosome 6

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 12

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 13

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 20

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 1

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 18

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 14

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 15

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 3

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 8

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 10

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## Starting analysis for chromosome 5

## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

#An output file will be save in your computer. This file is called: "QTLs_around_SNPs.txt". Now we will import this file and check the output format.
qtls.interval<-read.table("QTLs_around_SNPs.txt", h=T, sep="\t")
head(qtls.interval)

##   CHR             SNP        BP A1 TEST NMISS     BETA   STAT        P chr
## 1   5 BTA-73797-no-rs  64190317  A  ADD  1092 -0.05269 -6.197 8.17e-10   5
## 2   5 BTA-73797-no-rs  64190317  A  ADD  1092 -0.05269 -6.197 8.17e-10   5
## 3   5 BTA-73797-no-rs  64190317  A  ADD  1092 -0.05269 -6.197 8.17e-10   5
## 4   5 BTA-73797-no-rs  64190317  A  ADD  1092 -0.05269 -6.197 8.17e-10   5
## 5   5 BTA-73797-no-rs  64190317  A  ADD  1092 -0.05269 -6.197 8.17e-10   5
## 6   5   UA-IFASA-6163 109499140  A  ADD  1092  0.05058  6.279 4.91e-10   5
##       database                 QTL_type start_pos   end_pos
## 1 Animal QTLdb Reproduction_Association  63899433  63899473
## 2 Animal QTLdb Reproduction_Association  63899433  63899473
## 3 Animal QTLdb     Meat_and_Carcass_QTL  47539203  90486800
## 4 Animal QTLdb         Reproduction_QTL  63700911  68433351
## 5 Animal QTLdb         Reproduction_QTL  63700911  68433351
## 6 Animal QTLdb   Production_Association 109149833 109149873
##                                                                                                                                                                                                                                                                                                                                                           extra_info
## 1                                                        QTL_ID=15409;Name=Gestation length;Abbrev=GLENGTH;PUBMED_ID=22034999;trait_ID=1076;trait=Gestation length;breed=holstein;FlankMarkers=rs110718748;VTO_name=gestation period duration;CMO_name=gestation period length;Map_Type=Genome;Model=Mendelian;peak_cM=73.57;Significance=Significant;P-value=0.0014
## 2                                                     QTL_ID=15410;Name=Gestation length;Abbrev=GLENGTH;PUBMED_ID=22034999;trait_ID=1076;trait=Gestation length;breed=brown swiss;FlankMarkers=rs110718748;VTO_name=gestation period duration;CMO_name=gestation period length;Map_Type=Genome;Model=Mendelian;peak_cM=73.57;Significance=Significant;P-value=0.0184
## 3 QTL_ID=22864;Name=Intramuscular fat;Abbrev=EEF;PUBMED_ID=23038745;trait_ID=1103;trait=Intramuscular fat;breed=brangus;FlankMarkers=rs110911808,rs109843295;VTO_name=intramuscular adipose amount;PTO_name=intramuscular fat content;Map_Type=Genome;Model=Mendelian;Test_Base=Comparison-wise;peak_cM=56.82;Significance=Significant;P-value=0.001;Variance=0.0131
## 4                                                                                 QTL_ID=126849;Name=Calving ease;Abbrev=CALEASE;PUBMED_ID=28109604;trait_ID=1073;trait=Calving ease;breed=holstein;FlankMarkers=rs110412715,rs109571239;VTO_name=parturition trait;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.01
## 5                                                                               QTL_ID=126850;Name=Stillbirth;Abbrev=SB;PUBMED_ID=28109604;trait_ID=1153;trait=Stillbirth;breed=holstein;FlankMarkers=rs110412715,rs109571239;VTO_name=stillborn offspring quantity;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.01
## 6                                                                      QTL_ID=65454;Name=Maturity rate;Abbrev=MATRATE;PUBMED_ID=26445451;trait_ID=1601;trait=Maturity rate;breed=brahman;FlankMarkers=rs135153778;VTO_name=postnatal growth trait;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=0.457091
##   Distance_from_gene_start Distance_from_gene_end
## 1                   290884                 290844
## 2                   290884                 290844
## 3                 16651114              -26296483
## 4                   489406               -4243034
## 5                   489406               -4243034
## 6                   349307                 349267

Note that, as well as the gene searching output, the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each interval is repeated n times. Where n is equal the number of genes within this interval.

The same analyses can be performed to haplotype data. You only need to change the argument informed to “marker=”. To choose the analysis using haplotypes, you need to inform “haplotype”.

Gene Ontology

The Gene Ontology enrichment analyses can also be performed using R. The functions available in the packages GO.db, topGO and GOstats can be used together to perform an enrichment analysis. Additionally, the information present in the database “org.Bt.eg.db” will be used.

First, we need to import a table with the genes mapped inside a group of intervals. For this, we will import the table created using the function getBM() in biomaRt package.

#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")
head(genes.interval)

##      ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000012798       KCNH8               1      138455304
## 2 ENSBTAG00000030105                           1      149720302
## 3 ENSBTAG00000018682                           1      150025827
## 4 ENSBTAG00000023384                           1      150054221
## 5 ENSBTAG00000030629                           1      150108438
## 6 ENSBTAG00000018688                           1      150144149
##   end_position entrezgene external_gene_name
## 1    138676649  100336609              KCNH8
## 2    149720392         NA                   
## 3    150048269     540440              SETD4
## 4    150064636     515946               CBR1
## 5    150110508     510180          MGC127133
## 6    150153934     516036               CBR3

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite(c("GO.db","topGO","GOstats", "org.Bt.eg.db""))

After, importing the list with the positional candidate genes, we must map all the genes annotated in the org.Bt.eg.db to the corresponding GO terms.

#Loading packages
library(org.Bt.eg.db)

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

##

library(GO.db)

##

#org.Bt.egGO is an R object that provides mappings between entrez gene identifers and the GO identifers that they are directly associated with
entrez_ID<-org.Bt.egGO

#Getting the entrez gene identifiers that are mapped to a GO ID
mapped_genes <- mappedkeys(entrez_ID)

Using the above commands, it is possible to create a list with the GO terms and the genes associated to each one of the terms. Now, we can use functions from the GO.db and GOstats packages to perform an enrichment test.

#Loading the packages
library("GO.db")
library("GOstats")

## Loading required package: Category

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:IRanges':
## 
##     expand

## Loading required package: graph

## 
## Attaching package: 'GOstats'

## The following object is masked from 'package:AnnotationDbi':
## 
##     makeGOGraph

#Creating the object with the entrez ID for each mapped gene (universe) 
universe <- mapped_genes

#Creating an object with the entrez ID for our positional candidate genes. Notice that we are using each Id only once and the class of the object must be the same that in the universe object (character)
candidate_ID<-as.character(unique(genes.interval$entrezgene))

#Creating the list of parameters to perform the enrichment test. Here, we will perform an enrichment analysis for Biological Process (BP)
params <- new('GOHyperGParams', geneIds=candidate_ID, universeGeneIds=universe, ontology='BP', pvalueCutoff=0.05, conditional=F, testDirection='over', annotation="org.Bt.eg.db")

## Warning in makeValidParams(.Object): removing geneIds not in
## universeGeneIds

#Running a hypergeometric test
hgOver <- hyperGTest(params)
hgOver

## Gene to GO BP  test for over-representation 
## 1068 GO BP ids tested (129 have p < 0.05)
## Selected gene set size: 39 
##     Gene universe size: 4931 
##     Annotation package: org.Bt.eg

#Saving the summary of the results in a object
result <- summary(hgOver)

#Checking the dimensions of the results
dim(result)

## [1] 129   7

#Checking the results
head(result,10)

##        GOBPID       Pvalue OddsRatio    ExpCount Count Size
## 1  GO:0017144 0.0001819737 264.37838 0.023727439     2    3
## 2  GO:0032784 0.0068208644  18.83398 0.126546339     2   16
## 3  GO:0006423 0.0079091462       Inf 0.007909146     1    1
## 4  GO:0016999 0.0079091462       Inf 0.007909146     1    1
## 5  GO:0031635 0.0079091462       Inf 0.007909146     1    1
## 6  GO:0034137 0.0079091462       Inf 0.007909146     1    1
## 7  GO:0035696 0.0079091462       Inf 0.007909146     1    1
## 8  GO:0035723 0.0079091462       Inf 0.007909146     1    1
## 9  GO:0035724 0.0079091462       Inf 0.007909146     1    1
## 10 GO:0038194 0.0079091462       Inf 0.007909146     1    1
##                                                              Term
## 1                                          drug metabolic process
## 2           regulation of DNA-templated transcription, elongation
## 3                                   cysteinyl-tRNA aminoacylation
## 4                                    antibiotic metabolic process
## 5  adenylate cyclase-inhibiting opioid receptor signaling pathway
## 6   positive regulation of toll-like receptor 2 signaling pathway
## 7                                          monocyte extravasation
## 8                       interleukin-15-mediated signaling pathway
## 9                                       CD24 biosynthetic process
## 10                  thyroid-stimulating hormone signaling pathway

Notice that the output table shows all the GO terms with a p-value smaller than the threshold defined in the parameters. However, we do not know which genes are associated with the enriched terms. It is possible to retrieve this information using the following command line. For example, let’s search the genes associated with the GO:0032784, “regulation of DNA-templated transcription, elongation”.

#First, we need to create a object with a mapping information for each GO term to Entrez gene ids
go_object <- as.list(org.Bt.egGO2EG)

#After this step, we can easily retrive this information using a indexing filtering in the go_object
go.genes<-go_object[["GO:0032784"]][which(go_object[["GO:0032784"]] %in% candidate_ID)]

#Showing the Entrez ID
go.genes

##      IEA      ISS 
## "505722" "533643"

#Filtering the gene information from the input table
genes.interval[genes.interval$entrezgene %in% go.genes,]

##       ensembl_gene_id hgnc_symbol chromosome_name start_position
## 46 ENSBTAG00000003460                          14       23620858
## 99 ENSBTAG00000019133                           3       53385189
##    end_position entrezgene external_gene_name
## 46     23637730     505722              TCEA1
## 99     53419460     533643             ZNF326

Additionally, a multiple testing correction can be in our results. This approach reduces the number of false-positives in our data.

#Creating a new column with the adjusted p-value using FDR correction
result$P.adjust<-p.adjust(result$Pvalue, method = "fdr", n = length(result$Pvalue))

#Checking the results
head(result,10)

##        GOBPID       Pvalue OddsRatio    ExpCount Count Size
## 1  GO:0017144 0.0001819737 264.37838 0.023727439     2    3
## 2  GO:0032784 0.0068208644  18.83398 0.126546339     2   16
## 3  GO:0006423 0.0079091462       Inf 0.007909146     1    1
## 4  GO:0016999 0.0079091462       Inf 0.007909146     1    1
## 5  GO:0031635 0.0079091462       Inf 0.007909146     1    1
## 6  GO:0034137 0.0079091462       Inf 0.007909146     1    1
## 7  GO:0035696 0.0079091462       Inf 0.007909146     1    1
## 8  GO:0035723 0.0079091462       Inf 0.007909146     1    1
## 9  GO:0035724 0.0079091462       Inf 0.007909146     1    1
## 10 GO:0038194 0.0079091462       Inf 0.007909146     1    1
##                                                              Term
## 1                                          drug metabolic process
## 2           regulation of DNA-templated transcription, elongation
## 3                                   cysteinyl-tRNA aminoacylation
## 4                                    antibiotic metabolic process
## 5  adenylate cyclase-inhibiting opioid receptor signaling pathway
## 6   positive regulation of toll-like receptor 2 signaling pathway
## 7                                          monocyte extravasation
## 8                       interleukin-15-mediated signaling pathway
## 9                                       CD24 biosynthetic process
## 10                  thyroid-stimulating hormone signaling pathway
##      P.adjust
## 1  0.02347461
## 2  0.04049741
## 3  0.04049741
## 4  0.04049741
## 5  0.04049741
## 6  0.04049741
## 7  0.04049741
## 8  0.04049741
## 9  0.04049741
## 10 0.04049741

Notice that different multiple correction tests can be used. For more details, check the help material of the p.adjust function.

Metabolic pathways

The metabolic pathways enrichment is an important additional functional analysis to help in the selection of functional candidate genes. The package KEGGprofile allows us to perform enrichment analysis using several species. Here, we will perform an enrichment analysis using the same dataset used for the GO enrichment analysis. The function find_enriched_pathway() is responsible to perform the enrichment test. We need to inform the entrezID list of our positional candidate genes for the function.

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite("KEGGprofile")

#Loading the package
library(KEGGprofile)

## 
## KEGG.db contains mappings based on older data because the original
##   resource was removed from the the public domain before the most
##   recent update was produced. This package should now be
##   considered deprecated and future versions of Bioconductor may
##   not have it available.  Users who want more current data are
##   encouraged to look at the KEGGREST or reactome.db packages

#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")

#Creating a object with the entrez ID for our positional candidate genes.
candidate_ID<-as.character(unique(genes.interval$entrezgene))

#Running the enrichment test
keeg.enriched<-find_enriched_pathway(candidate_ID, species = "bta", returned_pvalue = 0.05, returned_adjpvalue = 0.05, returned_genenumber = 1, download_latest = TRUE)

str(keeg.enriched)

## List of 2
##  $ stastic:'data.frame': 4 obs. of  6 variables:
##   ..$ Pathway_Name: chr [1:4] "Metabolism of xenobiotics by cytochrome P450" "RNA polymerase" "Phagosome" "Synaptic vesicle cycle"
##   ..$ Gene_Found  : int [1:4] 3 2 5 3
##   ..$ Gene_Pathway: int [1:4] 67 30 182 65
##   ..$ Percentage  : num [1:4] 0.04 0.07 0.03 0.05
##   ..$ pvalue      : num [1:4] 0.00036 0.000465 0.000277 0.00032
##   ..$ pvalueAdj   : num [1:4] 0.0374 0.0374 0.0374 0.0374
##  $ detail :List of 4
##   ..$ 00980: chr [1:3] "510180" "515946" "516036"
##   ..$ 03020: chr [1:2] "107133113" "614776"
##   ..$ 04145: chr [1:5] "281111" "282657" "287017" "768036" ...
##   ..$ 04721: chr [1:3] "282657" "287017" "529659"

The arguments informed to find_enriched_pathway() were the list of unique Entrez IDs, the species (bta= Bos taurus), the p-value threshold, the adjusted p-value threshold, the minimum number of annotated genes for enriched pathways. The last argument informs if the latest version of the KEGG pathway database should be download for the analysis.

Notice that the output of this function is a list. This list is composed by the table with the enrichment results and a vector with the Entrez IDs associated with each pathway. Therefore, we can handle this output to obtain individual objects with this information and filter our positional candidate genes from our input file.

#Creating an object with the enriched pathways
enrich_table<-keeg.enriched$stastic

enrich_table

##                                       Pathway_Name Gene_Found Gene_Pathway
## 00980 Metabolism of xenobiotics by cytochrome P450          3           67
## 03020                               RNA polymerase          2           30
## 04145                                    Phagosome          5          182
## 04721                       Synaptic vesicle cycle          3           65
##       Percentage       pvalue  pvalueAdj
## 00980       0.04 0.0003597641 0.03741374
## 03020       0.07 0.0004647670 0.03741374
## 04145       0.03 0.0002774512 0.03741374
## 04721       0.05 0.0003204313 0.03741374

#Filtering the input file based on the genes associated with the enriched pathway
enriched.genes<-genes.interval[genes.interval$entrezgene %in% keeg.enriched$detail[["04145"]],]

enriched.genes

##        ensembl_gene_id hgnc_symbol chromosome_name start_position
## 44  ENSBTAG00000003450                          14       23511301
## 58  ENSBTAG00000003895                          18       13931107
## 89  ENSBTAG00000023730                          18       14761260
## 122 ENSBTAG00000014238                           5      109564503
## 128 ENSBTAG00000018530                           5      109876442
##     end_position entrezgene external_gene_name
## 44      23558678     282657            ATP6V1H
## 58      13938075     281111               CYBA
## 89      14769914     768070              TUBB3
## 122    109578159     287017           ATP6V1E1
## 128    109894993     768036              TUBA8

Notice that only the pathway “04145” (Phagosome) was enriched. Therefore, its ID was used in the filtering process. The commands above can be adapted to filter multiple the enriched pathways.

Identification of positional and functional candidate genes using R