Identification of positional and functional candidate genes using R

Authors: Pablo Fonseca / Angela Cánovas

March 22, 2018

Why use R to search for positional and functional candidate genes?

In the last tutorials, we learned how to search for positional and functional candidate genes using several softwares. This is a great strategy. However, during the process, several output files are created and we must manage these files in a very cautious way. We must convert different outputs to use as input files in other software, combine different results to select candidate genes, to create plots in order to summarize the results. The larger number of genes in the input files and the larger amount of information retrieved using the software showed up to now, can increase the complexity of those processes. R is a good alternative to manage and combine multiple results in a very straightforward way. There are several packages available in R that performs the same (and additional) analyses. Work with this kind of analysis within R allows the design of a pipeline to combine different steps and to create outputs in the same environment. These characteristics result in an analysis that is performed in a much more efficient way and simplify the management process of the intermediary files. Additionally, with using R, it is possible to create new kind of analysis and /or to adapt previous approaches to better fit to you data. Here, we will see some possibilities to use R to identify positional and functional candidate genes.

Package: biomaRt

The package biomaRt “provides an interface to a growing collection of databases implementing the BioMart software suite (http://www.biomart.org). The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries”. Therefore, it is possible to retrieve all the information available using the online version of Biomart, using R.

Here, we will use the same input file used in the online version of Biomart. The file to be imported in R is called “input_candidate_markers.txt”.

#The following command can  be used to install the package (you just need to remove the comment character #)
#install.packages("biomaRt")

#Loading the input files
input.regions<-read.table("input_candidate_markers_SC_Fortes2013.txt", h=T, sep="\t")

#Checking the imported file
head(input.regions)
##        Associated.marker SNP.reference                 Trait CHR       BP
## 1    ARS-BFGL-NGS-116764   rs109169629 Scrotal circumference   7  8741497
## 2     ARS-BFGL-NGS-40630    rs43556880 Scrotal circumference   8 60800770
## 3 Hapmap32552-BTA-129045   rs109890227 Scrotal circumference  14 22323719
## 4     ARS-BFGL-BAC-12159    rs42024431 Scrotal circumference  14 22587081
## 5 Hapmap31202-BTA-162588   rs110849144 Scrotal circumference  14 23988778
## 6    ARS-BFGL-NGS-104268   rs110845339 Scrotal circumference  14 24057354
##    P.value Variance.explained Allelic.substitution.effect   Breed
## 1 4.53e-06               0.03                       -0.70 Brahman
## 2 7.95e-06               0.03                        1.73 Brahman
## 3 3.58e-05               0.03                       -0.40 Brahman
## 4 3.67e-06               0.03                        0.47 Brahman
## 5 8.92e-05               0.02                        0.39 Brahman
## 6 4.73e-08               0.05                        0.56 Brahman
##              Reference
## 1 Fortes et al. (2012)
## 2 Fortes et al. (2012)
## 3 Fortes et al. (2012)
## 4 Fortes et al. (2012)
## 5 Fortes et al. (2012)
## 6 Fortes et al. (2012)
#Loading the package
library(biomaRt)

After loading the input file and loading the package, it is possible to start the search of genes within our selected interval.

  1. The first step is to choose the database and the organism to choose the genes
#The following command shows the databases that are available to retrive information
listMarts()
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 91
## 2   ENSEMBL_MART_MOUSE      Mouse strains 91
## 3     ENSEMBL_MART_SNP  Ensembl Variation 91
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 91
#We will choose ENSEMBL_MART_ENSEMBL, which correspond to Ensembl Genes 91
#It is necessary to creat a object of the type "Mart" to load the database information. This is performed with the following command:
mart <- useMart("ENSEMBL_MART_ENSEMBL")

#Once the object mart was created using useMart() function, we need to select the organism. The following comand can be used to identify the correspondent argument to each organism

#listDatasets(mart)

#The function useDataset() can be used to insert the organism infortation in the mart object
mart <- useDataset("btaurus_gene_ensembl", mart)
  1. Now, it is time to define the filters and attributes to be retrieved from the database
#The following command can be used to list the filters available to use in the retriving process

#listFilters(mart)

#Here, we will use the chromosome region filter ("chromosomal_region") to retrive the genes mapped in the intervals selected
filter<-"chromosomal_region"

#The following command will be used to creat the chromosomal coordinates using a 500Kb upstream and downstream for each slected marker

value<-paste(input.regions$CHR, ":", (input.regions$BP)-500000, ":", (input.regions$BP)+500000, sep="")

value
##  [1] "7:8241497:9241497"    "8:60300770:61300770"  "14:21823719:22823719"
##  [4] "14:22087081:23087081" "14:23488778:24488778" "14:23557354:24557354"
##  [7] "14:23937778:24937778" "14:23982969:24982969" "14:24024205:25024205"
## [10] "14:24073257:25073257" "14:24143266:25143266" "14:24287245:25287245"
## [13] "14:24607556:25607556" "14:24675950:25675950" "14:24807116:25807116"
## [16] "14:24851733:25851733" "14:25138580:26138580" "14:25198286:26198286"
## [19] "14:26449215:27449215" "14:26698715:27698715" "14:26860366:27860366"
## [22] "14:26880992:27880992" "30:15508013:16508013" "30:61941560:62941560"
## [25] "30:64677231:65677231" "30:68696959:69696959" "30:69557309:70557309"
## [28] "30:69734032:70734032" "30:70490145:71490145" "30:71450688:72450688"
## [31] "30:71947651:72947651" "30:72535611:73535611" "30:72756546:73756546"
## [34] "30:72879138:73879138" "30:72983564:73983564" "30:74052529:75052529"
## [37] "30:74734781:75734781" "30:75006946:76006946" "30:76869290:77869290"
## [40] "30:78738433:79738433" "30:78777137:79777137" "30:80977451:81977451"
## [43] "30:81138519:82138519" "30:81908727:82908727" "30:83835452:84835452"
## [46] "30:84367168:85367168" "30:84834862:85834862" "30:84963982:85963982"
## [49] "30:84995447:85995447" "30:85089749:86089749" "30:85212052:86212052"
## [52] "30:85333472:86333472" "30:86056922:87056922" "30:90410050:91410050"
## [55] "30:90770661:91770661" "30:91549559:92549559" "30:93131688:94131688"
#The following command allows to list all the attirbutes possible to be retrived from Biomart

#listAttributes(mart)

#The attributes to be retrived will be inserted in a object
attributes <- c("ensembl_gene_id","hgnc_symbol","chromosome_name","start_position","end_position","entrezgene","external_gene_name")
  1. Now, after defining the filters and the attributes, it is time to retrieve the information
#At this point, we need to inform the objects created in the previous steps as arguments of the function getBM()
all.genes <- getBM(attributes=attributes, filters=filter, values=value, mart=mart)

#Checking the output
str(all.genes)
## 'data.frame':    112 obs. of  7 variables:
##  $ ensembl_gene_id   : chr  "ENSBTAG00000027760" "ENSBTAG00000002448" "ENSBTAG00000044404" "ENSBTAG00000044616" ...
##  $ hgnc_symbol       : chr  "" "" "" "" ...
##  $ chromosome_name   : int  14 14 14 14 14 14 14 14 14 14 ...
##  $ start_position    : int  21890883 22212868 22605811 22646987 22654044 22669363 22815477 23067159 23511301 23561303 ...
##  $ end_position      : int  21891976 22350058 22605948 22647093 22654320 22717576 22872779 23068103 23558678 23613020 ...
##  $ entrezgene        : int  NA 517353 NA NA NA 521261 536336 NA 282657 614441 ...
##  $ external_gene_name: chr  "" "SNTG1" "U2" "U6" ...
head(all.genes)
##      ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000027760                          14       21890883
## 2 ENSBTAG00000002448                          14       22212868
## 3 ENSBTAG00000044404                          14       22605811
## 4 ENSBTAG00000044616                          14       22646987
## 5 ENSBTAG00000043686                          14       22654044
## 6 ENSBTAG00000017492                          14       22669363
##   end_position entrezgene external_gene_name
## 1     21891976         NA                   
## 2     22350058     517353              SNTG1
## 3     22605948         NA                 U2
## 4     22647093         NA                 U6
## 5     22654320         NA                7SK
## 6     22717576     521261             PCMTD1
#Saving the ouput

write.table(all.genes, file="output_biomart_genes_within_interval.txt", row.names=F, sep="\t", quote=F)

Using the command listed above, it is possible to obtain the same results obtained in the online version of biomart. Notice that it is possible to run these commands for all the organisms available in the Ensembl database, for all the filters and attributes. Check the help material for each function to obtain more information. For example “?getBM()”.

The manual of biomaRt can be found here:

https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf

Retrieving the positional candidate genes and QTLs within selected interval using a R function

We created a R function to search genes or QTLs within selected intervals. This function can handle with SNP and haplotype data.

The function is called “find_genes_qtls_around_markers”. For this function, you should inform some arguments:

**db_file:** For this argument you should inform the name of the file with the gene mapping or QTL information. For the gene mapping, you should use the .gtf file download from Ensembl data base (ftp://ftp.ensembl.org/pub/release-90/gtf/bos_taurus/). For the QTL search, you need to inform the .gff file that can be downloaded from Cattle QTlLdb (https://www.animalgenome.org/cgi-bin/QTLdb/BT/download?file=gbpUMD_3.1).

**marker_file:** The file with the SNP or haplotype positions. Detail: For SNP files, you must have a column called “CHR” and a column called “BP” with the chromosome and base pair position, respectively. For the haplotype, you must have three columns: “CHR”, “BP1” and “BP2”. All the columns names are capitals, as weel as the PLINK output.

method: “gene” or “qtl”

marker: “snp” or “haplotype”

interval: The interval in base pair

All positions must be informed using base pairs (not Kb or Mb).

**Important:* This function is a beta version Therefore, errors or bugs might be found. Use with care.

In order to execute this function, it is necessary to installthe packages refGenome, mefa and gtools. These packages can be installed using the following command.

#install.packages(c("refGenome", "mefa","gtools"))
#Importing the input file
input.regions<-read.table("input_candidate_markers_SC_Fortes2013.txt", h=T, sep="\t")

#Loading the function from a external file
source("complete_function_v1.1.R")

#Running the function to search genes in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="Bos_taurus.UMD3.1.91.gtf",marker_file="input_candidate_markers_SC_Fortes2013.txt",method="gene",marker="snp",interval=500000)
## Loading required package: doBy
## Loading required package: RSQLite
## mefa 3.2-7    2016-01-11
## You are using the method: gene with snp
## 
## [read.gtf.refGenome] Reading file 'Bos_taurus.UMD3.1.91.gtf'.
## 
[GTF]   100000 lines processed.
[GTF]   200000 lines processed.
[GTF]   300000 lines processed.
[GTF]   400000 lines processed.
[GTF]   500000 lines processed.
[GTF]   563839 lines processed.
## [read.gtf.refGenome] Extracting genes table.
## [read.gtf.refGenome] Found 24,616 gene records.
## [read.gtf.refGenome] Finished.
## Starting Gene searching using 5e+05 bp as interval
## Starting analysis for chromosome 7
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Starting analysis for chromosome 8
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Starting analysis for chromosome 14
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |=================================================================| 100%
#An output file will be save in your computer. This file is called: "genes_around_SNPs.txt". Now we will import this file and check the output format.
genes.interval<-read.table("genes_around_SNPs.txt", h=T, sep="\t")
head(genes.interval)
##        Associated.marker SNP.reference                 Trait CHR       BP
## 1 Hapmap26621-BTC-072953   rs109286496 Scrotal circumference  14 27380992
## 2 Hapmap26621-BTC-072953   rs109286496 Scrotal circumference  14 27380992
## 3 Hapmap26621-BTC-072953   rs109286496 Scrotal circumference  14 27380992
## 4 Hapmap23060-BTC-072978    rs42474245 Scrotal circumference  14 27360366
## 5 Hapmap23060-BTC-072978    rs42474245 Scrotal circumference  14 27360366
## 6 Hapmap23509-BTC-073113   rs109462631 Scrotal circumference  14 27198715
##    P.value Variance.explained Allelic.substitution.effect   Breed
## 1 2.23e-05               0.03                        0.44 Brahman
## 2 2.23e-05               0.03                        0.44 Brahman
## 3 2.23e-05               0.03                        0.44 Brahman
## 4 1.02e-05               0.03                       -0.46 Brahman
## 5 1.02e-05               0.03                       -0.46 Brahman
## 6 7.42e-05               0.02                        0.38 Brahman
##              Reference chr                 ID gene_name start_pos  end_pos
## 1 Fortes et al. (2012)  14 ENSBTAG00000004954       TOX  26631190 26941726
## 2 Fortes et al. (2012)  14 ENSBTAG00000017529       CA8  27637561 27719848
## 3 Fortes et al. (2012)  14 ENSBTAG00000000948     RAB2A  27864735 27937015
## 4 Fortes et al. (2012)  14 ENSBTAG00000004954       TOX  26631190 26941726
## 5 Fortes et al. (2012)  14 ENSBTAG00000017529       CA8  27637561 27719848
## 6 Fortes et al. (2012)  14 ENSBTAG00000004954       TOX  26631190 26941726
##   Distance_from_gene_start Distance_from_gene_end
## 1                   749802                 439266
## 2                  -256569                -338856
## 3                  -483743                -556023
## 4                   729176                 418640
## 5                  -277195                -359482
## 6                   567525                 256989

Note that the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each interval is repeated in time. Where n is equal the number of genes within this interval.

#Importing the input file
input.regions<-read.table("input_candidate_markers_SC_Fortes2013.txt", h=T, sep="\t")

#Loading the function from a external file
source("complete_function_v1.1.R")

#Running the function to search QTLs in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="QTL_UMD_3.1.gff",marker_file="input_candidate_markers_SC_Fortes2013.txt",method="qtl",marker="snp",interval=500000)
## You are using the method: qtl with snp
## Starting QTL searching using 5e+05 bp as interval
## Starting analysis for chromosome 7
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Starting analysis for chromosome 8
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Starting analysis for chromosome 14
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |=================================================================| 100%
## Starting analysis for chromosome 30
## 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%
#An output file will be save in your computer. This file is called: "QTLs_around_SNPs.txt". Now we will import this file and check the output format.
qtls.interval<-read.table("QTLs_around_SNPs.txt", h=T, sep="\t")
head(qtls.interval)
##       Associated.marker SNP.reference                 Trait CHR       BP
## 1 Hapmap47906-BTA-29386    rs41632485 Scrotal circumference  30 93631688
## 2 Hapmap47906-BTA-29386    rs41632485 Scrotal circumference  30 93631688
## 3 Hapmap47906-BTA-29386    rs41632485 Scrotal circumference  30 93631688
## 4 Hapmap47906-BTA-29386    rs41632485 Scrotal circumference  30 93631688
## 5 Hapmap47906-BTA-29386    rs41632485 Scrotal circumference  30 93631688
## 6 Hapmap47906-BTA-29386    rs41632485 Scrotal circumference  30 93631688
##    P.value Variance.explained Allelic.substitution.effect   Breed
## 1 2.61e-06               0.01                       -0.43 Brahman
## 2 2.61e-06               0.01                       -0.43 Brahman
## 3 2.61e-06               0.01                       -0.43 Brahman
## 4 2.61e-06               0.01                       -0.43 Brahman
## 5 2.61e-06               0.01                       -0.43 Brahman
## 6 2.61e-06               0.01                       -0.43 Brahman
##              Reference chr     database                 QTL_type start_pos
## 1 Fortes et al. (2012)  30 Animal QTLdb Reproduction_Association  93408980
## 2 Fortes et al. (2012)  30 Animal QTLdb Reproduction_Association  93621106
## 3 Fortes et al. (2012)  30 Animal QTLdb Reproduction_Association  93622421
## 4 Fortes et al. (2012)  30 Animal QTLdb Reproduction_Association  93493363
## 5 Fortes et al. (2012)  30 Animal QTLdb Reproduction_Association  93509417
## 6 Fortes et al. (2012)  30 Animal QTLdb Reproduction_Association  93541439
##    end_pos
## 1 93409020
## 2 93621146
## 3 93622461
## 4 93493403
## 5 93509457
## 6 93541479
##                                                                                                                                                                                                                                                                                                                                  extra_info
## 1 QTL_ID=72492;Name=Scrotal circumference;Abbrev=SCRCIR;PUBMED_ID=23785023;trait_ID=1270;trait=Scrotal circumference;breed=Tropical Composite;FlankMarkers=rs134273685;VTO_name=scrotum circumference;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=16.32;Additive_Effect=1.966
## 2 QTL_ID=72590;Name=Scrotal circumference;Abbrev=SCRCIR;PUBMED_ID=23785023;trait_ID=1270;trait=Scrotal circumference;breed=Tropical Composite;FlankMarkers=rs134819079;VTO_name=scrotum circumference;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=17.85;Additive_Effect=1.858
## 3 QTL_ID=72591;Name=Scrotal circumference;Abbrev=SCRCIR;PUBMED_ID=23785023;trait_ID=1270;trait=Scrotal circumference;breed=Tropical Composite;FlankMarkers=rs137529726;VTO_name=scrotum circumference;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=15.99;Additive_Effect=1.858
## 4 QTL_ID=72599;Name=Scrotal circumference;Abbrev=SCRCIR;PUBMED_ID=23785023;trait_ID=1270;trait=Scrotal circumference;breed=Tropical Composite;FlankMarkers=rs137376212;VTO_name=scrotum circumference;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=19.11;Additive_Effect=1.855
## 5 QTL_ID=72642;Name=Scrotal circumference;Abbrev=SCRCIR;PUBMED_ID=23785023;trait_ID=1270;trait=Scrotal circumference;breed=Tropical Composite;FlankMarkers=rs109407688;VTO_name=scrotum circumference;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=18.96;Additive_Effect=1.834
## 6 QTL_ID=72643;Name=Scrotal circumference;Abbrev=SCRCIR;PUBMED_ID=23785023;trait_ID=1270;trait=Scrotal circumference;breed=Tropical Composite;FlankMarkers=rs135158935;VTO_name=scrotum circumference;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05;Variance=20.07;Additive_Effect=1.834
##   Distance_from_gene_start Distance_from_gene_end
## 1                   222708                 222668
## 2                    10582                  10542
## 3                     9267                   9227
## 4                   138325                 138285
## 5                   122271                 122231
## 6                    90249                  90209

Note that, as well as the gene searching output, the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each interval is repeated n times. Where n is equal the number of genes within this interval.

The same analyses can be performed to haplotype data. You only need to change the argument informed to “marker=”. To choose the analysis using haplotypes, you need to inform “haplotype”.

Gene Ontology

The Gene Ontology enrichment analyses can also be performed using R. The functions available in the packages GO.db, topGO and GOstats can be used together to perform an enrichment analysis. Additionally, the information present in the database “org.Bt.eg.db” will be used.

First, we need to import a table with the genes mapped inside a group of intervals. For this, we will import the table created using the function getBM() in biomaRt package.

#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")
head(genes.interval)
##      ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000027760                          14       21890883
## 2 ENSBTAG00000002448                          14       22212868
## 3 ENSBTAG00000044404                          14       22605811
## 4 ENSBTAG00000044616                          14       22646987
## 5 ENSBTAG00000043686                          14       22654044
## 6 ENSBTAG00000017492                          14       22669363
##   end_position entrezgene external_gene_name
## 1     21891976         NA                   
## 2     22350058     517353              SNTG1
## 3     22605948         NA                 U2
## 4     22647093         NA                 U6
## 5     22654320         NA                7SK
## 6     22717576     521261             PCMTD1

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite(c("GO.db","topGO","GOstats", "org.Bt.eg.db""))

After, importing the list with the positional candidate genes, we must map all the genes annotated in the org.Bt.eg.db to the corresponding GO terms.

#Loading packages
library(org.Bt.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
library(GO.db)
## 
#org.Bt.egGO is an R object that provides mappings between entrez gene identifers and the GO identifers that they are directly associated with
entrez_ID<-org.Bt.egGO

#Getting the entrez gene identifiers that are mapped to a GO ID
mapped_genes <- mappedkeys(entrez_ID)

Using the above commands, it is possible to create a list with the GO terms and the genes associated to each one of the terms. Now, we can use functions from the GO.db and GOstats packages to perform an enrichment test.

#Loading the packages
library("GO.db")
library("GOstats")
## Loading required package: Category
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:IRanges':
## 
##     expand
## Loading required package: graph
## 
## Attaching package: 'GOstats'
## The following object is masked from 'package:AnnotationDbi':
## 
##     makeGOGraph
#Creating the object with the entrez ID for each mapped gene (universe) 
universe <- mapped_genes

#Creating an object with the entrez ID for our positional candidate genes. Notice that we are using each Id only once and the class of the object must be the same that in the universe object (character)
candidate_ID<-as.character(unique(genes.interval$entrezgene))

#Creating the list of parameters to perform the enrichment test. Here, we will perform an enrichment analysis for Biological Process (BP)
params <- new('GOHyperGParams', geneIds=candidate_ID, universeGeneIds=universe, ontology='BP', pvalueCutoff=0.05, conditional=F, testDirection='over', annotation="org.Bt.eg.db")
## Warning in makeValidParams(.Object): removing geneIds not in
## universeGeneIds
#Running a hypergeometric test
hgOver <- hyperGTest(params)
hgOver
## Gene to GO BP  test for over-representation 
## 479 GO BP ids tested (51 have p < 0.05)
## Selected gene set size: 19 
##     Gene universe size: 4931 
##     Annotation package: org.Bt.eg
#Saving the summary of the results in a object
result <- summary(hgOver)

#Checking the dimensions of the results
dim(result)
## [1] 51  7
#Checking the results
head(result,10)
##        GOBPID      Pvalue  OddsRatio    ExpCount Count Size
## 1  GO:0051345 0.003374592   7.720325 0.647333198     4  168
## 2  GO:0002118 0.003853174        Inf 0.003853174     1    1
## 3  GO:0019042 0.003853174        Inf 0.003853174     1    1
## 4  GO:0019043 0.003853174        Inf 0.003853174     1    1
## 5  GO:0019046 0.003853174        Inf 0.003853174     1    1
## 6  GO:0090045 0.003853174        Inf 0.003853174     1    1
## 7  GO:2000326 0.003853174        Inf 0.003853174     1    1
## 8  GO:0043547 0.006193945   9.307345 0.385317380     3  100
## 9  GO:0043087 0.007669316   8.583929 0.416142770     3  108
## 10 GO:0002084 0.007692279 272.833333 0.007706348     1    2
##                                                                                           Term
## 1                                                    positive regulation of hydrolase activity
## 2                                                                          aggressive behavior
## 3                                                                                viral latency
## 4                                                               establishment of viral latency
## 5                                                                   release from viral latency
## 6                                                  positive regulation of deacetylase activity
## 7  negative regulation of ligand-dependent nuclear receptor transcription coactivator activity
## 8                                                       positive regulation of GTPase activity
## 9                                                                regulation of GTPase activity
## 10                                                                    protein depalmitoylation

Notice that the output table shows all the GO terms with a p-value smaller than the threshold defined in the parameters. However, we do not know which genes are associated with the enriched terms. It is possible to retrieve this information using the following command line. For example, let’s search the genes associated with the GO:0032784, “regulation of DNA-templated transcription, elongation”.

#First, we need to create a object with a mapping information for each GO term to Entrez gene ids
go_object <- as.list(org.Bt.egGO2EG)

#After this step, we can easily retrive this information using a indexing filtering in the go_object
go.genes<-go_object[["GO:0032784"]][which(go_object[["GO:0032784"]] %in% candidate_ID)]

#Showing the Entrez ID
go.genes
##      IEA 
## "505722"
#Filtering the gene information from the input table
genes.interval[genes.interval$entrezgene %in% go.genes,]
##       ensembl_gene_id hgnc_symbol chromosome_name start_position
## 11 ENSBTAG00000003460                          14       23620858
##    end_position entrezgene external_gene_name
## 11     23637730     505722              TCEA1

Additionally, a multiple testing correction can be in our results. This approach reduces the number of false-positives in our data.

#Creating a new column with the adjusted p-value using FDR correction
result$P.adjust<-p.adjust(result$Pvalue, method = "fdr", n = length(result$Pvalue))

#Checking the results
head(result,10)
##        GOBPID      Pvalue  OddsRatio    ExpCount Count Size
## 1  GO:0051345 0.003374592   7.720325 0.647333198     4  168
## 2  GO:0002118 0.003853174        Inf 0.003853174     1    1
## 3  GO:0019042 0.003853174        Inf 0.003853174     1    1
## 4  GO:0019043 0.003853174        Inf 0.003853174     1    1
## 5  GO:0019046 0.003853174        Inf 0.003853174     1    1
## 6  GO:0090045 0.003853174        Inf 0.003853174     1    1
## 7  GO:2000326 0.003853174        Inf 0.003853174     1    1
## 8  GO:0043547 0.006193945   9.307345 0.385317380     3  100
## 9  GO:0043087 0.007669316   8.583929 0.416142770     3  108
## 10 GO:0002084 0.007692279 272.833333 0.007706348     1    2
##                                                                                           Term
## 1                                                    positive regulation of hydrolase activity
## 2                                                                          aggressive behavior
## 3                                                                                viral latency
## 4                                                               establishment of viral latency
## 5                                                                   release from viral latency
## 6                                                  positive regulation of deacetylase activity
## 7  negative regulation of ligand-dependent nuclear receptor transcription coactivator activity
## 8                                                       positive regulation of GTPase activity
## 9                                                                regulation of GTPase activity
## 10                                                                    protein depalmitoylation
##      P.adjust
## 1  0.02807312
## 2  0.02807312
## 3  0.02807312
## 4  0.02807312
## 5  0.02807312
## 6  0.02807312
## 7  0.02807312
## 8  0.03269219
## 9  0.03269219
## 10 0.03269219

Notice that different multiple correction tests can be used. For more details, check the help material of the p.adjust function.

Metabolic pathways

The metabolic pathways enrichment is an important additional functional analysis to help in the selection of functional candidate genes. The package KEGGprofile allows us to perform enrichment analysis using several species. Here, we will perform an enrichment analysis using the same dataset used for the GO enrichment analysis. The function find_enriched_pathway() is responsible to perform the enrichment test. We need to inform the entrezID list of our positional candidate genes for the function.

To install the packages, the following commands can be used:

#source("https://bioconductor.org/biocLite.R")
#biocLite("KEGGprofile")
#Loading the package
library(KEGGprofile)
## 
## KEGG.db contains mappings based on older data because the original
##   resource was removed from the the public domain before the most
##   recent update was produced. This package should now be
##   considered deprecated and future versions of Bioconductor may
##   not have it available.  Users who want more current data are
##   encouraged to look at the KEGGREST or reactome.db packages
#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")

#Creating a object with the entrez ID for our positional candidate genes.
candidate_ID<-as.character(unique(genes.interval$entrezgene))

#Running the enrichment test
keeg.enriched<-find_enriched_pathway(candidate_ID, species = "bta", returned_pvalue = 0.05, returned_adjpvalue = 0.05, returned_genenumber = 1, download_latest = TRUE)

str(keeg.enriched)
## List of 2
##  $ stastic:'data.frame': 2 obs. of  6 variables:
##   ..$ Pathway_Name: chr [1:2] "Arachidonic acid metabolism" "Olfactory transduction"
##   ..$ Gene_Found  : int [1:2] 5 23
##   ..$ Gene_Pathway: int [1:2] 91 1245
##   ..$ Percentage  : num [1:2] 0.05 0.02
##   ..$ pvalue      : num [1:2] 2.23e-05 1.31e-08
##   ..$ pvalueAdj   : num [1:2] 3.58e-03 4.23e-06
##  $ detail :List of 2
##   ..$ 00590: chr [1:5] "100295883" "507016" "507615" "509506" ...
##   ..$ 04740: chr [1:23] "100299372" "504766" "506202" "506533" ...

The arguments informed to find_enriched_pathway() were the list of unique Entrez IDs, the species (bta= Bos taurus), the p-value threshold, the adjusted p-value threshold, the minimum number of annotated genes for enriched pathways. The last argument informs if the latest version of the KEGG pathway database should be download for the analysis.

Notice that the output of this function is a list. This list is composed by the table with the enrichment results and a vector with the Entrez IDs associated with each pathway. Therefore, we can handle this output to obtain individual objects with this information and filter our positional candidate genes from our input file.

#Creating an object with the enriched pathways
enrich_table<-keeg.enriched$stastic

enrich_table
##                      Pathway_Name Gene_Found Gene_Pathway Percentage
## 00590 Arachidonic acid metabolism          5           91       0.05
## 04740      Olfactory transduction         23         1245       0.02
##             pvalue    pvalueAdj
## 00590 2.225140e-05 3.582475e-03
## 04740 1.313752e-08 4.230283e-06
#Filtering the input file based on the genes associated with the enriched pathway
enriched.genes<-genes.interval[genes.interval$entrezgene %in% keeg.enriched$detail[["04145"]],]

enriched.genes
## [1] ensembl_gene_id    hgnc_symbol        chromosome_name   
## [4] start_position     end_position       entrezgene        
## [7] external_gene_name
## <0 rows> (or 0-length row.names)

Notice that only the pathway “04145” (Phagosome) was enriched. Therefore, its ID was used in the filtering process. The commands above can be adapted to filter multiple the enriched pathways.