In the last tutorials, we learned how to search for positional and functional candidate genes using several software. This is a great strategy. However, during the process, several output files are created and we must to manage this files in a very caution way. We must convert different outputs to use as input files in others software, combine different results to select candidate genes, to create plots in order to summarize the results. The larger number of genes in the input files and the larger amount of information retrieved using the software showed up to now, can increase the complexity of those processes. R is a good alternative to manage and combine multiple results in a very straightforward way. There are several packages available in R that performs the same (and additional) analyses. Work with this kind of analysis within R allows the design of a pipeline to combine different steps and to create outputs in the same environment. These characteristics result in an analysis that is performed in a much more efficient way and simplify the management process of the intermediary files. Additionally, using R it is possible to create new kind of analysis and /or to adapt previous approaches to better fit to you data. Here, we will see some possibilities to use R to identify positional and functional candidate genes.
The package biomaRt “provides an interface to a growing collection of databases implementing the BioMart software suite (http://www.biomart.org). The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries”. Therefore, it is possibke to retrive all the information available using the online version of Biomart, using R.
Here, we will use the same input file used in the online version of Biomart. The file to be imported in R is called “input_candidate_markers.txt”.
#The following command can be used to install the package (you just need to remove the comment character #)
#install.packages("biomaRt")
#Loading the input files
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")
#Checking the imported file
head(input.regions)
## CHR SNP BP A1 TEST NMISS BETA STAT
## 1 6 Hapmap25274-BTA-143756 14314482 A ADD 1093 0.06197 7.352
## 2 12 ARS-BFGL-NGS-1794 88834796 B ADD 1093 0.14470 7.259
## 3 13 Hapmap48963-BTA-96734 999229 B ADD 1089 0.05813 7.062
## 4 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909
## 5 1 ARS-BFGL-NGS-97757 149746651 A ADD 1093 0.06505 6.872
## 6 18 ARS-BFGL-NGS-13803 14526709 B ADD 1090 0.05189 6.780
## P
## 1 3.82e-13
## 2 7.41e-13
## 3 2.93e-12
## 4 8.28e-12
## 5 1.07e-11
## 6 1.97e-11
#Loading the package
library(biomaRt)
After load the input file and load the package, it is possible to start the search of genes within our selected interval.
#The following command shows the databases that are available to retrive information
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 90
## 2 ENSEMBL_MART_MOUSE Mouse strains 90
## 3 ENSEMBL_MART_SNP Ensembl Variation 90
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 90
#We will choose ENSEMBL_MART_ENSEMBL, which correspond to Ensembl Genes 90
#It is necessary to creat a object of the type "Mart" to load the database information. This is performed with the following command:
mart <- useMart("ENSEMBL_MART_ENSEMBL")
#Once the object mart was created using useMart() function, we need to select the organism. The following comand can be used to identify the correspondent argument to each organism
#listDatasets(mart)
#The function useDataset() can be used to insert the organism infortation in the mart object
mart <- useDataset("btaurus_gene_ensembl", mart)
#The following command can be used to list the filters available to use in the retriving process
#listFilters(mart)
#Here, we will use the chromosome region filter ("chromosomal_region") to retrive the genes mapped in the intervals selected
filter<-"chromosomal_region"
#The following command will be used to creat the chromosomal coordinates using a 500Kb upstream and downstream for each slected marker
value<-paste(input.regions$CHR, ":", (input.regions$BP)-500000, ":", (input.regions$BP)+500000, sep="")
value
## [1] "6:13814482:14814482" "12:88334796:89334796"
## [3] "13:499229:1499229" "20:44606413:45606413"
## [5] "1:149246651:150246651" "18:14026709:15026709"
## [7] "14:22752097:23752097" "12:82524081:83524081"
## [9] "13:14646915:15646915" "15:70206906:71206906"
## [11] "3:52843172:53843172" "8:16073737:17073737"
## [13] "10:93530802:94530802" "6:230937:1230937"
## [15] "8:18297386:19297386" "5:108999140:109999140"
## [17] "1:138594903:139594903" "6:13754604:14754604"
## [19] "5:63690317:64690317" "18:13646373:14646373"
#The following command allows to list all the attirbutes possible to be retrived from Biomart
#listAttributes(mart)
#The attributes to be retrived will be inserted in a object
attributes <- c("ensembl_gene_id","hgnc_symbol","chromosome_name","start_position","end_position","entrezgene","external_gene_name")
#At this point, we need to inform the objects created in the previous steps as arguments of the function getBM()
all.genes <- getBM(attributes=attributes, filters=filter, values=value, mart=mart)
#Checking the output
str(all.genes)
## 'data.frame': 157 obs. of 7 variables:
## $ ensembl_gene_id : chr "ENSBTAG00000012798" "ENSBTAG00000030105" "ENSBTAG00000018682" "ENSBTAG00000023384" ...
## $ hgnc_symbol : chr "KCNH8" "" "SETD4" "" ...
## $ chromosome_name : int 1 1 1 1 1 1 1 10 10 10 ...
## $ start_position : int 138455304 149720302 150025827 150054221 150108438 150144149 150169936 93417338 93617616 93638609 ...
## $ end_position : int 138676649 149720392 150048269 150064636 150110508 150153934 150293609 93571380 93677055 93638752 ...
## $ entrezgene : int 100336609 NA 540440 515946 510180 516036 514520 281553 785612 NA ...
## $ external_gene_name: chr "KCNH8" "" "SETD4" "" ...
head(all.genes)
## ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000012798 KCNH8 1 138455304
## 2 ENSBTAG00000030105 1 149720302
## 3 ENSBTAG00000018682 SETD4 1 150025827
## 4 ENSBTAG00000023384 1 150054221
## 5 ENSBTAG00000030629 1 150108438
## 6 ENSBTAG00000018688 CBR3 1 150144149
## end_position entrezgene external_gene_name
## 1 138676649 100336609 KCNH8
## 2 149720392 NA
## 3 150048269 540440 SETD4
## 4 150064636 515946
## 5 150110508 510180
## 6 150153934 516036 CBR3
#Saving the ouput
write.table(all.genes, file="output_biomart_genes_within_interval.txt", row.names=F, sep="\t", quote=F)
Using the command listed above, it is possible to obtain the same results obtained in the online version of biomart. Notice that it is possible to ru these commands for all the organisms available in the Ensembl database, for all the filters and atributes. Check the help material for each function to obtain more information. For example “?getBM()”.
The manual of biomaRt can de found here:
https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf
We created a R function to search genes or QTLs within selected intervals. This function can handle with SNP and haplotype data.
The function is called “find_genes_qtls_around_markers”. For this function, you should inform some arguments:
db_file: For this argument you should inform the name of the file with the gene mapping or QTL information. For the gene mapping, you should use the .gtf file download from Ensembl data base (ftp://ftp.ensembl.org/pub/release-90/gtf/bos_taurus/). For the QTL search, you need to inform the .gff file that can be downloaded from Cattle QTlLdb (https://www.animalgenome.org/cgi-bin/QTLdb/BT/download?file=gbpUMD_3.1).
marker_file: The file with the SNP or haplotype positions. Detail: For SNP files, you must have a column called “CHR” and a column called “BP” with the chromosome and base pair position, respectively. For the haplotype, you must have three columns: “CHR”, “BP1” and “BP2”. All the columns names are capitals, as weel as the PLINK output.
method: “gene” or “qtl”
marker: “snp” or “haplotype”
interval: The interval in base pair
All positions must be informed using base pairs (not Kb or Mb).
Important: This function is a beta version Therefore, errors or bugs might be found. Use with carefully.
In order to execute this fundtion, it is necessary that the packages refGenome, mefa and gtools. These packages can be installed using the following command.
#install.packages(c("refGenome", "mefa","gtools"))
#Importing the input file
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")
#Loading the function from a external file
source("complete_function.R")
#Running the function to search genes in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="Bos_taurus.UMD3.1.90.chr.gtf",marker_file="input_candidate_markers.txt",method="qtl",marker="snp",interval=500000)
## Loading required package: doBy
## Loading required package: RSQLite
## mefa 3.2-7 2016-01-11
## You are using the method: qtl with snp
## Warning in find_genes_qtls_around_markers(db_file =
## "Bos_taurus.UMD3.1.90.chr.gtf", : NAs introduzidos por coerção
## Starting QTL searching using 5e+05 bp as interval
## Starting analysis for chromosome 1
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 3
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 5
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 6
##
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|=================================================================| 100%
## Starting analysis for chromosome 8
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 10
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 12
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 13
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 14
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 15
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 18
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 20
##
|
| | 0%
|
|=================================================================| 100%
#An output file will be save in your computer. This file is called: "genes_around_SNPs.txt". Now we will import this file and check the output format.
genes.interval<-read.table("genes_around_SNPs.txt", h=T, sep="\t")
head(genes.interval)
## CHR SNP BP A1 TEST NMISS BETA STAT P
## 1 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 2 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 3 18 ARS-BFGL-NGS-39953 14146373 A ADD 1091 0.04613 6.166 9.83e-10
## 4 18 ARS-BFGL-NGS-39953 14146373 A ADD 1091 0.04613 6.166 9.83e-10
## 5 18 ARS-BFGL-NGS-39953 14146373 A ADD 1091 0.04613 6.166 9.83e-10
## 6 18 ARS-BFGL-NGS-39953 14146373 A ADD 1091 0.04613 6.166 9.83e-10
## chr ID gene_name start_pos end_pos
## 1 20 ENSBTAG00000043813 5S_rRNA 45050487 45050605
## 2 20 ENSBTAG00000043184 U6 45414399 45414504
## 3 18 ENSBTAG00000045507 ZNF469 13740622 13753849
## 4 18 ENSBTAG00000015766 ZFPM1 13809878 13826787
## 5 18 ENSBTAG00000018507 <NA> 13860761 13865260
## 6 18 ENSBTAG00000019520 ZC3H18 13879228 13922305
## Distance_from_gene_start Distance_from_gene_end
## 1 55926 55808
## 2 -307986 -308091
## 3 405751 392524
## 4 336495 319586
## 5 285612 281113
## 6 267145 224068
Note that the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each inverval is repeated n time. Where n is equal the number of genes within this interval.
#Importing the input file
input.regions<-read.table("input_candidate_markers.txt", h=T, sep="\t")
#Loading the function from a external file
source("complete_function.R")
#Running the function to search QTLs in a interval of 500Kb upstream and downstream from each marker
find_genes_qtls_around_markers(db_file="QTL_UMD_3.1.gff",marker_file="input_candidate_markers.txt",method="qtl",marker="snp",interval=500000)
## You are using the method: qtl with snp
## Starting QTL searching using 5e+05 bp as interval
## Starting analysis for chromosome 1
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 3
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 5
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 6
##
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|=================================================================| 100%
## Starting analysis for chromosome 8
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 10
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 12
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 13
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 14
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 15
##
|
| | 0%
|
|=================================================================| 100%
## Starting analysis for chromosome 18
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
## Starting analysis for chromosome 20
##
|
| | 0%
|
|=================================================================| 100%
#An output file will be save in your computer. This file is called: "QTLs_around_SNPs.txt". Now we will import this file and check the output format.
qtls.interval<-read.table("QTLs_around_SNPs.txt", h=T, sep="\t")
head(qtls.interval)
## CHR SNP BP A1 TEST NMISS BETA STAT P
## 1 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 2 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 3 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 4 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 5 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## 6 20 Hapmap61037-rs29016327 45106413 A ADD 1093 0.05491 6.909 8.28e-12
## chr database QTL_type start_pos end_pos
## 1 20 Animal QTLdb Production_Association 44916096 44916136
## 2 20 Animal QTLdb Exterior_Association 44694544 44694584
## 3 20 Animal QTLdb Reproduction_Association 44694544 44694584
## 4 20 Animal QTLdb Exterior_Association 44694544 44694584
## 5 20 Animal QTLdb Milk_Association 44694544 44694584
## 6 20 Animal QTLdb Milk_Association 44694544 44694584
## extra_info
## 1 QTL_ID=69077;Name=Body weight gain;Abbrev=BWG;PUBMED_ID=19966163;trait_ID=1612;trait=Body weight gain;breed=charolais,gelbvieh,hereford,limousin,pinzgauer,red angus,red poll,simmental,Angus;FlankMarkers=rs29015683;VTO_name=postnatal growth trait;Map_Type=Genome;Model=Mendelian;Test_Base=Experiment-wise;Significance=Significant;P-value=<0.05
## 2 QTL_ID=50811;Name=Dairy form;Abbrev=DYF;PUBMED_ID=21831322;trait_ID=1137;trait=Dairy form;breed=holstein;FlankMarkers=rs41641100;VTO_name=back conformation trait;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 3 QTL_ID=50812;Name=Stillbirth (maternal);Abbrev=SB;PUBMED_ID=21831322;trait_ID=1278;trait=Stillbirth (maternal);breed=holstein;FlankMarkers=rs41641100;VTO_name=stillborn offspring quantity;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 4 QTL_ID=50813;Name=Foot angle;Abbrev=FANG;PUBMED_ID=21831322;trait_ID=1101;trait=Foot angle;breed=holstein;FlankMarkers=rs41641100;VTO_name=hoof angle;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 5 QTL_ID=50814;Name=Milk fat percentage;Abbrev=FP;PUBMED_ID=21831322;trait_ID=1037;trait=Milk fat percentage;breed=holstein;FlankMarkers=rs41641100;VTO_name=milk fat amount;PTO_name=milk fat content;CMO_name=milk fat percentage;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## 6 QTL_ID=50815;Name=Milk fat yield;Abbrev=FY;PUBMED_ID=21831322;trait_ID=1039;trait=Milk fat yield;breed=holstein;FlankMarkers=rs41641100;VTO_name=milk fat amount;PTO_name=milk fat yield;CMO_name=milk fat yield;Map_Type=Genome;Model=Mendelian;Test_Base=Genome-wise;peak_cM=51.72;Significance=Significant;P-value=<0.05
## Distance_from_gene_start Distance_from_gene_end
## 1 190317 190277
## 2 411869 411829
## 3 411869 411829
## 4 411869 411829
## 5 411869 411829
## 6 411869 411829
Note that, as well as the gene searching output, the output was created merging the columns from your input files with the gene coordinates obtained in the gtf file. Therefore, each inverval is repeated n time. Where n is equal the number of genes within this interval.
The same analyses can be performed to haplotype data. You only need to change the argument informed to “marker=”. To choose the analysis using haplotypes, you need to inform “haplotype”.
The Gene Ontology enrichment analyses are also possible to be performed using R. The functions available in the packages GO.db, topGO and GOstats can be used together to perform a enrichment analysis. Additionally, the information present in the database “org.Bt.eg.db” will be used.
Firt, we need to import a table with the genes mapped inside an group of intervals. For this, we will import the table created using the function getBM() in biomaRt package.
#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")
head(genes.interval)
## ensembl_gene_id hgnc_symbol chromosome_name start_position
## 1 ENSBTAG00000012798 KCNH8 1 138455304
## 2 ENSBTAG00000030105 1 149720302
## 3 ENSBTAG00000018682 SETD4 1 150025827
## 4 ENSBTAG00000023384 1 150054221
## 5 ENSBTAG00000030629 1 150108438
## 6 ENSBTAG00000018688 CBR3 1 150144149
## end_position entrezgene external_gene_name
## 1 138676649 100336609 KCNH8
## 2 149720392 NA
## 3 150048269 540440 SETD4
## 4 150064636 515946
## 5 150110508 510180
## 6 150153934 516036 CBR3
To install the packages, the following commands can be used:
#source("https://bioconductor.org/biocLite.R")
#biocLite(c("GO.db","topGO","GOstats", "org.Bt.eg.db""))
After to import the list with the positional candidate genes, we must map all the genes annoated in the org.Bt.eg.db to the correspondent GO terms.
#Loading packages
library(org.Bt.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unlist, unsplit
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
library(GO.db)
##
#org.Bt.egGO is an R object that provides mappings between entrez gene identifers and the GO identifers that they are directly associated with
entrez_ID<-org.Bt.egGO
#Getting the entrez gene identifiers that are mapped to a GO ID
mapped_genes <- mappedkeys(entrez_ID)
Using the above commands, it was possible to create a list with the GO terms and the genes associated to each one of the terms. Now, we can use functions from the GO.db and GOstats packages to perform an enrichment test.
#Loading the packages
library("GO.db")
library("GOstats")
## Loading required package: Category
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:IRanges':
##
## expand
## Loading required package: graph
##
## Attaching package: 'GOstats'
## The following object is masked from 'package:AnnotationDbi':
##
## makeGOGraph
#Creating the object with the entrez ID for each mapped gene (universe)
universe <- mapped_genes
#Creating a object with the entrez ID for our positional candidate genes. Notice that we are using each Id only once and the class of the object must be the same that in the universe object (character)
candidate_ID<-as.character(unique(genes.interval$entrezgene))
#Creating the list of parameters to perform the enrichment test. Here, we will perform a enrichment analysis for Biological Process (BP)
params <- new('GOHyperGParams', geneIds=candidate_ID, universeGeneIds=universe, ontology='BP', pvalueCutoff=0.05, conditional=F, testDirection='over', annotation="org.Bt.eg.db")
## Warning in makeValidParams(.Object): removing geneIds not in
## universeGeneIds
#Running a hypergeometric test
hgOver <- hyperGTest(params)
hgOver
## Gene to GO BP test for over-representation
## 1068 GO BP ids tested (129 have p < 0.05)
## Selected gene set size: 39
## Gene universe size: 4931
## Annotation package: org.Bt.eg
#Saving the summary of the results in a object
result <- summary(hgOver)
#Checking the dimensions of the results
dim(result)
## [1] 129 7
#Checking the results
head(result,10)
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0017144 0.0001819737 264.37838 0.023727439 2 3
## 2 GO:0032784 0.0068208644 18.83398 0.126546339 2 16
## 3 GO:0006423 0.0079091462 Inf 0.007909146 1 1
## 4 GO:0016999 0.0079091462 Inf 0.007909146 1 1
## 5 GO:0031635 0.0079091462 Inf 0.007909146 1 1
## 6 GO:0034137 0.0079091462 Inf 0.007909146 1 1
## 7 GO:0035696 0.0079091462 Inf 0.007909146 1 1
## 8 GO:0035723 0.0079091462 Inf 0.007909146 1 1
## 9 GO:0035724 0.0079091462 Inf 0.007909146 1 1
## 10 GO:0038194 0.0079091462 Inf 0.007909146 1 1
## Term
## 1 drug metabolic process
## 2 regulation of DNA-templated transcription, elongation
## 3 cysteinyl-tRNA aminoacylation
## 4 antibiotic metabolic process
## 5 adenylate cyclase-inhibiting opioid receptor signaling pathway
## 6 positive regulation of toll-like receptor 2 signaling pathway
## 7 monocyte extravasation
## 8 interleukin-15-mediated signaling pathway
## 9 CD24 biosynthetic process
## 10 thyroid-stimulating hormone signaling pathway
Notice that the output table shows all the GO terms with a p-value smaller than the threshold defined in the parameters. However, we do not know which genes are associated with the enriched terms. It is possible to retrive this information using the following command line. For example, let’s seacrh the genes associated with the GO:0032784, “regulation of DNA-templated transcription, elongation”.
#First, we need to create a object with a mapping information for each GO term to Entrez gene ids
go_object <- as.list(org.Bt.egGO2EG)
#After this step, we can easily retrive this information using a indexing filtering in the go_object
go.genes<-go_object[["GO:0032784"]][which(go_object[["GO:0032784"]] %in% candidate_ID)]
#Showing the Entrez ID
go.genes
## IEA ISS
## "505722" "533643"
#Filtering the gene information from the input table
genes.interval[genes.interval$entrezgene %in% go.genes,]
## ensembl_gene_id hgnc_symbol chromosome_name start_position
## 46 ENSBTAG00000003460 14 23620858
## 99 ENSBTAG00000019133 ZNF326 3 53385189
## end_position entrezgene external_gene_name
## 46 23637730 505722 TCEA1
## 99 53419460 533643 ZNF326
Additionally, a multiple testing correction can be in our results. This approach reduce the number of false-positives in our data.
#Creating a new column with the adjusted p-value using FDR correction
result$P.adjust<-p.adjust(result$Pvalue, method = "fdr", n = length(result$Pvalue))
#Checking the results
head(result,10)
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0017144 0.0001819737 264.37838 0.023727439 2 3
## 2 GO:0032784 0.0068208644 18.83398 0.126546339 2 16
## 3 GO:0006423 0.0079091462 Inf 0.007909146 1 1
## 4 GO:0016999 0.0079091462 Inf 0.007909146 1 1
## 5 GO:0031635 0.0079091462 Inf 0.007909146 1 1
## 6 GO:0034137 0.0079091462 Inf 0.007909146 1 1
## 7 GO:0035696 0.0079091462 Inf 0.007909146 1 1
## 8 GO:0035723 0.0079091462 Inf 0.007909146 1 1
## 9 GO:0035724 0.0079091462 Inf 0.007909146 1 1
## 10 GO:0038194 0.0079091462 Inf 0.007909146 1 1
## Term
## 1 drug metabolic process
## 2 regulation of DNA-templated transcription, elongation
## 3 cysteinyl-tRNA aminoacylation
## 4 antibiotic metabolic process
## 5 adenylate cyclase-inhibiting opioid receptor signaling pathway
## 6 positive regulation of toll-like receptor 2 signaling pathway
## 7 monocyte extravasation
## 8 interleukin-15-mediated signaling pathway
## 9 CD24 biosynthetic process
## 10 thyroid-stimulating hormone signaling pathway
## P.adjust
## 1 0.02347461
## 2 0.04049741
## 3 0.04049741
## 4 0.04049741
## 5 0.04049741
## 6 0.04049741
## 7 0.04049741
## 8 0.04049741
## 9 0.04049741
## 10 0.04049741
Notice that different multiple correction tests can be used. For more details, check the help material of the p.adjust function.
The metabolic pathways enrichment is an important additional functional analysis to help in the selection of functional candidate genes. The package KEGGprofile allows to perform enrichment analysis using several especies. Here, we will perform a encrihment analysis using the same dataset used for the GO enrichment analysis. The function find_enriched_pathway() is responsible to perform the enrichment test. We need to inform the entrezID list of our positional candidate genes for the function.
To install the packages, the following commands can be used:
#source("https://bioconductor.org/biocLite.R")
#biocLite("KEGGprofile")
#Loading the package
library(KEGGprofile)
##
## KEGG.db contains mappings based on older data because the original
## resource was removed from the the public domain before the most
## recent update was produced. This package should now be
## considered deprecated and future versions of Bioconductor may
## not have it available. Users who want more current data are
## encouraged to look at the KEGGREST or reactome.db packages
#Importing the list with the positional candiate genes
genes.interval<-read.table("output_biomart_genes_within_interval.txt", h=T, sep="\t")
#Creating a object with the entrez ID for our positional candidate genes.
candidate_ID<-as.character(unique(genes.interval$entrezgene))
#Running the enrichment test
keeg.enriched<-find_enriched_pathway(candidate_ID, species = "bta", returned_pvalue = 0.05, returned_adjpvalue = 0.05, returned_genenumber = 1, download_latest = TRUE)
str(keeg.enriched)
## List of 2
## $ stastic:'data.frame': 4 obs. of 6 variables:
## ..$ Pathway_Name: chr [1:4] "Metabolism of xenobiotics by cytochrome P450" "RNA polymerase" "Phagosome" "Synaptic vesicle cycle"
## ..$ Gene_Found : int [1:4] 3 2 5 3
## ..$ Gene_Pathway: int [1:4] 66 30 182 65
## ..$ Percentage : num [1:4] 0.05 0.07 0.03 0.05
## ..$ pvalue : num [1:4] 0.000352 0.000478 0.000291 0.000332
## ..$ pvalueAdj : num [1:4] 0.0373 0.038 0.0373 0.0373
## $ detail :List of 4
## ..$ 00980: chr [1:3] "510180" "515946" "516036"
## ..$ 03020: chr [1:2] "107133113" "614776"
## ..$ 04145: chr [1:5] "281111" "282657" "287017" "768036" ...
## ..$ 04721: chr [1:3] "282657" "287017" "529659"
The arguments informed to find_enriched_pathway() were the list of unique Entrez IDs, the especie (bta= Bos taurus), the p-value threshold, the adjusted p-value threshold, the minimum number of annotated genes for enriched pathways. The last argument inform if the lastest version of the KEGG pathway database should be download for the analysis.
Notice that the output of this function is a list. This list is composed by the table with the enrichment results and a vector with the Entrez IDs associated with each pathway. Therefore, we can handle this output to obtain individual objects with this information and filter our positional candidate genes from out input file.
#Creating a object with the enriched pathways
enrich_table<-keeg.enriched$stastic
enrich_table
## Pathway_Name Gene_Found Gene_Pathway
## 00980 Metabolism of xenobiotics by cytochrome P450 3 66
## 03020 RNA polymerase 2 30
## 04145 Phagosome 5 182
## 04721 Synaptic vesicle cycle 3 65
## Percentage pvalue pvalueAdj
## 00980 0.05 0.0003518575 0.03729690
## 03020 0.07 0.0004775888 0.03796831
## 04145 0.03 0.0002914335 0.03729690
## 04721 0.05 0.0003319200 0.03729690
#Filtering the input file based on the genes associated with the enriched pathway
enriched.genes<-genes.interval[genes.interval$entrezgene %in% keeg.enriched$detail[["04145"]],]
enriched.genes
## ensembl_gene_id hgnc_symbol chromosome_name start_position
## 44 ENSBTAG00000003450 ATP6V1H 14 23511301
## 58 ENSBTAG00000003895 CYBA 18 13931107
## 89 ENSBTAG00000023730 TUBB3 18 14761260
## 122 ENSBTAG00000014238 ATP6V1E1 5 109564503
## 128 ENSBTAG00000018530 TUBA8 5 109876442
## end_position entrezgene external_gene_name
## 44 23558678 282657 ATP6V1H
## 58 13938075 281111 CYBA
## 89 14769914 768070 TUBB3
## 122 109578159 287017 ATP6V1E1
## 128 109894993 768036 TUBA8
Notice that only the pathway “04145” (Phagosome) was enriched. Tehrefore, its ID was used in the filtering process. The commands above can be adapter to filter multiple the enriched pathways.
The gene network analysis is a interesting approach to better understand the relationship between the positional candidate genes. The identification of gene networks can help to select the functional candidate genes, to identify key regulatory genes and to better understand the biological processes related with the development of complex traits.
The package STRINGdb provides a R interface to the STRING protein-protein interactions database (http://www.string-db.org).
To install the packages, the following commands can be used:
#source("https://bioconductor.org/biocLite.R")
#biocLite("STRINGdb"))
#Loading the package
library(STRINGdb)
## Loading required package: png
## Loading required package: sqldf
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following object is masked from 'package:graph':
##
## join
## The following object is masked from 'package:IRanges':
##
## desc
## The following object is masked from 'package:S4Vectors':
##
## rename
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:graph':
##
## degree, edges, intersection, union
## The following objects are masked from 'package:IRanges':
##
## compare, simplify, union
## The following objects are masked from 'package:S4Vectors':
##
## compare, union
## The following objects are masked from 'package:BiocGenerics':
##
## normalize, union
## The following object is masked from 'package:gtools':
##
## permute
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: plotrix
## Loading required package: RColorBrewer
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:plotrix':
##
## plotCI
## The following object is masked from 'package:IRanges':
##
## space
## The following object is masked from 'package:stats':
##
## lowess
## Loading required package: hash
## hash-2.2.6 provided by Decision Patterns
##
## Attaching package: 'hash'
## The following object is masked from 'package:AnnotationDbi':
##
## keys
## The following objects are masked from 'package:IRanges':
##
## values, values<-
## The following objects are masked from 'package:S4Vectors':
##
## values, values<-
## The following object is masked from 'package:biomaRt':
##
## keys
#Creating a Data frame with the Gene symbols
candidate_ID<-data.frame(gene=unique(genes.interval$hgnc_symbol))
#Loading the STRING interface for Bos taurus (ID=9913)
string_db <- STRINGdb$new(species=9913)
## WARNING: You didn't specify a version of the STRING database to use. Hence we will use STRING 10
#Mapping the interactions of the genes presen in our input list
gene_mapped <- string_db$map(candidate_ID, "gene", removeUnmappedRows = TRUE )
## Warning: we couldn't map to STRING 4% of your identifiers
hits <- gene_mapped$STRING_id
#Plotting the gene network
string_db$plot_network(hits, payload_id=NULL, required_score=NULL, add_link=T, add_summary=T)
It is possible to perform GO and KEGG pathways enrichment analyses using STRINGdb package, as well as in the online version o STRING db.
#GO enrichment analysis
enrichmentGO <- string_db$get_enrichment( hits, category = "Process", methodMT = "fdr", iea = TRUE )
head(enrichmentGO)
## term_id proteins hits pvalue pvalue_fdr
## 1 GO:0050896 1416 10 1.237745e-05 0.009543015
## 2 GO:0019538 1047 8 6.687328e-05 0.017436151
## 3 GO:0051716 1057 8 7.151369e-05 0.017436151
## 4 GO:0071822 334 5 9.045993e-05 0.017436151
## 5 GO:0044267 878 7 1.618923e-04 0.021268626
## 6 GO:0051047 80 3 1.880052e-04 0.021268626
## term_description
## 1 response to stimulus
## 2 protein metabolic process
## 3 cellular response to stimulus
## 4 protein complex subunit organization
## 5 cellular protein metabolic process
## 6 positive regulation of secretion
#KEGG enrichment analysis
enrichmentKEGG <- string_db$get_enrichment( hits, category = "KEGG", methodMT = "fdr", iea = TRUE )
head(enrichmentKEGG)
## term_id proteins hits pvalue pvalue_fdr
## 1 04145 151 5 2.358668e-06 0.0002052041
## 2 04540 84 3 2.415520e-04 0.0088705582
## 3 05323 91 3 3.058813e-04 0.0088705582
## 4 05146 103 3 4.401131e-04 0.0095724593
## 5 04721 61 2 3.525329e-03 0.0546216535
## 6 04918 68 2 4.360907e-03 0.0546216535
## term_description
## 1 Phagosome
## 2 Gap junction
## 3 Rheumatoid arthritis
## 4 Amoebiasis
## 5 Synaptic vesicle cycle
## 6 Thyroid hormone synthesis
There are several other options to perform clustering and additional enrichment analysis in the STRINGdb package. Some examples can be found here:
This tutorial showed some interesting analyses that can be performed using R. These analyses can be integrated in a pipeline, where the output from different functions and/or packages can be directly used by other functions/packages. This procedure increase the efficiency and the management process for functional studies when hundred (or even thousands) of genes are analyzed.