Lets load the necessary libraries first
#Un-comment these code lines if user dont have Biocmanager & org.Hs.eg.db packages installed.
#if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db) #Package conatainine a database in itself
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
##
library(readxl) #To read and support excel documents
library(clusterProfiler) #To access bitr function from this package
##
## clusterProfiler v4.12.0 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
##
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:stats':
##
## filter
Below given code is accessing Gene IDs in an excel sheet from a given directory.
In-case if user’s data is already present in R environment which has a column with gene IDs. user can skip “data <- read_excel(”../Documents/Gene IDs.xlsx”)” and just access the column, and store the IDs in a variable, (as given at line 30), rest is the same.
#Loading the data from the Excel file from directory
data <- read_excel("../Documents/Gene IDs.xlsx")
#Accessing the column having Gene IDs in the Excel file & storing in a variable "genes"
genes <- data$'Gene Ids'
After we have the data prepared, we’re good to go for annotation.
#To see available Key Types (Supported Identifiers) in org.Hs.eg.db Package
keytypes<- keytypes (org.Hs.eg.db)
keytypes
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
#Annotation or Mapping of Gene IDs from Gene Symbols to ENSEMBL IDs using bitr function from clusterProfiler
annotated_ids1 <- bitr(genes, fromType = "SYMBOL" , toType = "ENSEMBL", OrgDb = org.Hs.eg.db)
## 'select()' returned 1:1 mapping between keys and columns
View(annotated_ids1)
#Use this line only if user wants to remove the previous Gene column & need the annotated one only
annotated_ids1 <- (annotated_ids1[,2])
View(annotated_ids1)
#Generating the Gene IDs character vector into a data frame, to later save it in excel
annotated_ids1 <- data.frame(annotated_ids1)
View(annotated_ids1)
colnames(annotated_ids1) <- "ENSEMBL_IDs" #Renaming it
Since we have the annotations done and stored in a data.frame, now we can write an xlsx file and store it into a directory. We are directly accessing “write.xlsx” function from “openxlsx” library.
#install.packages("openxlsx")
library(openxlsx)#To write the excel (.xlsx) sheets directly from R environment
#Storing the annotated IDs into a (.xslx) file & saving it into a directory
openxlsx::write.xlsx(annotated_ids1, "../Documents/annotated_ids1.xlsx",
colnames = T)
This method use biomaRt package to annotate, it lets us annotate thousands of genes at one. Lets load the necessary libraries first.
Tip: User should run this code chunk line by line, just to see how it goes and see what happening in the environment and also view the variables in Global Environment.
#Un-comment these code lines if user dont have Biocmanager & org.Hs.eg.db packages installed.
#if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")
#Or user can install some packages like this if User dont find them through Biocmanager. Just un-comment the below code line.
#install.packages("Package Name")
library(org.Hs.eg.db) #Package conatainine a database in itself
library(biomaRt)
#Read the dataset from a directory
gene_IDs <- read.csv("../Documents/Gene IDs.csv", header = T)
#Use this only,if user already have data loaded in R
gene_IDs2 <- gene_IDs$Gene.Ids #Variablen_name$Column_name
#Listing available database in biomaRt
avail_dataset <- listEnsembl()
#Selecting the available database to use in biomaRt
ensembl <- useEnsembl(biomart = "genes")
## Ensembl site unresponsive, trying asia mirror
#Listing the available data and using it from biomaRt
datasets <- listDatasets(ensembl)
#Connecting the dataset by using useMart function
ensembl_conn <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
#To see attributes and filters in biomaRt database
attr <- listAttributes(ensembl_conn)
filters <- listFilters(ensembl_conn)
#Building the query
annotated_ids2 <- getBM(attributes = c("uniprot_gn_symbol","ensembl_gene_id"), #Put the desired gene symbol to retreive
filters = "uniprot_gn_symbol", #Here put the data's value available in biomaRt
values = gene_IDs, #Specify your own dataset here without double qoutations
mart = ensembl_conn)
#Use this line only if user wants to remove the previous Gene column & need the annotated one only
annotated_ids2 <- as.data.frame(annotated_ids2[,2]) #It'll disturb the column name/header name
colnames(annotated_ids2) <- "ENSEMBL_IDs" #Renaming it
View(annotated_ids2) #Now View the results
#Storing the annotated IDs into a (.xslx) file & saving it into a directory
openxlsx::write.xlsx(annotated_ids2, "../Documents/annotated_ids2.xlsx",
colnames = T)
View(annotated_ids2)
This method uses only two libraries. The EnsDb.Hsapiens.v86 Package is a database to access the data. And AnnotationDbi Package is to access its function to annotate the IDs and build a query for annotation.
#Lets Install / Load the libraries
#Biocmanager::install("EnsDb.Hsapiens.v86")
library(EnsDb.Hsapiens.v86)
## Loading required package: ensembldb
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicFeatures
## Loading required package: AnnotationFilter
##
## Attaching package: 'ensembldb'
## The following object is masked from 'package:openxlsx':
##
## addFilter
## The following object is masked from 'package:clusterProfiler':
##
## filter
## The following object is masked from 'package:stats':
##
## filter
#Biocmanager::install("AnnotationDBi")
library(AnnotationDbi)
#Available information in EnsDb.Hsapiens.v86 Package about annotations
keytypes(EnsDb.Hsapiens.v86)
## [1] "ENTREZID" "EXONID" "GENEBIOTYPE"
## [4] "GENEID" "GENENAME" "PROTDOMID"
## [7] "PROTEINDOMAINID" "PROTEINDOMAINSOURCE" "PROTEINID"
## [10] "SEQNAME" "SEQSTRAND" "SYMBOL"
## [13] "TXBIOTYPE" "TXID" "TXNAME"
## [16] "UNIPROTID"
columns(EnsDb.Hsapiens.v86)
## [1] "ENTREZID" "EXONID" "EXONIDX"
## [4] "EXONSEQEND" "EXONSEQSTART" "GENEBIOTYPE"
## [7] "GENEID" "GENENAME" "GENESEQEND"
## [10] "GENESEQSTART" "INTERPROACCESSION" "ISCIRCULAR"
## [13] "PROTDOMEND" "PROTDOMSTART" "PROTEINDOMAINID"
## [16] "PROTEINDOMAINSOURCE" "PROTEINID" "PROTEINSEQUENCE"
## [19] "SEQCOORDSYSTEM" "SEQLENGTH" "SEQNAME"
## [22] "SEQSTRAND" "SYMBOL" "TXBIOTYPE"
## [25] "TXCDSSEQEND" "TXCDSSEQSTART" "TXID"
## [28] "TXNAME" "TXSEQEND" "TXSEQSTART"
## [31] "UNIPROTDB" "UNIPROTID" "UNIPROTMAPPINGTYPE"
#We'll be using mapIds function from AnnotationDBi Package
annotated_ids3 <- mapIds(EnsDb.Hsapiens.v86, #Here, databases to use is specified
keys = gene_IDs$Gene.Ids, #Genes to annotate are given here
keytype = "SYMBOL", #The format of gene we have is given here
column = "GENEID" #The format of gene we wanna retreive
)
#View the type aka class of our annotated_ids3 variable here
class(annotated_ids3)
## [1] "character"
#Converting the character vector into a data frame to store it in excel and then into a direcotry
annotated_ids3 <- data.frame(annotated_ids3)
colnames(annotated_ids3) <- "ENSEMBL_IDs" #Renaming it
#Storing the annotated IDs into a (.xslx) file & saving it into a directory
openxlsx::write.xlsx(annotated_ids3, "../Documents/annotated_ids3.xlsx",
colnames = T)
View(annotated_ids3)
sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.28.0
## [3] AnnotationFilter_1.28.0 GenomicFeatures_1.56.0
## [5] GenomicRanges_1.56.0 GenomeInfoDb_1.40.1
## [7] biomaRt_2.60.0 openxlsx_4.2.5.2
## [9] clusterProfiler_4.12.0 readxl_1.4.3
## [11] org.Hs.eg.db_3.19.1 AnnotationDbi_1.66.0
## [13] IRanges_2.38.0 S4Vectors_0.42.0
## [15] Biobase_2.64.0 BiocGenerics_0.50.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.16.0
## [3] jsonlite_1.8.8 magrittr_2.0.3
## [5] farver_2.1.2 rmarkdown_2.27
## [7] BiocIO_1.14.0 fs_1.6.4
## [9] zlibbioc_1.50.0 vctrs_0.6.5
## [11] Rsamtools_2.20.0 memoise_2.0.1
## [13] RCurl_1.98-1.14 ggtree_3.12.0
## [15] S4Arrays_1.4.1 htmltools_0.5.8.1
## [17] progress_1.2.3 curl_5.2.1
## [19] cellranger_1.1.0 SparseArray_1.4.8
## [21] gridGraphics_0.5-1 sass_0.4.9
## [23] bslib_0.7.0 plyr_1.8.9
## [25] httr2_1.0.1 cachem_1.1.0
## [27] GenomicAlignments_1.40.0 igraph_2.0.3
## [29] lifecycle_1.0.4 pkgconfig_2.0.3
## [31] Matrix_1.7-0 R6_2.5.1
## [33] fastmap_1.2.0 gson_0.1.0
## [35] MatrixGenerics_1.16.0 GenomeInfoDbData_1.2.12
## [37] digest_0.6.35 aplot_0.2.2
## [39] enrichplot_1.24.0 colorspace_2.1-0
## [41] patchwork_1.2.0 RSQLite_2.3.7
## [43] filelock_1.0.3 fansi_1.0.6
## [45] abind_1.4-5 httr_1.4.7
## [47] polyclip_1.10-6 compiler_4.4.0
## [49] bit64_4.0.5 withr_3.0.0
## [51] BiocParallel_1.38.0 viridis_0.6.5
## [53] DBI_1.2.3 ggforce_0.4.2
## [55] MASS_7.3-60.2 DelayedArray_0.30.1
## [57] rappdirs_0.3.3 rjson_0.2.21
## [59] HDO.db_0.99.1 tools_4.4.0
## [61] ape_5.8 scatterpie_0.2.3
## [63] zip_2.3.1 glue_1.7.0
## [65] restfulr_0.0.15 nlme_3.1-164
## [67] GOSemSim_2.30.0 grid_4.4.0
## [69] shadowtext_0.1.3 reshape2_1.4.4
## [71] fgsea_1.30.0 generics_0.1.3
## [73] gtable_0.3.5 tidyr_1.3.1
## [75] data.table_1.15.4 hms_1.1.3
## [77] xml2_1.3.6 tidygraph_1.3.1
## [79] utf8_1.2.4 XVector_0.44.0
## [81] ggrepel_0.9.5 pillar_1.9.0
## [83] stringr_1.5.1 yulab.utils_0.1.4
## [85] splines_4.4.0 dplyr_1.1.4
## [87] tweenr_2.0.3 BiocFileCache_2.12.0
## [89] treeio_1.28.0 lattice_0.22-6
## [91] rtracklayer_1.64.0 bit_4.0.5
## [93] tidyselect_1.2.1 GO.db_3.19.1
## [95] Biostrings_2.72.1 knitr_1.47
## [97] gridExtra_2.3 ProtGenerics_1.36.0
## [99] SummarizedExperiment_1.34.0 xfun_0.44
## [101] graphlayouts_1.1.1 matrixStats_1.3.0
## [103] stringi_1.8.4 UCSC.utils_1.0.0
## [105] lazyeval_0.2.2 ggfun_0.1.5
## [107] yaml_2.3.8 evaluate_0.23
## [109] codetools_0.2-20 ggraph_2.2.1
## [111] tibble_3.2.1 qvalue_2.36.0
## [113] ggplotify_0.1.2 cli_3.6.2
## [115] munsell_0.5.1 jquerylib_0.1.4
## [117] Rcpp_1.0.12 dbplyr_2.5.0
## [119] png_0.1-8 XML_3.99-0.16.1
## [121] parallel_4.4.0 ggplot2_3.5.1
## [123] blob_1.2.4 prettyunits_1.2.0
## [125] DOSE_3.30.1 bitops_1.0-7
## [127] viridisLite_0.4.2 tidytree_0.4.6
## [129] scales_1.3.0 purrr_1.0.2
## [131] crayon_1.5.2 rlang_1.1.4
## [133] cowplot_1.1.3 fastmatch_1.1-4
## [135] KEGGREST_1.44.0