METHOD NO 1.

Lets load the necessary libraries first

#Un-comment these code lines if user dont have Biocmanager & org.Hs.eg.db packages installed.

#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")

library(org.Hs.eg.db) #Package conatainine a database in itself

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

##

library(readxl) #To read and support excel documents
library(clusterProfiler) #To access bitr function from this package

##

## clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

## 
## Attaching package: 'clusterProfiler'

## The following object is masked from 'package:AnnotationDbi':
## 
##     select

## The following object is masked from 'package:IRanges':
## 
##     slice

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## The following object is masked from 'package:stats':
## 
##     filter

Data preparations

Below given code is accessing Gene IDs in an excel sheet from a given directory.

In-case if user’s data is already present in R environment which has a column with gene IDs. user can skip “data <- read_excel(”../Documents/Gene IDs.xlsx”)” and just access the column, and store the IDs in a variable, (as given at line 30), rest is the same.

#Loading the data from the Excel file from directory
data <- read_excel("../Documents/Gene IDs.xlsx")

#Accessing the column having Gene IDs in the Excel file & storing in a variable "genes"
genes <- data$'Gene Ids'

After we have the data prepared, we’re good to go for annotation.

#To see available Key Types (Supported Identifiers) in org.Hs.eg.db Package
keytypes<- keytypes (org.Hs.eg.db)
keytypes

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

#Annotation or Mapping of Gene IDs from Gene Symbols to ENSEMBL IDs using bitr function from clusterProfiler
annotated_ids1 <- bitr(genes, fromType = "SYMBOL" , toType = "ENSEMBL", OrgDb = org.Hs.eg.db)

## 'select()' returned 1:1 mapping between keys and columns

View(annotated_ids1) 

#Use this line only if user wants to remove the previous Gene column & need the annotated one only
annotated_ids1 <- (annotated_ids1[,2])
View(annotated_ids1)

#Generating the Gene IDs character vector into a data frame, to later save it in excel
annotated_ids1 <- data.frame(annotated_ids1)
View(annotated_ids1)

colnames(annotated_ids1) <- "ENSEMBL_IDs" #Renaming it

Since we have the annotations done and stored in a data.frame, now we can write an xlsx file and store it into a directory. We are directly accessing “write.xlsx” function from “openxlsx” library.

#install.packages("openxlsx")
library(openxlsx)#To write the excel (.xlsx) sheets directly from R environment

#Storing the annotated IDs into a (.xslx) file & saving it into a directory
openxlsx::write.xlsx(annotated_ids1, "../Documents/annotated_ids1.xlsx",
                     colnames = T)

METHOD NO 2

This method use biomaRt package to annotate, it lets us annotate thousands of genes at one. Lets load the necessary libraries first.

Tip: User should run this code chunk line by line, just to see how it goes and see what happening in the environment and also view the variables in Global Environment.

#Un-comment these code lines if user dont have Biocmanager & org.Hs.eg.db packages installed.

#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")

#Or user can install some packages like this if User dont find them through Biocmanager. Just un-comment the below code line.
#install.packages("Package Name")

library(org.Hs.eg.db) #Package conatainine a database in itself
library(biomaRt)

#Read the dataset from a directory
gene_IDs <- read.csv("../Documents/Gene IDs.csv", header = T)

#Use this only,if user already have data loaded in R
gene_IDs2 <- gene_IDs$Gene.Ids #Variablen_name$Column_name

#Listing available database in biomaRt
avail_dataset <- listEnsembl()

#Selecting the available database to use in biomaRt
ensembl <- useEnsembl(biomart = "genes")

## Ensembl site unresponsive, trying asia mirror

#Listing the available data and using it from biomaRt
datasets <- listDatasets(ensembl)

#Connecting the dataset by using useMart function
ensembl_conn <- useMart("ensembl", dataset =  "hsapiens_gene_ensembl")

#To see attributes and filters in biomaRt database
attr <- listAttributes(ensembl_conn)
filters <- listFilters(ensembl_conn)

#Building the query
annotated_ids2 <- getBM(attributes = c("uniprot_gn_symbol","ensembl_gene_id"), #Put the desired gene symbol to retreive
      filters = "uniprot_gn_symbol", #Here put the data's value available in biomaRt
      values = gene_IDs, #Specify your own dataset here without double qoutations
      mart = ensembl_conn)

#Use this line only if user wants to remove the previous Gene column & need the annotated one only
annotated_ids2 <- as.data.frame(annotated_ids2[,2]) #It'll disturb the column name/header name
colnames(annotated_ids2) <- "ENSEMBL_IDs" #Renaming it
View(annotated_ids2) #Now View the results

#Storing the annotated IDs into a (.xslx) file & saving it into a directory
openxlsx::write.xlsx(annotated_ids2, "../Documents/annotated_ids2.xlsx",
                     colnames = T)
View(annotated_ids2)

METHOD NO 3

This method uses only two libraries. The EnsDb.Hsapiens.v86 Package is a database to access the data. And AnnotationDbi Package is to access its function to annotate the IDs and build a query for annotation.

#Lets Install / Load the libraries
#Biocmanager::install("EnsDb.Hsapiens.v86")
library(EnsDb.Hsapiens.v86)

## Loading required package: ensembldb

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: GenomicFeatures

## Loading required package: AnnotationFilter

## 
## Attaching package: 'ensembldb'

## The following object is masked from 'package:openxlsx':
## 
##     addFilter

## The following object is masked from 'package:clusterProfiler':
## 
##     filter

## The following object is masked from 'package:stats':
## 
##     filter

#Biocmanager::install("AnnotationDBi")
library(AnnotationDbi)

#Available information in EnsDb.Hsapiens.v86 Package about annotations
keytypes(EnsDb.Hsapiens.v86)

##  [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
##  [4] "GENEID"              "GENENAME"            "PROTDOMID"          
##  [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
## [10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
## [13] "TXBIOTYPE"           "TXID"                "TXNAME"             
## [16] "UNIPROTID"

columns(EnsDb.Hsapiens.v86)

##  [1] "ENTREZID"            "EXONID"              "EXONIDX"            
##  [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"        
##  [7] "GENEID"              "GENENAME"            "GENESEQEND"         
## [10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
## [13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
## [16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
## [19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
## [22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
## [25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
## [28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
## [31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE"

#We'll be using mapIds function from AnnotationDBi Package
annotated_ids3 <- mapIds(EnsDb.Hsapiens.v86, #Here, databases to use is specified
       keys = gene_IDs$Gene.Ids, #Genes to annotate are given here
       keytype = "SYMBOL", #The format of gene we have is given here
       column = "GENEID" #The format of gene we wanna retreive
       )
#View the type aka class of our annotated_ids3 variable here
class(annotated_ids3)

## [1] "character"

#Converting the character vector into a data frame to store it in excel and then into a direcotry
annotated_ids3 <- data.frame(annotated_ids3)

colnames(annotated_ids3) <- "ENSEMBL_IDs" #Renaming it

#Storing the annotated IDs into a (.xslx) file & saving it into a directory
openxlsx::write.xlsx(annotated_ids3, "../Documents/annotated_ids3.xlsx",
                     colnames = T)
View(annotated_ids3)

sessionInfo()

## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.28.0         
##  [3] AnnotationFilter_1.28.0   GenomicFeatures_1.56.0   
##  [5] GenomicRanges_1.56.0      GenomeInfoDb_1.40.1      
##  [7] biomaRt_2.60.0            openxlsx_4.2.5.2         
##  [9] clusterProfiler_4.12.0    readxl_1.4.3             
## [11] org.Hs.eg.db_3.19.1       AnnotationDbi_1.66.0     
## [13] IRanges_2.38.0            S4Vectors_0.42.0         
## [15] Biobase_2.64.0            BiocGenerics_0.50.0      
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.16.0          
##   [3] jsonlite_1.8.8              magrittr_2.0.3             
##   [5] farver_2.1.2                rmarkdown_2.27             
##   [7] BiocIO_1.14.0               fs_1.6.4                   
##   [9] zlibbioc_1.50.0             vctrs_0.6.5                
##  [11] Rsamtools_2.20.0            memoise_2.0.1              
##  [13] RCurl_1.98-1.14             ggtree_3.12.0              
##  [15] S4Arrays_1.4.1              htmltools_0.5.8.1          
##  [17] progress_1.2.3              curl_5.2.1                 
##  [19] cellranger_1.1.0            SparseArray_1.4.8          
##  [21] gridGraphics_0.5-1          sass_0.4.9                 
##  [23] bslib_0.7.0                 plyr_1.8.9                 
##  [25] httr2_1.0.1                 cachem_1.1.0               
##  [27] GenomicAlignments_1.40.0    igraph_2.0.3               
##  [29] lifecycle_1.0.4             pkgconfig_2.0.3            
##  [31] Matrix_1.7-0                R6_2.5.1                   
##  [33] fastmap_1.2.0               gson_0.1.0                 
##  [35] MatrixGenerics_1.16.0       GenomeInfoDbData_1.2.12    
##  [37] digest_0.6.35               aplot_0.2.2                
##  [39] enrichplot_1.24.0           colorspace_2.1-0           
##  [41] patchwork_1.2.0             RSQLite_2.3.7              
##  [43] filelock_1.0.3              fansi_1.0.6                
##  [45] abind_1.4-5                 httr_1.4.7                 
##  [47] polyclip_1.10-6             compiler_4.4.0             
##  [49] bit64_4.0.5                 withr_3.0.0                
##  [51] BiocParallel_1.38.0         viridis_0.6.5              
##  [53] DBI_1.2.3                   ggforce_0.4.2              
##  [55] MASS_7.3-60.2               DelayedArray_0.30.1        
##  [57] rappdirs_0.3.3              rjson_0.2.21               
##  [59] HDO.db_0.99.1               tools_4.4.0                
##  [61] ape_5.8                     scatterpie_0.2.3           
##  [63] zip_2.3.1                   glue_1.7.0                 
##  [65] restfulr_0.0.15             nlme_3.1-164               
##  [67] GOSemSim_2.30.0             grid_4.4.0                 
##  [69] shadowtext_0.1.3            reshape2_1.4.4             
##  [71] fgsea_1.30.0                generics_0.1.3             
##  [73] gtable_0.3.5                tidyr_1.3.1                
##  [75] data.table_1.15.4           hms_1.1.3                  
##  [77] xml2_1.3.6                  tidygraph_1.3.1            
##  [79] utf8_1.2.4                  XVector_0.44.0             
##  [81] ggrepel_0.9.5               pillar_1.9.0               
##  [83] stringr_1.5.1               yulab.utils_0.1.4          
##  [85] splines_4.4.0               dplyr_1.1.4                
##  [87] tweenr_2.0.3                BiocFileCache_2.12.0       
##  [89] treeio_1.28.0               lattice_0.22-6             
##  [91] rtracklayer_1.64.0          bit_4.0.5                  
##  [93] tidyselect_1.2.1            GO.db_3.19.1               
##  [95] Biostrings_2.72.1           knitr_1.47                 
##  [97] gridExtra_2.3               ProtGenerics_1.36.0        
##  [99] SummarizedExperiment_1.34.0 xfun_0.44                  
## [101] graphlayouts_1.1.1          matrixStats_1.3.0          
## [103] stringi_1.8.4               UCSC.utils_1.0.0           
## [105] lazyeval_0.2.2              ggfun_0.1.5                
## [107] yaml_2.3.8                  evaluate_0.23              
## [109] codetools_0.2-20            ggraph_2.2.1               
## [111] tibble_3.2.1                qvalue_2.36.0              
## [113] ggplotify_0.1.2             cli_3.6.2                  
## [115] munsell_0.5.1               jquerylib_0.1.4            
## [117] Rcpp_1.0.12                 dbplyr_2.5.0               
## [119] png_0.1-8                   XML_3.99-0.16.1            
## [121] parallel_4.4.0              ggplot2_3.5.1              
## [123] blob_1.2.4                  prettyunits_1.2.0          
## [125] DOSE_3.30.1                 bitops_1.0-7               
## [127] viridisLite_0.4.2           tidytree_0.4.6             
## [129] scales_1.3.0                purrr_1.0.2                
## [131] crayon_1.5.2                rlang_1.1.4                
## [133] cowplot_1.1.3               fastmatch_1.1-4            
## [135] KEGGREST_1.44.0

Gene and Protein IDs Annotation Methods in R

2024-06-13

METHOD NO 1.

Data preparations

METHOD NO 2

METHOD NO 3