Using internet search to find a way to get gene symbols to our gene bank accession IDs. So we can add this database of genes to our Tableau collection of EBV associated and non-associated pathologies for comparison.
library(rmarkdown)
Data <- read.csv("autismDB_41kgenes_notSamples_needs_symbolID.csv", header=T, na.strings=c("","NA"," ") )#41472X8
paged_table(Data)
data <- Data[!duplicated(Data$GB_ACC),] #39930X8
#remove that 1 NA as well
data2 <- data[!is.na(data$GB_ACC),] #39929X8
Here are internet instructions:
AI Overview
To make a list of gene synonyms from gene accession IDs in R, use Bioconductor’s organism-specific annotation packages (e.g., org.Hs.eg.db for humans). The mapIds() function allows you to directly query NCBI/RefSeq/Ensembl accessions to return ALIAS (synonym) values.Step-by-Step InstructionsInstall and Load Packages:Ensure you have the core AnnotationDbi and your species-specific package (e.g., org.Hs.eg.db) installed via Bioconductor.Rif
(!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
##
library(AnnotationDbi)
Define Accession IDs and Keytype:Because accession numbers (e.g., NM_001369 or NP_001618) often include version numbers, it is best practice to strip them (e.g., using sub(“\..*“,”“, …)), as the database stores base accessions.Map and Retrieve Synonyms:Synonyms are stored under the”ALIAS” keytype. Because one accession ID can have multiple synonyms, use multivals = “list” to return a list of synonyms for each query.R#
accession_ids <- c(“NM_004448”, “NM_005228”) clean_accessions <- sub(“\..*“,”“, accession_ids)
accession_ids <- data2$GB_ACC
clean_accessions <- sub("\\..*", "", accession_ids)
gene_synonyms <- mapIds(org.Hs.eg.db, keys = clean_accessions, column = “ALIAS”, keytype = “REFSEQ”, # or “ACCNUM”, “ENSEMBL”, etc. multivals = “list”)
gene_synonyms <- mapIds(org.Hs.eg.db,
keys = clean_accessions,
column = "ALIAS",
keytype = "ACCNUM", # or "ACCNUM", "ENSEMBL", etc.
multivals = "list")
## 'select()' returned 1:many mapping between keys and columns
print(gene_synonyms)
geneID <- unlist(gene_synonyms) #39929 length of string
str(geneID[1:50])
## Named chr [1:50] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "names")= chr [1:50] "M86720" "AA486138" "N51018" "H65481" ...
Tips for Best ResultsCheck Your Keytype: The most common accession types are REFSEQ (for RNA/protein accessions), ACCNUM (GenBank accessions), and ENSEMBL. Use keytypes(org.Hs.eg.db) to view all supported accession types for your organism.
No Matches? If your accessions don’t map, check that your keytype matches the exact format of your IDs, and remove any .1, .2, etc., suffixes at the end of the accession string.For a full list of mapping keys available for your species, view the official Bioconductor org.Hs.eg.db documentation.
The unlist() removes null values, we probably have a null value in the data that the na.strings parameter reading it in didn’t find.
data2$Gene_ID <- geneID
paged_table(data2[1:100,])
data3 <- data2[!is.na(data2$Gene_ID),] #only 538 genes remained
paged_table(data3)
This is our table for Tableau. Lets write it out to csv.
write.csv(data2, 'dataMergedSymbols_NAs.csv', row.names=F)
write.csv(data3,"dataMerged_Symbols_GBacc_538X9.csv", row.names=F)
This documet of the 538 genes is available here.