Using internet search to find a way to get gene symbols to our gene bank accession IDs. So we can add this database of genes to our Tableau collection of EBV associated and non-associated pathologies for comparison.
library(rmarkdown)
Data <- read.csv("autismDB_41kgenes_notSamples_needs_symbolID.csv", header=T, na.strings=c("","NA"," ") )#41472X8
paged_table(Data)
data <- Data[!duplicated(Data$GB_ACC),] #39930X8
#remove that 1 NA as well
data2 <- data[!is.na(data$GB_ACC),] #39929X8
Here are internet instructions:
AI Overview
To make a list of gene synonyms from gene accession IDs in R, use Bioconductor’s organism-specific annotation packages (e.g., org.Hs.eg.db for humans). The mapIds() function allows you to directly query NCBI/RefSeq/Ensembl accessions to return ALIAS (synonym) values.Step-by-Step InstructionsInstall and Load Packages:Ensure you have the core AnnotationDbi and your species-specific package (e.g., org.Hs.eg.db) installed via Bioconductor.Rif
(!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
##
library(AnnotationDbi)
Define Accession IDs and Keytype:Because accession numbers (e.g., NM_001369 or NP_001618) often include version numbers, it is best practice to strip them (e.g., using sub(“\..*“,”“, …)), as the database stores base accessions.Map and Retrieve Synonyms:Synonyms are stored under the”ALIAS” keytype. Because one accession ID can have multiple synonyms, use multivals = “list” to return a list of synonyms for each query.R#
accession_ids <- c(“NM_004448”, “NM_005228”) clean_accessions <- sub(“\..*“,”“, accession_ids)
accession_ids <- data2$GB_ACC
clean_accessions <- sub("\\..*", "", accession_ids)
gene_synonyms <- mapIds(org.Hs.eg.db, keys = clean_accessions, column = “ALIAS”, keytype = “REFSEQ”, # or “ACCNUM”, “ENSEMBL”, etc. multivals = “list”)
gene_synonyms <- mapIds(org.Hs.eg.db,
keys = clean_accessions,
column = "ALIAS",
keytype = "ACCNUM", # or "ACCNUM", "ENSEMBL", etc.
multivals = "list")
## 'select()' returned 1:many mapping between keys and columns
print(gene_synonyms)
geneID <- unlist(gene_synonyms) #39929 length of string
str(geneID[1:50])
## Named chr [1:50] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "names")= chr [1:50] "M86720" "AA486138" "N51018" "H65481" ...
Tips for Best ResultsCheck Your Keytype: The most common accession types are REFSEQ (for RNA/protein accessions), ACCNUM (GenBank accessions), and ENSEMBL. Use keytypes(org.Hs.eg.db) to view all supported accession types for your organism.
No Matches? If your accessions don’t map, check that your keytype matches the exact format of your IDs, and remove any .1, .2, etc., suffixes at the end of the accession string.For a full list of mapping keys available for your species, view the official Bioconductor org.Hs.eg.db documentation.
The unlist() removes null values, we probably have a null value in the data that the na.strings parameter reading it in didn’t find.
data2$Gene_ID <- geneID
paged_table(data2[1:100,])
data3 <- data2[!is.na(data2$Gene_ID),] #only 538 genes remained
paged_table(data3)
This is our table for Tableau. Lets write it out to csv.
write.csv(data2, 'dataMergedSymbols_NAs.csv', row.names=F)
write.csv(data3,"dataMerged_Symbols_GBacc_538X9.csv", row.names=F)
This documet of the 538 genes is available here.
==============================
5-17-2026 adding these genes to our pathology database and then to Tableau for the non-associated EBV pathologies.
We are using the data3 dataset and need to import the pathologies database.
path <- "C:/...Path to current Pathologies Database/"
setwd(path)
pathologyDB <- read.csv("Pathology_database_ICC_added_5_15_2026.csv",
header=T)
colnames(pathologyDB)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
colnames(data3
)
## [1] "healthy_mean" "language_mean" "mild_mean"
## [4] "savant_mean" "FC_language_healthy" "FC_mild_healthy"
## [7] "FC_savant_healthy" "GB_ACC" "Gene_ID"
We need Ensembl IDs and we have it already here.
path2 <- "path to Ensemble IDs merger for Genecards IDs GSE271486 resource/"
setwd(path2)
ensembl <- read.csv("GSE271486_ensembleIDs_NPC_LBMP_study.csv", header=T)
colnames(ensembl)
## [1] "gene_id" "gene_name" "description"
## [4] "locus" "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
## [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count" "HNE_1_WT_LMP1_2_count"
## [10] "HNE_1_WT_LMP1_3_count" "HNE_1_MUT_LMP1_1_FPKM" "HNE_1_MUT_LMP1_2_FPKM"
## [13] "HNE_1_MUT_LMP1_3_FPKM" "HNE_1_WT_LMP1_1_FPKM" "HNE_1_WT_LMP1_2_FPKM"
## [16] "HNE_1_WT_LMP1_3_FPKM"
ensembl2 <- ensembl[,c(1,2)]
head(ensembl2)
## gene_id gene_name
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000005 TNMD
## 3 ENSG00000000419 DPM1
## 4 ENSG00000000457 SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938 FGR
data4 <- merge(ensembl2, data3, by.x="gene_name", by.y="Gene_ID")
paged_table(data4)
write.csv(data4, "autism_geneIDs538X10.csv", row.names=F)
We want to only add the top genes to the pathology database. We need to get that file from that folder.
path3 <- "C:/Users/jlcor/Desktop/r programs 2025-2026/savant autism GSE15402/"
setwd(path3)
savant20 <- read.csv("savant_foldchange_top20_genebankAccession.csv", header=T, na.strings=c("",NA," ", "NA"))
mild20 <- read.csv("mild_foldchange_top20_genebankAccession.csv", header=T, na.strings=c("",NA," ", "NA"))
language20 <- read.csv("Language_foldchange_top20_genebankAccession.csv", header=T, na.strings=c("",NA," ", "NA"))
colnames(savant20)
## [1] "UID" "R" "C"
## [4] "GSM386518" "GSM386519" "GSM386520"
## [7] "GSM386521" "GSM386522" "GSM386523"
## [10] "GSM386524" "GSM386525" "GSM386526"
## [13] "GSM386527" "GSM386528" "GSM386529"
## [16] "GSM386530" "GSM386531" "GSM386532"
## [19] "GSM386533" "GSM386534" "GSM386535"
## [22] "GSM386536" "GSM386537" "GSM386538"
## [25] "GSM386539" "GSM386540" "GSM386541"
## [28] "GSM386542" "GSM386543" "GSM386544"
## [31] "GSM386545" "GSM386546" "GSM386547"
## [34] "GSM386548" "GSM386549" "GSM386550"
## [37] "GSM386551" "GSM386552" "GSM386553"
## [40] "GSM386554" "GSM386555" "GSM386556"
## [43] "GSM386557" "GSM386558" "GSM386559"
## [46] "GSM386560" "GSM386561" "GSM386562"
## [49] "GSM386563" "GSM386564" "GSM386565"
## [52] "GSM386566" "GSM386567" "GSM386568"
## [55] "GSM386569" "GSM386570" "GSM386571"
## [58] "GSM386572" "GSM386573" "GSM386574"
## [61] "GSM386575" "GSM386576" "GSM386577"
## [64] "GSM386578" "GSM386579" "GSM386580"
## [67] "GSM386581" "GSM386582" "GSM386583"
## [70] "GSM386584" "GSM386585" "GSM386586"
## [73] "GSM386587" "GSM386588" "GSM386589"
## [76] "GSM386590" "GSM386591" "GSM386592"
## [79] "GSM386593" "GSM386594" "GSM386595"
## [82] "GSM386596" "GSM386597" "GSM386598"
## [85] "GSM386599" "GSM386600" "GSM386601"
## [88] "GSM386602" "GSM386603" "GSM386604"
## [91] "GSM386605" "GSM386606" "GSM386607"
## [94] "GSM386608" "GSM386609" "GSM386610"
## [97] "GSM386611" "GSM386612" "GSM386613"
## [100] "GSM386614" "GSM386615" "GSM386616"
## [103] "GSM386617" "GSM386618" "GSM386619"
## [106] "GSM386620" "GSM386621" "GSM386622"
## [109] "GSM386623" "GSM386624" "GSM386625"
## [112] "GSM386626" "GSM386627" "GSM386628"
## [115] "GSM386629" "GSM386630" "GSM386631"
## [118] "GSM386632" "GSM386633" "healthy_mean"
## [121] "language_mean" "mild_mean" "savant_mean"
## [124] "FC_language_healthy" "FC_mild_healthy" "FC_savant_healthy"
## [127] "ID" "GB_ACC" "SPOT_ID"
str(data4$GB_ACC)
## chr [1:140] "AI038092" "AI208428" "N95187" "AA939070" "AA911712" ...
str(savant20$GB_ACC)
## chr [1:20] "N63864" NA "AI222331" NA "T99336" NA "AA486089" NA "AI018277" ...
Lets just keep all the Gene IDs but do a left join in the merge function to keep all Gene IDs but add in the Ensembl IDs if they are available. We have only 140 genes of autism when merging by the ensembl ID so we want to keep as many genes as possible. It looks like when we combine the top genes with ensembl ID by GB_ACC because we didn’t merge to find the gene ID that there may not be any gene symbols for our top genes. So we will just add it to Tableau but not the top genes if there are none with gene symbols.
data5 <- merge(data3, ensembl2, by.x="Gene_ID", by.y="gene_name", all.x=T)
Lets combine top genes together but add in a feature describing where it fits in for top gene by type of autism as mild, language, or savant.
mild20$autismType <- "mild"
language20$autismType <- 'language'
savant20$autismType <- 'savant'
allTopGenes60 <- rbind(mild20, language20, savant20)
paged_table(allTopGenes60)
TopGeneList <- allTopGenes60$GB_ACC
TopGeneList2 <- TopGeneList[!is.na(TopGeneList)]
TopGeneList2
## [1] "AI346406" "R53966" "R64686" "AA778346" "R40855" "W95682"
## [7] "AI351587" "AA448277" "N56875" "AA158162" "AA001718" "N73877"
## [13] "AI348022" "T86999" "T65948" "H27752" "N62522" "R93875"
## [19] "AI016962" "W03687" "R64686" "AI351317" "R38172" "AA456715"
## [25] "N51529" "AA400272" "R76772" "AA158162" "H63994" "AI198536"
## [31] "AA908954" "AI348022" "T86983" "N63864" "AI222331" "T99336"
## [37] "AA486089" "AI018277" "AI016962" "AA427367" "W03687" "N35115"
## [43] "AA778346" "N62729" "AI348022" "AA927340" "AI074469"
data6 <- data5[which(data5$GB_ACC %in% TopGeneList2),]
paged_table(data6)
None of the Autism top genes are in the database of 540 genes, so we cannot add it to our pathology database, but we can still make the tableau dashboard and compare top gene expression for EBV related pathologies to the Autism data of provided gene symbols.
write.csv(data5, 'autism540X10_ensemble_geneSymbolsAdded.csv', row.names=F)
We will go to the Tableau dashboard for this.