Using internet search to find a way to get gene symbols to our gene bank accession IDs. So we can add this database of genes to our Tableau collection of EBV associated and non-associated pathologies for comparison.

library(rmarkdown)

Data <- read.csv("autismDB_41kgenes_notSamples_needs_symbolID.csv", header=T, na.strings=c("","NA"," ") )#41472X8

paged_table(Data)

data <- Data[!duplicated(Data$GB_ACC),] #39930X8

#remove that 1 NA as well
data2 <- data[!is.na(data$GB_ACC),] #39929X8

Here are internet instructions:

AI Overview

To make a list of gene synonyms from gene accession IDs in R, use Bioconductor’s organism-specific annotation packages (e.g., org.Hs.eg.db for humans). The mapIds() function allows you to directly query NCBI/RefSeq/Ensembl accessions to return ALIAS (synonym) values.Step-by-Step InstructionsInstall and Load Packages:Ensure you have the core AnnotationDbi and your species-specific package (e.g., org.Hs.eg.db) installed via Bioconductor.Rif

(!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)

#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: generics

## 
## Attaching package: 'generics'

## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

##

library(AnnotationDbi)

Define Accession IDs and Keytype:Because accession numbers (e.g., NM_001369 or NP_001618) often include version numbers, it is best practice to strip them (e.g., using sub(“\..*“,”“, …)), as the database stores base accessions.Map and Retrieve Synonyms:Synonyms are stored under the”ALIAS” keytype. Because one accession ID can have multiple synonyms, use multivals = “list” to return a list of synonyms for each query.R#

Your example accession IDs (stripping version numbers if necessary)

accession_ids <- c(“NM_004448”, “NM_005228”) clean_accessions <- sub(“\..*“,”“, accession_ids)

accession_ids <- data2$GB_ACC

clean_accessions <- sub("\\..*", "", accession_ids)

2. Map accessions to ALIAS (synonyms)

gene_synonyms <- mapIds(org.Hs.eg.db, keys = clean_accessions, column = “ALIAS”, keytype = “REFSEQ”, # or “ACCNUM”, “ENSEMBL”, etc. multivals = “list”)

gene_synonyms <- mapIds(org.Hs.eg.db,
                        keys = clean_accessions,
                        column = "ALIAS",
                        keytype = "ACCNUM", # or "ACCNUM", "ENSEMBL", etc.
                        multivals = "list")

## 'select()' returned 1:many mapping between keys and columns

3. View the results

print(gene_synonyms)

geneID <- unlist(gene_synonyms) #39929 length of string

str(geneID[1:50])

##  Named chr [1:50] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "names")= chr [1:50] "M86720" "AA486138" "N51018" "H65481" ...

Tips for Best ResultsCheck Your Keytype: The most common accession types are REFSEQ (for RNA/protein accessions), ACCNUM (GenBank accessions), and ENSEMBL. Use keytypes(org.Hs.eg.db) to view all supported accession types for your organism.

No Matches? If your accessions don’t map, check that your keytype matches the exact format of your IDs, and remove any .1, .2, etc., suffixes at the end of the accession string.For a full list of mapping keys available for your species, view the official Bioconductor org.Hs.eg.db documentation.

The unlist() removes null values, we probably have a null value in the data that the na.strings parameter reading it in didn’t find.

data2$Gene_ID <- geneID

paged_table(data2[1:100,])

data3 <- data2[!is.na(data2$Gene_ID),] #only 538 genes remained

paged_table(data3)

This is our table for Tableau. Lets write it out to csv.

write.csv(data2, 'dataMergedSymbols_NAs.csv', row.names=F)
write.csv(data3,"dataMerged_Symbols_GBacc_538X9.csv", row.names=F)

This documet of the 538 genes is available here.

==============================

5-17-2026 adding these genes to our pathology database and then to Tableau for the non-associated EBV pathologies.

We are using the data3 dataset and need to import the pathologies database.

path <- "C:/...Path to current Pathologies Database/"

setwd(path)

pathologyDB <- read.csv("Pathology_database_ICC_added_5_15_2026.csv",
                        header=T)

colnames(pathologyDB)

## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

colnames(data3
         )

## [1] "healthy_mean"        "language_mean"       "mild_mean"          
## [4] "savant_mean"         "FC_language_healthy" "FC_mild_healthy"    
## [7] "FC_savant_healthy"   "GB_ACC"              "Gene_ID"

We need Ensembl IDs and we have it already here.

path2 <- "path to Ensemble IDs merger for Genecards IDs GSE271486 resource/"

setwd(path2)

ensembl <- read.csv("GSE271486_ensembleIDs_NPC_LBMP_study.csv", header=T)

colnames(ensembl)

##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"

ensembl2 <- ensembl[,c(1,2)]

head(ensembl2)

##           gene_id gene_name
## 1 ENSG00000000003    TSPAN6
## 2 ENSG00000000005      TNMD
## 3 ENSG00000000419      DPM1
## 4 ENSG00000000457     SCYL3
## 5 ENSG00000000460  C1orf112
## 6 ENSG00000000938       FGR

data4 <- merge(ensembl2, data3, by.x="gene_name", by.y="Gene_ID")

paged_table(data4)

write.csv(data4, "autism_geneIDs538X10.csv", row.names=F)

We want to only add the top genes to the pathology database. We need to get that file from that folder.

path3 <- "C:/Users/jlcor/Desktop/r programs 2025-2026/savant autism GSE15402/"

setwd(path3)

savant20 <- read.csv("savant_foldchange_top20_genebankAccession.csv", header=T, na.strings=c("",NA," ", "NA"))

mild20 <- read.csv("mild_foldchange_top20_genebankAccession.csv", header=T, na.strings=c("",NA," ", "NA"))

language20 <- read.csv("Language_foldchange_top20_genebankAccession.csv", header=T, na.strings=c("",NA," ", "NA"))

colnames(savant20)

##   [1] "UID"                 "R"                   "C"                  
##   [4] "GSM386518"           "GSM386519"           "GSM386520"          
##   [7] "GSM386521"           "GSM386522"           "GSM386523"          
##  [10] "GSM386524"           "GSM386525"           "GSM386526"          
##  [13] "GSM386527"           "GSM386528"           "GSM386529"          
##  [16] "GSM386530"           "GSM386531"           "GSM386532"          
##  [19] "GSM386533"           "GSM386534"           "GSM386535"          
##  [22] "GSM386536"           "GSM386537"           "GSM386538"          
##  [25] "GSM386539"           "GSM386540"           "GSM386541"          
##  [28] "GSM386542"           "GSM386543"           "GSM386544"          
##  [31] "GSM386545"           "GSM386546"           "GSM386547"          
##  [34] "GSM386548"           "GSM386549"           "GSM386550"          
##  [37] "GSM386551"           "GSM386552"           "GSM386553"          
##  [40] "GSM386554"           "GSM386555"           "GSM386556"          
##  [43] "GSM386557"           "GSM386558"           "GSM386559"          
##  [46] "GSM386560"           "GSM386561"           "GSM386562"          
##  [49] "GSM386563"           "GSM386564"           "GSM386565"          
##  [52] "GSM386566"           "GSM386567"           "GSM386568"          
##  [55] "GSM386569"           "GSM386570"           "GSM386571"          
##  [58] "GSM386572"           "GSM386573"           "GSM386574"          
##  [61] "GSM386575"           "GSM386576"           "GSM386577"          
##  [64] "GSM386578"           "GSM386579"           "GSM386580"          
##  [67] "GSM386581"           "GSM386582"           "GSM386583"          
##  [70] "GSM386584"           "GSM386585"           "GSM386586"          
##  [73] "GSM386587"           "GSM386588"           "GSM386589"          
##  [76] "GSM386590"           "GSM386591"           "GSM386592"          
##  [79] "GSM386593"           "GSM386594"           "GSM386595"          
##  [82] "GSM386596"           "GSM386597"           "GSM386598"          
##  [85] "GSM386599"           "GSM386600"           "GSM386601"          
##  [88] "GSM386602"           "GSM386603"           "GSM386604"          
##  [91] "GSM386605"           "GSM386606"           "GSM386607"          
##  [94] "GSM386608"           "GSM386609"           "GSM386610"          
##  [97] "GSM386611"           "GSM386612"           "GSM386613"          
## [100] "GSM386614"           "GSM386615"           "GSM386616"          
## [103] "GSM386617"           "GSM386618"           "GSM386619"          
## [106] "GSM386620"           "GSM386621"           "GSM386622"          
## [109] "GSM386623"           "GSM386624"           "GSM386625"          
## [112] "GSM386626"           "GSM386627"           "GSM386628"          
## [115] "GSM386629"           "GSM386630"           "GSM386631"          
## [118] "GSM386632"           "GSM386633"           "healthy_mean"       
## [121] "language_mean"       "mild_mean"           "savant_mean"        
## [124] "FC_language_healthy" "FC_mild_healthy"     "FC_savant_healthy"  
## [127] "ID"                  "GB_ACC"              "SPOT_ID"

str(data4$GB_ACC)

##  chr [1:140] "AI038092" "AI208428" "N95187" "AA939070" "AA911712" ...

str(savant20$GB_ACC)

##  chr [1:20] "N63864" NA "AI222331" NA "T99336" NA "AA486089" NA "AI018277" ...

Lets just keep all the Gene IDs but do a left join in the merge function to keep all Gene IDs but add in the Ensembl IDs if they are available. We have only 140 genes of autism when merging by the ensembl ID so we want to keep as many genes as possible. It looks like when we combine the top genes with ensembl ID by GB_ACC because we didn’t merge to find the gene ID that there may not be any gene symbols for our top genes. So we will just add it to Tableau but not the top genes if there are none with gene symbols.

data5 <- merge(data3, ensembl2, by.x="Gene_ID", by.y="gene_name", all.x=T)

Lets combine top genes together but add in a feature describing where it fits in for top gene by type of autism as mild, language, or savant.

mild20$autismType <- "mild"
language20$autismType <- 'language'
savant20$autismType <- 'savant'

allTopGenes60 <- rbind(mild20, language20, savant20)

paged_table(allTopGenes60)

TopGeneList <- allTopGenes60$GB_ACC

TopGeneList2 <- TopGeneList[!is.na(TopGeneList)]

TopGeneList2

##  [1] "AI346406" "R53966"   "R64686"   "AA778346" "R40855"   "W95682"  
##  [7] "AI351587" "AA448277" "N56875"   "AA158162" "AA001718" "N73877"  
## [13] "AI348022" "T86999"   "T65948"   "H27752"   "N62522"   "R93875"  
## [19] "AI016962" "W03687"   "R64686"   "AI351317" "R38172"   "AA456715"
## [25] "N51529"   "AA400272" "R76772"   "AA158162" "H63994"   "AI198536"
## [31] "AA908954" "AI348022" "T86983"   "N63864"   "AI222331" "T99336"  
## [37] "AA486089" "AI018277" "AI016962" "AA427367" "W03687"   "N35115"  
## [43] "AA778346" "N62729"   "AI348022" "AA927340" "AI074469"

data6 <- data5[which(data5$GB_ACC %in% TopGeneList2),]

paged_table(data6)

None of the Autism top genes are in the database of 540 genes, so we cannot add it to our pathology database, but we can still make the tableau dashboard and compare top gene expression for EBV related pathologies to the Autism data of provided gene symbols.

write.csv(data5, 'autism540X10_ensemble_geneSymbolsAdded.csv', row.names=F)

We will go to the Tableau dashboard for this.

=======================

We do have the top genes from months ago with manual added gene summaries. We copied and pasted it into this folder and opened the pathology database file. We are using pathologyDB and the topGenes with summaries we are going to read in.

topGenes44 <- read.csv("Top44_chromosomeLocations_geneIDs_gbACC_geneSummariesAdded.csv", header=T)

colnames(topGenes44)

##   [1] "Gene_ID"        "GB_ACC"         "ASD_group"      "gene_id"       
##   [5] "description"    "Gene_Summaries" "locus"          "GSM386518"     
##   [9] "GSM386519"      "GSM386520"      "GSM386521"      "GSM386522"     
##  [13] "GSM386523"      "GSM386524"      "GSM386525"      "GSM386526"     
##  [17] "GSM386527"      "GSM386528"      "GSM386529"      "GSM386530"     
##  [21] "GSM386531"      "GSM386532"      "GSM386533"      "GSM386534"     
##  [25] "GSM386535"      "GSM386536"      "GSM386537"      "GSM386538"     
##  [29] "GSM386539"      "GSM386540"      "GSM386541"      "GSM386542"     
##  [33] "GSM386543"      "GSM386544"      "GSM386545"      "GSM386546"     
##  [37] "GSM386547"      "GSM386548"      "GSM386549"      "GSM386550"     
##  [41] "GSM386551"      "GSM386552"      "GSM386553"      "GSM386554"     
##  [45] "GSM386555"      "GSM386556"      "GSM386557"      "GSM386558"     
##  [49] "GSM386559"      "GSM386560"      "GSM386561"      "GSM386562"     
##  [53] "GSM386563"      "GSM386564"      "GSM386565"      "GSM386566"     
##  [57] "GSM386567"      "GSM386568"      "GSM386569"      "GSM386570"     
##  [61] "GSM386571"      "GSM386572"      "GSM386573"      "GSM386574"     
##  [65] "GSM386575"      "GSM386576"      "GSM386577"      "GSM386578"     
##  [69] "GSM386579"      "GSM386580"      "GSM386581"      "GSM386582"     
##  [73] "GSM386583"      "GSM386584"      "GSM386585"      "GSM386586"     
##  [77] "GSM386587"      "GSM386588"      "GSM386589"      "GSM386590"     
##  [81] "GSM386591"      "GSM386592"      "GSM386593"      "GSM386594"     
##  [85] "GSM386595"      "GSM386596"      "GSM386597"      "GSM386598"     
##  [89] "GSM386599"      "GSM386600"      "GSM386601"      "GSM386602"     
##  [93] "GSM386603"      "GSM386604"      "GSM386605"      "GSM386606"     
##  [97] "GSM386607"      "GSM386608"      "GSM386609"      "GSM386610"     
## [101] "GSM386611"      "GSM386612"      "GSM386613"      "GSM386614"     
## [105] "GSM386615"      "GSM386616"      "GSM386617"      "GSM386618"     
## [109] "GSM386619"      "GSM386620"      "GSM386621"      "GSM386622"     
## [113] "GSM386623"      "GSM386624"      "GSM386625"      "GSM386626"     
## [117] "GSM386627"      "GSM386628"      "GSM386629"      "GSM386630"     
## [121] "GSM386631"      "GSM386632"      "GSM386633"

We went ahead and ran all the script to get the fold change values from last dataset with fold change values allTopGenes60.

data60_44 <- merge(allTopGenes60[,c(120:128,130)], topGenes44[,c(1:4,6)], by.x="GB_ACC",by.y="GB_ACC", all.x=T)

summary(data60_44)

## Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 9 is invalid
## UTF-8

##        GB_ACC    healthy_mean      language_mean        mild_mean      
##  Length   :70   Min.   :0.002817   Min.   :0.003074   Min.   :0.00000  
##  N.unique :40   1st Qu.:0.004910   1st Qu.:0.019652   1st Qu.:0.01871  
##  N.blank  : 0   Median :0.007419   Median :0.029503   Median :0.02641  
##  Min.nchar: 6   Mean   :0.024873   Mean   :0.029890   Mean   :0.02604  
##  Max.nchar: 8   3rd Qu.:0.046016   3rd Qu.:0.041119   3rd Qu.:0.03667  
##  NAs      :13   Max.   :0.087555   Max.   :0.064139   Max.   :0.05772  
##   savant_mean      FC_language_healthy FC_mild_healthy   FC_savant_healthy
##  Min.   :0.00449   Min.   : 0.05518    Min.   : 0.0000   Min.   : 0.1452  
##  1st Qu.:0.02123   1st Qu.: 0.42589    1st Qu.: 0.5315   1st Qu.: 0.6031  
##  Median :0.03734   Median : 4.29281    Median : 2.6876   Median : 5.8483  
##  Mean   :0.03530   Mean   : 4.46047    Mean   : 3.7646   Mean   : 5.1896  
##  3rd Qu.:0.04645   3rd Qu.: 8.46997    3rd Qu.: 6.7303   3rd Qu.: 8.7775  
##  Max.   :0.07048   Max.   :12.98571    Max.   :11.7146   Max.   :14.7887  
##          ID         autismType      Gene_ID       ASD_group       gene_id  
##  Length   :70   Length   :70   Length   :70   Length   :70   Length   :70  
##  N.unique :50   N.unique : 3   N.unique :38   N.unique : 3   N.unique :38  
##  N.blank  : 0   N.blank  : 0   N.blank  : 0   N.blank  : 0   N.blank  : 0  
##  Min.nchar: 3   Min.nchar: 4   Min.nchar: 4   Min.nchar: 4   Min.nchar:15  
##  Max.nchar: 7   Max.nchar: 8   Max.nchar:10   Max.nchar: 8   Max.nchar:15  
##                                NAs      :20   NAs      :20   NAs      :20  
##    Gene_Summaries
##  Length   :70    
##  N.unique :37    
##  N.blank  : 0    
##  Min.nchar:NA    
##  Max.nchar:NA    
##  NAs      :20

data60_44_noNAs <- data60_44[!is.na(data60_44$Gene_ID),]

summary(data60_44_noNAs)

## Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 9 is invalid
## UTF-8

##        GB_ACC    healthy_mean      language_mean        mild_mean      
##  Length   :50   Min.   :0.003441   Min.   :0.003074   Min.   :0.00000  
##  N.unique :36   1st Qu.:0.005007   1st Qu.:0.018465   1st Qu.:0.02045  
##  N.blank  : 0   Median :0.021384   Median :0.029503   Median :0.02629  
##  Min.nchar: 6   Mean   :0.028328   Mean   :0.030798   Mean   :0.02567  
##  Max.nchar: 8   3rd Qu.:0.046904   3rd Qu.:0.045449   3rd Qu.:0.03400  
##                 Max.   :0.087555   Max.   :0.064139   Max.   :0.05772  
##   savant_mean      FC_language_healthy FC_mild_healthy  FC_savant_healthy
##  Min.   :0.00643   Min.   : 0.05518    Min.   :0.0000   Min.   : 0.1474  
##  1st Qu.:0.02123   1st Qu.: 0.36208    1st Qu.:0.5159   1st Qu.: 0.5941  
##  Median :0.03447   Median : 2.60424    Median :1.7398   Median : 2.4552  
##  Mean   :0.03483   Mean   : 3.90610    Mean   :3.0186   Mean   : 4.1897  
##  3rd Qu.:0.04645   3rd Qu.: 6.54350    3rd Qu.:5.1475   3rd Qu.: 7.6451  
##  Max.   :0.07048   Max.   :10.69432    Max.   :9.8809   Max.   :13.4551  
##          ID         autismType      Gene_ID       ASD_group       gene_id  
##  Length   :50   Length   :50   Length   :50   Length   :50   Length   :50  
##  N.unique :36   N.unique : 3   N.unique :38   N.unique : 3   N.unique :38  
##  N.blank  : 0   N.blank  : 0   N.blank  : 0   N.blank  : 0   N.blank  : 0  
##  Min.nchar: 3   Min.nchar: 4   Min.nchar: 4   Min.nchar: 4   Min.nchar:15  
##  Max.nchar: 6   Max.nchar: 8   Max.nchar:10   Max.nchar: 8   Max.nchar:15  
##                                                                            
##    Gene_Summaries
##  Length   :50    
##  N.unique :37    
##  N.blank  : 0    
##  Min.nchar:NA    
##  Max.nchar:NA    
##

paged_table(data60_44_noNAs)

dataFiltered60_44 <- data60_44_noNAs[data60_44_noNAs$ASD_group == data60_44_noNAs$autismType,]

summary(dataFiltered60_44)

## Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 7 is invalid
## UTF-8

##        GB_ACC    healthy_mean      language_mean        mild_mean      
##  Length   :42   Min.   :0.003441   Min.   :0.003074   Min.   :0.00000  
##  N.unique :36   1st Qu.:0.005724   1st Qu.:0.015901   1st Qu.:0.01458  
##  N.blank  : 0   Median :0.042595   Median :0.026142   Median :0.02629  
##  Min.nchar: 6   Mean   :0.032746   Mean   :0.028569   Mean   :0.02517  
##  Max.nchar: 8   3rd Qu.:0.048933   3rd Qu.:0.037851   3rd Qu.:0.03511  
##                 Max.   :0.087555   Max.   :0.064139   Max.   :0.05772  
##   savant_mean      FC_language_healthy FC_mild_healthy  FC_savant_healthy
##  Min.   :0.00643   Min.   : 0.05518    Min.   :0.0000   Min.   : 0.1474  
##  1st Qu.:0.02123   1st Qu.: 0.31747    1st Qu.:0.4833   1st Qu.: 0.4278  
##  Median :0.03347   Median : 0.79052    Median :0.9131   Median : 0.7998  
##  Mean   :0.03313   Mean   : 3.05465    Mean   :2.4213   Mean   : 3.4079  
##  3rd Qu.:0.04254   3rd Qu.: 5.96785    3rd Qu.:4.2691   3rd Qu.: 6.2473  
##  Max.   :0.07048   Max.   :10.69432    Max.   :9.8809   Max.   :13.4551  
##          ID         autismType      Gene_ID       ASD_group       gene_id  
##  Length   :42   Length   :42   Length   :42   Length   :42   Length   :42  
##  N.unique :36   N.unique : 3   N.unique :38   N.unique : 3   N.unique :38  
##  N.blank  : 0   N.blank  : 0   N.blank  : 0   N.blank  : 0   N.blank  : 0  
##  Min.nchar: 3   Min.nchar: 4   Min.nchar: 4   Min.nchar: 4   Min.nchar:15  
##  Max.nchar: 6   Max.nchar: 8   Max.nchar:10   Max.nchar: 8   Max.nchar:15  
##                                                                            
##    Gene_Summaries
##  Length   :42    
##  N.unique :37    
##  N.blank  : 0    
##  Min.nchar:NA    
##  Max.nchar:NA    
##

paged_table(dataFiltered60_44)

dataKeptPathology <- dataFiltered60_44[,c(2:8,11:14)]

paged_table(dataKeptPathology)

mild1 <- dataKeptPathology[dataKeptPathology$ASD_group == "mild",]
savant1 <- dataKeptPathology[dataKeptPathology$ASD_group == "savant",]
language1 <- dataKeptPathology[dataKeptPathology$ASD_group == "language",]

mild2 <- mild1[,c(6,8:11)]
savant2 <- savant1[,c(7:11)]
language2 <- language1[,c(5,8:11)]

str(mild2)

## 'data.frame':    12 obs. of  5 variables:
##  $ FC_mild_healthy: num  0.0672 9.8809 0.0756 6.7417 0.0836 ...
##  $ Gene_ID        : chr  "BMPER" "TUT1" "FOXO1" "PRMT2" ...
##  $ ASD_group      : chr  "mild" "mild" "mild" "mild" ...
##  $ gene_id        : chr  "ENSG00000164619" "ENSG00000149016" "ENSG00000150907" "ENSG00000160310" ...
##  $ Gene_Summaries : chr  "NCBI Gene Summary for BMPER Gene \nThis gene encodes a secreted protein that interacts with, and inhibits bone "| __truncated__ "GeneCards Summary for ENSG00000308510 Gene\nENSG00000308510 (Novel Transcript, Antisense To MTA2 And TUT1) is a"| __truncated__ "Aliases for LINC00598 Gene\nGeneCards Symbol: LINC00598 2 \nLong Intergenic Non-Protein Coding RNA 598 2 3 5\nL"| __truncated__ "GeneCards Summary for MCM3AP-AS1 Gene\nMCM3AP-AS1 (MCM3AP Antisense RNA 1) is an RNA Gene, and is affiliated wi"| __truncated__ ...

colnames(mild2) <- c("FC_pathology_control","Genecards_ID","topGenePathology","Ensembl_ID","Gene_Summaries")

mild2$topGenePathology <- "mild Autism"

paged_table(mild2)

colnames(savant2) <- c("FC_pathology_control","Genecards_ID","topGenePathology","Ensembl_ID","Gene_Summaries")

savant2$topGenePathology <- "savant Autism"

paged_table(savant2)

colnames(language2) <- c("FC_pathology_control","Genecards_ID","topGenePathology","Ensembl_ID","Gene_Summaries")

language2$topGenePathology <- "language Autism"

paged_table(language2)

mild3 <- mild2[,c(4,2,1,3)]
savant3 <- savant2[,c(4,2,1,3)]
language3 <- language2[,c(4,2,1,3)]

mild3$mediaType <- "total RNA lymphoblastoid cell lines or LCLs in PBMCs"
mild3$studySummarized <- paste("Autism was studied and compared with healthy patients and 3 types of autism of mild, language, and savant. This group is mild Autism for fold change of the mean of mild over the mean of healthy.", mild2$Gene_Summaries, sep = "--- gene summaries added manually --- ")

mild3$GSE_study_ID <- "GSE15402"

savant3$mediaType <- "total RNA lymphoblastoid cell lines or LCLs in PBMCs"
savant3$studySummarized <- paste("Autism was studied and compared with healthy patients and 3 types of autism of mild, language, and savant. This group is savant Autism for fold change of the mean of savant over the mean of healthy.",savant2$Gene_Summaries, sep = "--- gene summaries added manually --- ")

savant3$GSE_study_ID <- "GSE15402"

language3$mediaType <- "total RNA lymphoblastoid cell lines or LCLs in PBMCs"
language3$studySummarized <- paste("Autism was studied and compared with healthy patients and 3 types of autism of mild, language, and savant. This group is language Autism for fold change of the mean of language over the mean of healthy.", language2$Gene_Summaries, sep = "--- gene summaries added manually --- ")

language3$GSE_study_ID <- "GSE15402"


combinedGenesTop <- rbind(mild3,language3,savant3)

paged_table(combinedGenesTop)

We can add these now to the pathology database.

pathologyDB_added <- rbind(pathologyDB, combinedGenesTop)

paged_table(pathologyDB_added)

There are now 521 genes in the Autism database.

Lets write this file out to the all pathologies data folder.

setwd(path)

write.csv(pathologyDB_added, "pathology_DB_autismAdded_5-18-2026.csv", row.names=F)

extract symbols to accession IDs autism data

Janis Harris

2026-05-16

2. Map accessions to ALIAS (synonyms)

3. View the results