savant Autism GEO GSE15402 2009data MEV files

This file is a project from the GSE15402 GEO gene expression omnibus on National Center for Bioinformatics Information (NCBI). It explores autism and has 3 different types using lymphoblastic cell lines (LCL) from many autistic and control patients. The study is a gene chip study and has a published article last updated in 2019 but this data was uploaded to GEO in 2009 and you can tell by how many digits are in the GSE ID of only 5 when the newer ones I have worked on recent publications of 2024-2026 are 6 digits. It discovered 5 genes in the published article that was not read yet, but will if I need more information, which is likely. They said that the androgens seem to be impacted and the genes related to circadian rhythm in Autistic folk. The diagnosis of Autism Spectrum Disorder is made clinically by answering with questions and objective findings in clinic using a questionnaire.

I have never worked with MEV files but it says that they are just tab spaced and read.delimit function can open each one up. But there were 133 separate zipped files extracted to separate folders with each file inside the folder. I manually unzipped each one in Windows and then manually copied each file to a folder called ‘files unzipped no folders’ to retrieve. Lets look at the series text data first.

library(rmarkdown)

series_36 <- read.table("GSE15402_series_matrix.txt", nrows=35)

paged_table(series_36)

series9 <- read.table("GSE15402_series_matrix.txt", skip=36, nrow=9)

paged_table(series9)

We know the sample ID and type of sample for our project are in the 2nd table first 2 rows we named it series9.

dataframe <- series9[c(1:2),]
colnames(dataframe) <- series9[1,]
paged_table(dataframe)

data <- dataframe[2,c(2:117)]

paged_table(data)

Lets make our class vector to use later in predicting the class, its nice that these are ordered but necessary as grep will save the indices under that name. You can enter each name and it gives the indices, or you can just write the name of the grep object in the brackets and it is a placeholcer for columns of that type by index.

control <- grep('control',colnames(data))
language <- grep('language',colnames(data))
mild <- grep('mild', colnames(data))
savant <- grep('savant', colnames(data))

class <- c(1:116)
class[1:29] <- 'control'
class[30:60] <- 'language'
class[61:86] <- 'mild'
class[87:116] <- 'savant'

class

##   [1] "control"  "control"  "control"  "control"  "control"  "control" 
##   [7] "control"  "control"  "control"  "control"  "control"  "control" 
##  [13] "control"  "control"  "control"  "control"  "control"  "control" 
##  [19] "control"  "control"  "control"  "control"  "control"  "control" 
##  [25] "control"  "control"  "control"  "control"  "control"  "language"
##  [31] "language" "language" "language" "language" "language" "language"
##  [37] "language" "language" "language" "language" "language" "language"
##  [43] "language" "language" "language" "language" "language" "language"
##  [49] "language" "language" "language" "language" "language" "language"
##  [55] "language" "language" "language" "language" "language" "language"
##  [61] "mild"     "mild"     "mild"     "mild"     "mild"     "mild"    
##  [67] "mild"     "mild"     "mild"     "mild"     "mild"     "mild"    
##  [73] "mild"     "mild"     "mild"     "mild"     "mild"     "mild"    
##  [79] "mild"     "mild"     "mild"     "mild"     "mild"     "mild"    
##  [85] "mild"     "mild"     "savant"   "savant"   "savant"   "savant"  
##  [91] "savant"   "savant"   "savant"   "savant"   "savant"   "savant"  
##  [97] "savant"   "savant"   "savant"   "savant"   "savant"   "savant"  
## [103] "savant"   "savant"   "savant"   "savant"   "savant"   "savant"  
## [109] "savant"   "savant"   "savant"   "savant"   "savant"   "savant"  
## [115] "savant"   "savant"

table(class)

## class
##  control language     mild   savant 
##       29       31       26       30

We don’t have our genes yet. But we do have a folder for it. Lets go into the folder and open up the files one by one.

path <- "files unzipped no folders/"

setwd(path)

list.files()

##   [1] "GSM386518.mev" "GSM386519.mev" "GSM386520.mev" "GSM386521.mev"
##   [5] "GSM386522.mev" "GSM386523.mev" "GSM386524.mev" "GSM386525.mev"
##   [9] "GSM386526.mev" "GSM386527.mev" "GSM386528.mev" "GSM386529.mev"
##  [13] "GSM386530.mev" "GSM386531.mev" "GSM386532.mev" "GSM386533.mev"
##  [17] "GSM386534.mev" "GSM386535.mev" "GSM386536.mev" "GSM386537.mev"
##  [21] "GSM386538.mev" "GSM386539.mev" "GSM386540.mev" "GSM386541.mev"
##  [25] "GSM386542.mev" "GSM386543.mev" "GSM386544.mev" "GSM386545.mev"
##  [29] "GSM386546.mev" "GSM386547.mev" "GSM386548.mev" "GSM386549.mev"
##  [33] "GSM386550.mev" "GSM386551.mev" "GSM386552.mev" "GSM386553.mev"
##  [37] "GSM386554.mev" "GSM386555.mev" "GSM386556.mev" "GSM386557.mev"
##  [41] "GSM386558.mev" "GSM386559.mev" "GSM386560.mev" "GSM386561.mev"
##  [45] "GSM386562.mev" "GSM386563.mev" "GSM386564.mev" "GSM386565.mev"
##  [49] "GSM386566.mev" "GSM386567.mev" "GSM386568.mev" "GSM386569.mev"
##  [53] "GSM386570.mev" "GSM386571.mev" "GSM386572.mev" "GSM386573.mev"
##  [57] "GSM386574.mev" "GSM386575.mev" "GSM386576.mev" "GSM386577.mev"
##  [61] "GSM386578.mev" "GSM386579.mev" "GSM386580.mev" "GSM386581.mev"
##  [65] "GSM386582.mev" "GSM386583.mev" "GSM386584.mev" "GSM386585.mev"
##  [69] "GSM386586.mev" "GSM386587.mev" "GSM386588.mev" "GSM386589.mev"
##  [73] "GSM386590.mev" "GSM386591.mev" "GSM386592.mev" "GSM386593.mev"
##  [77] "GSM386594.mev" "GSM386595.mev" "GSM386596.mev" "GSM386597.mev"
##  [81] "GSM386598.mev" "GSM386599.mev" "GSM386600.mev" "GSM386601.mev"
##  [85] "GSM386602.mev" "GSM386603.mev" "GSM386604.mev" "GSM386605.mev"
##  [89] "GSM386606.mev" "GSM386607.mev" "GSM386608.mev" "GSM386609.mev"
##  [93] "GSM386610.mev" "GSM386611.mev" "GSM386612.mev" "GSM386613.mev"
##  [97] "GSM386614.mev" "GSM386615.mev" "GSM386616.mev" "GSM386617.mev"
## [101] "GSM386618.mev" "GSM386619.mev" "GSM386620.mev" "GSM386621.mev"
## [105] "GSM386622.mev" "GSM386623.mev" "GSM386624.mev" "GSM386625.mev"
## [109] "GSM386626.mev" "GSM386627.mev" "GSM386628.mev" "GSM386629.mev"
## [113] "GSM386630.mev" "GSM386631.mev" "GSM386632.mev" "GSM386633.mev"

There are 116 files that are mev type in this folder, each one extracted from a folder that it was unzipped to individually.

setwd(path)

file1 <- read.table("GSM386518.mev", header=T,comment.char='#')


dim(file1)

## [1] 41472    18

paged_table(file1[41470:41472,])

The columns for each file look like the micro array chip readings. UID must be row number, C must be count, IA and IB are integrated intensities for Cy3 and Cy5 respectively in the machine. I looked up on the internet for last few and got all columns:

UID is unique identifier for each spot
intensity is IA and IB for channel 1 and channel 2
R and C is Meta-Row and Meta-column, the block row and column
MR and MC is sub-row and sub-column a specific row or column in the block
SR and SC are spot area of spot row and spot column
FlagA and FlagB are for letter codes describing spot with A=0 non-saturated pixels in spot, B=0-50 non-saturated pixels in spot, C=50 or more, X is spot rejected due to spot shape or intensity relative to background, Y is background higher than spot intensity, Z is spot not detected by Spotfinder.
SA is actual spot in channel A
SF is saturation factor in channel A
QCScore is cumulative quality control score
QCA and QCB is quality control score in channel A and B respectively
BKgA and BKgB is background value in channel A and channel B

Nowhere is there a gene to go with this data that I am seeing. The internet says its in the Data section, but no Data section when I removed the comment tags of ‘#’ to read in the file. Each of the 116 files has one to read in.

These UID are unique identifiers of the TIGR system that has a table that translates the UID into the GenBank Accession number in the GPL or gene platform used, it is in the series information. In series_36 its line 32.

series_36[32,]

##                     V1      V2
## 32 !Series_platform_id GPL3427

We can download the GPL3427 platform and read it in.

GPL3427 <- read.delim('GPL3427_family.soft', nrow=235)

paged_table(GPL3427) #235 X 1

GenBank accession IDs are the nucleotide sequence reference, if you input it into NCBI database system search of all databases, the GenBank Accession ID takes you to a report of the nucleotide sequence or where referenced as a loci of a chromosome, similar to the address of a chromosome. The very first hundreds of thousands of rows had empty SPOT_IDs so I thought it was a problem with the delimiter reading in the commas at the ID. But checked and GB_acc is the correct entry and further down the 8million long platform dataframe there are SPOT_ID entries that are numeric of ‘null’ not just empty. The ID must be the row and column of the array, so a matrix ID by row and column, so probably why separated by a comma.

platform <- read.delim('GPL3427_family.soft', header=F, skip=235)

colnames(platform) <- c("ID","GB_ACC","SPOT_ID")

paged_table(platform[1800000:1800010,]) #8180580 X 3

And if you look at the file we opened of one of the MEV files, the R and C for row and column match up with the ID, the row of the platform could be the UID or unique identifier.

file1[1:5,]

##   UID     IA     IB R C MR MC SR SC FlagA FlagB SA SF QCscore    QCA    QCB
## 1   1      0      0 1 1  1  1  1  1     X     X 25  0  0.0000 0.0000 0.0000
## 2   2 117298 119918 1 2  1  1  1  2     B     B 36  1  0.9860 0.9860 0.9860
## 3   3  96678 233351 1 3  1  1  1  3     C     C 50  1  0.5967 0.2367 0.9567
## 4   4  61484  64172 1 4  1  1  1  4     B     B 49  1  0.7890 0.7806 0.7974
## 5   5  40932  22366 1 5  1  1  1  5     B     B 44  1  0.7640 0.7731 0.7549
##    BkgA  BkgB
## 1     0     0
## 2 15876 16200
## 3 20100 21850
## 4 18130 22687
## 5 17072 16192

platform[1:5,]

##    ID   GB_ACC SPOT_ID
## 1 1,2 AA486138        
## 2 1,3   N51018        
## 3 1,4   H65481        
## 4 1,5   T98628        
## 5 1,6   N34799

The row names can be used to see where the Row and Column are, but the first 5 rows of the file are the first row and down the columns excluding the first column and first row that is 0 or no value of intensity in channel 1 or IA or channel 2 or IB. We would then have to know what the values are, it looks like the Flag columns give the saturation of pixels in that row and column for the channels of A and B, with quality control scores. There is a combined score and individual scores per channel. We can revert to the series information to see which channel they used to get gene expression values. I have seen channel 1 used a lot, but there are so many machines.

Lets go back to the series9 dataframe of empty but meta data to each 116 samples with GSM ID.

paged_table(series9) #9 X 117

They have that they used 2 channels in the Sample_channel_count for each sample as a column. So the gene expression value is probably the QCscore that combines both channels.

The tail of the file1 we uploaded of 116 files should show number of rows and columns to compare in the platform table.

paged_table(file1[41470:41472,])

There are 384 Rows and 108 columns, the platform ID will be 384,108 as the entry. Lets see what the platform looks like at 384,108 for the ID column

platform_384_108 <- platform[platform$ID == "384,108",]
platform_384_108

##              ID    GB_ACC     SPOT_ID
## 41471   384,108                 blank
## 82989   384,108                      
## 124507  384,108                      
## 166025  384,108                      
## 207545  384,108                      
## 249065  384,108                      
## 290585  384,108                      
## 332105  384,108                      
## 373625  384,108                      
## 415145  384,108                      
## 456665  384,108                      
## 498185  384,108                      
## 539705  384,108                      
## 581225  384,108                      
## 622745  384,108                      
## 664265  384,108                      
## 705785  384,108                      
## 747305  384,108                      
## 788825  384,108                      
## 830345  384,108                      
## 871869  384,108      null        null
## 913393  384,108      null        null
## 954917  384,108      null        null
## 996441  384,108      null        null
## 1037965 384,108      null        null
## 1079489 384,108      null        null
## 1121013 384,108      null        null
## 1162537 384,108      null        null
## 1204061 384,108      null        null
## 1245585 384,108      null        null
## 1287109 384,108      null        null
## 1328633 384,108      null        null
## 1370157 384,108      null        null
## 1411681 384,108      null        null
## 1453205 384,108      null        null
## 1494729 384,108      null        null
## 1536253 384,108      null        null
## 1577777 384,108 -0.117222 0.117222026
## 1619301 384,108 -0.919426   0.9194264
## 1660825 384,108 -0.550725   0.5507253
## 1702349 384,108      null        null
## 1743873 384,108      null        null
## 1785397 384,108      null        null
## 1826921 384,108      null        null
## 1868445 384,108      null        null
## 1909969 384,108      null        null
## 1951493 384,108      null        null
## 1993017 384,108      null        null
## 2034541 384,108      null        null
## 2076065 384,108      null        null
## 2117589 384,108      null        null
## 2159113 384,108      null        null
## 2200637 384,108      null        null
## 2242161 384,108      null        null
## 2283685 384,108      null        null
## 2325209 384,108      null        null
## 2366733 384,108      null        null
## 2408257 384,108      null        null
## 2449781 384,108      null        null
## 2491305 384,108      null        null
## 2532829 384,108      null        null
## 2574353 384,108      null        null
## 2615877 384,108      null        null
## 2657401 384,108      null        null
## 2698925 384,108      null        null
## 2740449 384,108      null        null
## 2781973 384,108      null        null
## 2823497 384,108      null        null
## 2865021 384,108      null        null
## 2906545 384,108      null        null
## 2948069 384,108      null        null
## 2989593 384,108      null        null
## 3031117 384,108      null        null
## 3072641 384,108      null        null
## 3114165 384,108      null        null
## 3155689 384,108      null        null
## 3197213 384,108      null        null
## 3238737 384,108      null        null
## 3280261 384,108      null        null
## 3321785 384,108      null        null
## 3363309 384,108      null        null
## 3404833 384,108      null        null
## 3446357 384,108      null        null
## 3487881 384,108      null        null
## 3529405 384,108      null        null
## 3570929 384,108      null        null
## 3612453 384,108  -1.10021   1.1002067
## 3653977 384,108      null        null
## 3695501 384,108      null        null
## 3737025 384,108      null        null
## 3778549 384,108      null        null
## 3820073 384,108      null        null
## 3861597 384,108      null        null
## 3903121 384,108      null        null
## 3944645 384,108      null        null
## 3986169 384,108      null        null
## 4027693 384,108      null        null
## 4069217 384,108      null        null
## 4110741 384,108      null        null
## 4152265 384,108      null        null
## 4193789 384,108      null        null
## 4235313 384,108      null        null
## 4276837 384,108      null        null
## 4318361 384,108      null        null
## 4359885 384,108      null        null
## 4401409 384,108      null        null
## 4442933 384,108      null        null
## 4484457 384,108      null        null
## 4525981 384,108      null        null
## 4567505 384,108      null        null
## 4609029 384,108      null        null
## 4650553 384,108      null        null
## 4692077 384,108      null        null
## 4733601 384,108      null        null
## 4775125 384,108      null        null
## 4816649 384,108      null        null
## 4858173 384,108      null        null
## 4899697 384,108      null        null
## 4941221 384,108      null        null
## 4982745 384,108      null        null
## 5024269 384,108      null        null
## 5065793 384,108      null        null
## 5107317 384,108      null        null
## 5148841 384,108      null        null
## 5190365 384,108      null        null
## 5231889 384,108      null        null
## 5273413 384,108      null        null
## 5314937 384,108      null        null
## 5356461 384,108      null        null
## 5397985 384,108      null        null
## 5439509 384,108      null        null
## 5481033 384,108      null        null
## 5522557 384,108      null        null
## 5564081 384,108      null        null
## 5605605 384,108      null        null
## 5647129 384,108      null        null
## 5688655 384,108                      
## 5730182 384,108                      
## 5771709 384,108                      
## 5813236 384,108                      
## 5854763 384,108                      
## 5896290 384,108                      
## 5937816 384,108 -0.380484  0.38048366
## 5979343 384,108                      
## 6020870 384,108                      
## 6062397 384,108                      
## 6103923 384,108                      
## 6145450 384,108                      
## 6186976 384,108                      
## 6228502 384,108                      
## 6270028 384,108                      
## 6311554 384,108                      
## 6353080 384,108                      
## 6394606 384,108                      
## 6436132 384,108                      
## 6477658 384,108                      
## 6519184 384,108                      
## 6560710 384,108                      
## 6602236 384,108                      
## 6643762 384,108                      
## 6685288 384,108                      
## 6726814 384,108                      
## 6768340 384,108                      
## 6809866 384,108                      
## 6851392 384,108                      
## 6892918 384,108                      
## 6934444 384,108                      
## 6975970 384,108                      
## 7017496 384,108                      
## 7059022 384,108                      
## 7100548 384,108                      
## 7142074 384,108                      
## 7183600 384,108                      
## 7225126 384,108                      
## 7266653 384,108      null            
## 7308180 384,108      null            
## 7349707 384,108      null            
## 7391234 384,108      null            
## 7432761 384,108      null            
## 7474288 384,108      null            
## 7515815 384,108      null            
## 7557342 384,108      null            
## 7598869 384,108      null            
## 7640396 384,108      null            
## 7681923 384,108      null            
## 7723450 384,108      null            
## 7764977 384,108      null            
## 7806504 384,108      null            
## 7848031 384,108      null            
## 7889558 384,108      null            
## 7931085 384,108      null            
## 7972612 384,108      null            
## 8014139 384,108      null            
## 8055666 384,108      null            
## 8097193 384,108      null            
## 8138720 384,108      null            
## 8180247 384,108      null

This is not the tail of platform because many entries for the row and column end of the file in the platform. But the ID in the platform is the unique ID to this series as platforms or GPL are shared with multiple series and this is the catalog of its entry when using this platform and identifying the machine values.

The rows are off by 1 where it looks like uploading the platform file I cut off 1 extra line. Lets try it again I skipped one too many lines maybe. array, so a matrix ID by row and column, so probably why separated by a comma.

platform <- read.delim('GPL3427_family.soft', header=F, skip=234)

colnames(platform) <- c("ID","GB_ACC","SPOT_ID")

paged_table(platform[41470:41472,])

That works! the row number is the same in the file1 as it should be in the platform when reading in the file I originally skipped too many lines. But we see above it is corrected and row 41472 is the ID 384,108 for the row 384 and column 108 of the file.

Lets see if at least one more file also has this same amount of rows.

setwd(path)

file2 <- read.table("GSM386519.mev", header=T,comment.char='#')

paged_table(file2[41470:41472,])

It looks the same.

Now lets remove the file1 and file2 and start reading in each file by their file name from this list, and double check again to see if the QCscore is the gene expression reading for that ID.

Lets look at all the files we must individually read in. A for loop would be great for this but I forgot how to do that. Maybe AI has a solution.

setwd(path)

list.files()

##   [1] "GSM386518.mev" "GSM386519.mev" "GSM386520.mev" "GSM386521.mev"
##   [5] "GSM386522.mev" "GSM386523.mev" "GSM386524.mev" "GSM386525.mev"
##   [9] "GSM386526.mev" "GSM386527.mev" "GSM386528.mev" "GSM386529.mev"
##  [13] "GSM386530.mev" "GSM386531.mev" "GSM386532.mev" "GSM386533.mev"
##  [17] "GSM386534.mev" "GSM386535.mev" "GSM386536.mev" "GSM386537.mev"
##  [21] "GSM386538.mev" "GSM386539.mev" "GSM386540.mev" "GSM386541.mev"
##  [25] "GSM386542.mev" "GSM386543.mev" "GSM386544.mev" "GSM386545.mev"
##  [29] "GSM386546.mev" "GSM386547.mev" "GSM386548.mev" "GSM386549.mev"
##  [33] "GSM386550.mev" "GSM386551.mev" "GSM386552.mev" "GSM386553.mev"
##  [37] "GSM386554.mev" "GSM386555.mev" "GSM386556.mev" "GSM386557.mev"
##  [41] "GSM386558.mev" "GSM386559.mev" "GSM386560.mev" "GSM386561.mev"
##  [45] "GSM386562.mev" "GSM386563.mev" "GSM386564.mev" "GSM386565.mev"
##  [49] "GSM386566.mev" "GSM386567.mev" "GSM386568.mev" "GSM386569.mev"
##  [53] "GSM386570.mev" "GSM386571.mev" "GSM386572.mev" "GSM386573.mev"
##  [57] "GSM386574.mev" "GSM386575.mev" "GSM386576.mev" "GSM386577.mev"
##  [61] "GSM386578.mev" "GSM386579.mev" "GSM386580.mev" "GSM386581.mev"
##  [65] "GSM386582.mev" "GSM386583.mev" "GSM386584.mev" "GSM386585.mev"
##  [69] "GSM386586.mev" "GSM386587.mev" "GSM386588.mev" "GSM386589.mev"
##  [73] "GSM386590.mev" "GSM386591.mev" "GSM386592.mev" "GSM386593.mev"
##  [77] "GSM386594.mev" "GSM386595.mev" "GSM386596.mev" "GSM386597.mev"
##  [81] "GSM386598.mev" "GSM386599.mev" "GSM386600.mev" "GSM386601.mev"
##  [85] "GSM386602.mev" "GSM386603.mev" "GSM386604.mev" "GSM386605.mev"
##  [89] "GSM386606.mev" "GSM386607.mev" "GSM386608.mev" "GSM386609.mev"
##  [93] "GSM386610.mev" "GSM386611.mev" "GSM386612.mev" "GSM386613.mev"
##  [97] "GSM386614.mev" "GSM386615.mev" "GSM386616.mev" "GSM386617.mev"
## [101] "GSM386618.mev" "GSM386619.mev" "GSM386620.mev" "GSM386621.mev"
## [105] "GSM386622.mev" "GSM386623.mev" "GSM386624.mev" "GSM386625.mev"
## [109] "GSM386626.mev" "GSM386627.mev" "GSM386628.mev" "GSM386629.mev"
## [113] "GSM386630.mev" "GSM386631.mev" "GSM386632.mev" "GSM386633.mev"

We will do that next time. This is part 1. Keep checking in.

colnames(file1)

##  [1] "UID"     "IA"      "IB"      "R"       "C"       "MR"      "MC"     
##  [8] "SR"      "SC"      "FlagA"   "FlagB"   "SA"      "SF"      "QCscore"
## [15] "QCA"     "QCB"     "BkgA"    "BkgB"

The columns we want to keep are 1, 4, 5, 14. We will column bind the other samples of 116 samples total, so 115 other samples we will only combine the QCscore or 14th column.

setwd(path)

GSM386518 <- read.table("GSM386518.mev", header=T, comment.char='#')

GSM386518_start <- GSM386518[,c(1,4,5,14)]

setwd(path)

GSM386519 <- read.table("GSM386519.mev", header=T, comment.char='#')

setwd(path)

GSM386520 <- read.table("GSM386520.mev", header=T, comment.char='#')

setwd(path)

GSM386521 <- read.table("GSM386521.mev", header=T, comment.char='#')

setwd(path)

GSM386522 <- read.table("GSM386522.mev", header=T, comment.char='#')

setwd(path)

GSM386523 <- read.table("GSM386523.mev", header=T, comment.char='#')

setwd(path)

GSM386524 <- read.table("GSM386524.mev", header=T, comment.char='#')

setwd(path)

GSM386525 <- read.table("GSM386525.mev", header=T, comment.char='#')

setwd(path)

GSM386526 <- read.table("GSM386526.mev", header=T, comment.char='#')

setwd(path)

GSM386527 <- read.table("GSM386527.mev", header=T, comment.char='#')

setwd(path)

GSM386528 <- read.table("GSM386528.mev", header=T, comment.char='#')

setwd(path)

GSM386529 <- read.table("GSM386529.mev", header=T, comment.char='#')

setwd(path)

GSM386530 <- read.table("GSM386530.mev", header=T, comment.char='#')

setwd(path)

GSM386531 <- read.table("GSM386531.mev", header=T, comment.char='#')

setwd(path)

GSM386532 <- read.table("GSM386532.mev", header=T, comment.char='#')

setwd(path)

GSM386533 <- read.table("GSM386533.mev", header=T, comment.char='#')

setwd(path)

GSM386534 <- read.table("GSM386534.mev", header=T, comment.char='#')

setwd(path)

GSM386535 <- read.table("GSM386535.mev", header=T, comment.char='#')

setwd(path)

GSM386536 <- read.table("GSM386536.mev", header=T, comment.char='#')

setwd(path)

GSM386537 <- read.table("GSM386537.mev", header=T, comment.char='#')

setwd(path)

GSM386538 <- read.table("GSM386538.mev", header=T, comment.char='#')

setwd(path)

GSM386539 <- read.table("GSM386539.mev", header=T, comment.char='#')

setwd(path)

GSM386540 <- read.table("GSM386540.mev", header=T, comment.char='#')

setwd(path)

GSM386541 <- read.table("GSM386541.mev", header=T, comment.char='#')

setwd(path)

GSM386542 <- read.table("GSM386542.mev", header=T, comment.char='#')

setwd(path)

GSM386543 <- read.table("GSM386543.mev", header=T, comment.char='#')

setwd(path)

GSM386544 <- read.table("GSM386544.mev", header=T, comment.char='#')

setwd(path)

GSM386545 <- read.table("GSM386545.mev", header=T, comment.char='#')

setwd(path)

GSM386546 <- read.table("GSM386546.mev", header=T, comment.char='#')

setwd(path)

GSM386547 <- read.table("GSM386547.mev", header=T, comment.char='#')

setwd(path)

GSM386548 <- read.table("GSM386548.mev", header=T, comment.char='#')

setwd(path)

GSM386549 <- read.table("GSM386549.mev", header=T, comment.char='#')

setwd(path)

GSM386550 <- read.table("GSM386550.mev", header=T, comment.char='#')

setwd(path)

GSM386551 <- read.table("GSM386551.mev", header=T, comment.char='#')

setwd(path)

GSM386552 <- read.table("GSM386552.mev", header=T, comment.char='#')

setwd(path)

GSM386553 <- read.table("GSM386553.mev", header=T, comment.char='#')

setwd(path)

GSM386554 <- read.table("GSM386554.mev", header=T, comment.char='#')

setwd(path)

GSM386555 <- read.table("GSM386555.mev", header=T, comment.char='#')

setwd(path)

GSM386556 <- read.table("GSM386556.mev", header=T, comment.char='#')

setwd(path)

GSM386557 <- read.table("GSM386557.mev", header=T, comment.char='#')

setwd(path)

GSM386558 <- read.table("GSM386558.mev", header=T, comment.char='#')

setwd(path)

GSM386559 <- read.table("GSM386559.mev", header=T, comment.char='#')

setwd(path)

GSM386560 <- read.table("GSM386560.mev", header=T, comment.char='#')

setwd(path)

GSM386561 <- read.table("GSM386561.mev", header=T, comment.char='#')

setwd(path)

GSM386562 <- read.table("GSM386562.mev", header=T, comment.char='#')

setwd(path)

GSM386563 <- read.table("GSM386563.mev", header=T, comment.char='#')

setwd(path)

GSM386564 <- read.table("GSM386564.mev", header=T, comment.char='#')

setwd(path)

GSM386565 <- read.table("GSM386565.mev", header=T, comment.char='#')

setwd(path)

GSM386566 <- read.table("GSM386566.mev", header=T, comment.char='#')

setwd(path)

GSM386567 <- read.table("GSM386567.mev", header=T, comment.char='#')

setwd(path)

GSM386568 <- read.table("GSM386568.mev", header=T, comment.char='#')

setwd(path)

GSM386569 <- read.table("GSM386569.mev", header=T, comment.char='#')

setwd(path)

GSM386570 <- read.table("GSM386570.mev", header=T, comment.char='#')

setwd(path)

GSM386571 <- read.table("GSM386571.mev", header=T, comment.char='#')

setwd(path)

GSM386572 <- read.table("GSM386572.mev", header=T, comment.char='#')

setwd(path)

GSM386573 <- read.table("GSM386573.mev", header=T, comment.char='#')

setwd(path)

GSM386574 <- read.table("GSM386574.mev", header=T, comment.char='#')

setwd(path)

GSM386575 <- read.table("GSM386575.mev", header=T, comment.char='#')

setwd(path)

GSM386576 <- read.table("GSM386576.mev", header=T, comment.char='#')

setwd(path)

GSM386577 <- read.table("GSM386577.mev", header=T, comment.char='#')

setwd(path)

GSM386578 <- read.table("GSM386578.mev", header=T, comment.char='#')

setwd(path)

GSM386579 <- read.table("GSM386579.mev", header=T, comment.char='#')

setwd(path)

GSM386580 <- read.table("GSM386580.mev", header=T, comment.char='#')

setwd(path)

GSM386581 <- read.table("GSM386581.mev", header=T, comment.char='#')

setwd(path)

GSM386582 <- read.table("GSM386582.mev", header=T, comment.char='#')

setwd(path)

GSM386583 <- read.table("GSM386583.mev", header=T, comment.char='#')

setwd(path)

GSM386584 <- read.table("GSM386584.mev", header=T, comment.char='#')

setwd(path)

GSM386585 <- read.table("GSM386585.mev", header=T, comment.char='#')

setwd(path)

GSM386586 <- read.table("GSM386586.mev", header=T, comment.char='#')

setwd(path)

GSM386587 <- read.table("GSM386587.mev", header=T, comment.char='#')

setwd(path)

GSM386588 <- read.table("GSM386588.mev", header=T, comment.char='#')

setwd(path)

GSM386589 <- read.table("GSM386589.mev", header=T, comment.char='#')

setwd(path)

GSM386590 <- read.table("GSM386590.mev", header=T, comment.char='#')

setwd(path)

GSM386591 <- read.table("GSM386591.mev", header=T, comment.char='#')

setwd(path)

GSM386592 <- read.table("GSM386592.mev", header=T, comment.char='#')

setwd(path)

GSM386593 <- read.table("GSM386593.mev", header=T, comment.char='#')

setwd(path)

GSM386594 <- read.table("GSM386594.mev", header=T, comment.char='#')

setwd(path)

GSM386595 <- read.table("GSM386595.mev", header=T, comment.char='#')

setwd(path)

GSM386596 <- read.table("GSM386596.mev", header=T, comment.char='#')

setwd(path)

GSM386597 <- read.table("GSM386597.mev", header=T, comment.char='#')

setwd(path)

GSM386598 <- read.table("GSM386598.mev", header=T, comment.char='#')

setwd(path)

GSM386599 <- read.table("GSM386599.mev", header=T, comment.char='#')

setwd(path)

GSM386600 <- read.table("GSM386600.mev", header=T, comment.char='#')

setwd(path)

GSM386601 <- read.table("GSM386601.mev", header=T, comment.char='#')

setwd(path)

GSM386602 <- read.table("GSM386602.mev", header=T, comment.char='#')

setwd(path)

GSM386603 <- read.table("GSM386603.mev", header=T, comment.char='#')

setwd(path)

GSM386604 <- read.table("GSM386604.mev", header=T, comment.char='#')

setwd(path)

GSM386605 <- read.table("GSM386605.mev", header=T, comment.char='#')

setwd(path)

GSM386606 <- read.table("GSM386606.mev", header=T, comment.char='#')

setwd(path)

GSM386607 <- read.table("GSM386607.mev", header=T, comment.char='#')

setwd(path)

GSM386608 <- read.table("GSM386608.mev", header=T, comment.char='#')

setwd(path)

GSM386609 <- read.table("GSM386609.mev", header=T, comment.char='#')

setwd(path)

GSM386610 <- read.table("GSM386610.mev", header=T, comment.char='#')

setwd(path)

GSM386611 <- read.table("GSM386611.mev", header=T, comment.char='#')

setwd(path)

GSM386612 <- read.table("GSM386612.mev", header=T, comment.char='#')

setwd(path)

GSM386613 <- read.table("GSM386613.mev", header=T, comment.char='#')

setwd(path)

GSM386614 <- read.table("GSM386614.mev", header=T, comment.char='#')

setwd(path)

GSM386615 <- read.table("GSM386615.mev", header=T, comment.char='#')

setwd(path)

GSM386616 <- read.table("GSM386616.mev", header=T, comment.char='#')

setwd(path)

GSM386617 <- read.table("GSM386617.mev", header=T, comment.char='#')

setwd(path)

GSM386618 <- read.table("GSM386618.mev", header=T, comment.char='#')

setwd(path)

GSM386619 <- read.table("GSM386619.mev", header=T, comment.char='#')

setwd(path)

GSM386620 <- read.table("GSM386620.mev", header=T, comment.char='#')

setwd(path)

GSM386621 <- read.table("GSM386621.mev", header=T, comment.char='#')

setwd(path)

GSM386622 <- read.table("GSM386622.mev", header=T, comment.char='#')

setwd(path)

GSM386623 <- read.table("GSM386623.mev", header=T, comment.char='#')

setwd(path)

GSM386624 <- read.table("GSM386624.mev", header=T, comment.char='#')

setwd(path)

GSM386625 <- read.table("GSM386625.mev", header=T, comment.char='#')

setwd(path)

GSM386626 <- read.table("GSM386626.mev", header=T, comment.char='#')

setwd(path)

GSM386627 <- read.table("GSM386627.mev", header=T, comment.char='#')

setwd(path)

GSM386628 <- read.table("GSM386628.mev", header=T, comment.char='#')

setwd(path)

GSM386629 <- read.table("GSM386629.mev", header=T, comment.char='#')

setwd(path)

GSM386630 <- read.table("GSM386630.mev", header=T, comment.char='#')

setwd(path)

GSM386631 <- read.table("GSM386631.mev", header=T, comment.char='#')

setwd(path)

GSM386632 <- read.table("GSM386632.mev", header=T, comment.char='#')

setwd(path)

GSM386633 <- read.table("GSM386633.mev", header=T, comment.char='#')

BIG <- cbind(GSM386518_start, GSM386519[,14],GSM386520[,14],GSM386521[,14],GSM386522[,14],GSM386523[,14],
             GSM386524[,14],GSM386525[,14],GSM386526[,14],GSM386527[,14],GSM386528[,14],GSM386529[,14],
             GSM386530[,14],GSM386531[,14],GSM386532[,14],GSM386533[,14],GSM386534[,14],GSM386535[,14],
             GSM386536[,14],GSM386537[,14],GSM386538[,14],GSM386539[,14],GSM386540[,14],GSM386541[,14],
             GSM386542[,14],GSM386543[,14],GSM386544[,14],GSM386545[,14],GSM386546[,14],GSM386547[,14],
             GSM386548[,14],GSM386549[,14],GSM386550[,14],GSM386551[,14],GSM386552[,14],GSM386553[,14],
             GSM386554[,14],GSM386555[,14],GSM386556[,14],GSM386557[,14],GSM386558[,14],GSM386559[,14],
             GSM386560[,14],GSM386561[,14],GSM386562[,14],GSM386563[,14],GSM386564[,14],GSM386565[,14],
             GSM386566[,14],GSM386567[,14],GSM386568[,14],GSM386569[,14],GSM386570[,14],GSM386571[,14],
             GSM386572[,14],GSM386573[,14],GSM386574[,14],GSM386575[,14],GSM386576[,14],GSM386577[,14],
             GSM386578[,14],GSM386579[,14],GSM386580[,14],GSM386581[,14],GSM386582[,14],GSM386583[,14],
             GSM386584[,14],GSM386585[,14],GSM386586[,14],GSM386587[,14],GSM386588[,14],GSM386589[,14],
             GSM386590[,14],GSM386591[,14],GSM386592[,14],GSM386593[,14],GSM386594[,14],GSM386595[,14],
             GSM386596[,14],GSM386597[,14],GSM386598[,14],GSM386599[,14],GSM386600[,14],GSM386601[,14],
             GSM386602[,14],GSM386603[,14],GSM386604[,14],GSM386605[,14],GSM386606[,14],GSM386607[,14],
GSM386608[,14],GSM386609[,14],GSM386610[,14],GSM386611[,14],GSM386612[,14],GSM386613[,14],GSM386614[,14],
GSM386615[,14],GSM386616[,14],GSM386617[,14],GSM386618[,14],GSM386619[,14],
GSM386620[,14],GSM386621[,14],GSM386622[,14],GSM386623[,14],GSM386624[,14],GSM386625[,14],GSM386626[,14],
GSM386627[,14],GSM386628[,14],GSM386629[,14],GSM386630[,14],GSM386631[,14],GSM386632[,14],GSM386633[,14]
)

colnames(BIG)[4] <- "GSM386518"

colnames(BIG) <- gsub("[, 14]", "", colnames(BIG), fixed=T)

colnames(BIG)

##   [1] "UID"       "R"         "C"         "GSM386518" "GSM386519" "GSM386520"
##   [7] "GSM386521" "GSM386522" "GSM386523" "GSM386524" "GSM386525" "GSM386526"
##  [13] "GSM386527" "GSM386528" "GSM386529" "GSM386530" "GSM386531" "GSM386532"
##  [19] "GSM386533" "GSM386534" "GSM386535" "GSM386536" "GSM386537" "GSM386538"
##  [25] "GSM386539" "GSM386540" "GSM386541" "GSM386542" "GSM386543" "GSM386544"
##  [31] "GSM386545" "GSM386546" "GSM386547" "GSM386548" "GSM386549" "GSM386550"
##  [37] "GSM386551" "GSM386552" "GSM386553" "GSM386554" "GSM386555" "GSM386556"
##  [43] "GSM386557" "GSM386558" "GSM386559" "GSM386560" "GSM386561" "GSM386562"
##  [49] "GSM386563" "GSM386564" "GSM386565" "GSM386566" "GSM386567" "GSM386568"
##  [55] "GSM386569" "GSM386570" "GSM386571" "GSM386572" "GSM386573" "GSM386574"
##  [61] "GSM386575" "GSM386576" "GSM386577" "GSM386578" "GSM386579" "GSM386580"
##  [67] "GSM386581" "GSM386582" "GSM386583" "GSM386584" "GSM386585" "GSM386586"
##  [73] "GSM386587" "GSM386588" "GSM386589" "GSM386590" "GSM386591" "GSM386592"
##  [79] "GSM386593" "GSM386594" "GSM386595" "GSM386596" "GSM386597" "GSM386598"
##  [85] "GSM386599" "GSM386600" "GSM386601" "GSM386602" "GSM386603" "GSM386604"
##  [91] "GSM386605" "GSM386606" "GSM386607" "GSM386608" "GSM386609" "GSM386610"
##  [97] "GSM386611" "GSM386612" "GSM386613" "GSM386614" "GSM386615" "GSM386616"
## [103] "GSM386617" "GSM386618" "GSM386619" "GSM386620" "GSM386621" "GSM386622"
## [109] "GSM386623" "GSM386624" "GSM386625" "GSM386626" "GSM386627" "GSM386628"
## [115] "GSM386629" "GSM386630" "GSM386631" "GSM386632" "GSM386633"

So we have our 116 samples plus the row(R) and column(C) and the UID for unique ID that matches row number in platform GPL table to the Gene Bank accession number.

paged_table(BIG)

paged_table(series9)

The controls or healthy start with GSM386518 through GSM386546, the language is GSM386547 through GSM386577, and mild is GSM386578 through GSM386603, and the savant are GSM386604 through GSM386633.

BIG$healthy_mean <- rowMeans(BIG[,c(4:32)])
BIG$language_mean <- rowMeans(BIG[,c(33:63)])
BIG$mild_mean <- rowMeans(BIG[,c(64:89)])
BIG$savant_mean <- rowMeans(BIG[,c(90:119)])

colnames(BIG)

##   [1] "UID"           "R"             "C"             "GSM386518"    
##   [5] "GSM386519"     "GSM386520"     "GSM386521"     "GSM386522"    
##   [9] "GSM386523"     "GSM386524"     "GSM386525"     "GSM386526"    
##  [13] "GSM386527"     "GSM386528"     "GSM386529"     "GSM386530"    
##  [17] "GSM386531"     "GSM386532"     "GSM386533"     "GSM386534"    
##  [21] "GSM386535"     "GSM386536"     "GSM386537"     "GSM386538"    
##  [25] "GSM386539"     "GSM386540"     "GSM386541"     "GSM386542"    
##  [29] "GSM386543"     "GSM386544"     "GSM386545"     "GSM386546"    
##  [33] "GSM386547"     "GSM386548"     "GSM386549"     "GSM386550"    
##  [37] "GSM386551"     "GSM386552"     "GSM386553"     "GSM386554"    
##  [41] "GSM386555"     "GSM386556"     "GSM386557"     "GSM386558"    
##  [45] "GSM386559"     "GSM386560"     "GSM386561"     "GSM386562"    
##  [49] "GSM386563"     "GSM386564"     "GSM386565"     "GSM386566"    
##  [53] "GSM386567"     "GSM386568"     "GSM386569"     "GSM386570"    
##  [57] "GSM386571"     "GSM386572"     "GSM386573"     "GSM386574"    
##  [61] "GSM386575"     "GSM386576"     "GSM386577"     "GSM386578"    
##  [65] "GSM386579"     "GSM386580"     "GSM386581"     "GSM386582"    
##  [69] "GSM386583"     "GSM386584"     "GSM386585"     "GSM386586"    
##  [73] "GSM386587"     "GSM386588"     "GSM386589"     "GSM386590"    
##  [77] "GSM386591"     "GSM386592"     "GSM386593"     "GSM386594"    
##  [81] "GSM386595"     "GSM386596"     "GSM386597"     "GSM386598"    
##  [85] "GSM386599"     "GSM386600"     "GSM386601"     "GSM386602"    
##  [89] "GSM386603"     "GSM386604"     "GSM386605"     "GSM386606"    
##  [93] "GSM386607"     "GSM386608"     "GSM386609"     "GSM386610"    
##  [97] "GSM386611"     "GSM386612"     "GSM386613"     "GSM386614"    
## [101] "GSM386615"     "GSM386616"     "GSM386617"     "GSM386618"    
## [105] "GSM386619"     "GSM386620"     "GSM386621"     "GSM386622"    
## [109] "GSM386623"     "GSM386624"     "GSM386625"     "GSM386626"    
## [113] "GSM386627"     "GSM386628"     "GSM386629"     "GSM386630"    
## [117] "GSM386631"     "GSM386632"     "GSM386633"     "healthy_mean" 
## [121] "language_mean" "mild_mean"     "savant_mean"

BIG$FC_language_healthy <- BIG$language_mean/BIG$healthy_mean

BIG$FC_mild_healthy <- BIG$mild_mean/BIG$healthy_mean

BIG$FC_savant_healthy <- BIG$savant_mean/BIG$healthy_mean

paged_table(BIG)

summary(BIG[,124:126])

##  FC_language_healthy FC_mild_healthy   FC_savant_healthy
##  Min.   : 0.0000     Min.   : 0.0000   Min.   : 0.1452  
##  1st Qu.: 0.8751     1st Qu.: 0.8512   1st Qu.: 0.8705  
##  Median : 0.9615     Median : 0.9415   Median : 0.9639  
##  Mean   : 0.9528     Mean   : 0.9436   Mean   : 0.9818  
##  3rd Qu.: 1.0260     3rd Qu.: 1.0095   3rd Qu.: 1.0176  
##  Max.   :12.9857     Max.   :11.7146   Max.   :14.7887

BIG_language_order <- BIG[order(BIG$FC_language_healthy, decreasing=T),]

tail(BIG_language_order$FC_language_healthy)

## [1] 0.05517816 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000

BIG_languageFC <- BIG_language_order[BIG_language_order$FC_language_healthy > 0,]
top_languageFC <- BIG_languageFC[c(1:10,41458:41467),]

BIG_mild_order <- BIG[order(BIG$FC_mild_healthy, decreasing=T),]

tail(BIG_mild_order$FC_language_healthy)

## [1] 1.5290631 0.5179005 0.6304165 0.3940067 0.3623758 2.8225370

BIG_mildFC <- BIG_mild_order[BIG_mild_order$FC_mild_healthy > 0,] #41450X126
top_mildFC <- BIG_mildFC[c(1:10,41441:41450),]

paged_table(top_mildFC)

BIG_savant_order <- BIG[order(BIG$FC_savant_healthy, decreasing=T),]

tail(BIG_savant_order$FC_language_healthy)

## [1] 0.4267026 0.3269997 0.6662624 0.2694706 0.4897680 1.3663469

BIG_savantFC <- BIG_savant_order[BIG_savant_order$FC_savant_healthy > 0,] #41472X126
top_savantFC <- BIG_savantFC[c(1:10,41463:41472),]

paged_table(top_savantFC)

shortPlatform <- platform[1:41472,]

dim(shortPlatform)

## [1] 41472     3

shortPlatform$UID <- row.names(shortPlatform)

paged_table(shortPlatform)

Language_20gb <- merge(top_languageFC, shortPlatform, by.x='UID', by.y="UID")

paged_table(Language_20gb)

write.csv(Language_20gb, "Language_foldchange_top20_genebankAccession.csv", row.names=F)

mild_20gb <- merge(top_mildFC, shortPlatform, by.x='UID', by.y="UID")

paged_table(mild_20gb)

savant_20gb <- merge(top_savantFC, shortPlatform, by.x='UID', by.y="UID")

paged_table(savant_20gb)

write.csv(mild_20gb, 'mild_foldchange_top20_genebankAccession.csv', row.names=F)
write.csv(savant_20gb, 'savant_foldchange_top20_genebankAccession.csv', row.names=F)

language top 20 here. mild top 20 here. savant top 20 here.

Thanks so much we got our top 20 genes of stimulated or up regulated and inhibited or down regulated.

We will have to go to another site to get the gene names from the gene bank accession IDs, using BLAST as one way of retrieving those genes for our top genes in ASD, autism spectrum disorder.

=================================================

******* Part 2 Machine Learning

rm("GSM386518" ,        
 "GSM386518_start" ,   "GSM386519"    ,      "GSM386520" ,         "GSM386521" ,         "GSM386522" ,        
 "GSM386523"        ,  "GSM386524"     ,     "GSM386525"  ,        "GSM386526"  ,        "GSM386527"  ,       
  "GSM386528"        ,  "GSM386529"     ,     "GSM386530"  ,        "GSM386531"  ,        "GSM386532"  ,       
  "GSM386533"         , "GSM386534"      ,    "GSM386535"   ,       "GSM386536"   ,       "GSM386537"   ,      
  "GSM386538"          ,"GSM386539"       ,   "GSM386540"    ,      "GSM386541"    ,      "GSM386542"    ,     
  "GSM386543"     ,     "GSM386544"        ,  "GSM386545"     ,     "GSM386546"     ,     "GSM386547"     ,    
  "GSM386548"      ,    "GSM386549"         , "GSM386550"      ,    "GSM386551"      ,    "GSM386552"      ,   
  "GSM386553"       ,   "GSM386554" ,         "GSM386555"       ,   "GSM386556"       ,   "GSM386557" ,        
  "GSM386558"        ,  "GSM386559"  ,        "GSM386560"        ,  "GSM386561",          "GSM386562"  ,       
  "GSM386563"        ,  "GSM386564"   ,       "GSM386565",          "GSM386566" ,         "GSM386567"   ,      
  "GSM386568"         , "GSM386569"    ,      "GSM386570" ,         "GSM386571"  ,        "GSM386572"    ,     
  "GSM386573"          ,"GSM386574"     ,     "GSM386575"  ,        "GSM386576"   ,       "GSM386577"     ,    
  "GSM386578"     ,     "GSM386579"      ,    "GSM386580"   ,       "GSM386581"    ,      "GSM386582"      ,   
  "GSM386583"      ,    "GSM386584"       ,   "GSM386585"    ,      "GSM386586"     ,     "GSM386587"       ,  
  "GSM386588"       ,   "GSM386589"        ,  "GSM386590"     ,     "GSM386591"      ,    "GSM386592" ,        
  "GSM386593"        ,  "GSM386594"         , "GSM386595"      ,    "GSM386596" ,         "GSM386597"  ,       
  "GSM386598"         , "GSM386599" ,         "GSM386600"       ,   "GSM386601"  ,        "GSM386602"   ,      
 "GSM386603"    ,      "GSM386604"   ,       "GSM386605"         , "GSM386606"    ,      "GSM386607"     ,    
 "GSM386608"     ,     "GSM386609"    ,      "GSM386610",          "GSM386611"     ,     "GSM386612"      ,   
 "GSM386613"      ,    "GSM386614"     ,     "GSM386615" ,         "GSM386616"      ,    "GSM386617"       ,  
 "GSM386618"       ,   "GSM386619"      ,    "GSM386620"  ,        "GSM386621"       ,   "GSM386622"        , 
 "GSM386623"        ,  "GSM386624"       ,   "GSM386625"   ,       "GSM386626"        ,  "GSM386627"  ,       
 "GSM386628"         , "GSM386629"        ,  "GSM386630"    ,      "GSM386631"         , "GSM386632"   ,      
"GSM386633"        )

paged_table(savant_20gb)

colnames(savant_20gb)

##   [1] "UID"                 "R"                   "C"                  
##   [4] "GSM386518"           "GSM386519"           "GSM386520"          
##   [7] "GSM386521"           "GSM386522"           "GSM386523"          
##  [10] "GSM386524"           "GSM386525"           "GSM386526"          
##  [13] "GSM386527"           "GSM386528"           "GSM386529"          
##  [16] "GSM386530"           "GSM386531"           "GSM386532"          
##  [19] "GSM386533"           "GSM386534"           "GSM386535"          
##  [22] "GSM386536"           "GSM386537"           "GSM386538"          
##  [25] "GSM386539"           "GSM386540"           "GSM386541"          
##  [28] "GSM386542"           "GSM386543"           "GSM386544"          
##  [31] "GSM386545"           "GSM386546"           "GSM386547"          
##  [34] "GSM386548"           "GSM386549"           "GSM386550"          
##  [37] "GSM386551"           "GSM386552"           "GSM386553"          
##  [40] "GSM386554"           "GSM386555"           "GSM386556"          
##  [43] "GSM386557"           "GSM386558"           "GSM386559"          
##  [46] "GSM386560"           "GSM386561"           "GSM386562"          
##  [49] "GSM386563"           "GSM386564"           "GSM386565"          
##  [52] "GSM386566"           "GSM386567"           "GSM386568"          
##  [55] "GSM386569"           "GSM386570"           "GSM386571"          
##  [58] "GSM386572"           "GSM386573"           "GSM386574"          
##  [61] "GSM386575"           "GSM386576"           "GSM386577"          
##  [64] "GSM386578"           "GSM386579"           "GSM386580"          
##  [67] "GSM386581"           "GSM386582"           "GSM386583"          
##  [70] "GSM386584"           "GSM386585"           "GSM386586"          
##  [73] "GSM386587"           "GSM386588"           "GSM386589"          
##  [76] "GSM386590"           "GSM386591"           "GSM386592"          
##  [79] "GSM386593"           "GSM386594"           "GSM386595"          
##  [82] "GSM386596"           "GSM386597"           "GSM386598"          
##  [85] "GSM386599"           "GSM386600"           "GSM386601"          
##  [88] "GSM386602"           "GSM386603"           "GSM386604"          
##  [91] "GSM386605"           "GSM386606"           "GSM386607"          
##  [94] "GSM386608"           "GSM386609"           "GSM386610"          
##  [97] "GSM386611"           "GSM386612"           "GSM386613"          
## [100] "GSM386614"           "GSM386615"           "GSM386616"          
## [103] "GSM386617"           "GSM386618"           "GSM386619"          
## [106] "GSM386620"           "GSM386621"           "GSM386622"          
## [109] "GSM386623"           "GSM386624"           "GSM386625"          
## [112] "GSM386626"           "GSM386627"           "GSM386628"          
## [115] "GSM386629"           "GSM386630"           "GSM386631"          
## [118] "GSM386632"           "GSM386633"           "healthy_mean"       
## [121] "language_mean"       "mild_mean"           "savant_mean"        
## [124] "FC_language_healthy" "FC_mild_healthy"     "FC_savant_healthy"  
## [127] "ID"                  "GB_ACC"              "SPOT_ID"

We need columns 4-119 for samples plus the loci of gene bank accession ID as GB_ACC column 128.

savant_df20 <- savant_20gb[,c(4:119,128)]

paged_table(savant_df20)

savant_df20$GB_ACC <- as.factor(savant_df20$GB_ACC)

savant_df20$GB_ACC

##  [1] N63864            AI222331          T99336            AA486089         
##  [9] AI018277 AI016962          AA427367 W03687   N35115   AA778346         
## [17] N62729   AI348022 AA927340 AI074469
## 15 Levels:  AA427367 AA486089 AA778346 AA927340 AI016962 AI018277 ... W03687

sv <- savant_df20[order(savant_df20$GB_ACC, decreasing=T),]

sv$GB_ACC

##  [1] W03687   T99336   N63864   N62729   N35115   AI348022 AI222331 AI074469
##  [9] AI018277 AI016962 AA927340 AA778346 AA486089 AA427367                  
## [17]                                    
## 15 Levels:  AA427367 AA486089 AA778346 AA927340 AI016962 AI018277 ... W03687

We can only use the first 14 rows of GB_ACC IDs.

savant_GB_acc14 <- sv[c(1:14),]


paged_table(savant_GB_acc14)

We have the savant top 14 genes due to limitations of the GB_ACC ID. Now we will get the Language and mild top 20 of useable GB_ACC gene loci and then merge with the genes in the larger data to get the healthy

sv_mx <- data.frame(t(savant_GB_acc14[,1:116]))
colnames(sv_mx) <- savant_GB_acc14$GB_ACC

sv_mx$class <- as.factor(class)

paged_table(sv_mx)

Filter to only get the healthy controls and the savant ASD.

healthy_savant_mx <- sv_mx[sv_mx$class == "control" | sv_mx$class == "savant",]

healthy_savant_mx$class <- as.character(healthy_savant_mx$class) #remove the 2 classes attached to original vector of 4 classes

healthy_savant_mx$class <- as.factor(healthy_savant_mx$class)

paged_table(healthy_savant_mx)

table(healthy_savant_mx$class)

## 
## control  savant 
##      29      30

# install.packages("randomForest")

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

set.seed(123)

inTrain <- sample(1:59, .80*59)

training <- healthy_savant_mx[inTrain,]

testing <- healthy_savant_mx[-inTrain,]

table(training$class)

## 
## control  savant 
##      24      23

table(testing$class)

## 
## control  savant 
##       5       7

rf_hs <- randomForest(training[1:14], training$class, mtry=4, ntree=5000, confusion=T, importance=T)

rf_hs$confusion

##         control savant class.error
## control      23      1  0.04166667
## savant        3     20  0.13043478

Pretty good predictor set of genes for savant versus healthy samples on a two class model. But we need to see how it does on prediction.

predicted_hs <- predict(rf_hs, testing)

results <- data.frame(predicted=predicted_hs, actual=testing$class)

results

##           predicted  actual
## GSM386521   control control
## GSM386528   control control
## GSM386533    savant control
## GSM386535    savant control
## GSM386539   control control
## GSM386608    savant  savant
## GSM386610    savant  savant
## GSM386613    savant  savant
## GSM386615    savant  savant
## GSM386619   control  savant
## GSM386623    savant  savant
## GSM386629   control  savant

On the predicted model, there were 12 samples, and 7 savants total with 5/7 correctly predicted savant, the other 2/7 incorrectly predicted healthy control. There were 5 healthy control samples which 3/5 correctly predicted control and 2/5 incorrectly predicted control or healthy.

The accuracy on these genes in prediction for savant vs healthy on the validation test set. There were a total of 8/12 correctly predicted that is 67% accuracy, with savant having a 71% accuracy and control having a 60% accuracy.

We still need to see how well the language GB_ACC IDs and mild ones compare to healthy in predicting their ASD group compared to healthy, then we will see how well all these genes together or separately or both predict a 4 class model of healthy, language, mild, or savant.

Now we are going to test the Language ASD genes on how well they perform in predicting the class of healthy control or language ASD. We have to remove the blank or empty gene bank accession IDs and work with remaining genes, then form a matrix with the data and then see how well those genes predict a 2 class model of healthy control or language ASD.

language_ordered <- Language_20gb[order(Language_20gb$GB_ACC, decreasing=T),]
paged_table(language_ordered)

We want rows 1-17 and columns 4-119 and 128 for building our matrix.

language_df17 <- language_ordered[c(1:17),c(4:119,128)]

paged_table(language_df17)

table(class)

## class
##  control language     mild   savant 
##       29       31       26       30

language_mx <- data.frame(t(language_df17[,1:116]))
colnames(language_mx) <- language_df17$GB_ACC

language_mx$class <- class

language_mx <- subset(language_mx, language_mx$class == "control" | language_mx$class == "language")

language_mx$class <- as.factor(as.character(language_mx$class))#for 2 levels not 4 of origin

paged_table(language_mx)

set.seed(123)

inTrain <- sample(1:60, .80*60)

training <- language_mx[inTrain,]

testing <- language_mx[-inTrain,]

table(training$class)

## 
##  control language 
##       24       24

table(testing$class)

## 
##  control language 
##        5        7

rf_language <- randomForest(training[1:16], training$class, mtry=6, ntrees=5000, confusion=T, importance=T)

rf_language$confusion

##          control language class.error
## control       20        4  0.16666667
## language       1       23  0.04166667

The training set scored well, the healthy controls had 20/24 correctly predicted and the language had 23/24 predicted correctly.

Now lets see how well this model predicts on the hold out validation test set.

predicted_language <- predict(rf_language, testing)

results <- data.frame(predicted=predicted_language, actual=testing$class)

results

##           predicted   actual
## GSM386521   control  control
## GSM386528  language  control
## GSM386533   control  control
## GSM386535  language  control
## GSM386539   control  control
## GSM386551  language language
## GSM386557  language language
## GSM386563  language language
## GSM386569   control language
## GSM386571  language language
## GSM386572  language language
## GSM386573  language language

The model scored 9/12 correct with 3/5 healthy controls predicted correctly and 6/7 language predicted correctly. That is 75% accuracy overall. Not bad.

Now we do the same for the mild ASD versus healthy controls.

mild_ordered20 <- mild_20gb[order(mild_20gb$GB_ACC, decreasing=T),]

mild_ordered20$GB_ACC

##  [1] "W95682"   "T86999"   "T65948"   "R64686"   "R53966"   "R40855"  
##  [7] "N73877"   "N56875"   "H27752"   "AI351587" "AI348022" "AI346406"
## [13] "AA778346" "AA448277" "AA158162" "AA001718" ""         ""        
## [19] ""         ""

We want rows 1:16 and for columns 4:119 and 128

mild_df16 <- mild_ordered20[c(1:16),c(4:119, 128)]

paged_table(mild_df16)

Now we make the matrix.

mild_mx <- data.frame(t(mild_df16[,1:116]))

colnames(mild_mx) <- mild_df16$GB_ACC

mild_mx$class <- class

paged_table(mild_mx)

mild_mx <- subset(mild_mx, mild_mx$class == "control" | mild_mx$class == "mild")

mild_mx$class <- as.factor(as.character(mild_mx$class))

table(mild_mx$class)

## 
## control    mild 
##      29      26

str(mild_mx$class)

##  Factor w/ 2 levels "control","mild": 1 1 1 1 1 1 1 1 1 1 ...

set.seed(123)

inTrain <- sample(1:55, .8*55)

training <- mild_mx[inTrain,]
testing <- mild_mx[-inTrain,]

table(training$class)

## 
## control    mild 
##      26      18

table(testing$class)

## 
## control    mild 
##       3       8

rf_mild <- randomForest(training[1:16], training$class, mtry=5, ntree=5000, confusion=T, importance=T)

rf_mild$confusion

##         control mild class.error
## control      20    6   0.2307692
## mild          4   14   0.2222222

In the training set, the healthy controls scored 81% accuracy with 21/26 correct and the mild ASD scored 78% accuracy with 14/18 correct.

Lets see how well it predicts, this model.

predicted_mild <- predict(rf_mild, testing)

results <- data.frame(predicted=predicted_mild, actual=testing$class)

results

##           predicted  actual
## GSM386518   control control
## GSM386533   control control
## GSM386535   control control
## GSM386578      mild    mild
## GSM386580   control    mild
## GSM386581      mild    mild
## GSM386584      mild    mild
## GSM386587   control    mild
## GSM386594      mild    mild
## GSM386601      mild    mild
## GSM386603      mild    mild

In the hold out validation set, there were 2/11 predicted incorrectly and they were both mild samples. All controls 3/3 were predicted correctly, and 7/9 mild samples predicted correctly. The overall accuracy was 9/11 or 82% accuracy.

Seems these gene bank accession IDs are good locations of identifyng the class of ASD or healthy control.

language_df17$ASD_group <- "language"
mild_df16$ASD_group <- "mild"
savant_GB_acc14$ASD_group <- "savant"

topGB_IDs <- rbind(language_df17, mild_df16, savant_GB_acc14)

paged_table(topGB_IDs)

write.csv(topGB_IDs, "topGB_IDs_47total_GB_accessionIDs_language_mild_savant.csv", row.names=F)

You can get this file here.

Lets look up a few of these Gene Bank IDs in BLAST. But we get the nucleotide sequence from NCBI after looking it up by Gene Bank Accession ID to get the FASTA file of nucleotide sequence.

Lets look up AA427367 and AI018277 both savant genes in NCBI and then in BLAST with the nucleotide sequence found in NCBI.

BLAST image screenshot inserting FASTA nucleotide sequence

We will look at this more in depth next time in Part 4. We found that these genes were great with more than 70% accuracy in all sets of language, mild, and savant in predicting the class of the ASD type or a healthy control. Next, we will see how well these genes together can predict a 4 class model of healthy, language, mild, or savant.

=====================================================================

The data set we are using is topGB_IDs, we are going to basically look up each of the 47 gene bank accesssion IDs and get the FASTA sequence of nucleotides and copy and paste that into the BLAST site and select the genomic + transcripts similar name and then open in a new window to get the top genome selected by BLAST the Basic Local Alignment Search Tool and then click that top ranked gene location locus/loci and then go to the graphics link and select it then search for the query and the neighboring genes near it.

paged_table(topGB_IDs)

topGB_IDs$GB_ACC

##  [1] "W03687"   "T86983"   "R93875"   "R76772"   "R64686"   "R38172"  
##  [7] "N62522"   "N51529"   "H63994"   "AI351317" "AI348022" "AI198536"
## [13] "AI016962" "AA908954" "AA456715" "AA400272" "AA158162" "W95682"  
## [19] "T86999"   "T65948"   "R64686"   "R53966"   "R40855"   "N73877"  
## [25] "N56875"   "H27752"   "AI351587" "AI348022" "AI346406" "AA778346"
## [31] "AA448277" "AA158162" "AA001718" "W03687"   "T99336"   "N63864"  
## [37] "N62729"   "N35115"   "AI348022" "AI222331" "AI074469" "AI018277"
## [43] "AI016962" "AA927340" "AA778346" "AA486089" "AA427367"

These are the GB_ACC IDs we will find:

W03687 , MLIP
T86983 , RPAIN
R93875 , NAP1L1
R76772, SEPTIN6
R64686 , CFAP144
R38172 , RIF1
N62522 , SLAMF8 & SNHG28
N51529 , TET3
H63994 , FTCDNL1
AI351317, RAP1B
AI348022 , NA
AI198536 , TLCD4
AI016962, RIC1
AA908954, ATF2
AA456715 , KBTBD2
AA400272, TIPARP
AA158162 , TUT1
W95682 , MYO1B
T86999 , WNK1
T65948 , STING1
R64686 , CFAP144
R53966 , CHN1
R40855 , ADAMTS5
N73877 , NA
N56875 , CST7
H27752 , AQP7
AI351587 , NA
AI348022, NA
AI346406 , ABCB8
AA778346 , PRMT2
AA448277 , FOXO1
AA158162 , TUT1
AA001718 , BMPER
W03687 , MLIP
T99336 , CEP164
N63864 , PBX3
N62729 , CNKSR3
N35115 , FAM111A-DT
AI348022 , NA
AI222331 , MKRN1
AI074469 , THOC1
AI018277 , FGGY
AI016962 , RIC1
AA927340 , PSEN2
AA778346 , PRMT2
AA486089 , NAV1
AA427367 , BECN1 & CNTD1

genes_to_GB_ACC <- c("MLIP",
                     "RPAIN",
                     "NAP1L1",
                     "SEPTIN6",
                     "CFAP144",
                     "RIF1",
                     "SLAMF8 & SNHG28",
                     "TET3",
                     "FTCDNL1",
                     "RAP1B",
                     NA,
                     "TLCD4",
                     "RIC1",
                     "ATF2",
                     "KBTBD2",
                     "TIPARP",
                    "TUT1",
                    "MYO1B",
                    "WNK1",
                    "STING1",
                    "CFAP144",
                    "CHN1",
                    "ADAMTS5",
                    NA,
                    "CST7",
                    "AQP7",
                    NA,
                    NA,
                    "ABCB8",
                    "PRMT2",
                    "FOXO1",
                     "TUT1",
                    "BMPER",
                    "MLIP",
                    "CEP164",
                    "PBX3",
                    "CNKSR3",
                    "FAM111A-DT",
                    NA,
                    "MKRN1",
                    "THOC1",
                    "FGGY",
                    "RIC1",
                    "PSEN2",
                    "PRMT2",
                    "NAV1",
                    "CNTD1 & BECN1"
                    )

Lets go ahead and add this vector of gene IDs to our data of 47 observations of gene bank accession IDs, we have 5 NAs and 2 Gene Banck Accession IDs with 2 genes overlapping. We didn’t include the LOC genes or NC genes if present but looked for closest gene on same strand of reverse or forward (to the right) or different strand.

topGB_IDs$Gene_ID <- genes_to_GB_ACC

paged_table(topGB_IDs)

row 7 for language has 2 genes, and so does row 47. We will make 2 separate rows for those entries and leave the other rows with other entry.

row7_47 <- topGB_IDs[c(7,47),]
row7_47[,c(118:119)]

##     ASD_group         Gene_ID
## 1    language SLAMF8 & SNHG28
## 122    savant   CNTD1 & BECN1

row7_47$Gene_ID <- c("SNHG28","BECN1")
topGB_IDs$Gene_ID[7] <- "SLAMF8"
topGB_IDs$Gene_ID[47] <- "CNTD1"

topGenes <- rbind(topGB_IDs, row7_47)

paged_table(topGenes)

We went ahead and made two more observations for the double gene entries of row 7 and row 47 and changed the double entry at those rows to the 1st entry while putting the 2nd entry for Gene_ID in the added 2 rows. Then we rbind or row bound the 2 rows to the data. Now we are going to remove the NAs.

top44 <- which(is.na(topGenes$Gene_ID))

Top44 <- topGenes[-top44,]

paged_table(Top44)

top44_ordered <- Top44[order(Top44$ASD_group, decreasing=T),]

Now we will look at the genes and remove the duplicates just for our machine learning.

top44_df <- top44_ordered[,c(1:116,119)]

colnames(top44_df)

##   [1] "GSM386518" "GSM386519" "GSM386520" "GSM386521" "GSM386522" "GSM386523"
##   [7] "GSM386524" "GSM386525" "GSM386526" "GSM386527" "GSM386528" "GSM386529"
##  [13] "GSM386530" "GSM386531" "GSM386532" "GSM386533" "GSM386534" "GSM386535"
##  [19] "GSM386536" "GSM386537" "GSM386538" "GSM386539" "GSM386540" "GSM386541"
##  [25] "GSM386542" "GSM386543" "GSM386544" "GSM386545" "GSM386546" "GSM386547"
##  [31] "GSM386548" "GSM386549" "GSM386550" "GSM386551" "GSM386552" "GSM386553"
##  [37] "GSM386554" "GSM386555" "GSM386556" "GSM386557" "GSM386558" "GSM386559"
##  [43] "GSM386560" "GSM386561" "GSM386562" "GSM386563" "GSM386564" "GSM386565"
##  [49] "GSM386566" "GSM386567" "GSM386568" "GSM386569" "GSM386570" "GSM386571"
##  [55] "GSM386572" "GSM386573" "GSM386574" "GSM386575" "GSM386576" "GSM386577"
##  [61] "GSM386578" "GSM386579" "GSM386580" "GSM386581" "GSM386582" "GSM386583"
##  [67] "GSM386584" "GSM386585" "GSM386586" "GSM386587" "GSM386588" "GSM386589"
##  [73] "GSM386590" "GSM386591" "GSM386592" "GSM386593" "GSM386594" "GSM386595"
##  [79] "GSM386596" "GSM386597" "GSM386598" "GSM386599" "GSM386600" "GSM386601"
##  [85] "GSM386602" "GSM386603" "GSM386604" "GSM386605" "GSM386606" "GSM386607"
##  [91] "GSM386608" "GSM386609" "GSM386610" "GSM386611" "GSM386612" "GSM386613"
##  [97] "GSM386614" "GSM386615" "GSM386616" "GSM386617" "GSM386618" "GSM386619"
## [103] "GSM386620" "GSM386621" "GSM386622" "GSM386623" "GSM386624" "GSM386625"
## [109] "GSM386626" "GSM386627" "GSM386628" "GSM386629" "GSM386630" "GSM386631"
## [115] "GSM386632" "GSM386633" "Gene_ID"

top44_DF <- top44_df[!duplicated(top44_df$Gene_ID),]

dim(top44_DF)

## [1]  39 117

We removed duplicate genes in MLIP, RIC1, PRMT2, TUT1, and CFAP144. Now we can make our matrix for the 4 class model in prediction to see how well all these 39 genes after removing duplicates and NAs of the GB_ACC loci.

top44_mx <- data.frame(t(top44_DF[,1:116]))
colnames(top44_mx) <- top44_DF$Gene_ID

top44_mx$class <- as.factor(class)

paged_table(top44_mx)

This is our matrix. Now we can random sample indices and make our 80% training set and 20% validation hold-out testing set to see how well our random forest model predicts these 4 classes of control (healthy), language ASD, mild ASD, or savant ASD.

set.seed(123)

inTrain <- sample(1:116, .8*116)

training <- top44_mx[inTrain,]
testing <- top44_mx[-inTrain,]

table(training$class)

## 
##  control language     mild   savant 
##       20       25       23       24

table(testing$class)

## 
##  control language     mild   savant 
##        9        6        3        6

rf_4class <- randomForest(training[1:39], training$class, mtry=12, ntree=5000, confusion=T, importance=T)

rf_4class$confusion

##          control language mild savant class.error
## control        8        2    6      4   0.6000000
## language       3       16    3      3   0.3600000
## mild           5        5   11      2   0.5217391
## savant         2        4    1     17   0.2916667

In our confusion matrix by classification, The savant class scored best with 71% accuracy 17/24 predicted correctly, 4/24 predicted as language, 2/24 as healthy control, and 1/24 as mild. The next best was language in 64% accuracy with 16/25 correctly predicted, 3/25 predicted healthy, 3/25 predicted mild, and 3/25 predicted savant. The mild class was 48% accurate with 11/23 correctly predicted, and then healthy controls worse with only 40% accuracy in 8/20 predicted correctly. We thought that might occur due to the imbalance in samples with less healthy samples to train.

Now we will see how well this model predicts the validation set.

prediction_4class <- predict(rf_4class, testing)

results <- data.frame(predicted=prediction_4class, actual=testing$class)

results

##           predicted   actual
## GSM386518   control  control
## GSM386519   control  control
## GSM386527      mild  control
## GSM386528   control  control
## GSM386535   control  control
## GSM386536   control  control
## GSM386537   control  control
## GSM386541  language  control
## GSM386545    savant  control
## GSM386550  language language
## GSM386557  language language
## GSM386562    savant language
## GSM386563  language language
## GSM386566  language language
## GSM386572  language language
## GSM386579   control     mild
## GSM386581  language     mild
## GSM386582  language     mild
## GSM386606    savant   savant
## GSM386617  language   savant
## GSM386619   control   savant
## GSM386625    savant   savant
## GSM386627    savant   savant
## GSM386629      mild   savant

There are 9 controls with 6/9 correctly predicted,, and 3/9 incorrectly predicted one in each other class. There are 6 of the language class and 5/6 predicted correctly but 1/6 predicted savant. The mild there were 3 and all 3 incorrectly predicted with 0% accuracy for 0/3 correct, 2/3 as language and 1/3 as healthy control. There were 6 savant samples, and 3/6 predicted correctly with 50% accuracy, with the other 3 incorrectly predicted in one of each other class.

The overall score in accuracy is 14/24 correct among all 4 classes just above 50% at 58.3% accuracy. Language scored the best with 5/6 correct or 83.3% accuracy.

Lets look at the importance of these gene features in predicting the class in this model.

rf_4class$importance

##                  control      language          mild        savant
## MLIP        0.0196141436 -0.0001079565 -0.0020763858 -1.477116e-03
## CEP164     -0.0019396370 -0.0005230902  0.0016963142  3.416798e-03
## PBX3        0.0135131840 -0.0008264462  0.0115194186  4.926165e-02
## CNKSR3      0.0015907992 -0.0011029137 -0.0033310001  4.821112e-03
## FAM111A-DT -0.0020600072 -0.0007202231 -0.0004218204  6.914652e-04
## MKRN1      -0.0011694322 -0.0015518330 -0.0004451010  3.052068e-03
## THOC1      -0.0007736580 -0.0003763037  0.0005916667 -8.679654e-06
## FGGY       -0.0013577850 -0.0033738048 -0.0064913276  3.944372e-03
## RIC1        0.0038665729 -0.0029339716 -0.0030003169  1.936546e-03
## PSEN2       0.0024399567 -0.0009820313 -0.0052615218  6.600651e-03
## PRMT2       0.0020386386 -0.0011069285  0.0006958847 -1.706098e-03
## NAV1        0.0007760245 -0.0008788564 -0.0027281530  4.766800e-03
## CNTD1       0.0053226429 -0.0044379900  0.0054030320  7.016235e-03
## BECN1       0.0058631746 -0.0054869074  0.0065221373  7.043365e-03
## MYO1B      -0.0080827101  0.0083040132  0.0393362915 -4.681939e-04
## WNK1       -0.0059078960 -0.0036855391  0.0254285287 -2.841657e-03
## STING1     -0.0053742785 -0.0001914259  0.0145299323 -2.258959e-03
## CFAP144     0.0031309596  0.0021516391  0.0036729226 -3.837084e-03
## CHN1        0.0017453680 -0.0022763534 -0.0050988739 -1.301563e-03
## ADAMTS5     0.0056677128  0.0007451871  0.0129055089 -3.485653e-03
## CST7        0.0022497042  0.0001411747 -0.0013187951  9.693504e-04
## AQP7        0.0036330231 -0.0003630858  0.0089055983  1.092918e-03
## ABCB8      -0.0042662438 -0.0041739677  0.0140588406  1.501179e-03
## FOXO1       0.0100846703  0.0074818042  0.0139100771 -2.261987e-03
## TUT1        0.0062244733 -0.0004046034  0.0020368171  8.173241e-04
## BMPER      -0.0011204329  0.0049488486  0.0019286380 -1.124377e-03
## RPAIN       0.0024049822  0.0043015716 -0.0018967657 -1.063939e-04
## NAP1L1     -0.0036360872  0.0125823310 -0.0100306038 -3.340630e-03
## SEPTIN6     0.0180368803  0.0063385492 -0.0074179304 -1.931274e-03
## RIF1       -0.0086684244  0.0413524897 -0.0087698465  2.583202e-02
## SLAMF8      0.0036529742  0.0104367517 -0.0003997047 -1.708729e-03
## TET3        0.0023573354 -0.0025290970 -0.0035780725  1.452323e-04
## FTCDNL1     0.0167948762  0.0102672481  0.0028062676 -1.864472e-03
## RAP1B       0.0016096226  0.0002653971  0.0025609085 -5.193484e-04
## TLCD4       0.0033860817  0.0054703402 -0.0021870485 -1.569696e-03
## ATF2        0.0021823255  0.0137346970  0.0044828533 -1.939509e-03
## KBTBD2     -0.0055049090  0.0198126457  0.0020236530 -2.415300e-03
## TIPARP      0.0051943434  0.0192765422 -0.0049994805 -2.139075e-03
## SNHG28      0.0030867749  0.0099969536 -0.0005158858 -1.113049e-03
##            MeanDecreaseAccuracy MeanDecreaseGini
## MLIP               2.948602e-03        2.5292573
## CEP164             8.098680e-04        1.3146458
## PBX3               1.766068e-02        3.9249676
## CNKSR3             4.181854e-04        1.5585023
## FAM111A-DT        -5.549257e-04        0.5211649
## MKRN1              2.322043e-05        1.4256611
## THOC1             -1.103119e-04        0.2585823
## FGGY              -1.848726e-03        1.4738872
## RIC1              -1.750769e-04        1.1284747
## PSEN2              6.865761e-04        1.9361936
## PRMT2             -1.717936e-05        1.1361110
## NAV1               5.105883e-04        1.1731100
## CNTD1              2.831829e-03        1.7842142
## BECN1              2.908556e-03        1.7017189
## MYO1B              9.034131e-03        2.6326275
## WNK1               2.497875e-03        1.9041070
## STING1             1.556358e-03        1.9368503
## CFAP144            1.181375e-03        1.3598107
## CHN1              -1.814265e-03        1.0628419
## ADAMTS5            3.257445e-03        1.8132883
## CST7               3.980064e-04        1.1407448
## AQP7               2.949016e-03        1.6325775
## ABCB8              1.230356e-03        1.4227803
## FOXO1              6.702237e-03        2.7598204
## TUT1               1.864217e-03        1.4071016
## BMPER              1.264619e-03        1.1777672
## RPAIN              1.055661e-03        1.2921315
## NAP1L1            -8.183188e-04        2.0241247
## SEPTIN6            3.014036e-03        2.8197942
## RIF1               1.278322e-02        3.5296894
## SLAMF8             2.841338e-03        1.5815549
## TET3              -9.688841e-04        1.0765874
## FTCDNL1            6.221253e-03        2.4599227
## RAP1B              9.591689e-04        1.3364054
## TLCD4              1.361158e-03        1.9618149
## ATF2               4.443218e-03        2.0445974
## KBTBD2             3.604229e-03        2.1444071
## TIPARP             4.250009e-03        2.0855893
## SNHG28             2.621116e-03        1.5800650

Lets try with a higher mtry and ntree.

rf_4classB <- randomForest(training[1:39], training$class, mtry=14, ntree=15000, confusion=T, importance=T)

rf_4classB$confusion

##          control language mild savant class.error
## control        8        2    6      4   0.6000000
## language       3       16    3      3   0.3600000
## mild           6        4   11      2   0.5217391
## savant         1        5    0     18   0.2500000

predictionB <- predict(rf_4classB, testing)

results <- data.frame(predicted=predictionB, actual=testing$class)
results

##           predicted   actual
## GSM386518   control  control
## GSM386519   control  control
## GSM386527   control  control
## GSM386528   control  control
## GSM386535   control  control
## GSM386536   control  control
## GSM386537   control  control
## GSM386541  language  control
## GSM386545    savant  control
## GSM386550  language language
## GSM386557  language language
## GSM386562    savant language
## GSM386563  language language
## GSM386566  language language
## GSM386572  language language
## GSM386579   control     mild
## GSM386581  language     mild
## GSM386582  language     mild
## GSM386606    savant   savant
## GSM386617      mild   savant
## GSM386619   control   savant
## GSM386625    savant   savant
## GSM386627    savant   savant
## GSM386629      mild   savant

The model did better with more mtrys and ntrees on the testing set, so it generalized better and did worse on its training set it seems at first. Controls are 7/9 correct. Language is 5/6 correct. Mild is still terrible at 0/3 correct. Savant is still 3/6 correct. The total accuracy is 7+5+3 = 15/24 correct. An improvement of 1/15 of 6.7% improvement in accuracy overall with a score of 62.5% up from 58.3% with different parameter settings.

That is it for today. We may or may not add these to our pathology database.

=====================================================================

Adding in the meta information for the genes we found to see what the genes describe as the top genes using another previous study for that information provided in that study GSE271486.

path <- "C:...Genecards IDs GSE271486 resource/" #location of the file on your desktop

setwd(path)

ensembl <- read.csv("GSE271486_ensembleIDs_NPC_LBMP_study.csv", header=T)

paged_table(ensembl[1:10,]) #50868 X 16

colnames(ensembl)

##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"

ensembl_4 <- ensembl[,c(1:4)]

paged_table(ensembl_4[1:10,])

Now lets merge the meta information with the Top44 dataset and with the BIG data set for use with our larger global goals in a larger pathology machine model to predict all pathologies we have looked at so far.

metaTop44 <- merge(Top44, ensembl_4, by.x="Gene_ID", by.y="gene_name")

colnames(metaTop44)

##   [1] "Gene_ID"     "GSM386518"   "GSM386519"   "GSM386520"   "GSM386521"  
##   [6] "GSM386522"   "GSM386523"   "GSM386524"   "GSM386525"   "GSM386526"  
##  [11] "GSM386527"   "GSM386528"   "GSM386529"   "GSM386530"   "GSM386531"  
##  [16] "GSM386532"   "GSM386533"   "GSM386534"   "GSM386535"   "GSM386536"  
##  [21] "GSM386537"   "GSM386538"   "GSM386539"   "GSM386540"   "GSM386541"  
##  [26] "GSM386542"   "GSM386543"   "GSM386544"   "GSM386545"   "GSM386546"  
##  [31] "GSM386547"   "GSM386548"   "GSM386549"   "GSM386550"   "GSM386551"  
##  [36] "GSM386552"   "GSM386553"   "GSM386554"   "GSM386555"   "GSM386556"  
##  [41] "GSM386557"   "GSM386558"   "GSM386559"   "GSM386560"   "GSM386561"  
##  [46] "GSM386562"   "GSM386563"   "GSM386564"   "GSM386565"   "GSM386566"  
##  [51] "GSM386567"   "GSM386568"   "GSM386569"   "GSM386570"   "GSM386571"  
##  [56] "GSM386572"   "GSM386573"   "GSM386574"   "GSM386575"   "GSM386576"  
##  [61] "GSM386577"   "GSM386578"   "GSM386579"   "GSM386580"   "GSM386581"  
##  [66] "GSM386582"   "GSM386583"   "GSM386584"   "GSM386585"   "GSM386586"  
##  [71] "GSM386587"   "GSM386588"   "GSM386589"   "GSM386590"   "GSM386591"  
##  [76] "GSM386592"   "GSM386593"   "GSM386594"   "GSM386595"   "GSM386596"  
##  [81] "GSM386597"   "GSM386598"   "GSM386599"   "GSM386600"   "GSM386601"  
##  [86] "GSM386602"   "GSM386603"   "GSM386604"   "GSM386605"   "GSM386606"  
##  [91] "GSM386607"   "GSM386608"   "GSM386609"   "GSM386610"   "GSM386611"  
##  [96] "GSM386612"   "GSM386613"   "GSM386614"   "GSM386615"   "GSM386616"  
## [101] "GSM386617"   "GSM386618"   "GSM386619"   "GSM386620"   "GSM386621"  
## [106] "GSM386622"   "GSM386623"   "GSM386624"   "GSM386625"   "GSM386626"  
## [111] "GSM386627"   "GSM386628"   "GSM386629"   "GSM386630"   "GSM386631"  
## [116] "GSM386632"   "GSM386633"   "GB_ACC"      "ASD_group"   "gene_id"    
## [121] "description" "locus"

Top44_meta <- metaTop44[,c(1,118:122,2:117)]

paged_table(Top44_meta)

Now we have to do a few merges with the BIG dataset and the shortPlatform.

BIG_gbAdded <- merge(BIG, shortPlatform, by.x="UID", by.y="UID")

colnames(BIG_gbAdded)

##   [1] "UID"                 "R"                   "C"                  
##   [4] "GSM386518"           "GSM386519"           "GSM386520"          
##   [7] "GSM386521"           "GSM386522"           "GSM386523"          
##  [10] "GSM386524"           "GSM386525"           "GSM386526"          
##  [13] "GSM386527"           "GSM386528"           "GSM386529"          
##  [16] "GSM386530"           "GSM386531"           "GSM386532"          
##  [19] "GSM386533"           "GSM386534"           "GSM386535"          
##  [22] "GSM386536"           "GSM386537"           "GSM386538"          
##  [25] "GSM386539"           "GSM386540"           "GSM386541"          
##  [28] "GSM386542"           "GSM386543"           "GSM386544"          
##  [31] "GSM386545"           "GSM386546"           "GSM386547"          
##  [34] "GSM386548"           "GSM386549"           "GSM386550"          
##  [37] "GSM386551"           "GSM386552"           "GSM386553"          
##  [40] "GSM386554"           "GSM386555"           "GSM386556"          
##  [43] "GSM386557"           "GSM386558"           "GSM386559"          
##  [46] "GSM386560"           "GSM386561"           "GSM386562"          
##  [49] "GSM386563"           "GSM386564"           "GSM386565"          
##  [52] "GSM386566"           "GSM386567"           "GSM386568"          
##  [55] "GSM386569"           "GSM386570"           "GSM386571"          
##  [58] "GSM386572"           "GSM386573"           "GSM386574"          
##  [61] "GSM386575"           "GSM386576"           "GSM386577"          
##  [64] "GSM386578"           "GSM386579"           "GSM386580"          
##  [67] "GSM386581"           "GSM386582"           "GSM386583"          
##  [70] "GSM386584"           "GSM386585"           "GSM386586"          
##  [73] "GSM386587"           "GSM386588"           "GSM386589"          
##  [76] "GSM386590"           "GSM386591"           "GSM386592"          
##  [79] "GSM386593"           "GSM386594"           "GSM386595"          
##  [82] "GSM386596"           "GSM386597"           "GSM386598"          
##  [85] "GSM386599"           "GSM386600"           "GSM386601"          
##  [88] "GSM386602"           "GSM386603"           "GSM386604"          
##  [91] "GSM386605"           "GSM386606"           "GSM386607"          
##  [94] "GSM386608"           "GSM386609"           "GSM386610"          
##  [97] "GSM386611"           "GSM386612"           "GSM386613"          
## [100] "GSM386614"           "GSM386615"           "GSM386616"          
## [103] "GSM386617"           "GSM386618"           "GSM386619"          
## [106] "GSM386620"           "GSM386621"           "GSM386622"          
## [109] "GSM386623"           "GSM386624"           "GSM386625"          
## [112] "GSM386626"           "GSM386627"           "GSM386628"          
## [115] "GSM386629"           "GSM386630"           "GSM386631"          
## [118] "GSM386632"           "GSM386633"           "healthy_mean"       
## [121] "language_mean"       "mild_mean"           "savant_mean"        
## [124] "FC_language_healthy" "FC_mild_healthy"     "FC_savant_healthy"  
## [127] "ID"                  "GB_ACC"              "SPOT_ID"

largeDF_GBstats_added <- BIG_gbAdded[,c(4:126,128)]

colnames(largeDF_GBstats_added)

##   [1] "GSM386518"           "GSM386519"           "GSM386520"          
##   [4] "GSM386521"           "GSM386522"           "GSM386523"          
##   [7] "GSM386524"           "GSM386525"           "GSM386526"          
##  [10] "GSM386527"           "GSM386528"           "GSM386529"          
##  [13] "GSM386530"           "GSM386531"           "GSM386532"          
##  [16] "GSM386533"           "GSM386534"           "GSM386535"          
##  [19] "GSM386536"           "GSM386537"           "GSM386538"          
##  [22] "GSM386539"           "GSM386540"           "GSM386541"          
##  [25] "GSM386542"           "GSM386543"           "GSM386544"          
##  [28] "GSM386545"           "GSM386546"           "GSM386547"          
##  [31] "GSM386548"           "GSM386549"           "GSM386550"          
##  [34] "GSM386551"           "GSM386552"           "GSM386553"          
##  [37] "GSM386554"           "GSM386555"           "GSM386556"          
##  [40] "GSM386557"           "GSM386558"           "GSM386559"          
##  [43] "GSM386560"           "GSM386561"           "GSM386562"          
##  [46] "GSM386563"           "GSM386564"           "GSM386565"          
##  [49] "GSM386566"           "GSM386567"           "GSM386568"          
##  [52] "GSM386569"           "GSM386570"           "GSM386571"          
##  [55] "GSM386572"           "GSM386573"           "GSM386574"          
##  [58] "GSM386575"           "GSM386576"           "GSM386577"          
##  [61] "GSM386578"           "GSM386579"           "GSM386580"          
##  [64] "GSM386581"           "GSM386582"           "GSM386583"          
##  [67] "GSM386584"           "GSM386585"           "GSM386586"          
##  [70] "GSM386587"           "GSM386588"           "GSM386589"          
##  [73] "GSM386590"           "GSM386591"           "GSM386592"          
##  [76] "GSM386593"           "GSM386594"           "GSM386595"          
##  [79] "GSM386596"           "GSM386597"           "GSM386598"          
##  [82] "GSM386599"           "GSM386600"           "GSM386601"          
##  [85] "GSM386602"           "GSM386603"           "GSM386604"          
##  [88] "GSM386605"           "GSM386606"           "GSM386607"          
##  [91] "GSM386608"           "GSM386609"           "GSM386610"          
##  [94] "GSM386611"           "GSM386612"           "GSM386613"          
##  [97] "GSM386614"           "GSM386615"           "GSM386616"          
## [100] "GSM386617"           "GSM386618"           "GSM386619"          
## [103] "GSM386620"           "GSM386621"           "GSM386622"          
## [106] "GSM386623"           "GSM386624"           "GSM386625"          
## [109] "GSM386626"           "GSM386627"           "GSM386628"          
## [112] "GSM386629"           "GSM386630"           "GSM386631"          
## [115] "GSM386632"           "GSM386633"           "healthy_mean"       
## [118] "language_mean"       "mild_mean"           "savant_mean"        
## [121] "FC_language_healthy" "FC_mild_healthy"     "FC_savant_healthy"  
## [124] "GB_ACC"

No way to merge actually because there are no gene IDs in the platform but only accession IDs. This is why we didn’t do this earlier in the project.

But we do have the gene bank accession ID attached to the largeDF_GBstats_added and the Top44_meta data frames.

write.csv(Top44_meta, 'Top44_chromosomeLocations_geneIDs_gbACC.csv', row.names=F)

write.csv(largeDF_GBstats_added, 'largeData_needsGeneIDs_onlyGB_accIDs_41472X124.csv', row.names=F)

We added gene summaries in this file with different arrangement of columns, here is the link.

==============================

Later examination of the 9p24.1 loci seen in EBV of classic Hodgkin’s lymphoma (CHL) and diffuse large b-cell lymphomas (DLBCL) in last project. After reviewing the summaries and saw that the RIC1 gene is in the same loci of the PDL1 (CD247) and PDL2 (PDCD1LG2) genes.

We have an interest in comparing the gene expression values of PDL1 and PDL2 for this Autism Spectrum Disorder data. We have to pull up the gene bank accession IDs for those genes and reference them in the BIG data set made much earlier in this project.

We are using the BIG_gbAdded dataframe to find the GB_ACC IDs for the PDL1 and PDL2 genes of CD247 and PDCD1LG2 respectively.

Lets look at a screenshot of the 9p24.1 loci of chormosome 9 where the RIC1 and CD247 and PDCD1LG2 genes live.

chromosome 9p24.1 with high variation genes in CHL and DLBCL From the image you can see that RIC1 is upstream from the PDL1 and PDL2 genes, where PDL2 is closer to it and actually has a nucleotide location that overlaps it.

Now we have to find the GB_ACC IDs for PDL1 and PDL2 to get from our BIG_gbAdded data set. And then compare fold change values across the 3 types of ASD.

Google search is PDL1 as NM_005228.3 and DQ336699.1 for PDL2.

PDCD1LG2 <- grep("NM_025239", BIG_gbAdded$GB_ACC) #integer 0 is empty, not in data
CD274 <- grep("NC_000009.12", BIG_gbAdded$GB_ACC)#integer 0 is empty, not in data

RIC1 <- BIG_gbAdded[grep("AI016962", BIG_gbAdded$GB_ACC),]    # 13248 is location

paged_table(RIC1)

The RIC1 gene is over expressed, up regulated, or stimulated as alternate descriptors for this gene on the gene expression fold change value for Language, Mild, and Savant. With the Language and Savant types of ASD most over expressed compared to healthy levels of the mean.

colnames(RIC1)

##   [1] "UID"                 "R"                   "C"                  
##   [4] "GSM386518"           "GSM386519"           "GSM386520"          
##   [7] "GSM386521"           "GSM386522"           "GSM386523"          
##  [10] "GSM386524"           "GSM386525"           "GSM386526"          
##  [13] "GSM386527"           "GSM386528"           "GSM386529"          
##  [16] "GSM386530"           "GSM386531"           "GSM386532"          
##  [19] "GSM386533"           "GSM386534"           "GSM386535"          
##  [22] "GSM386536"           "GSM386537"           "GSM386538"          
##  [25] "GSM386539"           "GSM386540"           "GSM386541"          
##  [28] "GSM386542"           "GSM386543"           "GSM386544"          
##  [31] "GSM386545"           "GSM386546"           "GSM386547"          
##  [34] "GSM386548"           "GSM386549"           "GSM386550"          
##  [37] "GSM386551"           "GSM386552"           "GSM386553"          
##  [40] "GSM386554"           "GSM386555"           "GSM386556"          
##  [43] "GSM386557"           "GSM386558"           "GSM386559"          
##  [46] "GSM386560"           "GSM386561"           "GSM386562"          
##  [49] "GSM386563"           "GSM386564"           "GSM386565"          
##  [52] "GSM386566"           "GSM386567"           "GSM386568"          
##  [55] "GSM386569"           "GSM386570"           "GSM386571"          
##  [58] "GSM386572"           "GSM386573"           "GSM386574"          
##  [61] "GSM386575"           "GSM386576"           "GSM386577"          
##  [64] "GSM386578"           "GSM386579"           "GSM386580"          
##  [67] "GSM386581"           "GSM386582"           "GSM386583"          
##  [70] "GSM386584"           "GSM386585"           "GSM386586"          
##  [73] "GSM386587"           "GSM386588"           "GSM386589"          
##  [76] "GSM386590"           "GSM386591"           "GSM386592"          
##  [79] "GSM386593"           "GSM386594"           "GSM386595"          
##  [82] "GSM386596"           "GSM386597"           "GSM386598"          
##  [85] "GSM386599"           "GSM386600"           "GSM386601"          
##  [88] "GSM386602"           "GSM386603"           "GSM386604"          
##  [91] "GSM386605"           "GSM386606"           "GSM386607"          
##  [94] "GSM386608"           "GSM386609"           "GSM386610"          
##  [97] "GSM386611"           "GSM386612"           "GSM386613"          
## [100] "GSM386614"           "GSM386615"           "GSM386616"          
## [103] "GSM386617"           "GSM386618"           "GSM386619"          
## [106] "GSM386620"           "GSM386621"           "GSM386622"          
## [109] "GSM386623"           "GSM386624"           "GSM386625"          
## [112] "GSM386626"           "GSM386627"           "GSM386628"          
## [115] "GSM386629"           "GSM386630"           "GSM386631"          
## [118] "GSM386632"           "GSM386633"           "healthy_mean"       
## [121] "language_mean"       "mild_mean"           "savant_mean"        
## [124] "FC_language_healthy" "FC_mild_healthy"     "FC_savant_healthy"  
## [127] "ID"                  "GB_ACC"              "SPOT_ID"

Lets compare median gene expression values for this gene, RIC1, over the healthy, Language, Mild, and Savant groups. And then compare to the mean values to see if skew values of outliers are in the data.

RIC1_samples <- RIC1[,c(4:119)]


healthy_RIC1 <- RIC1_samples[,c(1:29)] #29 samples the median is 0
language_RIC1 <- RIC1_samples[,c(30:60)] #31
mild_RIC1 <- RIC1_samples[,c(61:86)] #26
savant_RIC1 <- RIC1_samples[,c(87:116)] #30

healthy

paged_table(healthy_RIC1)

The median value of healthy samples is 0.

median(as.numeric(healthy_RIC1))

## [1] 0

language

paged_table(language_RIC1)

language_RIC1[,order(as.numeric(language_RIC1))]

##       GSM386547 GSM386549 GSM386550 GSM386551 GSM386553 GSM386554 GSM386556
## 13248         0         0         0         0         0         0         0
##       GSM386558 GSM386559 GSM386563 GSM386564 GSM386566 GSM386567 GSM386568
## 13248         0         0         0         0         0         0         0
##       GSM386569 GSM386570 GSM386571 GSM386572 GSM386574 GSM386575 GSM386576
## 13248         0         0         0         0         0         0         0
##       GSM386577 GSM386555 GSM386573 GSM386561 GSM386562 GSM386552 GSM386560
## 13248         0    0.0878    0.0986    0.1064     0.107    0.1083    0.1324
##       GSM386557 GSM386565 GSM386548
## 13248    0.1454    0.2891    0.9133

median(as.numeric(language_RIC1))

## [1] 0

The median of language is 0 as well.

mild

mild_RIC1[,order(as.numeric(mild_RIC1))]

##       GSM386579 GSM386580 GSM386581 GSM386582 GSM386583 GSM386584 GSM386585
## 13248         0         0         0         0         0         0         0
##       GSM386586 GSM386587 GSM386589 GSM386590 GSM386591 GSM386592 GSM386593
## 13248         0         0         0         0         0         0         0
##       GSM386594 GSM386595 GSM386596 GSM386597 GSM386598 GSM386600 GSM386601
## 13248         0         0         0         0         0         0         0
##       GSM386603 GSM386599 GSM386602 GSM386588 GSM386578
## 13248         0    0.0928    0.1064    0.1311    0.2014

median(as.numeric(mild_RIC1))

## [1] 0

The median of mild is also 0.

savant

savant_RIC1[,order(as.numeric(savant_RIC1))]

##       GSM386604 GSM386605 GSM386606 GSM386607 GSM386608 GSM386609 GSM386611
## 13248         0         0         0         0         0         0         0
##       GSM386612 GSM386613 GSM386614 GSM386615 GSM386618 GSM386619 GSM386620
## 13248         0         0         0         0         0         0         0
##       GSM386621 GSM386622 GSM386624 GSM386625 GSM386627 GSM386629 GSM386630
## 13248         0         0         0         0         0         0         0
##       GSM386631 GSM386610 GSM386616 GSM386617 GSM386633 GSM386628 GSM386626
## 13248     0.085    0.0888       0.1    0.1024    0.1349    0.1395    0.1529
##       GSM386623 GSM386632
## 13248    0.1602    0.9277

median(as.numeric(savant_RIC1))

## [1] 0

The median of savant is also 0.

Clearly, the median is going to skew the average or mean of the samples as there were no zeros in the mean values of each group. But only a few of the samples like up to a handful each of 29-31 samples each had a numeric value other than 0 for the RIC1 gene.

That could be due to the sampling error or dropping very low numeric values. But somehow RIC1 was a gene that was able to make it to the top genes in top 10 most up regulated fold change by mean value of pathology group over mean value of healthy.

Lets look at the summary stats for the GB_ACC feature in the data of 41k genes.

sum(is.na(BIG_gbAdded$GB_ACC) == FALSE)

## [1] 41472

We cannot compare to see if the data of ASD has the EBV gene targets of PDL1 and PDL2 seen in CHL or DLBCL.

Because the genes aren’t in the data by accession ID, the only way to uncover the genes by manually searching for the gene by GB_ACC.

One more gene to check is LMP1 by GB_ACC, google search is in genecards goes by TRAF3 and locus NC_000014

NC_000014 <- BIG_gbAdded[grep("NC_000014", BIG_gbAdded$GB_ACC),] #empty

LMP1 was a gene in nasopharyngeal cancer and gastric cancer from mutations caused by EBV latent type activated.

We can go ahead and look at the other genes that had summaries other than long non-codeing lnc summaries for RNA.

The Merck manual 20th edition says that vaccinations are strongly proven not to cause ASD, the genetic component is that if a monozygotic twin with ASD, then the other likely to have ASD, and 60-100% likely to have a sibling subsequently born with ASD after having a child with ASD. There are 1/68 individuals estimated to be on the spectrum of ASD, and it is a neurodevelopmental disorder often found in childhood by ritualistic patterns, avoidance, speech problems, learning disabilities, or stimming that can be hand flapping or rocking of body when stressed and a disinterest in many things especially socializing. They stim their body to avoid or cope with stress and changes. But allowing an ASD person to know ahead of time of changes helps them accept the change and adapt but can be extremely overwhelming and stressful to cope with change to any of their routines or ritualistic patterns.

This data is one of few that showed up and was done with a machine I hadn’t seen before of the TIGR type where most today is Affymetrix or Illumina in the projects I have worked with in RNA or mRNA data. It seems like many samples were skewed by the average gene expression values in the top fold change. But we used a basic simple fold change of group average of pathology over group average of healthy. There is a method to actually handle differential expression using binomial distribution and the negative binomial in linear regression that will handle mean logistic error or MLE in the DESeq package seen on many youtube instructional videos. But they go ahead with filtering out the skewed data where variation within is fine if far from the total line of averages per samples but handling variability with in low or high values. Not necessarily fine but why the negative binomial distribution is found.

For now it would be interesting to compare it to EBV known or putative gene targets and see if some connection there as well as some studies recently worked on have had statements that known EBV pathologies are expected to increase in next 10 years such as CHL 10% in next 10 years and that 90% of the population have EBV. But ASD seems to be something that is not currently known as well as EBV as to the source of pathology. EBV is airborne initially and stays in your body after infection in a latent form activated in low immune periods of life such as high stress and poor health from other factors. While ASD is seen in early developmental years of the body and brain. ASD, dyslexia, ADHD, and intellectual disability are all under the umbrella of neurodevelopmental disorders.

savant Autism GEO GSE15402 2009data MEV files

Janis Harris

2026-04-14

These are the GB_ACC IDs we will find: