MS genes top 41 with cDNA amino acids attempt2 with bioconductor

An extension to the Multiple Sclerosis gene expression data of barcodes done earlier in week.

barcodes <- read.csv('MS_41genes_15samples_Foldchange_mRNA_AAs.csv', header=T, sep=',', na.strings=c('',' ','na','NA'))
str(barcodes)

## 'data.frame':    41 obs. of  28 variables:
##  $ ID_REF                            : chr  "TTTCGGGGAACCGAGTCGAT" "TTTAGAGTCGGTGGTAGATC" "TTCACGGTCCTTTTGGTCAC" "TGCGGTCGCGACCTTTCAGC" ...
##  $ control1.4362                     : int  1 1 31 21 27 2 28 20 1 2 ...
##  $ control2.4363                     : int  1 1 23 22 24 4 29 26 1 1 ...
##  $ control3.4364                     : int  3 1 25 30 32 2 23 13 2 4 ...
##  $ MS1_r1_4370                       : int  119 99 2 1 1 135 1 1 61 209 ...
##  $ MS1_r2_4371                       : int  106 89 2 2 1 124 1 2 114 176 ...
##  $ MS1_r3_4372                       : int  101 49 2 2 4 99 3 4 65 135 ...
##  $ MS1_r4_4373                       : int  200 99 1 2 4 235 3 2 103 242 ...
##  $ MS1_r5_4374                       : int  99 108 1 3 5 133 3 2 116 225 ...
##  $ MS2_r1_4375                       : int  75 85 3 5 3 122 1 3 101 104 ...
##  $ MS2_r2_4376                       : int  65 49 1 5 2 109 3 3 110 114 ...
##  $ MS2_r3_4377                       : int  63 53 3 2 3 97 4 4 87 107 ...
##  $ MS2_r4_4378                       : int  78 36 1 3 1 123 6 1 110 138 ...
##  $ MS2_r5_4379                       : int  87 43 1 3 4 98 1 2 68 111 ...
##  $ commercial1o.commercial_r1_4365   : int  128 83 3 2 5 106 3 3 88 124 ...
##  $ commercial2o.commercial_r2_4366   : int  98 52 2 3 1 116 4 2 105 114 ...
##  $ commercial3o.commercial_r3_4367   : int  109 60 2 3 1 156 2 2 97 156 ...
##  $ commercial4o.commercial_r4_4368   : int  93 67 1 1 5 122 1 2 86 95 ...
##  $ commercial5o.commercial_r5_4369   : int  57 48 1 1 1 99 2 2 64 103 ...
##  $ controlMeans                      : num  1.67 1 26.33 24.33 27.67 ...
##  $ MS1_Means                         : num  125 88.8 1.6 2 3 ...
##  $ MS2_Means                         : num  73.6 53.2 1.8 3.6 2.6 ...
##  $ commercial_Means                  : num  97 62 1.8 2 2.6 ...
##  $ foldchange_MS1_vs_control         : num  75 88.8 0.0608 0.0822 0.1084 ...
##  $ foldchange_MS2_vs_control         : num  44.16 53.2 0.0684 0.1479 0.094 ...
##  $ foldchange_commercialMS_vs_control: num  0.0172 0.0161 14.6296 12.1667 10.641 ...
##  $ mRNA                              : chr  "AAAGCCCCUUGGCUCAGCUA" "AAAUCUCAGCCACCAUCUAG" "AAGUGCCAGGAAAACCAGUG" "ACGCCAGCGCUGGAAAGUCG" ...
##  $ aminoAcids_simulated              : chr  "-Lys-Ala-CC-Leu-G-Leu-AG-Leu-" "-Lys-U-Leu-A-Ala-A-Pro-U-Leu-G" "AA-Val-Pro-G-Glu-AA-Pro-Val-" "AC-Ala-A-Ala-Leu-Glu-A-Val-G" ...

Read in the meta data that has commented information on methods and procedures of aquiring the lab results. When originally read in, done with read.table and it excluded the ‘!’ prefixed notation to each comment and only included the header with no observations. We will read in with read.csv and look it over to get the methods on the media type and what the barcodes are.

meta <- read.csv("GSE293036-GPL34284_series_matrix.txt", sep='\t',comment.char="#")

str(meta)

## 'data.frame':    373 obs. of  2 variables:
##  $ X.Series_title                                                                              : chr  "!Series_geo_accession" "!Series_status" "!Series_submission_date" "!Series_last_update_date" ...
##  $ Genome.wide.discovery.of.multiple.sclerosis.genetic.risk.variant.allelic.regulatory.activity: chr  "GSE293036" "Public on Nov 13 2025" "Mar 26 2025" "Nov 14 2025" ...

It is better to view it in your Rstudio panel and scroll to see the information.

information <- meta[!duplicated(meta$X.Series_title),]
str(information)

## 'data.frame':    144 obs. of  2 variables:
##  $ X.Series_title                                                                              : chr  "!Series_geo_accession" "!Series_status" "!Series_submission_date" "!Series_last_update_date" ...
##  $ Genome.wide.discovery.of.multiple.sclerosis.genetic.risk.variant.allelic.regulatory.activity: chr  "GSE293036" "Public on Nov 13 2025" "Mar 26 2025" "Nov 14 2025" ...

The information is in row 141 and row 181 on what these barcodes are.

info <- meta[c(141,181),1]
info

## [1] "Total RNA was extracted by the RNeasy Midi Kit following the manufacturer’s instruction. Extracted RNA was subjected to DNase treatment. eGFP mRNA enrichment and elution were performed using biotinylated probes as in Lu et al (PMID: 33712590), and the mRNA was reverse transcribed to cDNA using SuperScript™ IV First-Strand Synthesis System with gene specific primer MPRA_v3_Amp2Sc_R, following the manufacturer’s instruction."                                                                                                                                  
## [2] "Oligo-barcode association: Paired-end, 150bp reads were first quality filtered using Trimmomatic (v0.38) (flags: PE -phred33, LEADING:25, TRAILING:25, MINLEN:80). Read 1 was separated into the 20bp barcode region and the oligo-matching region. The trimmed oligo-matching regions of Read 1 and Read 2 were mapped back to the synthesized oligo sequences using Bowtie2 (v2.3.4.1) (flags: -X 250, -very-sensitive, -p 16). Barcodes were then associated with the oligo sequences using the read ID. Only uniquely mapped barcodes were used for downstream analysis."

These barcodes are complementary DNA that has been split into at least 80 bp reads of 150 bp reads with oligo sequences leading the first 25 and the last 25, to get the middle 30 as a region of the barcode. The read ID given each read was what would pair it back with its original oligo sequence. Sounds confusing. And two software programs used for this to split it and separate it and then to map it back to original oligo sequence. An internet search for what an oligo sequence is in cDNA barcode analysis said it is to protect the actual biological processes from artifacts the process can create that wouldn’t be there normally but added by scanning the genes. A unique oligonucleotide sequence is added to each mRNA molecule before doing reverse transcription into cDNA it is added to 5’ end of mRNA (DNA is copied in the 5’ to 3’ direction into mRNA and mRNA is translated into the 5’ to 3’ direction as it is the forward strand while the cDNA strand is the reverse strand in DNA). It is to ensure accurate quantification of gene expression as well as transcript variants as well as protect the PCR (polymerase chain reaction) bias from creating artifacts and instead showing the true biological variation in the samples. Each mRNA molecule has an oligo identifier attached.

We made the mRNA feature from the cDNA ID_REF feature barcodes, and from the mRNA attempted to gsub in the amino acid for each respective triplet codon of nucleotides earlier.

strands <- barcodes[,c(1,27,28)]
strands

##                  ID_REF                 mRNA           aminoAcids_simulated
## 1  TTTCGGGGAACCGAGTCGAT AAAGCCCCUUGGCUCAGCUA  -Lys-Ala-CC-Leu-G-Leu-AG-Leu-
## 2  TTTAGAGTCGGTGGTAGATC AAAUCUCAGCCACCAUCUAG -Lys-U-Leu-A-Ala-A-Pro-U-Leu-G
## 3  TTCACGGTCCTTTTGGTCAC AAGUGCCAGGAAAACCAGUG   AA-Val-Pro-G-Glu-AA-Pro-Val-
## 4  TGCGGTCGCGACCTTTCAGC ACGCCAGCGCUGGAAAGUCG   AC-Ala-A-Ala-Leu-Glu-A-Val-G
## 5  TGCCCGCGCCTACAGTAGTG ACGGGCGCGGAUGUCAUCAC   -Thr-G-Ala-CG-Asp-Val-Ile-AC
## 6  TGAATTTTAGAGTCGGTTTC ACUUAAAAUCUCAGCCAAAG   AC-Leu-Lys-U-Leu-A-Ala-Lys-G
## 7  TAGTGGCGTGAGATTTGCGT AUCACCGCACUCUAAACGCA  -Ile-Thr-Ala-Leu-U-Lys-C-Ala-
## 8  TAGAGTACCGTTTTTGAACT AUCUCAUGGCAAAAACUUGA    AU-Leu-Met-Ala-Lys-AC-Leu-A
## 9  TAGACATGCAGTCGTTTCGA AUCUGUACGUCAGCAAAGCU   AU-Leu-Tyr-Val-A-Ala-AA-Ala-
## 10 GTGATTCCACAGTCGTTAAT CACUAAGGUGUCAGCAAUUA  CA-Leu-Arg-U-Val-A-Ala-A-Leu-
## 11 GTGAGGATACAGTCGGTTTT CACUCCUAUGUCAGCCAAAA   CA-Leu-Leu-U-Val-A-Ala-Lys-A
## 12 GTAGAGTCGTTACCCGACAC CAUCUCAGCAAUGGGCUGUG  -His-Leu-A-Ala-A-Trp-G-Leu-UG
## 13 GTACCGTCGGTTGCTCGTGC CAUGGCAGCCAACGAGCACG   -His-G-Ala-Ala-AA-Arg-Ala-CG
## 14 GGTGTTGTCAGAGTCGTTAA CCACAACAGUCUCAGCAAUU  -Pro-CA-Thr-GU-Leu-A-Ala-Ile-
## 15 GGTCCTGTCTTTTCTGCTGA CCAGGACAGAAAAGACGACU  -Pro-G-Asp-A-Glu-AA-Asp-Asp-U
## 16 GGGCGTGTTTTTCTGGAGTA CCCGCACAAAAAGACCUCAU    CCC-Ala-C-Lys-AA-Asp-Leu-AU
## 17 GGCTCGGAGTCGCTGAAAAT CCGAGCCUCAGCGACUUUUA   CC-Glu-C-Leu-A-Ala-AC-Phe-UA
## 18 GGAGTCGTCTTTTTATCCCC CCUCAGCAGAAAAAUAGGGG   C-Leu-A-Ala-Glu-AA-Ile-Gln-G
## 19 GCTTCGCAGTCGTTAGAGTT CGAAGCGUCAGCAAUCUCAA   C-Glu-GC-Val-A-Ala-AU-Leu-AA
## 20 GCCGTCCTGTCTTTCTCATT CGGCAGGACAGAAAGAGUAA   CG-Ala-G-Asp-A-Glu-Arg-Val-A
## 21 GCCGAGTCGTTATGGACCCA CGGCUCAGCAAUACCUGGGU   CGG-Leu-A-Ala-Ile-C-Leu-Gln-
## 22 GAGTCGTTTAAAGGCTCTCT CUCAGCAAAUUUCCGAGAGA  -Leu-A-Ala-AA-Phe-CC-Glu-Arg-
## 23 GAGTCGTTCTCGTTTCGCAG CUCAGCAAGAGCAAAGCGUC  -Leu-A-Ala-Arg-Ala-Lys-C-Val-
## 24 CTCCAGAGCCGTTTTCGGTG GAGGUCUCGGCAAAAGCCAC   -Glu-GU-Leu-G-Ala-Lys-Ala-AC
## 25 CTATGGTCCCTTAGTGTTTA GAUACCAGGGAAUCACAAAU   G-Ile-CC-Arg-GA-Ile-Thr-Asp-
## 26 CTATCAACAGAGTCGCTAAT GAUAGUUGUCUCAGCGAUUA G-Ile-G-Leu-U-Leu-A-Ala-A-Leu-
## 27 CGGTTAGAGTCGATAGCTTT GCCAAUCUCAGCUAUCGAAA   -Ala-Asp-Leu-AG-Leu-UC-Glu-A
## 28 CGAGTCGTTTGACCGGCGCA GCUCAGCAAACUGGCCGCGU   G-Leu-A-Ala-AA-Leu-Ala-Ala-U
## 29 CCTCTCACCAGTCGTTTTGG GGAGAGUGGUCAGCAAAACC  G-Glu-A-Val-Val-A-Ala-AA-Thr-
## 30 CCAGTCGATTCTTTTCATAT GGUCAGCUAAGAAAAGUAUA  G-Val-AG-Leu-A-Glu-Lys-U-Ile-
## 31 CCAGCAGAGTCGCTCGAAAT GGUCGUCUCAGCGAGCUUUA   G-Val-GU-Leu-A-Ala-AGC-Phe-A
## 32 CACCGTCGTTTTTGTGACCG GUGGCAGCAAAAACACUGGC    -Val-Ala-Ala-Lys-Thr-Leu-GC
## 33 ATCGTCGTTTTAGCCGTAGG UAGCAGCAAAAUCGGCAUCC    UA-Ala-Ala-AA-Ile-Gln-Ile-C
## 34 ATCGTCGGTCTTAGCGGTCA UAGCAGCCAGAAUCGCCAGU     UA-Ala-Ala-Arg-Ile-Ala-AGU
## 35 AGATTAACCCAATACATTAT UCUAAUUGGGUUAUGUAAUA  U-Leu-A-Leu-GG-Leu-U-Val-Ile-
## 36 AGAGTCGCTCGTTAGGATCT UCUCAGCGAGCAAUCCUAGA   U-Leu-A-Ala-A-Ala-Ile-Leu-GA
## 37 AGAGTCGATTTGTCCAATCG UCUCAGCUAAACAGGUUAGC    U-Leu-AG-Leu-Asp-Arg-Leu-GC
## 38 AGAGGCGTTCGATCTTAGAC UCUCCGCAAGCUAGAAUCUG  U-Leu-C-Ala-AG-Leu-Glu-U-Leu-
## 39 ACTGTCGTTTCAACGTTGAA UGACAGCAAAGUUGCAACUU   U-Asp-A-Ala-Lys-Leu-Gln-Leu-
## 40 ACCCCCGTCGTTAATTCGAC UGGGGGCAGCAAUUAAGCUG  -Trp-GG-Ala-Ala-A-Leu-AG-Leu-
## 41 AACGCACGGGCGTGTTAGTC UUGCGUGCCCGCACAAUCAG   -Leu-C-Val-CCC-Ala-CA-Ile-AG

It produced some singlet and double nucleotide gaps between triplet codons of amino acids that could be deletions, insertions, or translocations that happen when the nucleus of the cell has been damaged enough to destroy the DNA wrapped heavily around histones that encircle the chromosomes in the nucleus of the cell. The cell naturally repairs what it can or programs the cell to die and create another replacement from adjacent cells. When those cells get reproduced with damaged cells it can impact how genes are transcribed and proteins translated from those damaged cells. It can lead to systemic body failure and organ failure, as well as cancer and uncontrolled cell growth into tumors encapsulated which are survivable or non-encapsulated and deadlier as it mutates and turns adjacent cells into tumors and spreads until it damages major system functions of the body.

We want to see what these top strands in the barcodes of the ID_REF are and we found 41 that are up or down regulated. We can check the end features to. Lets see if most of these are up or down regulated in the MS samples that participated and the EBV cell line of B-cell commercially reproduced MS patient, when compared to the healthy control sample used.

foldchanges <- barcodes[order(barcodes$foldchange_commercialMS_vs_control, decreasing=T),c(1,27,28,24:26)]
foldchanges

##                  ID_REF                 mRNA           aminoAcids_simulated
## 3  TTCACGGTCCTTTTGGTCAC AAGUGCCAGGAAAACCAGUG   AA-Val-Pro-G-Glu-AA-Pro-Val-
## 4  TGCGGTCGCGACCTTTCAGC ACGCCAGCGCUGGAAAGUCG   AC-Ala-A-Ala-Leu-Glu-A-Val-G
## 7  TAGTGGCGTGAGATTTGCGT AUCACCGCACUCUAAACGCA  -Ile-Thr-Ala-Leu-U-Lys-C-Ala-
## 5  TGCCCGCGCCTACAGTAGTG ACGGGCGCGGAUGUCAUCAC   -Thr-G-Ala-CG-Asp-Val-Ile-AC
## 20 GCCGTCCTGTCTTTCTCATT CGGCAGGACAGAAAGAGUAA   CG-Ala-G-Asp-A-Glu-Arg-Val-A
## 41 AACGCACGGGCGTGTTAGTC UUGCGUGCCCGCACAAUCAG   -Leu-C-Val-CCC-Ala-CA-Ile-AG
## 15 GGTCCTGTCTTTTCTGCTGA CCAGGACAGAAAAGACGACU  -Pro-G-Asp-A-Glu-AA-Asp-Asp-U
## 8  TAGAGTACCGTTTTTGAACT AUCUCAUGGCAAAAACUUGA    AU-Leu-Met-Ala-Lys-AC-Leu-A
## 16 GGGCGTGTTTTTCTGGAGTA CCCGCACAAAAAGACCUCAU    CCC-Ala-C-Lys-AA-Asp-Leu-AU
## 25 CTATGGTCCCTTAGTGTTTA GAUACCAGGGAAUCACAAAU   G-Ile-CC-Arg-GA-Ile-Thr-Asp-
## 23 GAGTCGTTCTCGTTTCGCAG CUCAGCAAGAGCAAAGCGUC  -Leu-A-Ala-Arg-Ala-Lys-C-Val-
## 11 GTGAGGATACAGTCGGTTTT CACUCCUAUGUCAGCCAAAA   CA-Leu-Leu-U-Val-A-Ala-Lys-A
## 6  TGAATTTTAGAGTCGGTTTC ACUUAAAAUCUCAGCCAAAG   AC-Leu-Lys-U-Leu-A-Ala-Lys-G
## 29 CCTCTCACCAGTCGTTTTGG GGAGAGUGGUCAGCAAAACC  G-Glu-A-Val-Val-A-Ala-AA-Thr-
## 27 CGGTTAGAGTCGATAGCTTT GCCAAUCUCAGCUAUCGAAA   -Ala-Asp-Leu-AG-Leu-UC-Glu-A
## 33 ATCGTCGTTTTAGCCGTAGG UAGCAGCAAAAUCGGCAUCC    UA-Ala-Ala-AA-Ile-Gln-Ile-C
## 38 AGAGGCGTTCGATCTTAGAC UCUCCGCAAGCUAGAAUCUG  U-Leu-C-Ala-AG-Leu-Glu-U-Leu-
## 28 CGAGTCGTTTGACCGGCGCA GCUCAGCAAACUGGCCGCGU   G-Leu-A-Ala-AA-Leu-Ala-Ala-U
## 21 GCCGAGTCGTTATGGACCCA CGGCUCAGCAAUACCUGGGU   CGG-Leu-A-Ala-Ile-C-Leu-Gln-
## 35 AGATTAACCCAATACATTAT UCUAAUUGGGUUAUGUAAUA  U-Leu-A-Leu-GG-Leu-U-Val-Ile-
## 10 GTGATTCCACAGTCGTTAAT CACUAAGGUGUCAGCAAUUA  CA-Leu-Arg-U-Val-A-Ala-A-Leu-
## 26 CTATCAACAGAGTCGCTAAT GAUAGUUGUCUCAGCGAUUA G-Ile-G-Leu-U-Leu-A-Ala-A-Leu-
## 39 ACTGTCGTTTCAACGTTGAA UGACAGCAAAGUUGCAACUU   U-Asp-A-Ala-Lys-Leu-Gln-Leu-
## 30 CCAGTCGATTCTTTTCATAT GGUCAGCUAAGAAAAGUAUA  G-Val-AG-Leu-A-Glu-Lys-U-Ile-
## 19 GCTTCGCAGTCGTTAGAGTT CGAAGCGUCAGCAAUCUCAA   C-Glu-GC-Val-A-Ala-AU-Leu-AA
## 40 ACCCCCGTCGTTAATTCGAC UGGGGGCAGCAAUUAAGCUG  -Trp-GG-Ala-Ala-A-Leu-AG-Leu-
## 18 GGAGTCGTCTTTTTATCCCC CCUCAGCAGAAAAAUAGGGG   C-Leu-A-Ala-Glu-AA-Ile-Gln-G
## 1  TTTCGGGGAACCGAGTCGAT AAAGCCCCUUGGCUCAGCUA  -Lys-Ala-CC-Leu-G-Leu-AG-Leu-
## 37 AGAGTCGATTTGTCCAATCG UCUCAGCUAAACAGGUUAGC    U-Leu-AG-Leu-Asp-Arg-Leu-GC
## 2  TTTAGAGTCGGTGGTAGATC AAAUCUCAGCCACCAUCUAG -Lys-U-Leu-A-Ala-A-Pro-U-Leu-G
## 17 GGCTCGGAGTCGCTGAAAAT CCGAGCCUCAGCGACUUUUA   CC-Glu-C-Leu-A-Ala-AC-Phe-UA
## 36 AGAGTCGCTCGTTAGGATCT UCUCAGCGAGCAAUCCUAGA   U-Leu-A-Ala-A-Ala-Ile-Leu-GA
## 13 GTACCGTCGGTTGCTCGTGC CAUGGCAGCCAACGAGCACG   -His-G-Ala-Ala-AA-Arg-Ala-CG
## 14 GGTGTTGTCAGAGTCGTTAA CCACAACAGUCUCAGCAAUU  -Pro-CA-Thr-GU-Leu-A-Ala-Ile-
## 34 ATCGTCGGTCTTAGCGGTCA UAGCAGCCAGAAUCGCCAGU     UA-Ala-Ala-Arg-Ile-Ala-AGU
## 9  TAGACATGCAGTCGTTTCGA AUCUGUACGUCAGCAAAGCU   AU-Leu-Tyr-Val-A-Ala-AA-Ala-
## 24 CTCCAGAGCCGTTTTCGGTG GAGGUCUCGGCAAAAGCCAC   -Glu-GU-Leu-G-Ala-Lys-Ala-AC
## 12 GTAGAGTCGTTACCCGACAC CAUCUCAGCAAUGGGCUGUG  -His-Leu-A-Ala-A-Trp-G-Leu-UG
## 31 CCAGCAGAGTCGCTCGAAAT GGUCGUCUCAGCGAGCUUUA   G-Val-GU-Leu-A-Ala-AGC-Phe-A
## 32 CACCGTCGTTTTTGTGACCG GUGGCAGCAAAAACACUGGC    -Val-Ala-Ala-Lys-Thr-Leu-GC
## 22 GAGTCGTTTAAAGGCTCTCT CUCAGCAAAUUUCCGAGAGA  -Leu-A-Ala-AA-Phe-CC-Glu-Arg-
##    foldchange_MS1_vs_control foldchange_MS2_vs_control
## 3                 0.06075949                0.06835443
## 4                 0.08219178                0.14794521
## 7                 0.08250000                0.11250000
## 5                 0.10843373                0.09397590
## 20                0.08955224                0.10746269
## 41                0.08737864                0.12233010
## 15                0.07800000                0.12600000
## 8                 0.11186441                0.13220339
## 16                0.09320388                0.13980583
## 25                0.09230769                0.08076923
## 23               54.00000000               52.60000000
## 11               63.00000000               47.04000000
## 6                54.45000000               41.17500000
## 29               75.80000000               54.50000000
## 27               66.36000000               52.20000000
## 33               60.80000000               47.40000000
## 38               54.90000000               42.67500000
## 28               57.23076923               50.86153846
## 21               54.40000000               42.20000000
## 35               68.52000000               48.60000000
## 10               84.60000000               49.20000000
## 26               64.40000000               41.00000000
## 39               51.60000000               44.00000000
## 30               76.35000000               62.10000000
## 19               71.76000000               46.68000000
## 40               61.60000000               64.20000000
## 18               73.05000000               61.50000000
## 1                75.00000000               44.16000000
## 37               75.40000000               60.40000000
## 2                88.80000000               53.20000000
## 17               87.60000000               70.40000000
## 36               74.10000000               63.45000000
## 13               77.55000000               45.90000000
## 14               63.90000000               61.20000000
## 34               74.60000000               77.40000000
## 9                68.85000000               71.40000000
## 24               94.80000000               71.55000000
## 12               80.00000000               71.60000000
## 31               69.40000000               58.00000000
## 32              103.40000000               61.40000000
## 22              161.40000000              110.40000000
##    foldchange_commercialMS_vs_control
## 3                        14.629629630
## 4                        12.166666667
## 7                        11.111111111
## 5                        10.641025641
## 20                       10.151515152
## 41                        9.537037037
## 15                        9.259259259
## 8                         8.939393939
## 16                        8.583333333
## 25                        7.878787879
## 23                        0.024038462
## 11                        0.023607177
## 6                         0.022259321
## 29                        0.021052632
## 27                        0.020990764
## 33                        0.020833333
## 38                        0.020544427
## 28                        0.020420986
## 21                        0.020325203
## 35                        0.019794141
## 10                        0.019707207
## 26                        0.019193858
## 39                        0.018939394
## 30                        0.018365473
## 19                        0.018315018
## 40                        0.017730496
## 18                        0.017590150
## 1                         0.017182131
## 37                        0.016501650
## 2                         0.016129032
## 17                        0.016129032
## 36                        0.015948963
## 13                        0.015741834
## 14                        0.015290520
## 34                        0.015243902
## 9                         0.015151515
## 24                        0.014556041
## 12                        0.014245014
## 31                        0.013927577
## 32                        0.010989011
## 22                        0.007022472

The commercial line has an inverse relationship with each of the participating MS patient samples in that the barcodes that are under expressed in the commercial line are over expressed or upregulated in each of the MS samples. We saw the randomForest() was able to predict with 100% accuracy the sample as healthy or MS using this data of 15 samples and 41 genes without the fold change or means added to this data. It was set on cross validation with 10,000 trees.

Up regulation or over expression is also called enhanced or enhancer gene while the down regulation or under expression of gene transcripts is called silenced or silencer genes.

In seeing if there are copy variants, what it is looking at are those genes that have different nucleotides in their mRNA made after dropping introns or non-coding regions of DNA and keeping the exons as the coding region of DNA. This is from the pre-mRNA that is actually transcribed from DNA when copied from the complementary strand or anti-sense strand to make the adjacent strand or sense strand of DNA using RNA that substitutes Thymine for Uracil as a nucleic acid.

Since DNA is a double helix strand of complementary bases made up of DNA nucleic acids, the sense strand of forward strand moves up but not towards the centromere as that is on the chromosome where the DNA is wrapped around beads very tightly called histones that are then wrapped around the chromosome. We have 23 chromosomes, the first 22 are autosomes and the 23rd chromosome is the sex chromosome for identity. There are 2 copies in each cell nucleus where 1 is a copy of mother and other of father of human. So, the sense strand is just moving they say 5’ to 3’ and not up or down, and when DNA is copied, it is copying the reverse strand or anti-sense strand in the 5’ to 3’ direction to make the molecule of 1 strand of DNA not the double strand that makes the helix shape. That strand is 3’ to 5’ and RNA is not copied in that direction. It is reiterated many times and this makes nonsense at first because you wonder why when first learning it and reviewing it again and again that you don’t even know which direction it moves, but its because it doesn’t move up or down its in space the relation of one strand to the other. So, mRNA is fed into the ribosome in 5’ to 3’ direction the same as the DNA strand is transcribed into the mRNa molecule in the 5’ to 3’ direction.

Environmental toxins, pollution, ultraviolet radiation, any harmful exposure to radiation like x-rays or gamma rays, or higher frequency radiation, viral infection, bacterial infection, immune responses, poisoning, inherited mutations, and stress that fights own body systems as in autoimmune disease can create enough disruption in the body to reach the inside of the cell and the nucleus where chromosomes and DNA is housed can damage the DNA. There are regulating mechanisms that do immune surveilance to protect DNA and repair damaged DNA when alerted to it. When the body survives some catastrophes like malaria, they can pass on gene mutations to their offspring such as sickle cell anemia that is basically the same hemoglobin molecule but the shape is no longer a discoid shape but a sickle or half moon shape to prevent the survival of the virus that causes malaria but it can cause death if left unregulated or not managed well. This example was due to a gene mutation or copy variant in the valine amino acid of the gene for hemoglobin that in normal hemoglobin gives it the discoid shape. The internet says that glutamic acid is what is replaced and it is due to the hydrophobic affect of valine so that the malaria virus won’t survive in the environment if there isn’t water in the cell.

I’m just spilling all I know about what I know from a course in genetics and biology courses. There are shapes to these proteins. They aren’t just single strand molecules, because once the protein is translated it is sent to another organelle the golgi apparatus to shape and form the final protein product that can have a single peptide shape, a double peptide shape of alpha and beta folds, a tertiary shape that bends and winds the same strand of alpha and beta folds, or a quaternary shape of more complex winds and turns of peptides formation. If the shape is destroyed, usually the protein is dysfunctional or defective such as the brain proteins in Jakob-Creutzfeld disease or AKA prion or AKA mad cow disease. This could be from a virus, pH changes, or even extremely elevated body temperatures. That information off hand was mentioned in various chiropractic courses in neurology, human biology, anatomy, physiology, pathology, and probably cell tissue anatomy and pathology. It is highly contagious once caught and the instrument used to dissect a person with it to diagnose them has to be thrown away because the tools cannot be sterilized. It is a Hazmat situation where personal protective equipment is fully suited up to prevent exposure. It is said cannibals got it from eating their own meat from instructer from India who taught Pathology and Immunity and Infection in chiropractic school.

Given what I spilled about everything I know in a ‘nut shell’ as the saying goes, when we find these genes that the barcodes represent in MS cases versus healthy cases from B-cells or peripheral mononucleiated blood cells or the plasma WBC portion of blood, we can link those copy variants or allele variations in those genes with the pathology of MS. We will then add those genes to our machine to predict various classes using same media and samples but normalized by dividing each gene by the max of all observations and compare to see if it can identify EBV infection, mononucleosis, fibromyalgia, multiple sclerosis, Burkett’s lymphoms, Hodgkin’s lymphoma, or head and neck or throat cancer or sarcoma.

I have not been having luck with that simple line of code for the bioconductor as I think it has to do more with NCBI not allowing scraping to go on as they ask as well as genecards if the request is human by checking a box after using their inquiry search box. So, we can try again and use another code or we can manually look them up. There is a ranking system that is done by barcode or 20 base pair cDNA to see if the gene is the top gene for that sequence. We will just use the top ranked gene for that sequence.

I will manually input that data and return to upload the dataset and explore the results with information on those genes.

write.csv(strands,'barcodes41_cDNA_mRNA_AAs.csv', row.names=F)

IDs <- barcodes$ID_REF
IDs

##  [1] "TTTCGGGGAACCGAGTCGAT" "TTTAGAGTCGGTGGTAGATC" "TTCACGGTCCTTTTGGTCAC"
##  [4] "TGCGGTCGCGACCTTTCAGC" "TGCCCGCGCCTACAGTAGTG" "TGAATTTTAGAGTCGGTTTC"
##  [7] "TAGTGGCGTGAGATTTGCGT" "TAGAGTACCGTTTTTGAACT" "TAGACATGCAGTCGTTTCGA"
## [10] "GTGATTCCACAGTCGTTAAT" "GTGAGGATACAGTCGGTTTT" "GTAGAGTCGTTACCCGACAC"
## [13] "GTACCGTCGGTTGCTCGTGC" "GGTGTTGTCAGAGTCGTTAA" "GGTCCTGTCTTTTCTGCTGA"
## [16] "GGGCGTGTTTTTCTGGAGTA" "GGCTCGGAGTCGCTGAAAAT" "GGAGTCGTCTTTTTATCCCC"
## [19] "GCTTCGCAGTCGTTAGAGTT" "GCCGTCCTGTCTTTCTCATT" "GCCGAGTCGTTATGGACCCA"
## [22] "GAGTCGTTTAAAGGCTCTCT" "GAGTCGTTCTCGTTTCGCAG" "CTCCAGAGCCGTTTTCGGTG"
## [25] "CTATGGTCCCTTAGTGTTTA" "CTATCAACAGAGTCGCTAAT" "CGGTTAGAGTCGATAGCTTT"
## [28] "CGAGTCGTTTGACCGGCGCA" "CCTCTCACCAGTCGTTTTGG" "CCAGTCGATTCTTTTCATAT"
## [31] "CCAGCAGAGTCGCTCGAAAT" "CACCGTCGTTTTTGTGACCG" "ATCGTCGTTTTAGCCGTAGG"
## [34] "ATCGTCGGTCTTAGCGGTCA" "AGATTAACCCAATACATTAT" "AGAGTCGCTCGTTAGGATCT"
## [37] "AGAGTCGATTTGTCCAATCG" "AGAGGCGTTCGATCTTAGAC" "ACTGTCGTTTCAACGTTGAA"
## [40] "ACCCCCGTCGTTAATTCGAC" "AACGCACGGGCGTGTTAGTC"

# Install Bioconductor packages if not already installed
#if (!requireNamespace("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")

# Install required packages
#BiocManager::install(c("Biostrings"), ask = FALSE)
#install.packages("dplyr")

# Load libraries
library(Biostrings)

## Loading required package: BiocGenerics

## Loading required package: generics

## 
## Attaching package: 'generics'

## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: XVector

## Loading required package: Seqinfo

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

#library(dplyr) #needs higher version

#install.packages('dplyr')

Installing package into ‘C:/…/AppData/Local/R/win-library/4.5’ (as ‘lib’ is unspecified) trying URL ‘https://cran.rstudio.com/bin/windows/contrib/4.5/dplyr_1.1.4.zip’ Content type ‘application/zip’ length 1593322 bytes (1.5 MB) downloaded 1.5 MB

package ‘dplyr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:...7ZA_packages Loading required package: BiocGenerics Loading required package: generics

Attaching package: ‘generics’

The following objects are masked from ‘package:base’:

as.difftime, as.factor, as.ordered, intersect, is.element,
setdiff, setequal, union

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename,
cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, is.unsorted,
lapply, Map, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
rownames, sapply, saveRDS, table, tapply, unique, unsplit,
which.max, which.min

Loading required package: S4Vectors Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:utils’:

findMatches

The following objects are masked from ‘package:base’:

expand.grid, I, unname

Loading required package: IRanges

Attaching package: ‘IRanges’

The following object is masked from ‘package:grDevices’:

windows

Loading required package: XVector Loading required package: Seqinfo

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

strsplit

Error: package or namespace load failed for ‘dplyr’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): namespace ‘rlang’ 1.1.6 is already loaded, but >= 1.1.7 is required

#Example: Known barcode-to-gene mapping table
barcode_map <- data.frame(
  barcode = c("ACTGAC", "TTGACA", "GCTTGA"),
  gene_id = c("GeneA", "GeneB", "GeneC"),
  stringsAsFactors = FALSE
)
barcode_map

##   barcode gene_id
## 1  ACTGAC   GeneA
## 2  TTGACA   GeneB
## 3  GCTTGA   GeneC

barcode gene_id ACTGAC GeneA
TTGACA GeneB
GCTTGA GeneC
3 rows

# Example: Sequenced barcodes from experiment
sequenced_barcodes <- c("ACTGAC", "TTGACA", "GCTTGA", "ACTGAT")  # last one has a mismatch
sequenced_barcodes

## [1] "ACTGAC" "TTGACA" "GCTTGA" "ACTGAT"

[1] “ACTGAC” “TTGACA” “GCTTGA” “ACTGAT”

#Convert to DNAStringSet for sequence operations
seq_set <- DNAStringSet(sequenced_barcodes)
ref_set <- DNAStringSet(barcode_map$barcode)

seq_set

## DNAStringSet object of length 4:
##     width seq
## [1]     6 ACTGAC
## [2]     6 TTGACA
## [3]     6 GCTTGA
## [4]     6 ACTGAT

DNAStringSet object of length 4: width seq [1] 6 ACTGAC [2] 6 TTGACA [3] 6 GCTTGA [4] 6 ACTGAT

ref_set

## DNAStringSet object of length 3:
##     width seq
## [1]     6 ACTGAC
## [2]     6 TTGACA
## [3]     6 GCTTGA

DNAStringSet object of length 3: width seq [1] 6 ACTGAC [2] 6 TTGACA [3] 6 GCTTGA

#Convert to DNAStringSet for sequence operations
seq_set1 <- DNAStringSet(IDs)
ref_set1 <- DNAStringSet(barcode_map$barcode)

seq_set1

## DNAStringSet object of length 41:
##      width seq
##  [1]    20 TTTCGGGGAACCGAGTCGAT
##  [2]    20 TTTAGAGTCGGTGGTAGATC
##  [3]    20 TTCACGGTCCTTTTGGTCAC
##  [4]    20 TGCGGTCGCGACCTTTCAGC
##  [5]    20 TGCCCGCGCCTACAGTAGTG
##  ...   ... ...
## [37]    20 AGAGTCGATTTGTCCAATCG
## [38]    20 AGAGGCGTTCGATCTTAGAC
## [39]    20 ACTGTCGTTTCAACGTTGAA
## [40]    20 ACCCCCGTCGTTAATTCGAC
## [41]    20 AACGCACGGGCGTGTTAGTC

DNAStringSet object of length 41: width seq [1] 20 TTTCGGGGAACCGAGTCGAT [2] 20 TTTAGAGTCGGTGGTAGATC [3] 20 TTCACGGTCCTTTTGGTCAC [4] 20 TGCGGTCGCGACCTTTCAGC [5] 20 TGCCCGCGCCTACAGTAGTG … … … [37] 20 AGAGTCGATTTGTCCAATCG [38] 20 AGAGGCGTTCGATCTTAGAC [39] 20 ACTGTCGTTTCAACGTTGAA [40] 20 ACCCCCGTCGTTAATTCGAC [41] 20 AACGCACGGGCGTGTTAGTC

ref_set1

## DNAStringSet object of length 3:
##     width seq
## [1]     6 ACTGAC
## [2]     6 TTGACA
## [3]     6 GCTTGA

DNAStringSet object of length 3: width seq [1] 6 ACTGAC [2] 6 TTGACA [3] 6 GCTTGA

# Function to find closest matching barcode (allowing mismatches)
map_barcode <- function(query_seq, ref_set, ref_map, max_mismatch = 1) {
  # Find matches allowing up to max_mismatch mismatches
  hits <- vcountPattern(query_seq, ref_set, max.mismatch = max_mismatch, with.indels = FALSE)

  if (sum(hits) == 1) {
    matched_barcode <- as.character(ref_set[which(hits == 1)])
    return(ref_map$gene_id[ref_map$barcode == matched_barcode])
  } else {
    return(NA) # No unique match
  }
}

map_barcode

## function (query_seq, ref_set, ref_map, max_mismatch = 1) 
## {
##     hits <- vcountPattern(query_seq, ref_set, max.mismatch = max_mismatch, 
##         with.indels = FALSE)
##     if (sum(hits) == 1) {
##         matched_barcode <- as.character(ref_set[which(hits == 
##             1)])
##         return(ref_map$gene_id[ref_map$barcode == matched_barcode])
##     }
##     else {
##         return(NA)
##     }
## }

function(query_seq, ref_set, ref_map, max_mismatch = 1) { # Find matches allowing up to max_mismatch mismatches hits <- vcountPattern(query_seq, ref_set, max.mismatch = max_mismatch, with.indels = FALSE)

if (sum(hits) == 1) { matched_barcode <- as.character(ref_set[which(hits == 1)]) return(ref_map\(gene_id[ref_map\)barcode == matched_barcode]) } else { return(NA) # No unique match } }

# Map all sequenced barcodes to genes
mapped_genes <- sapply(sequenced_barcodes, map_barcode, ref_set = ref_set, ref_map = barcode_map)

mapped_genes

##  ACTGAC  TTGACA  GCTTGA  ACTGAT 
## "GeneA" "GeneB" "GeneC" "GeneA"

ACTGAC TTGACA GCTTGA ACTGAT “GeneA” “GeneB” “GeneC” “GeneA”

# Combine results into a data frame
result <- data.frame(
  sequenced_barcode = sequenced_barcodes,
  mapped_gene = mapped_genes,
  stringsAsFactors = FALSE
)

print(result)

##        sequenced_barcode mapped_gene
## ACTGAC            ACTGAC       GeneA
## TTGACA            TTGACA       GeneB
## GCTTGA            GCTTGA       GeneC
## ACTGAT            ACTGAT       GeneA

   sequenced_barcode mapped_gene

ACTGAC ACTGAC GeneA TTGACA TTGACA GeneB GCTTGA GCTTGA GeneC ACTGAT ACTGAT GeneA

The internet gave me example above but my dplyr package won’t load now and needs a new version higher than 1.1.4

BiocManager::available()

‘getOption(“repos”)’ replaces Bioconductor standard repositories, see ‘help(“repositories”, package = “BiocManager”)’ for details. Replacement repositories: CRAN: https://cran.rstudio.com/ [1] “a4”
[2] “a4Base”
[3] “a4Classif”
[4] “a4Core”
[5] “a4Preproc”
[6] “a4Reporting”
[7] “AalenJohansen”
[8] “aamatch”
[9] “AATtools”

…

[993] “atena”
[994] “ath1121501.db”
[995] “ath1121501cdf”
[996] “ath1121501frmavecs”
[997] “ath1121501probe”
[998] “Athlytics”
[999] “atime”
[1000] “atlas”
[ reached ‘max’ / getOption(“max.print”) – omitted 25643 entries ]

Different tutorial on biostrings

cDNA <- DNAStringSet(barcodes$ID_REF)
cDNA

## DNAStringSet object of length 41:
##      width seq
##  [1]    20 TTTCGGGGAACCGAGTCGAT
##  [2]    20 TTTAGAGTCGGTGGTAGATC
##  [3]    20 TTCACGGTCCTTTTGGTCAC
##  [4]    20 TGCGGTCGCGACCTTTCAGC
##  [5]    20 TGCCCGCGCCTACAGTAGTG
##  ...   ... ...
## [37]    20 AGAGTCGATTTGTCCAATCG
## [38]    20 AGAGGCGTTCGATCTTAGAC
## [39]    20 ACTGTCGTTTCAACGTTGAA
## [40]    20 ACCCCCGTCGTTAATTCGAC
## [41]    20 AACGCACGGGCGTGTTAGTC

proteins <- AAStringSet(barcodes$ID_REF)
proteins

## AAStringSet object of length 41:
##      width seq
##  [1]    20 TTTCGGGGAACCGAGTCGAT
##  [2]    20 TTTAGAGTCGGTGGTAGATC
##  [3]    20 TTCACGGTCCTTTTGGTCAC
##  [4]    20 TGCGGTCGCGACCTTTCAGC
##  [5]    20 TGCCCGCGCCTACAGTAGTG
##  ...   ... ...
## [37]    20 AGAGTCGATTTGTCCAATCG
## [38]    20 AGAGGCGTTCGATCTTAGAC
## [39]    20 ACTGTCGTTTCAACGTTGAA
## [40]    20 ACCCCCGTCGTTAATTCGAC
## [41]    20 AACGCACGGGCGTGTTAGTC

AAStringSet object of length 41: width seq [1] 20 TTTCGGGGAACCGAGTCGAT [2] 20 TTTAGAGTCGGTGGTAGATC [3] 20 TTCACGGTCCTTTTGGTCAC [4] 20 TGCGGTCGCGACCTTTCAGC [5] 20 TGCCCGCGCCTACAGTAGTG … … … [37] 20 AGAGTCGATTTGTCCAATCG [38] 20 AGAGGCGTTCGATCTTAGAC [39] 20 ACTGTCGTTTCAACGTTGAA [40] 20 ACCCCCGTCGTTAATTCGAC [41] 20 AACGCACGGGCGTGTTAGTC

RNAs <- RNAStringSet(barcodes$ID_REF)
RNAs

Error in .Call2(“new_XStringSet_from_CHARACTER”, class(x0), elementType(x0), : key 84 (char ‘T’) not in lookup table

Needs the strings to have U

Pro <- translate(cDNA)
Pro

## AAStringSet object of length 41:
##      width seq
##  [1]     6 FRGTES
##  [2]     6 FRVGGR
##  [3]     6 FTVLLV
##  [4]     6 CGRDLS
##  [5]     6 CPRLQ*
##  ...   ... ...
## [37]     6 RVDLSN
## [38]     6 RGVRS*
## [39]     6 TVVSTL
## [40]     6 TPVVNS
## [41]     6 NARAC*

AAStringSet object of length 41: width seq [1] 6 FRGTES [2] 6 FRVGGR [3] 6 FTVLLV [4] 6 CGRDLS [5] 6 CPRLQ … … … [37] 6 RVDLSN [38] 6 RGVRS [39] 6 TVVSTL [40] 6 TPVVNS [41] 6 NARAC*

Those are the one letter abbreviations for the 21 amino acids seen above.

It wasn’t working for what I needed, I believe it is due to the source not allowing what it thinks is web scraping off the site, or maybe user error. I didn’t get anything usable from AI generated code to retrieve the gene names for these 20 base long nucleotides of DNA as complementary DNA or cDNA. So, took a few days to enter them in manually.

==========================================

I had to use 2-3 days to tediously find all the gene associations top ranked for each 20 base pair cDNA strand using BLAST for Basic Local Alignment Scoring Tool or something similar in name.

Here are the details and you can get the file here.

To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, it is part of NIH or National Institute of Health. There is a text box at the top left to enter the query sequence where I pasted the cDNA 20bp string and then scrolled down to select the ‘Genomic + transcript databases’ in the ‘Database’ row, It selects Human Genomic + Transcript (Human G + T), then scroll down and select the option to open results in new window to easily get back to that window to enter in each of the 41 top genes used from our work earlier. Then select the tab at the bottom for the blue ‘BLAST’ for something similar to Basic Linear Alignmnet Scoring Tool that will give all results by a measure it ranks them by points in deductions for wrong and additions for correct in the string. I ignored all top section transcripts and scrolled down to 2nd section of genome results and selected the top score in ‘Per Ident’ column of 100%, then looked to other columns and the top listed max score, then the top score which was the highest, and sometimes if low and others in max score were higher in top score I would use those if the Per Ident field is 100% and the query cover was 100% or close. Query cover is 100% if all 20 base pairs found in the string of DNA. I found that many of the top gene base pair strands are from NC_000002.12, but it is a long gene coverage field and many other alternate genes overlap that section. I also avoided use of filling the geneSynonyms field of this database if there was no entry in genecards.org for an Ensembl ID. So some loci involved in other research but not named specifically as a gene or other were closer but had no Ensembl ID. And the populated first web link is the results of the top ranked genome range selected but some of the given or provided genes that were involved with that string on the range were not zoomed in as close to that string set but other genes were and so the closest gene was found from the chromosome view that is the blast web link found in selecting the ‘graphics’ link over the selected top ranked gene and then using the ‘link to this view’ option to copy the link to the graphic of the chromosome view specific to that target region. The bottom of the chromosome is zoomed in or panned in to the string searched and gene rank selected, from the top portion that has a tiny window of where that band of region is located and the green chunks along it also tiny are alternate genes or exons or mRNA, noncoding, uncharacterized loci, etc. that are in the neighborhood of the gene strip selected and searched. But not all were close to the string of nucleotides searched and some had overlap immediately when looking at the chromosome view. The reverse strand is the complementary DNA strand and the arrow points left, if the arrow points right it is the forward strand of DNA or opposite strand from the cDNA strand.

I didn’t go to genecards.org to actually put in the definition of each gene once I had all the information gathered that took an hour to get about five of the 41 strings each due to quality filtering and selection based on above procedure.

Lets go ahead and read in that table and list are top gene associations with the results that we couldn’t get from Bioconductor.

genes <- read.csv('barcodes41_cDNA_mRNA_AAs_BLAST_BLAT_ENSEMBL_added.csv',sep=',', header=T, na.strings=c('',' ','na','NA'))

str(genes)

## 'data.frame':    41 obs. of  9 variables:
##  $ ID_REF                             : chr  "TTTCGGGGAACCGAGTCGAT" "TTTAGAGTCGGTGGTAGATC" "TTCACGGTCCTTTTGGTCAC" "TGCGGTCGCGACCTTTCAGC" ...
##  $ ensembl_blast_blat_short           : chr  "NC_000008.11" "NC_000002.12" "NC_000001.11" "NC_000002.12" ...
##  $ geneSynonyms                       : chr  "CPA6" "RPL7P61" "DDAH1" "KLHL29" ...
##  $ location                           : chr  "chromosome 8, GRCh38.p14 Primary Assembly, Range 1: 67542483 to 67542496, forward strand,\nSequence ID: NC_0000"| __truncated__ "chromosome 2, GRCh38.p14 Primary Assembly, Range 1: 163177186 to 163177200, reverse strand,\nSequence ID: NC_00"| __truncated__ "chromosome 1, GRCh38.p14 Primary Assembly, Range 1: 85369881 to 85369895, reverse strand\nSequence ID: NC_00000"| __truncated__ "chromosome 2, GRCh38.p14 Primary Assembly, Range 1: 23651127 to 23651139, reverse strand,\nSequence ID: NC_0000"| __truncated__ ...
##  $ Ensembl_Name                       : chr  "ENSG00000165078" "ENSG00000230282" "ENSG00000153904" "ENSG00000119771" ...
##  $ weblink                            : chr  "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815590" "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815596" "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815597" "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815596" ...
##  $ blast_chromosome_graphics_view_link: chr  "https://www.ncbi.nlm.nih.gov/nuccore/568815590?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ "https://www.ncbi.nlm.nih.gov/nuccore/568815596?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ "https://www.ncbi.nlm.nih.gov/nuccore/568815597?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ "https://www.ncbi.nlm.nih.gov/nuccore/568815596?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ ...
##  $ note                               : chr  "Alignment:\tQuery_780917 x NC_000008.11\nAnchor:\tNC_000008.11 (67,542,483..67,542,496)\nQuery:\tQuery_780917 ("| __truncated__ "Alignment:\tQuery_3907771 x NC_000002.12\nAnchor:\tNC_000002.12 (103,430,601..186,861,488)\nQuery:\tQuery_39077"| __truncated__ "Alignment:\tQuery_3572341 x NC_000001.11\nAnchor:\tNC_000001.11 (85,369,881..85,369,895)\nQuery:\tQuery_3572341"| __truncated__ "Alignment:\tQuery_3047309 x NC_000002.12\nAnchor:\tNC_000002.12 (23,651,127..23,651,139)\nQuery:\tQuery_3047309"| __truncated__ ...
##  $ selecting_best_ranked_gene         : chr  "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ ...

Lets just view the genes by synonym name and by Ensembl ID

geneNames <- genes[,c(1,3,5)]
geneNames

##                  ID_REF geneSynonyms    Ensembl_Name
## 1  TTTCGGGGAACCGAGTCGAT         CPA6 ENSG00000165078
## 2  TTTAGAGTCGGTGGTAGATC      RPL7P61 ENSG00000230282
## 3  TTCACGGTCCTTTTGGTCAC        DDAH1 ENSG00000153904
## 4  TGCGGTCGCGACCTTTCAGC       KLHL29 ENSG00000119771
## 5  TGCCCGCGCCTACAGTAGTG     RPL31P30 ENSG00000230702
## 6  TGAATTTTAGAGTCGGTTTC      ATP5MC3 ENSG00000154518
## 7  TAGTGGCGTGAGATTTGCGT    RNU6-280P ENSG00000201015
## 8  TAGAGTACCGTTTTTGAACT         TFPI ENSG00000003436
## 9  TAGACATGCAGTCGTTTCGA      ST3GAL3 ENSG00000126091
## 10 GTGATTCCACAGTCGTTAAT         CDH8 ENSG00000150394
## 11 GTGAGGATACAGTCGGTTTT       CAMKMT ENSG00000143919
## 12 GTAGAGTCGTTACCCGACAC      SMARCA2 ENSG00000080503
## 13 GTACCGTCGGTTGCTCGTGC   POM121L15P ENSG00000161103
## 14 GGTGTTGTCAGAGTCGTTAA        NLRP3 ENSG00000162711
## 15 GGTCCTGTCTTTTCTGCTGA        ACTN4 ENSG00000130402
## 16 GGGCGTGTTTTTCTGGAGTA        WDR35 ENSG00000118965
## 17 GGCTCGGAGTCGCTGAAAAT        ESRP1 ENSG00000104413
## 18 GGAGTCGTCTTTTTATCCCC        ERBB4 ENSG00000178568
## 19 GCTTCGCAGTCGTTAGAGTT     TMEM200A ENSG00000164484
## 20 GCCGTCCTGTCTTTCTCATT    MIR4432HG ENSG00000228590
## 21 GCCGAGTCGTTATGGACCCA    KCNA6-AS1 ENSG00000256988
## 22 GAGTCGTTTAAAGGCTCTCT       CLINT1 ENSG00000113282
## 23 GAGTCGTTCTCGTTTCGCAG       AQP12B ENSG00000185176
## 24 CTCCAGAGCCGTTTTCGGTG       HSD3B1 ENSG00000203857
## 25 CTATGGTCCCTTAGTGTTTA         LZIC ENSG00000162441
## 26 CTATCAACAGAGTCGCTAAT      CRLF3P3 ENSG00000228225
## 27 CGGTTAGAGTCGATAGCTTT        HDAC4 ENSG00000068024
## 28 CGAGTCGTTTGACCGGCGCA      SLIRPP1 ENSG00000227505
## 29 CCTCTCACCAGTCGTTTTGG      SCHLAP1 ENSG00000281131
## 30 CCAGTCGATTCTTTTCATAT     NDUFB5P1 ENSG00000251025
## 31 CCAGCAGAGTCGCTCGAAAT       FAM20C ENSG00000177706
## 32 CACCGTCGTTTTTGTGACCG       IGFBP7 ENSG00000163453
## 33 ATCGTCGTTTTAGCCGTAGG      ST8SIA4 ENSG00000113532
## 34 ATCGTCGGTCTTAGCGGTCA        TSHZ2 ENSG00000182463
## 35 AGATTAACCCAATACATTAT      PIK3C2A ENSG00000011405
## 36 AGAGTCGCTCGTTAGGATCT        CASC8 ENSG00000246228
## 37 AGAGTCGATTTGTCCAATCG       PLSCR5 ENSG00000231213
## 38 AGAGGCGTTCGATCTTAGAC    PDE4DIPP5 ENSG00000275064
## 39 ACTGTCGTTTCAACGTTGAA        STX12 ENSG00000117758
## 40 ACCCCCGTCGTTAATTCGAC      CACNA1E ENSG00000198216
## 41 AACGCACGGGCGTGTTAGTC      TSNARE1 ENSG00000171045

Lets now combine this new information to our other database and see the results of the fold changes on the samples with the gene name, we will later add in the gene summaries and see what type of picture we can develop in our minds about the process of multiple sclerosis.

DATA <- merge(geneNames, barcodes, by.x='ID_REF', by.y="ID_REF")
str(DATA)

## 'data.frame':    41 obs. of  30 variables:
##  $ ID_REF                            : chr  "AACGCACGGGCGTGTTAGTC" "ACCCCCGTCGTTAATTCGAC" "ACTGTCGTTTCAACGTTGAA" "AGAGGCGTTCGATCTTAGAC" ...
##  $ geneSynonyms                      : chr  "TSNARE1" "CACNA1E" "STX12" "PDE4DIPP5" ...
##  $ Ensembl_Name                      : chr  "ENSG00000171045" "ENSG00000198216" "ENSG00000117758" "ENSG00000275064" ...
##  $ control1.4362                     : int  25 1 1 3 2 2 2 1 1 1 ...
##  $ control2.4363                     : int  46 1 1 4 2 1 1 1 1 1 ...
##  $ control3.4364                     : int  32 1 1 1 2 1 2 1 1 1 ...
##  $ MS1_r1_4370                       : int  1 25 49 63 70 94 160 17 78 103 ...
##  $ MS1_r2_4371                       : int  2 104 31 180 172 101 66 81 34 63 ...
##  $ MS1_r3_4372                       : int  1 34 35 107 68 49 75 70 50 85 ...
##  $ MS1_r4_4373                       : int  3 75 72 203 224 152 147 102 72 124 ...
##  $ MS1_r5_4374                       : int  8 70 71 179 220 98 123 103 70 142 ...
##  $ MS2_r1_4375                       : int  6 52 34 118 107 74 88 66 41 55 ...
##  $ MS2_r2_4376                       : int  6 60 51 94 126 85 75 110 44 61 ...
##  $ MS2_r3_4377                       : int  3 68 21 122 120 88 81 89 53 52 ...
##  $ MS2_r4_4378                       : int  3 75 67 112 132 95 75 74 49 75 ...
##  $ MS2_r5_4379                       : int  3 66 47 123 119 81 86 48 50 64 ...
##  $ commercial1o.commercial_r1_4365   : int  5 65 56 143 108 70 62 68 69 96 ...
##  $ commercial2o.commercial_r2_4366   : int  4 50 52 133 117 98 93 46 23 65 ...
##  $ commercial3o.commercial_r3_4367   : int  3 79 65 153 181 103 102 85 66 123 ...
##  $ commercial4o.commercial_r4_4368   : int  2 42 24 107 126 77 79 73 47 64 ...
##  $ commercial5o.commercial_r5_4369   : int  4 46 67 113 74 70 85 56 35 107 ...
##  $ controlMeans                      : num  34.33 1 1 2.67 2 ...
##  $ MS1_Means                         : num  3 61.6 51.6 146.4 150.8 ...
##  $ MS2_Means                         : num  4.2 64.2 44 113.8 120.8 ...
##  $ commercial_Means                  : num  3.6 56.4 52.8 129.8 121.2 ...
##  $ foldchange_MS1_vs_control         : num  0.0874 61.6 51.6 54.9 75.4 ...
##  $ foldchange_MS2_vs_control         : num  0.122 64.2 44 42.675 60.4 ...
##  $ foldchange_commercialMS_vs_control: num  9.537 0.0177 0.0189 0.0205 0.0165 ...
##  $ mRNA                              : chr  "UUGCGUGCCCGCACAAUCAG" "UGGGGGCAGCAAUUAAGCUG" "UGACAGCAAAGUUGCAACUU" "UCUCCGCAAGCUAGAAUCUG" ...
##  $ aminoAcids_simulated              : chr  "-Leu-C-Val-CCC-Ala-CA-Ile-AG" "-Trp-GG-Ala-Ala-A-Leu-AG-Leu-" "U-Asp-A-Ala-Lys-Leu-Gln-Leu-" "U-Leu-C-Ala-AG-Leu-Glu-U-Leu-" ...

I know why the values are running opposite directions for the commercial MS and the actual participating MS patients. I accidentally put the control means over the commercial means in obtaining the fold change values. So we can go ahead and correct that right now.

Since its control means / commercial means, we have to multiply by commercial means/control means to get 1 then multiply by commercial means/control means.We still have those values in our data frame.

colnames(DATA)

##  [1] "ID_REF"                             "geneSynonyms"                      
##  [3] "Ensembl_Name"                       "control1.4362"                     
##  [5] "control2.4363"                      "control3.4364"                     
##  [7] "MS1_r1_4370"                        "MS1_r2_4371"                       
##  [9] "MS1_r3_4372"                        "MS1_r4_4373"                       
## [11] "MS1_r5_4374"                        "MS2_r1_4375"                       
## [13] "MS2_r2_4376"                        "MS2_r3_4377"                       
## [15] "MS2_r4_4378"                        "MS2_r5_4379"                       
## [17] "commercial1o.commercial_r1_4365"    "commercial2o.commercial_r2_4366"   
## [19] "commercial3o.commercial_r3_4367"    "commercial4o.commercial_r4_4368"   
## [21] "commercial5o.commercial_r5_4369"    "controlMeans"                      
## [23] "MS1_Means"                          "MS2_Means"                         
## [25] "commercial_Means"                   "foldchange_MS1_vs_control"         
## [27] "foldchange_MS2_vs_control"          "foldchange_commercialMS_vs_control"
## [29] "mRNA"                               "aminoAcids_simulated"

Change foldchange_commercialMS_vs_control so that it is multiplied by commercial_Means/controlMeans to get 1 and then multiply by commercial_Means/controlMeans again to get the actual fold change value we originally wanted. You can go back to the original publication posted earlier called, “Machine Learning on MS data 9M rows for Top Allele variant Genes” to see the error and why the values are not correlated with the other MS samples when they should be. Actually, you can just add the column now but it will be after selecting top genes. with just the actual commercial_Means/controlMeans.

DATA$RealFC_commercial_control_means <- DATA$commercial_Means / DATA$controlMeans

DATA[,c(26:28,31)]

##    foldchange_MS1_vs_control foldchange_MS2_vs_control
## 1                 0.08737864                0.12233010
## 2                61.60000000               64.20000000
## 3                51.60000000               44.00000000
## 4                54.90000000               42.67500000
## 5                75.40000000               60.40000000
## 6                74.10000000               63.45000000
## 7                68.52000000               48.60000000
## 8                74.60000000               77.40000000
## 9                60.80000000               47.40000000
## 10              103.40000000               61.40000000
## 11               69.40000000               58.00000000
## 12               76.35000000               62.10000000
## 13               75.80000000               54.50000000
## 14               57.23076923               50.86153846
## 15               66.36000000               52.20000000
## 16               64.40000000               41.00000000
## 17                0.09230769                0.08076923
## 18               94.80000000               71.55000000
## 19               54.00000000               52.60000000
## 20              161.40000000              110.40000000
## 21               54.40000000               42.20000000
## 22                0.08955224                0.10746269
## 23               71.76000000               46.68000000
## 24               73.05000000               61.50000000
## 25               87.60000000               70.40000000
## 26                0.09320388                0.13980583
## 27                0.07800000                0.12600000
## 28               63.90000000               61.20000000
## 29               77.55000000               45.90000000
## 30               80.00000000               71.60000000
## 31               63.00000000               47.04000000
## 32               84.60000000               49.20000000
## 33               68.85000000               71.40000000
## 34                0.11186441                0.13220339
## 35                0.08250000                0.11250000
## 36               54.45000000               41.17500000
## 37                0.10843373                0.09397590
## 38                0.08219178                0.14794521
## 39                0.06075949                0.06835443
## 40               88.80000000               53.20000000
## 41               75.00000000               44.16000000
##    foldchange_commercialMS_vs_control RealFC_commercial_control_means
## 1                         9.537037037                      0.10485437
## 2                         0.017730496                     56.40000000
## 3                         0.018939394                     52.80000000
## 4                         0.020544427                     48.67500000
## 5                         0.016501650                     60.60000000
## 6                         0.015948963                     62.70000000
## 7                         0.019794141                     50.52000000
## 8                         0.015243902                     65.60000000
## 9                         0.020833333                     48.00000000
## 10                        0.010989011                     91.00000000
## 11                        0.013927577                     71.80000000
## 12                        0.018365473                     54.45000000
## 13                        0.021052632                     47.50000000
## 14                        0.020420986                     48.96923077
## 15                        0.020990764                     47.64000000
## 16                        0.019193858                     52.10000000
## 17                        7.878787879                      0.12692308
## 18                        0.014556041                     68.70000000
## 19                        0.024038462                     41.60000000
## 20                        0.007022472                    142.40000000
## 21                        0.020325203                     49.20000000
## 22                       10.151515152                      0.09850746
## 23                        0.018315018                     54.60000000
## 24                        0.017590150                     56.85000000
## 25                        0.016129032                     62.00000000
## 26                        8.583333333                      0.11650485
## 27                        9.259259259                      0.10800000
## 28                        0.015290520                     65.40000000
## 29                        0.015741834                     63.52500000
## 30                        0.014245014                     70.20000000
## 31                        0.023607177                     42.36000000
## 32                        0.019707207                     50.74285714
## 33                        0.015151515                     66.00000000
## 34                        8.939393939                      0.11186441
## 35                       11.111111111                      0.09000000
## 36                        0.022259321                     44.92500000
## 37                       10.641025641                      0.09397590
## 38                       12.166666667                      0.08219178
## 39                       14.629629630                      0.06835443
## 40                        0.016129032                     62.00000000
## 41                        0.017182131                     58.20000000

You can see now that the values are correlated and not inversely related, as the foldchange is high in the commercial set when the participating MS patient samples are high, and low when those same samples are low. As it should be. Lets get rid of the other inverse relation fold change or name it inverse relation commercial. Lets do the latter just to see the inverse relation when control means are over the sample pathology means. Instead of vice versa.

colnames(DATA)[28] <- "inverse_FC_control_over_commercial_means"
DATA <- DATA[,c(1:27,31,28:30)]
colnames(DATA)

##  [1] "ID_REF"                                  
##  [2] "geneSynonyms"                            
##  [3] "Ensembl_Name"                            
##  [4] "control1.4362"                           
##  [5] "control2.4363"                           
##  [6] "control3.4364"                           
##  [7] "MS1_r1_4370"                             
##  [8] "MS1_r2_4371"                             
##  [9] "MS1_r3_4372"                             
## [10] "MS1_r4_4373"                             
## [11] "MS1_r5_4374"                             
## [12] "MS2_r1_4375"                             
## [13] "MS2_r2_4376"                             
## [14] "MS2_r3_4377"                             
## [15] "MS2_r4_4378"                             
## [16] "MS2_r5_4379"                             
## [17] "commercial1o.commercial_r1_4365"         
## [18] "commercial2o.commercial_r2_4366"         
## [19] "commercial3o.commercial_r3_4367"         
## [20] "commercial4o.commercial_r4_4368"         
## [21] "commercial5o.commercial_r5_4369"         
## [22] "controlMeans"                            
## [23] "MS1_Means"                               
## [24] "MS2_Means"                               
## [25] "commercial_Means"                        
## [26] "foldchange_MS1_vs_control"               
## [27] "foldchange_MS2_vs_control"               
## [28] "RealFC_commercial_control_means"         
## [29] "inverse_FC_control_over_commercial_means"
## [30] "mRNA"                                    
## [31] "aminoAcids_simulated"

DATA[,c(26:29)]

##    foldchange_MS1_vs_control foldchange_MS2_vs_control
## 1                 0.08737864                0.12233010
## 2                61.60000000               64.20000000
## 3                51.60000000               44.00000000
## 4                54.90000000               42.67500000
## 5                75.40000000               60.40000000
## 6                74.10000000               63.45000000
## 7                68.52000000               48.60000000
## 8                74.60000000               77.40000000
## 9                60.80000000               47.40000000
## 10              103.40000000               61.40000000
## 11               69.40000000               58.00000000
## 12               76.35000000               62.10000000
## 13               75.80000000               54.50000000
## 14               57.23076923               50.86153846
## 15               66.36000000               52.20000000
## 16               64.40000000               41.00000000
## 17                0.09230769                0.08076923
## 18               94.80000000               71.55000000
## 19               54.00000000               52.60000000
## 20              161.40000000              110.40000000
## 21               54.40000000               42.20000000
## 22                0.08955224                0.10746269
## 23               71.76000000               46.68000000
## 24               73.05000000               61.50000000
## 25               87.60000000               70.40000000
## 26                0.09320388                0.13980583
## 27                0.07800000                0.12600000
## 28               63.90000000               61.20000000
## 29               77.55000000               45.90000000
## 30               80.00000000               71.60000000
## 31               63.00000000               47.04000000
## 32               84.60000000               49.20000000
## 33               68.85000000               71.40000000
## 34                0.11186441                0.13220339
## 35                0.08250000                0.11250000
## 36               54.45000000               41.17500000
## 37                0.10843373                0.09397590
## 38                0.08219178                0.14794521
## 39                0.06075949                0.06835443
## 40               88.80000000               53.20000000
## 41               75.00000000               44.16000000
##    RealFC_commercial_control_means inverse_FC_control_over_commercial_means
## 1                       0.10485437                              9.537037037
## 2                      56.40000000                              0.017730496
## 3                      52.80000000                              0.018939394
## 4                      48.67500000                              0.020544427
## 5                      60.60000000                              0.016501650
## 6                      62.70000000                              0.015948963
## 7                      50.52000000                              0.019794141
## 8                      65.60000000                              0.015243902
## 9                      48.00000000                              0.020833333
## 10                     91.00000000                              0.010989011
## 11                     71.80000000                              0.013927577
## 12                     54.45000000                              0.018365473
## 13                     47.50000000                              0.021052632
## 14                     48.96923077                              0.020420986
## 15                     47.64000000                              0.020990764
## 16                     52.10000000                              0.019193858
## 17                      0.12692308                              7.878787879
## 18                     68.70000000                              0.014556041
## 19                     41.60000000                              0.024038462
## 20                    142.40000000                              0.007022472
## 21                     49.20000000                              0.020325203
## 22                      0.09850746                             10.151515152
## 23                     54.60000000                              0.018315018
## 24                     56.85000000                              0.017590150
## 25                     62.00000000                              0.016129032
## 26                      0.11650485                              8.583333333
## 27                      0.10800000                              9.259259259
## 28                     65.40000000                              0.015290520
## 29                     63.52500000                              0.015741834
## 30                     70.20000000                              0.014245014
## 31                     42.36000000                              0.023607177
## 32                     50.74285714                              0.019707207
## 33                     66.00000000                              0.015151515
## 34                      0.11186441                              8.939393939
## 35                      0.09000000                             11.111111111
## 36                     44.92500000                              0.022259321
## 37                      0.09397590                             10.641025641
## 38                      0.08219178                             12.166666667
## 39                      0.06835443                             14.629629630
## 40                     62.00000000                              0.016129032
## 41                     58.20000000                              0.017182131

Lets now order this by decreasing order of fold change value by means of commercial samples over means of control samples and write it out to csv.

DATA1 <- DATA[order(DATA$RealFC_commercial_control_means, decreasing=T),]
head(DATA1)

##                  ID_REF geneSynonyms    Ensembl_Name control1.4362
## 20 GAGTCGTTTAAAGGCTCTCT       CLINT1 ENSG00000113282             1
## 10 CACCGTCGTTTTTGTGACCG       IGFBP7 ENSG00000163453             1
## 11 CCAGCAGAGTCGCTCGAAAT       FAM20C ENSG00000177706             1
## 30 GTAGAGTCGTTACCCGACAC      SMARCA2 ENSG00000080503             1
## 18 CTCCAGAGCCGTTTTCGGTG       HSD3B1 ENSG00000203857             1
## 33 TAGACATGCAGTCGTTTCGA      ST3GAL3 ENSG00000126091             1
##    control2.4363 control3.4364 MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## 20             1             1         125         185          97         209
## 10             1             1         103          63          85         124
## 11             1             1          85          47          65          83
## 30             1             1          82          67          55          99
## 18             2             1         141          84          61         190
## 33             1             2          61         114          65         103
##    MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377 MS2_r4_4378 MS2_r5_4379
## 20         191         102         122         111         114         103
## 10         142          55          61          52          75          64
## 11          67          53          60          58          61          58
## 30          97          69          51          77          94          67
## 18         156          69          88         110         105         105
## 33         116         101         110          87         110          68
##    commercial1o.commercial_r1_4365 commercial2o.commercial_r2_4366
## 20                             132                             155
## 10                              96                              65
## 11                              55                              84
## 30                              78                              82
## 18                             104                              92
## 33                              88                             105
##    commercial3o.commercial_r3_4367 commercial4o.commercial_r4_4368
## 20                             203                             123
## 10                             123                              64
## 11                              96                              67
## 30                              94                              51
## 18                             109                              89
## 33                              97                              86
##    commercial5o.commercial_r5_4369 controlMeans MS1_Means MS2_Means
## 20                              99     1.000000     161.4     110.4
## 10                             107     1.000000     103.4      61.4
## 11                              57     1.000000      69.4      58.0
## 30                              46     1.000000      80.0      71.6
## 18                              64     1.333333     126.4      95.4
## 33                              64     1.333333      91.8      95.2
##    commercial_Means foldchange_MS1_vs_control foldchange_MS2_vs_control
## 20            142.4                    161.40                    110.40
## 10             91.0                    103.40                     61.40
## 11             71.8                     69.40                     58.00
## 30             70.2                     80.00                     71.60
## 18             91.6                     94.80                     71.55
## 33             88.0                     68.85                     71.40
##    RealFC_commercial_control_means inverse_FC_control_over_commercial_means
## 20                           142.4                              0.007022472
## 10                            91.0                              0.010989011
## 11                            71.8                              0.013927577
## 30                            70.2                              0.014245014
## 18                            68.7                              0.014556041
## 33                            66.0                              0.015151515
##                    mRNA          aminoAcids_simulated
## 20 CUCAGCAAAUUUCCGAGAGA -Leu-A-Ala-AA-Phe-CC-Glu-Arg-
## 10 GUGGCAGCAAAAACACUGGC   -Val-Ala-Ala-Lys-Thr-Leu-GC
## 11 GGUCGUCUCAGCGAGCUUUA  G-Val-GU-Leu-A-Ala-AGC-Phe-A
## 30 CAUCUCAGCAAUGGGCUGUG -His-Leu-A-Ala-A-Trp-G-Leu-UG
## 18 GAGGUCUCGGCAAAAGCCAC  -Glu-GU-Leu-G-Ala-Lys-Ala-AC
## 33 AUCUGUACGUCAGCAAAGCU  AU-Leu-Tyr-Val-A-Ala-AA-Ala-

tail(DATA1)

##                  ID_REF geneSynonyms    Ensembl_Name control1.4362
## 1  AACGCACGGGCGTGTTAGTC      TSNARE1 ENSG00000171045            25
## 22 GCCGTCCTGTCTTTCTCATT    MIR4432HG ENSG00000228590            28
## 37 TGCCCGCGCCTACAGTAGTG     RPL31P30 ENSG00000230702            27
## 35 TAGTGGCGTGAGATTTGCGT    RNU6-280P ENSG00000201015            28
## 38 TGCGGTCGCGACCTTTCAGC       KLHL29 ENSG00000119771            21
## 39 TTCACGGTCCTTTTGGTCAC        DDAH1 ENSG00000153904            31
##    control2.4363 control3.4364 MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## 1             46            32           1           2           1           3
## 22            16            23           1           3           2           3
## 37            24            32           1           1           4           4
## 35            29            23           1           1           3           3
## 38            22            30           1           2           2           2
## 39            23            25           2           2           2           1
##    MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377 MS2_r4_4378 MS2_r5_4379
## 1            8           6           6           3           3           3
## 22           1           3           1           2           2           4
## 37           5           3           2           3           1           4
## 35           3           1           3           4           6           1
## 38           3           5           5           2           3           3
## 39           1           3           1           3           1           1
##    commercial1o.commercial_r1_4365 commercial2o.commercial_r2_4366
## 1                                5                               4
## 22                               2                               2
## 37                               5                               1
## 35                               3                               4
## 38                               2                               3
## 39                               3                               2
##    commercial3o.commercial_r3_4367 commercial4o.commercial_r4_4368
## 1                                3                               2
## 22                               1                               2
## 37                               1                               5
## 35                               2                               1
## 38                               3                               1
## 39                               2                               1
##    commercial5o.commercial_r5_4369 controlMeans MS1_Means MS2_Means
## 1                                4     34.33333       3.0       4.2
## 22                               4     22.33333       2.0       2.4
## 37                               1     27.66667       3.0       2.6
## 35                               2     26.66667       2.2       3.0
## 38                               1     24.33333       2.0       3.6
## 39                               1     26.33333       1.6       1.8
##    commercial_Means foldchange_MS1_vs_control foldchange_MS2_vs_control
## 1               3.6                0.08737864                0.12233010
## 22              2.2                0.08955224                0.10746269
## 37              2.6                0.10843373                0.09397590
## 35              2.4                0.08250000                0.11250000
## 38              2.0                0.08219178                0.14794521
## 39              1.8                0.06075949                0.06835443
##    RealFC_commercial_control_means inverse_FC_control_over_commercial_means
## 1                       0.10485437                                 9.537037
## 22                      0.09850746                                10.151515
## 37                      0.09397590                                10.641026
## 35                      0.09000000                                11.111111
## 38                      0.08219178                                12.166667
## 39                      0.06835443                                14.629630
##                    mRNA          aminoAcids_simulated
## 1  UUGCGUGCCCGCACAAUCAG  -Leu-C-Val-CCC-Ala-CA-Ile-AG
## 22 CGGCAGGACAGAAAGAGUAA  CG-Ala-G-Asp-A-Glu-Arg-Val-A
## 37 ACGGGCGCGGAUGUCAUCAC  -Thr-G-Ala-CG-Asp-Val-Ile-AC
## 35 AUCACCGCACUCUAAACGCA -Ile-Thr-Ala-Leu-U-Lys-C-Ala-
## 38 ACGCCAGCGCUGGAAAGUCG  AC-Ala-A-Ala-Leu-Glu-A-Val-G
## 39 AAGUGCCAGGAAAACCAGUG  AA-Val-Pro-G-Glu-AA-Pro-Val-

It seems to be ordered from enhancer genes at the top and silencer genes at the bottom. We will now write this out to csv and I will add in the gene summaries to these genes. These genes will be in our machine model to predict Lyme Disease, fibromyalgia, mononucleosis, multiple sclerosis, and when we get to it Burkett and Hodgkin lymphoma separately and then head and neck cancer or sarcoma related to EBV.

We have to get the data first on the lymphomas and the sarcoma of head and neck related to EBV, and then we have to see which media is the same and extracted the same or just do a whole other formatting of data extraction to get the same media in gene expression data from i.e. peripheral mononucleated blood cells, extracted with high throughput sequencing of same type, and counts or fragments, but all have to be one type and also not normalized by dividing by the max of each gene across all samples.

Lets list our genes now before writing it out to csv.

DATA1[,c(2,28)]

##    geneSynonyms RealFC_commercial_control_means
## 20       CLINT1                    142.40000000
## 10       IGFBP7                     91.00000000
## 11       FAM20C                     71.80000000
## 30      SMARCA2                     70.20000000
## 18       HSD3B1                     68.70000000
## 33      ST3GAL3                     66.00000000
## 8         TSHZ2                     65.60000000
## 28        NLRP3                     65.40000000
## 29   POM121L15P                     63.52500000
## 6         CASC8                     62.70000000
## 25        ESRP1                     62.00000000
## 40      RPL7P61                     62.00000000
## 5        PLSCR5                     60.60000000
## 41         CPA6                     58.20000000
## 24        ERBB4                     56.85000000
## 2       CACNA1E                     56.40000000
## 23     TMEM200A                     54.60000000
## 12     NDUFB5P1                     54.45000000
## 3         STX12                     52.80000000
## 16      CRLF3P3                     52.10000000
## 32         CDH8                     50.74285714
## 7       PIK3C2A                     50.52000000
## 21    KCNA6-AS1                     49.20000000
## 14      SLIRPP1                     48.96923077
## 4     PDE4DIPP5                     48.67500000
## 9       ST8SIA4                     48.00000000
## 15        HDAC4                     47.64000000
## 13      SCHLAP1                     47.50000000
## 36      ATP5MC3                     44.92500000
## 31       CAMKMT                     42.36000000
## 19       AQP12B                     41.60000000
## 17         LZIC                      0.12692308
## 26        WDR35                      0.11650485
## 34         TFPI                      0.11186441
## 27        ACTN4                      0.10800000
## 1       TSNARE1                      0.10485437
## 22    MIR4432HG                      0.09850746
## 37     RPL31P30                      0.09397590
## 35    RNU6-280P                      0.09000000
## 38       KLHL29                      0.08219178
## 39        DDAH1                      0.06835443

All genes above 1 in value are enhanced or upregulated, while those below the value of 1 are silenced or down regulated genes. We will write the DATA1 table to csv and you can get it here.

write.csv(DATA1, 'MS_top41genes_FCs_summariesNeeded.csv', row.names=F)

I will add in the gene summaries and share it as an attachment to same file link above when I do.

Thanks.

MS genes top 41 with cDNA amino acids attempt2 with bioconductor

Janis Harris

2026-01-18