An extension to the Multiple Sclerosis gene expression data of barcodes done earlier in week.
barcodes <- read.csv('MS_41genes_15samples_Foldchange_mRNA_AAs.csv', header=T, sep=',', na.strings=c('',' ','na','NA'))
str(barcodes)
## 'data.frame': 41 obs. of 28 variables:
## $ ID_REF : chr "TTTCGGGGAACCGAGTCGAT" "TTTAGAGTCGGTGGTAGATC" "TTCACGGTCCTTTTGGTCAC" "TGCGGTCGCGACCTTTCAGC" ...
## $ control1.4362 : int 1 1 31 21 27 2 28 20 1 2 ...
## $ control2.4363 : int 1 1 23 22 24 4 29 26 1 1 ...
## $ control3.4364 : int 3 1 25 30 32 2 23 13 2 4 ...
## $ MS1_r1_4370 : int 119 99 2 1 1 135 1 1 61 209 ...
## $ MS1_r2_4371 : int 106 89 2 2 1 124 1 2 114 176 ...
## $ MS1_r3_4372 : int 101 49 2 2 4 99 3 4 65 135 ...
## $ MS1_r4_4373 : int 200 99 1 2 4 235 3 2 103 242 ...
## $ MS1_r5_4374 : int 99 108 1 3 5 133 3 2 116 225 ...
## $ MS2_r1_4375 : int 75 85 3 5 3 122 1 3 101 104 ...
## $ MS2_r2_4376 : int 65 49 1 5 2 109 3 3 110 114 ...
## $ MS2_r3_4377 : int 63 53 3 2 3 97 4 4 87 107 ...
## $ MS2_r4_4378 : int 78 36 1 3 1 123 6 1 110 138 ...
## $ MS2_r5_4379 : int 87 43 1 3 4 98 1 2 68 111 ...
## $ commercial1o.commercial_r1_4365 : int 128 83 3 2 5 106 3 3 88 124 ...
## $ commercial2o.commercial_r2_4366 : int 98 52 2 3 1 116 4 2 105 114 ...
## $ commercial3o.commercial_r3_4367 : int 109 60 2 3 1 156 2 2 97 156 ...
## $ commercial4o.commercial_r4_4368 : int 93 67 1 1 5 122 1 2 86 95 ...
## $ commercial5o.commercial_r5_4369 : int 57 48 1 1 1 99 2 2 64 103 ...
## $ controlMeans : num 1.67 1 26.33 24.33 27.67 ...
## $ MS1_Means : num 125 88.8 1.6 2 3 ...
## $ MS2_Means : num 73.6 53.2 1.8 3.6 2.6 ...
## $ commercial_Means : num 97 62 1.8 2 2.6 ...
## $ foldchange_MS1_vs_control : num 75 88.8 0.0608 0.0822 0.1084 ...
## $ foldchange_MS2_vs_control : num 44.16 53.2 0.0684 0.1479 0.094 ...
## $ foldchange_commercialMS_vs_control: num 0.0172 0.0161 14.6296 12.1667 10.641 ...
## $ mRNA : chr "AAAGCCCCUUGGCUCAGCUA" "AAAUCUCAGCCACCAUCUAG" "AAGUGCCAGGAAAACCAGUG" "ACGCCAGCGCUGGAAAGUCG" ...
## $ aminoAcids_simulated : chr "-Lys-Ala-CC-Leu-G-Leu-AG-Leu-" "-Lys-U-Leu-A-Ala-A-Pro-U-Leu-G" "AA-Val-Pro-G-Glu-AA-Pro-Val-" "AC-Ala-A-Ala-Leu-Glu-A-Val-G" ...
Read in the meta data that has commented information on methods and procedures of aquiring the lab results. When originally read in, done with read.table and it excluded the ‘!’ prefixed notation to each comment and only included the header with no observations. We will read in with read.csv and look it over to get the methods on the media type and what the barcodes are.
meta <- read.csv("GSE293036-GPL34284_series_matrix.txt", sep='\t',comment.char="#")
str(meta)
## 'data.frame': 373 obs. of 2 variables:
## $ X.Series_title : chr "!Series_geo_accession" "!Series_status" "!Series_submission_date" "!Series_last_update_date" ...
## $ Genome.wide.discovery.of.multiple.sclerosis.genetic.risk.variant.allelic.regulatory.activity: chr "GSE293036" "Public on Nov 13 2025" "Mar 26 2025" "Nov 14 2025" ...
It is better to view it in your Rstudio panel and scroll to see the information.
information <- meta[!duplicated(meta$X.Series_title),]
str(information)
## 'data.frame': 144 obs. of 2 variables:
## $ X.Series_title : chr "!Series_geo_accession" "!Series_status" "!Series_submission_date" "!Series_last_update_date" ...
## $ Genome.wide.discovery.of.multiple.sclerosis.genetic.risk.variant.allelic.regulatory.activity: chr "GSE293036" "Public on Nov 13 2025" "Mar 26 2025" "Nov 14 2025" ...
The information is in row 141 and row 181 on what these barcodes are.
info <- meta[c(141,181),1]
info
## [1] "Total RNA was extracted by the RNeasy Midi Kit following the manufacturer’s instruction. Extracted RNA was subjected to DNase treatment. eGFP mRNA enrichment and elution were performed using biotinylated probes as in Lu et al (PMID: 33712590), and the mRNA was reverse transcribed to cDNA using SuperScript™ IV First-Strand Synthesis System with gene specific primer MPRA_v3_Amp2Sc_R, following the manufacturer’s instruction."
## [2] "Oligo-barcode association: Paired-end, 150bp reads were first quality filtered using Trimmomatic (v0.38) (flags: PE -phred33, LEADING:25, TRAILING:25, MINLEN:80). Read 1 was separated into the 20bp barcode region and the oligo-matching region. The trimmed oligo-matching regions of Read 1 and Read 2 were mapped back to the synthesized oligo sequences using Bowtie2 (v2.3.4.1) (flags: -X 250, -very-sensitive, -p 16). Barcodes were then associated with the oligo sequences using the read ID. Only uniquely mapped barcodes were used for downstream analysis."
These barcodes are complementary DNA that has been split into at least 80 bp reads of 150 bp reads with oligo sequences leading the first 25 and the last 25, to get the middle 30 as a region of the barcode. The read ID given each read was what would pair it back with its original oligo sequence. Sounds confusing. And two software programs used for this to split it and separate it and then to map it back to original oligo sequence. An internet search for what an oligo sequence is in cDNA barcode analysis said it is to protect the actual biological processes from artifacts the process can create that wouldn’t be there normally but added by scanning the genes. A unique oligonucleotide sequence is added to each mRNA molecule before doing reverse transcription into cDNA it is added to 5’ end of mRNA (DNA is copied in the 5’ to 3’ direction into mRNA and mRNA is translated into the 5’ to 3’ direction as it is the forward strand while the cDNA strand is the reverse strand in DNA). It is to ensure accurate quantification of gene expression as well as transcript variants as well as protect the PCR (polymerase chain reaction) bias from creating artifacts and instead showing the true biological variation in the samples. Each mRNA molecule has an oligo identifier attached.
We made the mRNA feature from the cDNA ID_REF feature barcodes, and from the mRNA attempted to gsub in the amino acid for each respective triplet codon of nucleotides earlier.
strands <- barcodes[,c(1,27,28)]
strands
## ID_REF mRNA aminoAcids_simulated
## 1 TTTCGGGGAACCGAGTCGAT AAAGCCCCUUGGCUCAGCUA -Lys-Ala-CC-Leu-G-Leu-AG-Leu-
## 2 TTTAGAGTCGGTGGTAGATC AAAUCUCAGCCACCAUCUAG -Lys-U-Leu-A-Ala-A-Pro-U-Leu-G
## 3 TTCACGGTCCTTTTGGTCAC AAGUGCCAGGAAAACCAGUG AA-Val-Pro-G-Glu-AA-Pro-Val-
## 4 TGCGGTCGCGACCTTTCAGC ACGCCAGCGCUGGAAAGUCG AC-Ala-A-Ala-Leu-Glu-A-Val-G
## 5 TGCCCGCGCCTACAGTAGTG ACGGGCGCGGAUGUCAUCAC -Thr-G-Ala-CG-Asp-Val-Ile-AC
## 6 TGAATTTTAGAGTCGGTTTC ACUUAAAAUCUCAGCCAAAG AC-Leu-Lys-U-Leu-A-Ala-Lys-G
## 7 TAGTGGCGTGAGATTTGCGT AUCACCGCACUCUAAACGCA -Ile-Thr-Ala-Leu-U-Lys-C-Ala-
## 8 TAGAGTACCGTTTTTGAACT AUCUCAUGGCAAAAACUUGA AU-Leu-Met-Ala-Lys-AC-Leu-A
## 9 TAGACATGCAGTCGTTTCGA AUCUGUACGUCAGCAAAGCU AU-Leu-Tyr-Val-A-Ala-AA-Ala-
## 10 GTGATTCCACAGTCGTTAAT CACUAAGGUGUCAGCAAUUA CA-Leu-Arg-U-Val-A-Ala-A-Leu-
## 11 GTGAGGATACAGTCGGTTTT CACUCCUAUGUCAGCCAAAA CA-Leu-Leu-U-Val-A-Ala-Lys-A
## 12 GTAGAGTCGTTACCCGACAC CAUCUCAGCAAUGGGCUGUG -His-Leu-A-Ala-A-Trp-G-Leu-UG
## 13 GTACCGTCGGTTGCTCGTGC CAUGGCAGCCAACGAGCACG -His-G-Ala-Ala-AA-Arg-Ala-CG
## 14 GGTGTTGTCAGAGTCGTTAA CCACAACAGUCUCAGCAAUU -Pro-CA-Thr-GU-Leu-A-Ala-Ile-
## 15 GGTCCTGTCTTTTCTGCTGA CCAGGACAGAAAAGACGACU -Pro-G-Asp-A-Glu-AA-Asp-Asp-U
## 16 GGGCGTGTTTTTCTGGAGTA CCCGCACAAAAAGACCUCAU CCC-Ala-C-Lys-AA-Asp-Leu-AU
## 17 GGCTCGGAGTCGCTGAAAAT CCGAGCCUCAGCGACUUUUA CC-Glu-C-Leu-A-Ala-AC-Phe-UA
## 18 GGAGTCGTCTTTTTATCCCC CCUCAGCAGAAAAAUAGGGG C-Leu-A-Ala-Glu-AA-Ile-Gln-G
## 19 GCTTCGCAGTCGTTAGAGTT CGAAGCGUCAGCAAUCUCAA C-Glu-GC-Val-A-Ala-AU-Leu-AA
## 20 GCCGTCCTGTCTTTCTCATT CGGCAGGACAGAAAGAGUAA CG-Ala-G-Asp-A-Glu-Arg-Val-A
## 21 GCCGAGTCGTTATGGACCCA CGGCUCAGCAAUACCUGGGU CGG-Leu-A-Ala-Ile-C-Leu-Gln-
## 22 GAGTCGTTTAAAGGCTCTCT CUCAGCAAAUUUCCGAGAGA -Leu-A-Ala-AA-Phe-CC-Glu-Arg-
## 23 GAGTCGTTCTCGTTTCGCAG CUCAGCAAGAGCAAAGCGUC -Leu-A-Ala-Arg-Ala-Lys-C-Val-
## 24 CTCCAGAGCCGTTTTCGGTG GAGGUCUCGGCAAAAGCCAC -Glu-GU-Leu-G-Ala-Lys-Ala-AC
## 25 CTATGGTCCCTTAGTGTTTA GAUACCAGGGAAUCACAAAU G-Ile-CC-Arg-GA-Ile-Thr-Asp-
## 26 CTATCAACAGAGTCGCTAAT GAUAGUUGUCUCAGCGAUUA G-Ile-G-Leu-U-Leu-A-Ala-A-Leu-
## 27 CGGTTAGAGTCGATAGCTTT GCCAAUCUCAGCUAUCGAAA -Ala-Asp-Leu-AG-Leu-UC-Glu-A
## 28 CGAGTCGTTTGACCGGCGCA GCUCAGCAAACUGGCCGCGU G-Leu-A-Ala-AA-Leu-Ala-Ala-U
## 29 CCTCTCACCAGTCGTTTTGG GGAGAGUGGUCAGCAAAACC G-Glu-A-Val-Val-A-Ala-AA-Thr-
## 30 CCAGTCGATTCTTTTCATAT GGUCAGCUAAGAAAAGUAUA G-Val-AG-Leu-A-Glu-Lys-U-Ile-
## 31 CCAGCAGAGTCGCTCGAAAT GGUCGUCUCAGCGAGCUUUA G-Val-GU-Leu-A-Ala-AGC-Phe-A
## 32 CACCGTCGTTTTTGTGACCG GUGGCAGCAAAAACACUGGC -Val-Ala-Ala-Lys-Thr-Leu-GC
## 33 ATCGTCGTTTTAGCCGTAGG UAGCAGCAAAAUCGGCAUCC UA-Ala-Ala-AA-Ile-Gln-Ile-C
## 34 ATCGTCGGTCTTAGCGGTCA UAGCAGCCAGAAUCGCCAGU UA-Ala-Ala-Arg-Ile-Ala-AGU
## 35 AGATTAACCCAATACATTAT UCUAAUUGGGUUAUGUAAUA U-Leu-A-Leu-GG-Leu-U-Val-Ile-
## 36 AGAGTCGCTCGTTAGGATCT UCUCAGCGAGCAAUCCUAGA U-Leu-A-Ala-A-Ala-Ile-Leu-GA
## 37 AGAGTCGATTTGTCCAATCG UCUCAGCUAAACAGGUUAGC U-Leu-AG-Leu-Asp-Arg-Leu-GC
## 38 AGAGGCGTTCGATCTTAGAC UCUCCGCAAGCUAGAAUCUG U-Leu-C-Ala-AG-Leu-Glu-U-Leu-
## 39 ACTGTCGTTTCAACGTTGAA UGACAGCAAAGUUGCAACUU U-Asp-A-Ala-Lys-Leu-Gln-Leu-
## 40 ACCCCCGTCGTTAATTCGAC UGGGGGCAGCAAUUAAGCUG -Trp-GG-Ala-Ala-A-Leu-AG-Leu-
## 41 AACGCACGGGCGTGTTAGTC UUGCGUGCCCGCACAAUCAG -Leu-C-Val-CCC-Ala-CA-Ile-AG
It produced some singlet and double nucleotide gaps between triplet codons of amino acids that could be deletions, insertions, or translocations that happen when the nucleus of the cell has been damaged enough to destroy the DNA wrapped heavily around histones that encircle the chromosomes in the nucleus of the cell. The cell naturally repairs what it can or programs the cell to die and create another replacement from adjacent cells. When those cells get reproduced with damaged cells it can impact how genes are transcribed and proteins translated from those damaged cells. It can lead to systemic body failure and organ failure, as well as cancer and uncontrolled cell growth into tumors encapsulated which are survivable or non-encapsulated and deadlier as it mutates and turns adjacent cells into tumors and spreads until it damages major system functions of the body.
We want to see what these top strands in the barcodes of the ID_REF are and we found 41 that are up or down regulated. We can check the end features to. Lets see if most of these are up or down regulated in the MS samples that participated and the EBV cell line of B-cell commercially reproduced MS patient, when compared to the healthy control sample used.
foldchanges <- barcodes[order(barcodes$foldchange_commercialMS_vs_control, decreasing=T),c(1,27,28,24:26)]
foldchanges
## ID_REF mRNA aminoAcids_simulated
## 3 TTCACGGTCCTTTTGGTCAC AAGUGCCAGGAAAACCAGUG AA-Val-Pro-G-Glu-AA-Pro-Val-
## 4 TGCGGTCGCGACCTTTCAGC ACGCCAGCGCUGGAAAGUCG AC-Ala-A-Ala-Leu-Glu-A-Val-G
## 7 TAGTGGCGTGAGATTTGCGT AUCACCGCACUCUAAACGCA -Ile-Thr-Ala-Leu-U-Lys-C-Ala-
## 5 TGCCCGCGCCTACAGTAGTG ACGGGCGCGGAUGUCAUCAC -Thr-G-Ala-CG-Asp-Val-Ile-AC
## 20 GCCGTCCTGTCTTTCTCATT CGGCAGGACAGAAAGAGUAA CG-Ala-G-Asp-A-Glu-Arg-Val-A
## 41 AACGCACGGGCGTGTTAGTC UUGCGUGCCCGCACAAUCAG -Leu-C-Val-CCC-Ala-CA-Ile-AG
## 15 GGTCCTGTCTTTTCTGCTGA CCAGGACAGAAAAGACGACU -Pro-G-Asp-A-Glu-AA-Asp-Asp-U
## 8 TAGAGTACCGTTTTTGAACT AUCUCAUGGCAAAAACUUGA AU-Leu-Met-Ala-Lys-AC-Leu-A
## 16 GGGCGTGTTTTTCTGGAGTA CCCGCACAAAAAGACCUCAU CCC-Ala-C-Lys-AA-Asp-Leu-AU
## 25 CTATGGTCCCTTAGTGTTTA GAUACCAGGGAAUCACAAAU G-Ile-CC-Arg-GA-Ile-Thr-Asp-
## 23 GAGTCGTTCTCGTTTCGCAG CUCAGCAAGAGCAAAGCGUC -Leu-A-Ala-Arg-Ala-Lys-C-Val-
## 11 GTGAGGATACAGTCGGTTTT CACUCCUAUGUCAGCCAAAA CA-Leu-Leu-U-Val-A-Ala-Lys-A
## 6 TGAATTTTAGAGTCGGTTTC ACUUAAAAUCUCAGCCAAAG AC-Leu-Lys-U-Leu-A-Ala-Lys-G
## 29 CCTCTCACCAGTCGTTTTGG GGAGAGUGGUCAGCAAAACC G-Glu-A-Val-Val-A-Ala-AA-Thr-
## 27 CGGTTAGAGTCGATAGCTTT GCCAAUCUCAGCUAUCGAAA -Ala-Asp-Leu-AG-Leu-UC-Glu-A
## 33 ATCGTCGTTTTAGCCGTAGG UAGCAGCAAAAUCGGCAUCC UA-Ala-Ala-AA-Ile-Gln-Ile-C
## 38 AGAGGCGTTCGATCTTAGAC UCUCCGCAAGCUAGAAUCUG U-Leu-C-Ala-AG-Leu-Glu-U-Leu-
## 28 CGAGTCGTTTGACCGGCGCA GCUCAGCAAACUGGCCGCGU G-Leu-A-Ala-AA-Leu-Ala-Ala-U
## 21 GCCGAGTCGTTATGGACCCA CGGCUCAGCAAUACCUGGGU CGG-Leu-A-Ala-Ile-C-Leu-Gln-
## 35 AGATTAACCCAATACATTAT UCUAAUUGGGUUAUGUAAUA U-Leu-A-Leu-GG-Leu-U-Val-Ile-
## 10 GTGATTCCACAGTCGTTAAT CACUAAGGUGUCAGCAAUUA CA-Leu-Arg-U-Val-A-Ala-A-Leu-
## 26 CTATCAACAGAGTCGCTAAT GAUAGUUGUCUCAGCGAUUA G-Ile-G-Leu-U-Leu-A-Ala-A-Leu-
## 39 ACTGTCGTTTCAACGTTGAA UGACAGCAAAGUUGCAACUU U-Asp-A-Ala-Lys-Leu-Gln-Leu-
## 30 CCAGTCGATTCTTTTCATAT GGUCAGCUAAGAAAAGUAUA G-Val-AG-Leu-A-Glu-Lys-U-Ile-
## 19 GCTTCGCAGTCGTTAGAGTT CGAAGCGUCAGCAAUCUCAA C-Glu-GC-Val-A-Ala-AU-Leu-AA
## 40 ACCCCCGTCGTTAATTCGAC UGGGGGCAGCAAUUAAGCUG -Trp-GG-Ala-Ala-A-Leu-AG-Leu-
## 18 GGAGTCGTCTTTTTATCCCC CCUCAGCAGAAAAAUAGGGG C-Leu-A-Ala-Glu-AA-Ile-Gln-G
## 1 TTTCGGGGAACCGAGTCGAT AAAGCCCCUUGGCUCAGCUA -Lys-Ala-CC-Leu-G-Leu-AG-Leu-
## 37 AGAGTCGATTTGTCCAATCG UCUCAGCUAAACAGGUUAGC U-Leu-AG-Leu-Asp-Arg-Leu-GC
## 2 TTTAGAGTCGGTGGTAGATC AAAUCUCAGCCACCAUCUAG -Lys-U-Leu-A-Ala-A-Pro-U-Leu-G
## 17 GGCTCGGAGTCGCTGAAAAT CCGAGCCUCAGCGACUUUUA CC-Glu-C-Leu-A-Ala-AC-Phe-UA
## 36 AGAGTCGCTCGTTAGGATCT UCUCAGCGAGCAAUCCUAGA U-Leu-A-Ala-A-Ala-Ile-Leu-GA
## 13 GTACCGTCGGTTGCTCGTGC CAUGGCAGCCAACGAGCACG -His-G-Ala-Ala-AA-Arg-Ala-CG
## 14 GGTGTTGTCAGAGTCGTTAA CCACAACAGUCUCAGCAAUU -Pro-CA-Thr-GU-Leu-A-Ala-Ile-
## 34 ATCGTCGGTCTTAGCGGTCA UAGCAGCCAGAAUCGCCAGU UA-Ala-Ala-Arg-Ile-Ala-AGU
## 9 TAGACATGCAGTCGTTTCGA AUCUGUACGUCAGCAAAGCU AU-Leu-Tyr-Val-A-Ala-AA-Ala-
## 24 CTCCAGAGCCGTTTTCGGTG GAGGUCUCGGCAAAAGCCAC -Glu-GU-Leu-G-Ala-Lys-Ala-AC
## 12 GTAGAGTCGTTACCCGACAC CAUCUCAGCAAUGGGCUGUG -His-Leu-A-Ala-A-Trp-G-Leu-UG
## 31 CCAGCAGAGTCGCTCGAAAT GGUCGUCUCAGCGAGCUUUA G-Val-GU-Leu-A-Ala-AGC-Phe-A
## 32 CACCGTCGTTTTTGTGACCG GUGGCAGCAAAAACACUGGC -Val-Ala-Ala-Lys-Thr-Leu-GC
## 22 GAGTCGTTTAAAGGCTCTCT CUCAGCAAAUUUCCGAGAGA -Leu-A-Ala-AA-Phe-CC-Glu-Arg-
## foldchange_MS1_vs_control foldchange_MS2_vs_control
## 3 0.06075949 0.06835443
## 4 0.08219178 0.14794521
## 7 0.08250000 0.11250000
## 5 0.10843373 0.09397590
## 20 0.08955224 0.10746269
## 41 0.08737864 0.12233010
## 15 0.07800000 0.12600000
## 8 0.11186441 0.13220339
## 16 0.09320388 0.13980583
## 25 0.09230769 0.08076923
## 23 54.00000000 52.60000000
## 11 63.00000000 47.04000000
## 6 54.45000000 41.17500000
## 29 75.80000000 54.50000000
## 27 66.36000000 52.20000000
## 33 60.80000000 47.40000000
## 38 54.90000000 42.67500000
## 28 57.23076923 50.86153846
## 21 54.40000000 42.20000000
## 35 68.52000000 48.60000000
## 10 84.60000000 49.20000000
## 26 64.40000000 41.00000000
## 39 51.60000000 44.00000000
## 30 76.35000000 62.10000000
## 19 71.76000000 46.68000000
## 40 61.60000000 64.20000000
## 18 73.05000000 61.50000000
## 1 75.00000000 44.16000000
## 37 75.40000000 60.40000000
## 2 88.80000000 53.20000000
## 17 87.60000000 70.40000000
## 36 74.10000000 63.45000000
## 13 77.55000000 45.90000000
## 14 63.90000000 61.20000000
## 34 74.60000000 77.40000000
## 9 68.85000000 71.40000000
## 24 94.80000000 71.55000000
## 12 80.00000000 71.60000000
## 31 69.40000000 58.00000000
## 32 103.40000000 61.40000000
## 22 161.40000000 110.40000000
## foldchange_commercialMS_vs_control
## 3 14.629629630
## 4 12.166666667
## 7 11.111111111
## 5 10.641025641
## 20 10.151515152
## 41 9.537037037
## 15 9.259259259
## 8 8.939393939
## 16 8.583333333
## 25 7.878787879
## 23 0.024038462
## 11 0.023607177
## 6 0.022259321
## 29 0.021052632
## 27 0.020990764
## 33 0.020833333
## 38 0.020544427
## 28 0.020420986
## 21 0.020325203
## 35 0.019794141
## 10 0.019707207
## 26 0.019193858
## 39 0.018939394
## 30 0.018365473
## 19 0.018315018
## 40 0.017730496
## 18 0.017590150
## 1 0.017182131
## 37 0.016501650
## 2 0.016129032
## 17 0.016129032
## 36 0.015948963
## 13 0.015741834
## 14 0.015290520
## 34 0.015243902
## 9 0.015151515
## 24 0.014556041
## 12 0.014245014
## 31 0.013927577
## 32 0.010989011
## 22 0.007022472
The commercial line has an inverse relationship with each of the participating MS patient samples in that the barcodes that are under expressed in the commercial line are over expressed or upregulated in each of the MS samples. We saw the randomForest() was able to predict with 100% accuracy the sample as healthy or MS using this data of 15 samples and 41 genes without the fold change or means added to this data. It was set on cross validation with 10,000 trees.
Up regulation or over expression is also called enhanced or enhancer gene while the down regulation or under expression of gene transcripts is called silenced or silencer genes.
In seeing if there are copy variants, what it is looking at are those genes that have different nucleotides in their mRNA made after dropping introns or non-coding regions of DNA and keeping the exons as the coding region of DNA. This is from the pre-mRNA that is actually transcribed from DNA when copied from the complementary strand or anti-sense strand to make the adjacent strand or sense strand of DNA using RNA that substitutes Thymine for Uracil as a nucleic acid.
Since DNA is a double helix strand of complementary bases made up of DNA nucleic acids, the sense strand of forward strand moves up but not towards the centromere as that is on the chromosome where the DNA is wrapped around beads very tightly called histones that are then wrapped around the chromosome. We have 23 chromosomes, the first 22 are autosomes and the 23rd chromosome is the sex chromosome for identity. There are 2 copies in each cell nucleus where 1 is a copy of mother and other of father of human. So, the sense strand is just moving they say 5’ to 3’ and not up or down, and when DNA is copied, it is copying the reverse strand or anti-sense strand in the 5’ to 3’ direction to make the molecule of 1 strand of DNA not the double strand that makes the helix shape. That strand is 3’ to 5’ and RNA is not copied in that direction. It is reiterated many times and this makes nonsense at first because you wonder why when first learning it and reviewing it again and again that you don’t even know which direction it moves, but its because it doesn’t move up or down its in space the relation of one strand to the other. So, mRNA is fed into the ribosome in 5’ to 3’ direction the same as the DNA strand is transcribed into the mRNa molecule in the 5’ to 3’ direction.
Environmental toxins, pollution, ultraviolet radiation, any harmful exposure to radiation like x-rays or gamma rays, or higher frequency radiation, viral infection, bacterial infection, immune responses, poisoning, inherited mutations, and stress that fights own body systems as in autoimmune disease can create enough disruption in the body to reach the inside of the cell and the nucleus where chromosomes and DNA is housed can damage the DNA. There are regulating mechanisms that do immune surveilance to protect DNA and repair damaged DNA when alerted to it. When the body survives some catastrophes like malaria, they can pass on gene mutations to their offspring such as sickle cell anemia that is basically the same hemoglobin molecule but the shape is no longer a discoid shape but a sickle or half moon shape to prevent the survival of the virus that causes malaria but it can cause death if left unregulated or not managed well. This example was due to a gene mutation or copy variant in the valine amino acid of the gene for hemoglobin that in normal hemoglobin gives it the discoid shape. The internet says that glutamic acid is what is replaced and it is due to the hydrophobic affect of valine so that the malaria virus won’t survive in the environment if there isn’t water in the cell.
I’m just spilling all I know about what I know from a course in genetics and biology courses. There are shapes to these proteins. They aren’t just single strand molecules, because once the protein is translated it is sent to another organelle the golgi apparatus to shape and form the final protein product that can have a single peptide shape, a double peptide shape of alpha and beta folds, a tertiary shape that bends and winds the same strand of alpha and beta folds, or a quaternary shape of more complex winds and turns of peptides formation. If the shape is destroyed, usually the protein is dysfunctional or defective such as the brain proteins in Jakob-Creutzfeld disease or AKA prion or AKA mad cow disease. This could be from a virus, pH changes, or even extremely elevated body temperatures. That information off hand was mentioned in various chiropractic courses in neurology, human biology, anatomy, physiology, pathology, and probably cell tissue anatomy and pathology. It is highly contagious once caught and the instrument used to dissect a person with it to diagnose them has to be thrown away because the tools cannot be sterilized. It is a Hazmat situation where personal protective equipment is fully suited up to prevent exposure. It is said cannibals got it from eating their own meat from instructer from India who taught Pathology and Immunity and Infection in chiropractic school.
Given what I spilled about everything I know in a ‘nut shell’ as the saying goes, when we find these genes that the barcodes represent in MS cases versus healthy cases from B-cells or peripheral mononucleiated blood cells or the plasma WBC portion of blood, we can link those copy variants or allele variations in those genes with the pathology of MS. We will then add those genes to our machine to predict various classes using same media and samples but normalized by dividing each gene by the max of all observations and compare to see if it can identify EBV infection, mononucleosis, fibromyalgia, multiple sclerosis, Burkett’s lymphoms, Hodgkin’s lymphoma, or head and neck or throat cancer or sarcoma.
I have not been having luck with that simple line of code for the bioconductor as I think it has to do more with NCBI not allowing scraping to go on as they ask as well as genecards if the request is human by checking a box after using their inquiry search box. So, we can try again and use another code or we can manually look them up. There is a ranking system that is done by barcode or 20 base pair cDNA to see if the gene is the top gene for that sequence. We will just use the top ranked gene for that sequence.
I will manually input that data and return to upload the dataset and explore the results with information on those genes.
write.csv(strands,'barcodes41_cDNA_mRNA_AAs.csv', row.names=F)
IDs <- barcodes$ID_REF
IDs
## [1] "TTTCGGGGAACCGAGTCGAT" "TTTAGAGTCGGTGGTAGATC" "TTCACGGTCCTTTTGGTCAC"
## [4] "TGCGGTCGCGACCTTTCAGC" "TGCCCGCGCCTACAGTAGTG" "TGAATTTTAGAGTCGGTTTC"
## [7] "TAGTGGCGTGAGATTTGCGT" "TAGAGTACCGTTTTTGAACT" "TAGACATGCAGTCGTTTCGA"
## [10] "GTGATTCCACAGTCGTTAAT" "GTGAGGATACAGTCGGTTTT" "GTAGAGTCGTTACCCGACAC"
## [13] "GTACCGTCGGTTGCTCGTGC" "GGTGTTGTCAGAGTCGTTAA" "GGTCCTGTCTTTTCTGCTGA"
## [16] "GGGCGTGTTTTTCTGGAGTA" "GGCTCGGAGTCGCTGAAAAT" "GGAGTCGTCTTTTTATCCCC"
## [19] "GCTTCGCAGTCGTTAGAGTT" "GCCGTCCTGTCTTTCTCATT" "GCCGAGTCGTTATGGACCCA"
## [22] "GAGTCGTTTAAAGGCTCTCT" "GAGTCGTTCTCGTTTCGCAG" "CTCCAGAGCCGTTTTCGGTG"
## [25] "CTATGGTCCCTTAGTGTTTA" "CTATCAACAGAGTCGCTAAT" "CGGTTAGAGTCGATAGCTTT"
## [28] "CGAGTCGTTTGACCGGCGCA" "CCTCTCACCAGTCGTTTTGG" "CCAGTCGATTCTTTTCATAT"
## [31] "CCAGCAGAGTCGCTCGAAAT" "CACCGTCGTTTTTGTGACCG" "ATCGTCGTTTTAGCCGTAGG"
## [34] "ATCGTCGGTCTTAGCGGTCA" "AGATTAACCCAATACATTAT" "AGAGTCGCTCGTTAGGATCT"
## [37] "AGAGTCGATTTGTCCAATCG" "AGAGGCGTTCGATCTTAGAC" "ACTGTCGTTTCAACGTTGAA"
## [40] "ACCCCCGTCGTTAATTCGAC" "AACGCACGGGCGTGTTAGTC"
# Install Bioconductor packages if not already installed
#if (!requireNamespace("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
# Install required packages
#BiocManager::install(c("Biostrings"), ask = FALSE)
#install.packages("dplyr")
# Load libraries
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: Seqinfo
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
#library(dplyr) #needs higher version
#install.packages('dplyr')
Installing package into ‘C:/…/AppData/Local/R/win-library/4.5’ (as ‘lib’ is unspecified) trying URL ‘https://cran.rstudio.com/bin/windows/contrib/4.5/dplyr_1.1.4.zip’ Content type ‘application/zip’ length 1593322 bytes (1.5 MB) downloaded 1.5 MB
package ‘dplyr’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in C:...7ZA_packages Loading required package: BiocGenerics Loading required package: generics
Attaching package: ‘generics’
The following objects are masked from ‘package:base’:
as.difftime, as.factor, as.ordered, intersect, is.element,
setdiff, setequal, union
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename,
cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, is.unsorted,
lapply, Map, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
rownames, sapply, saveRDS, table, tapply, unique, unsplit,
which.max, which.min
Loading required package: S4Vectors Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:grDevices’:
windows
Loading required package: XVector Loading required package: Seqinfo
Attaching package: ‘Biostrings’
The following object is masked from ‘package:base’:
strsplit
Error: package or namespace load failed for ‘dplyr’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): namespace ‘rlang’ 1.1.6 is already loaded, but >= 1.1.7 is required
#Example: Known barcode-to-gene mapping table
barcode_map <- data.frame(
barcode = c("ACTGAC", "TTGACA", "GCTTGA"),
gene_id = c("GeneA", "GeneB", "GeneC"),
stringsAsFactors = FALSE
)
barcode_map
## barcode gene_id
## 1 ACTGAC GeneA
## 2 TTGACA GeneB
## 3 GCTTGA GeneC
barcode
TTGACA GeneB
GCTTGA GeneC
3 rows
# Example: Sequenced barcodes from experiment
sequenced_barcodes <- c("ACTGAC", "TTGACA", "GCTTGA", "ACTGAT") # last one has a mismatch
sequenced_barcodes
## [1] "ACTGAC" "TTGACA" "GCTTGA" "ACTGAT"
[1] “ACTGAC” “TTGACA” “GCTTGA” “ACTGAT”
#Convert to DNAStringSet for sequence operations
seq_set <- DNAStringSet(sequenced_barcodes)
ref_set <- DNAStringSet(barcode_map$barcode)
seq_set
## DNAStringSet object of length 4:
## width seq
## [1] 6 ACTGAC
## [2] 6 TTGACA
## [3] 6 GCTTGA
## [4] 6 ACTGAT
DNAStringSet object of length 4: width seq [1] 6 ACTGAC [2] 6 TTGACA [3] 6 GCTTGA [4] 6 ACTGAT
ref_set
## DNAStringSet object of length 3:
## width seq
## [1] 6 ACTGAC
## [2] 6 TTGACA
## [3] 6 GCTTGA
DNAStringSet object of length 3: width seq [1] 6 ACTGAC [2] 6 TTGACA [3] 6 GCTTGA
#Convert to DNAStringSet for sequence operations
seq_set1 <- DNAStringSet(IDs)
ref_set1 <- DNAStringSet(barcode_map$barcode)
seq_set1
## DNAStringSet object of length 41:
## width seq
## [1] 20 TTTCGGGGAACCGAGTCGAT
## [2] 20 TTTAGAGTCGGTGGTAGATC
## [3] 20 TTCACGGTCCTTTTGGTCAC
## [4] 20 TGCGGTCGCGACCTTTCAGC
## [5] 20 TGCCCGCGCCTACAGTAGTG
## ... ... ...
## [37] 20 AGAGTCGATTTGTCCAATCG
## [38] 20 AGAGGCGTTCGATCTTAGAC
## [39] 20 ACTGTCGTTTCAACGTTGAA
## [40] 20 ACCCCCGTCGTTAATTCGAC
## [41] 20 AACGCACGGGCGTGTTAGTC
DNAStringSet object of length 41: width seq [1] 20 TTTCGGGGAACCGAGTCGAT [2] 20 TTTAGAGTCGGTGGTAGATC [3] 20 TTCACGGTCCTTTTGGTCAC [4] 20 TGCGGTCGCGACCTTTCAGC [5] 20 TGCCCGCGCCTACAGTAGTG … … … [37] 20 AGAGTCGATTTGTCCAATCG [38] 20 AGAGGCGTTCGATCTTAGAC [39] 20 ACTGTCGTTTCAACGTTGAA [40] 20 ACCCCCGTCGTTAATTCGAC [41] 20 AACGCACGGGCGTGTTAGTC
ref_set1
## DNAStringSet object of length 3:
## width seq
## [1] 6 ACTGAC
## [2] 6 TTGACA
## [3] 6 GCTTGA
DNAStringSet object of length 3: width seq [1] 6 ACTGAC [2] 6 TTGACA [3] 6 GCTTGA
# Function to find closest matching barcode (allowing mismatches)
map_barcode <- function(query_seq, ref_set, ref_map, max_mismatch = 1) {
# Find matches allowing up to max_mismatch mismatches
hits <- vcountPattern(query_seq, ref_set, max.mismatch = max_mismatch, with.indels = FALSE)
if (sum(hits) == 1) {
matched_barcode <- as.character(ref_set[which(hits == 1)])
return(ref_map$gene_id[ref_map$barcode == matched_barcode])
} else {
return(NA) # No unique match
}
}
map_barcode
## function (query_seq, ref_set, ref_map, max_mismatch = 1)
## {
## hits <- vcountPattern(query_seq, ref_set, max.mismatch = max_mismatch,
## with.indels = FALSE)
## if (sum(hits) == 1) {
## matched_barcode <- as.character(ref_set[which(hits ==
## 1)])
## return(ref_map$gene_id[ref_map$barcode == matched_barcode])
## }
## else {
## return(NA)
## }
## }
function(query_seq, ref_set, ref_map, max_mismatch = 1) { # Find matches allowing up to max_mismatch mismatches hits <- vcountPattern(query_seq, ref_set, max.mismatch = max_mismatch, with.indels = FALSE)
if (sum(hits) == 1) { matched_barcode <- as.character(ref_set[which(hits == 1)]) return(ref_map\(gene_id[ref_map\)barcode == matched_barcode]) } else { return(NA) # No unique match } }
# Map all sequenced barcodes to genes
mapped_genes <- sapply(sequenced_barcodes, map_barcode, ref_set = ref_set, ref_map = barcode_map)
mapped_genes
## ACTGAC TTGACA GCTTGA ACTGAT
## "GeneA" "GeneB" "GeneC" "GeneA"
ACTGAC TTGACA GCTTGA ACTGAT “GeneA” “GeneB” “GeneC” “GeneA”
# Combine results into a data frame
result <- data.frame(
sequenced_barcode = sequenced_barcodes,
mapped_gene = mapped_genes,
stringsAsFactors = FALSE
)
print(result)
## sequenced_barcode mapped_gene
## ACTGAC ACTGAC GeneA
## TTGACA TTGACA GeneB
## GCTTGA GCTTGA GeneC
## ACTGAT ACTGAT GeneA
sequenced_barcode mapped_gene
ACTGAC ACTGAC GeneA TTGACA TTGACA GeneB GCTTGA GCTTGA GeneC ACTGAT ACTGAT GeneA
The internet gave me example above but my dplyr package won’t load now and needs a new version higher than 1.1.4
BiocManager::available()
‘getOption(“repos”)’ replaces Bioconductor standard repositories, see
‘help(“repositories”, package = “BiocManager”)’ for details. Replacement
repositories: CRAN: https://cran.rstudio.com/ [1] “a4”
[2] “a4Base”
[3] “a4Classif”
[4] “a4Core”
[5] “a4Preproc”
[6] “a4Reporting”
[7] “AalenJohansen”
[8] “aamatch”
[9] “AATtools”
…
[993] “atena”
[994] “ath1121501.db”
[995] “ath1121501cdf”
[996] “ath1121501frmavecs”
[997] “ath1121501probe”
[998] “Athlytics”
[999] “atime”
[1000] “atlas”
[ reached ‘max’ / getOption(“max.print”) – omitted 25643 entries ]
Different tutorial on biostrings
cDNA <- DNAStringSet(barcodes$ID_REF)
cDNA
## DNAStringSet object of length 41:
## width seq
## [1] 20 TTTCGGGGAACCGAGTCGAT
## [2] 20 TTTAGAGTCGGTGGTAGATC
## [3] 20 TTCACGGTCCTTTTGGTCAC
## [4] 20 TGCGGTCGCGACCTTTCAGC
## [5] 20 TGCCCGCGCCTACAGTAGTG
## ... ... ...
## [37] 20 AGAGTCGATTTGTCCAATCG
## [38] 20 AGAGGCGTTCGATCTTAGAC
## [39] 20 ACTGTCGTTTCAACGTTGAA
## [40] 20 ACCCCCGTCGTTAATTCGAC
## [41] 20 AACGCACGGGCGTGTTAGTC
DNAStringSet object of length 41: width seq [1] 20 TTTCGGGGAACCGAGTCGAT [2] 20 TTTAGAGTCGGTGGTAGATC [3] 20 TTCACGGTCCTTTTGGTCAC [4] 20 TGCGGTCGCGACCTTTCAGC [5] 20 TGCCCGCGCCTACAGTAGTG … … … [37] 20 AGAGTCGATTTGTCCAATCG [38] 20 AGAGGCGTTCGATCTTAGAC [39] 20 ACTGTCGTTTCAACGTTGAA [40] 20 ACCCCCGTCGTTAATTCGAC [41] 20 AACGCACGGGCGTGTTAGTC
proteins <- AAStringSet(barcodes$ID_REF)
proteins
## AAStringSet object of length 41:
## width seq
## [1] 20 TTTCGGGGAACCGAGTCGAT
## [2] 20 TTTAGAGTCGGTGGTAGATC
## [3] 20 TTCACGGTCCTTTTGGTCAC
## [4] 20 TGCGGTCGCGACCTTTCAGC
## [5] 20 TGCCCGCGCCTACAGTAGTG
## ... ... ...
## [37] 20 AGAGTCGATTTGTCCAATCG
## [38] 20 AGAGGCGTTCGATCTTAGAC
## [39] 20 ACTGTCGTTTCAACGTTGAA
## [40] 20 ACCCCCGTCGTTAATTCGAC
## [41] 20 AACGCACGGGCGTGTTAGTC
AAStringSet object of length 41: width seq [1] 20 TTTCGGGGAACCGAGTCGAT [2] 20 TTTAGAGTCGGTGGTAGATC [3] 20 TTCACGGTCCTTTTGGTCAC [4] 20 TGCGGTCGCGACCTTTCAGC [5] 20 TGCCCGCGCCTACAGTAGTG … … … [37] 20 AGAGTCGATTTGTCCAATCG [38] 20 AGAGGCGTTCGATCTTAGAC [39] 20 ACTGTCGTTTCAACGTTGAA [40] 20 ACCCCCGTCGTTAATTCGAC [41] 20 AACGCACGGGCGTGTTAGTC
RNAs <- RNAStringSet(barcodes$ID_REF)
RNAs
Error in .Call2(“new_XStringSet_from_CHARACTER”, class(x0), elementType(x0), : key 84 (char ‘T’) not in lookup table
Needs the strings to have U
Pro <- translate(cDNA)
Pro
## AAStringSet object of length 41:
## width seq
## [1] 6 FRGTES
## [2] 6 FRVGGR
## [3] 6 FTVLLV
## [4] 6 CGRDLS
## [5] 6 CPRLQ*
## ... ... ...
## [37] 6 RVDLSN
## [38] 6 RGVRS*
## [39] 6 TVVSTL
## [40] 6 TPVVNS
## [41] 6 NARAC*
AAStringSet object of length 41: width seq [1] 6 FRGTES [2] 6 FRVGGR [3] 6 FTVLLV [4] 6 CGRDLS [5] 6 CPRLQ … … … [37] 6 RVDLSN [38] 6 RGVRS [39] 6 TVVSTL [40] 6 TPVVNS [41] 6 NARAC*
Those are the one letter abbreviations for the 21 amino acids seen above.
It wasn’t working for what I needed, I believe it is due to the source not allowing what it thinks is web scraping off the site, or maybe user error. I didn’t get anything usable from AI generated code to retrieve the gene names for these 20 base long nucleotides of DNA as complementary DNA or cDNA. So, took a few days to enter them in manually.
==========================================
I had to use 2-3 days to tediously find all the gene associations top ranked for each 20 base pair cDNA strand using BLAST for Basic Local Alignment Scoring Tool or something similar in name.
Here are the details and you can get the file here.
To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, it is part of NIH or National Institute of Health. There is a text box at the top left to enter the query sequence where I pasted the cDNA 20bp string and then scrolled down to select the ‘Genomic + transcript databases’ in the ‘Database’ row, It selects Human Genomic + Transcript (Human G + T), then scroll down and select the option to open results in new window to easily get back to that window to enter in each of the 41 top genes used from our work earlier. Then select the tab at the bottom for the blue ‘BLAST’ for something similar to Basic Linear Alignmnet Scoring Tool that will give all results by a measure it ranks them by points in deductions for wrong and additions for correct in the string. I ignored all top section transcripts and scrolled down to 2nd section of genome results and selected the top score in ‘Per Ident’ column of 100%, then looked to other columns and the top listed max score, then the top score which was the highest, and sometimes if low and others in max score were higher in top score I would use those if the Per Ident field is 100% and the query cover was 100% or close. Query cover is 100% if all 20 base pairs found in the string of DNA. I found that many of the top gene base pair strands are from NC_000002.12, but it is a long gene coverage field and many other alternate genes overlap that section. I also avoided use of filling the geneSynonyms field of this database if there was no entry in genecards.org for an Ensembl ID. So some loci involved in other research but not named specifically as a gene or other were closer but had no Ensembl ID. And the populated first web link is the results of the top ranked genome range selected but some of the given or provided genes that were involved with that string on the range were not zoomed in as close to that string set but other genes were and so the closest gene was found from the chromosome view that is the blast web link found in selecting the ‘graphics’ link over the selected top ranked gene and then using the ‘link to this view’ option to copy the link to the graphic of the chromosome view specific to that target region. The bottom of the chromosome is zoomed in or panned in to the string searched and gene rank selected, from the top portion that has a tiny window of where that band of region is located and the green chunks along it also tiny are alternate genes or exons or mRNA, noncoding, uncharacterized loci, etc. that are in the neighborhood of the gene strip selected and searched. But not all were close to the string of nucleotides searched and some had overlap immediately when looking at the chromosome view. The reverse strand is the complementary DNA strand and the arrow points left, if the arrow points right it is the forward strand of DNA or opposite strand from the cDNA strand.
I didn’t go to genecards.org to actually put in the definition of each gene once I had all the information gathered that took an hour to get about five of the 41 strings each due to quality filtering and selection based on above procedure.
Lets go ahead and read in that table and list are top gene associations with the results that we couldn’t get from Bioconductor.
genes <- read.csv('barcodes41_cDNA_mRNA_AAs_BLAST_BLAT_ENSEMBL_added.csv',sep=',', header=T, na.strings=c('',' ','na','NA'))
str(genes)
## 'data.frame': 41 obs. of 9 variables:
## $ ID_REF : chr "TTTCGGGGAACCGAGTCGAT" "TTTAGAGTCGGTGGTAGATC" "TTCACGGTCCTTTTGGTCAC" "TGCGGTCGCGACCTTTCAGC" ...
## $ ensembl_blast_blat_short : chr "NC_000008.11" "NC_000002.12" "NC_000001.11" "NC_000002.12" ...
## $ geneSynonyms : chr "CPA6" "RPL7P61" "DDAH1" "KLHL29" ...
## $ location : chr "chromosome 8, GRCh38.p14 Primary Assembly, Range 1: 67542483 to 67542496, forward strand,\nSequence ID: NC_0000"| __truncated__ "chromosome 2, GRCh38.p14 Primary Assembly, Range 1: 163177186 to 163177200, reverse strand,\nSequence ID: NC_00"| __truncated__ "chromosome 1, GRCh38.p14 Primary Assembly, Range 1: 85369881 to 85369895, reverse strand\nSequence ID: NC_00000"| __truncated__ "chromosome 2, GRCh38.p14 Primary Assembly, Range 1: 23651127 to 23651139, reverse strand,\nSequence ID: NC_0000"| __truncated__ ...
## $ Ensembl_Name : chr "ENSG00000165078" "ENSG00000230282" "ENSG00000153904" "ENSG00000119771" ...
## $ weblink : chr "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815590" "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815596" "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815597" "https://blast.ncbi.nlm.nih.gov/Blast.cgi#alnHdr_568815596" ...
## $ blast_chromosome_graphics_view_link: chr "https://www.ncbi.nlm.nih.gov/nuccore/568815590?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ "https://www.ncbi.nlm.nih.gov/nuccore/568815596?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ "https://www.ncbi.nlm.nih.gov/nuccore/568815597?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ "https://www.ncbi.nlm.nih.gov/nuccore/568815596?report=graph&tracks=[key:sequence_track,name:Sequence,display_na"| __truncated__ ...
## $ note : chr "Alignment:\tQuery_780917 x NC_000008.11\nAnchor:\tNC_000008.11 (67,542,483..67,542,496)\nQuery:\tQuery_780917 ("| __truncated__ "Alignment:\tQuery_3907771 x NC_000002.12\nAnchor:\tNC_000002.12 (103,430,601..186,861,488)\nQuery:\tQuery_39077"| __truncated__ "Alignment:\tQuery_3572341 x NC_000001.11\nAnchor:\tNC_000001.11 (85,369,881..85,369,895)\nQuery:\tQuery_3572341"| __truncated__ "Alignment:\tQuery_3047309 x NC_000002.12\nAnchor:\tNC_000002.12 (23,651,127..23,651,139)\nQuery:\tQuery_3047309"| __truncated__ ...
## $ selecting_best_ranked_gene : chr "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ "To select the best ranked gene, I used the 20 base pair sequence of cDNA and the website, https://blast.ncbi.nl"| __truncated__ ...
Lets just view the genes by synonym name and by Ensembl ID
geneNames <- genes[,c(1,3,5)]
geneNames
## ID_REF geneSynonyms Ensembl_Name
## 1 TTTCGGGGAACCGAGTCGAT CPA6 ENSG00000165078
## 2 TTTAGAGTCGGTGGTAGATC RPL7P61 ENSG00000230282
## 3 TTCACGGTCCTTTTGGTCAC DDAH1 ENSG00000153904
## 4 TGCGGTCGCGACCTTTCAGC KLHL29 ENSG00000119771
## 5 TGCCCGCGCCTACAGTAGTG RPL31P30 ENSG00000230702
## 6 TGAATTTTAGAGTCGGTTTC ATP5MC3 ENSG00000154518
## 7 TAGTGGCGTGAGATTTGCGT RNU6-280P ENSG00000201015
## 8 TAGAGTACCGTTTTTGAACT TFPI ENSG00000003436
## 9 TAGACATGCAGTCGTTTCGA ST3GAL3 ENSG00000126091
## 10 GTGATTCCACAGTCGTTAAT CDH8 ENSG00000150394
## 11 GTGAGGATACAGTCGGTTTT CAMKMT ENSG00000143919
## 12 GTAGAGTCGTTACCCGACAC SMARCA2 ENSG00000080503
## 13 GTACCGTCGGTTGCTCGTGC POM121L15P ENSG00000161103
## 14 GGTGTTGTCAGAGTCGTTAA NLRP3 ENSG00000162711
## 15 GGTCCTGTCTTTTCTGCTGA ACTN4 ENSG00000130402
## 16 GGGCGTGTTTTTCTGGAGTA WDR35 ENSG00000118965
## 17 GGCTCGGAGTCGCTGAAAAT ESRP1 ENSG00000104413
## 18 GGAGTCGTCTTTTTATCCCC ERBB4 ENSG00000178568
## 19 GCTTCGCAGTCGTTAGAGTT TMEM200A ENSG00000164484
## 20 GCCGTCCTGTCTTTCTCATT MIR4432HG ENSG00000228590
## 21 GCCGAGTCGTTATGGACCCA KCNA6-AS1 ENSG00000256988
## 22 GAGTCGTTTAAAGGCTCTCT CLINT1 ENSG00000113282
## 23 GAGTCGTTCTCGTTTCGCAG AQP12B ENSG00000185176
## 24 CTCCAGAGCCGTTTTCGGTG HSD3B1 ENSG00000203857
## 25 CTATGGTCCCTTAGTGTTTA LZIC ENSG00000162441
## 26 CTATCAACAGAGTCGCTAAT CRLF3P3 ENSG00000228225
## 27 CGGTTAGAGTCGATAGCTTT HDAC4 ENSG00000068024
## 28 CGAGTCGTTTGACCGGCGCA SLIRPP1 ENSG00000227505
## 29 CCTCTCACCAGTCGTTTTGG SCHLAP1 ENSG00000281131
## 30 CCAGTCGATTCTTTTCATAT NDUFB5P1 ENSG00000251025
## 31 CCAGCAGAGTCGCTCGAAAT FAM20C ENSG00000177706
## 32 CACCGTCGTTTTTGTGACCG IGFBP7 ENSG00000163453
## 33 ATCGTCGTTTTAGCCGTAGG ST8SIA4 ENSG00000113532
## 34 ATCGTCGGTCTTAGCGGTCA TSHZ2 ENSG00000182463
## 35 AGATTAACCCAATACATTAT PIK3C2A ENSG00000011405
## 36 AGAGTCGCTCGTTAGGATCT CASC8 ENSG00000246228
## 37 AGAGTCGATTTGTCCAATCG PLSCR5 ENSG00000231213
## 38 AGAGGCGTTCGATCTTAGAC PDE4DIPP5 ENSG00000275064
## 39 ACTGTCGTTTCAACGTTGAA STX12 ENSG00000117758
## 40 ACCCCCGTCGTTAATTCGAC CACNA1E ENSG00000198216
## 41 AACGCACGGGCGTGTTAGTC TSNARE1 ENSG00000171045
Lets now combine this new information to our other database and see the results of the fold changes on the samples with the gene name, we will later add in the gene summaries and see what type of picture we can develop in our minds about the process of multiple sclerosis.
DATA <- merge(geneNames, barcodes, by.x='ID_REF', by.y="ID_REF")
str(DATA)
## 'data.frame': 41 obs. of 30 variables:
## $ ID_REF : chr "AACGCACGGGCGTGTTAGTC" "ACCCCCGTCGTTAATTCGAC" "ACTGTCGTTTCAACGTTGAA" "AGAGGCGTTCGATCTTAGAC" ...
## $ geneSynonyms : chr "TSNARE1" "CACNA1E" "STX12" "PDE4DIPP5" ...
## $ Ensembl_Name : chr "ENSG00000171045" "ENSG00000198216" "ENSG00000117758" "ENSG00000275064" ...
## $ control1.4362 : int 25 1 1 3 2 2 2 1 1 1 ...
## $ control2.4363 : int 46 1 1 4 2 1 1 1 1 1 ...
## $ control3.4364 : int 32 1 1 1 2 1 2 1 1 1 ...
## $ MS1_r1_4370 : int 1 25 49 63 70 94 160 17 78 103 ...
## $ MS1_r2_4371 : int 2 104 31 180 172 101 66 81 34 63 ...
## $ MS1_r3_4372 : int 1 34 35 107 68 49 75 70 50 85 ...
## $ MS1_r4_4373 : int 3 75 72 203 224 152 147 102 72 124 ...
## $ MS1_r5_4374 : int 8 70 71 179 220 98 123 103 70 142 ...
## $ MS2_r1_4375 : int 6 52 34 118 107 74 88 66 41 55 ...
## $ MS2_r2_4376 : int 6 60 51 94 126 85 75 110 44 61 ...
## $ MS2_r3_4377 : int 3 68 21 122 120 88 81 89 53 52 ...
## $ MS2_r4_4378 : int 3 75 67 112 132 95 75 74 49 75 ...
## $ MS2_r5_4379 : int 3 66 47 123 119 81 86 48 50 64 ...
## $ commercial1o.commercial_r1_4365 : int 5 65 56 143 108 70 62 68 69 96 ...
## $ commercial2o.commercial_r2_4366 : int 4 50 52 133 117 98 93 46 23 65 ...
## $ commercial3o.commercial_r3_4367 : int 3 79 65 153 181 103 102 85 66 123 ...
## $ commercial4o.commercial_r4_4368 : int 2 42 24 107 126 77 79 73 47 64 ...
## $ commercial5o.commercial_r5_4369 : int 4 46 67 113 74 70 85 56 35 107 ...
## $ controlMeans : num 34.33 1 1 2.67 2 ...
## $ MS1_Means : num 3 61.6 51.6 146.4 150.8 ...
## $ MS2_Means : num 4.2 64.2 44 113.8 120.8 ...
## $ commercial_Means : num 3.6 56.4 52.8 129.8 121.2 ...
## $ foldchange_MS1_vs_control : num 0.0874 61.6 51.6 54.9 75.4 ...
## $ foldchange_MS2_vs_control : num 0.122 64.2 44 42.675 60.4 ...
## $ foldchange_commercialMS_vs_control: num 9.537 0.0177 0.0189 0.0205 0.0165 ...
## $ mRNA : chr "UUGCGUGCCCGCACAAUCAG" "UGGGGGCAGCAAUUAAGCUG" "UGACAGCAAAGUUGCAACUU" "UCUCCGCAAGCUAGAAUCUG" ...
## $ aminoAcids_simulated : chr "-Leu-C-Val-CCC-Ala-CA-Ile-AG" "-Trp-GG-Ala-Ala-A-Leu-AG-Leu-" "U-Asp-A-Ala-Lys-Leu-Gln-Leu-" "U-Leu-C-Ala-AG-Leu-Glu-U-Leu-" ...
I know why the values are running opposite directions for the commercial MS and the actual participating MS patients. I accidentally put the control means over the commercial means in obtaining the fold change values. So we can go ahead and correct that right now.
Since its control means / commercial means, we have to multiply by commercial means/control means to get 1 then multiply by commercial means/control means.We still have those values in our data frame.
colnames(DATA)
## [1] "ID_REF" "geneSynonyms"
## [3] "Ensembl_Name" "control1.4362"
## [5] "control2.4363" "control3.4364"
## [7] "MS1_r1_4370" "MS1_r2_4371"
## [9] "MS1_r3_4372" "MS1_r4_4373"
## [11] "MS1_r5_4374" "MS2_r1_4375"
## [13] "MS2_r2_4376" "MS2_r3_4377"
## [15] "MS2_r4_4378" "MS2_r5_4379"
## [17] "commercial1o.commercial_r1_4365" "commercial2o.commercial_r2_4366"
## [19] "commercial3o.commercial_r3_4367" "commercial4o.commercial_r4_4368"
## [21] "commercial5o.commercial_r5_4369" "controlMeans"
## [23] "MS1_Means" "MS2_Means"
## [25] "commercial_Means" "foldchange_MS1_vs_control"
## [27] "foldchange_MS2_vs_control" "foldchange_commercialMS_vs_control"
## [29] "mRNA" "aminoAcids_simulated"
Change foldchange_commercialMS_vs_control so that it is multiplied by commercial_Means/controlMeans to get 1 and then multiply by commercial_Means/controlMeans again to get the actual fold change value we originally wanted. You can go back to the original publication posted earlier called, “Machine Learning on MS data 9M rows for Top Allele variant Genes” to see the error and why the values are not correlated with the other MS samples when they should be. Actually, you can just add the column now but it will be after selecting top genes. with just the actual commercial_Means/controlMeans.
DATA$RealFC_commercial_control_means <- DATA$commercial_Means / DATA$controlMeans
DATA[,c(26:28,31)]
## foldchange_MS1_vs_control foldchange_MS2_vs_control
## 1 0.08737864 0.12233010
## 2 61.60000000 64.20000000
## 3 51.60000000 44.00000000
## 4 54.90000000 42.67500000
## 5 75.40000000 60.40000000
## 6 74.10000000 63.45000000
## 7 68.52000000 48.60000000
## 8 74.60000000 77.40000000
## 9 60.80000000 47.40000000
## 10 103.40000000 61.40000000
## 11 69.40000000 58.00000000
## 12 76.35000000 62.10000000
## 13 75.80000000 54.50000000
## 14 57.23076923 50.86153846
## 15 66.36000000 52.20000000
## 16 64.40000000 41.00000000
## 17 0.09230769 0.08076923
## 18 94.80000000 71.55000000
## 19 54.00000000 52.60000000
## 20 161.40000000 110.40000000
## 21 54.40000000 42.20000000
## 22 0.08955224 0.10746269
## 23 71.76000000 46.68000000
## 24 73.05000000 61.50000000
## 25 87.60000000 70.40000000
## 26 0.09320388 0.13980583
## 27 0.07800000 0.12600000
## 28 63.90000000 61.20000000
## 29 77.55000000 45.90000000
## 30 80.00000000 71.60000000
## 31 63.00000000 47.04000000
## 32 84.60000000 49.20000000
## 33 68.85000000 71.40000000
## 34 0.11186441 0.13220339
## 35 0.08250000 0.11250000
## 36 54.45000000 41.17500000
## 37 0.10843373 0.09397590
## 38 0.08219178 0.14794521
## 39 0.06075949 0.06835443
## 40 88.80000000 53.20000000
## 41 75.00000000 44.16000000
## foldchange_commercialMS_vs_control RealFC_commercial_control_means
## 1 9.537037037 0.10485437
## 2 0.017730496 56.40000000
## 3 0.018939394 52.80000000
## 4 0.020544427 48.67500000
## 5 0.016501650 60.60000000
## 6 0.015948963 62.70000000
## 7 0.019794141 50.52000000
## 8 0.015243902 65.60000000
## 9 0.020833333 48.00000000
## 10 0.010989011 91.00000000
## 11 0.013927577 71.80000000
## 12 0.018365473 54.45000000
## 13 0.021052632 47.50000000
## 14 0.020420986 48.96923077
## 15 0.020990764 47.64000000
## 16 0.019193858 52.10000000
## 17 7.878787879 0.12692308
## 18 0.014556041 68.70000000
## 19 0.024038462 41.60000000
## 20 0.007022472 142.40000000
## 21 0.020325203 49.20000000
## 22 10.151515152 0.09850746
## 23 0.018315018 54.60000000
## 24 0.017590150 56.85000000
## 25 0.016129032 62.00000000
## 26 8.583333333 0.11650485
## 27 9.259259259 0.10800000
## 28 0.015290520 65.40000000
## 29 0.015741834 63.52500000
## 30 0.014245014 70.20000000
## 31 0.023607177 42.36000000
## 32 0.019707207 50.74285714
## 33 0.015151515 66.00000000
## 34 8.939393939 0.11186441
## 35 11.111111111 0.09000000
## 36 0.022259321 44.92500000
## 37 10.641025641 0.09397590
## 38 12.166666667 0.08219178
## 39 14.629629630 0.06835443
## 40 0.016129032 62.00000000
## 41 0.017182131 58.20000000
You can see now that the values are correlated and not inversely related, as the foldchange is high in the commercial set when the participating MS patient samples are high, and low when those same samples are low. As it should be. Lets get rid of the other inverse relation fold change or name it inverse relation commercial. Lets do the latter just to see the inverse relation when control means are over the sample pathology means. Instead of vice versa.
colnames(DATA)[28] <- "inverse_FC_control_over_commercial_means"
DATA <- DATA[,c(1:27,31,28:30)]
colnames(DATA)
## [1] "ID_REF"
## [2] "geneSynonyms"
## [3] "Ensembl_Name"
## [4] "control1.4362"
## [5] "control2.4363"
## [6] "control3.4364"
## [7] "MS1_r1_4370"
## [8] "MS1_r2_4371"
## [9] "MS1_r3_4372"
## [10] "MS1_r4_4373"
## [11] "MS1_r5_4374"
## [12] "MS2_r1_4375"
## [13] "MS2_r2_4376"
## [14] "MS2_r3_4377"
## [15] "MS2_r4_4378"
## [16] "MS2_r5_4379"
## [17] "commercial1o.commercial_r1_4365"
## [18] "commercial2o.commercial_r2_4366"
## [19] "commercial3o.commercial_r3_4367"
## [20] "commercial4o.commercial_r4_4368"
## [21] "commercial5o.commercial_r5_4369"
## [22] "controlMeans"
## [23] "MS1_Means"
## [24] "MS2_Means"
## [25] "commercial_Means"
## [26] "foldchange_MS1_vs_control"
## [27] "foldchange_MS2_vs_control"
## [28] "RealFC_commercial_control_means"
## [29] "inverse_FC_control_over_commercial_means"
## [30] "mRNA"
## [31] "aminoAcids_simulated"
DATA[,c(26:29)]
## foldchange_MS1_vs_control foldchange_MS2_vs_control
## 1 0.08737864 0.12233010
## 2 61.60000000 64.20000000
## 3 51.60000000 44.00000000
## 4 54.90000000 42.67500000
## 5 75.40000000 60.40000000
## 6 74.10000000 63.45000000
## 7 68.52000000 48.60000000
## 8 74.60000000 77.40000000
## 9 60.80000000 47.40000000
## 10 103.40000000 61.40000000
## 11 69.40000000 58.00000000
## 12 76.35000000 62.10000000
## 13 75.80000000 54.50000000
## 14 57.23076923 50.86153846
## 15 66.36000000 52.20000000
## 16 64.40000000 41.00000000
## 17 0.09230769 0.08076923
## 18 94.80000000 71.55000000
## 19 54.00000000 52.60000000
## 20 161.40000000 110.40000000
## 21 54.40000000 42.20000000
## 22 0.08955224 0.10746269
## 23 71.76000000 46.68000000
## 24 73.05000000 61.50000000
## 25 87.60000000 70.40000000
## 26 0.09320388 0.13980583
## 27 0.07800000 0.12600000
## 28 63.90000000 61.20000000
## 29 77.55000000 45.90000000
## 30 80.00000000 71.60000000
## 31 63.00000000 47.04000000
## 32 84.60000000 49.20000000
## 33 68.85000000 71.40000000
## 34 0.11186441 0.13220339
## 35 0.08250000 0.11250000
## 36 54.45000000 41.17500000
## 37 0.10843373 0.09397590
## 38 0.08219178 0.14794521
## 39 0.06075949 0.06835443
## 40 88.80000000 53.20000000
## 41 75.00000000 44.16000000
## RealFC_commercial_control_means inverse_FC_control_over_commercial_means
## 1 0.10485437 9.537037037
## 2 56.40000000 0.017730496
## 3 52.80000000 0.018939394
## 4 48.67500000 0.020544427
## 5 60.60000000 0.016501650
## 6 62.70000000 0.015948963
## 7 50.52000000 0.019794141
## 8 65.60000000 0.015243902
## 9 48.00000000 0.020833333
## 10 91.00000000 0.010989011
## 11 71.80000000 0.013927577
## 12 54.45000000 0.018365473
## 13 47.50000000 0.021052632
## 14 48.96923077 0.020420986
## 15 47.64000000 0.020990764
## 16 52.10000000 0.019193858
## 17 0.12692308 7.878787879
## 18 68.70000000 0.014556041
## 19 41.60000000 0.024038462
## 20 142.40000000 0.007022472
## 21 49.20000000 0.020325203
## 22 0.09850746 10.151515152
## 23 54.60000000 0.018315018
## 24 56.85000000 0.017590150
## 25 62.00000000 0.016129032
## 26 0.11650485 8.583333333
## 27 0.10800000 9.259259259
## 28 65.40000000 0.015290520
## 29 63.52500000 0.015741834
## 30 70.20000000 0.014245014
## 31 42.36000000 0.023607177
## 32 50.74285714 0.019707207
## 33 66.00000000 0.015151515
## 34 0.11186441 8.939393939
## 35 0.09000000 11.111111111
## 36 44.92500000 0.022259321
## 37 0.09397590 10.641025641
## 38 0.08219178 12.166666667
## 39 0.06835443 14.629629630
## 40 62.00000000 0.016129032
## 41 58.20000000 0.017182131
Lets now order this by decreasing order of fold change value by means of commercial samples over means of control samples and write it out to csv.
DATA1 <- DATA[order(DATA$RealFC_commercial_control_means, decreasing=T),]
head(DATA1)
## ID_REF geneSynonyms Ensembl_Name control1.4362
## 20 GAGTCGTTTAAAGGCTCTCT CLINT1 ENSG00000113282 1
## 10 CACCGTCGTTTTTGTGACCG IGFBP7 ENSG00000163453 1
## 11 CCAGCAGAGTCGCTCGAAAT FAM20C ENSG00000177706 1
## 30 GTAGAGTCGTTACCCGACAC SMARCA2 ENSG00000080503 1
## 18 CTCCAGAGCCGTTTTCGGTG HSD3B1 ENSG00000203857 1
## 33 TAGACATGCAGTCGTTTCGA ST3GAL3 ENSG00000126091 1
## control2.4363 control3.4364 MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## 20 1 1 125 185 97 209
## 10 1 1 103 63 85 124
## 11 1 1 85 47 65 83
## 30 1 1 82 67 55 99
## 18 2 1 141 84 61 190
## 33 1 2 61 114 65 103
## MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377 MS2_r4_4378 MS2_r5_4379
## 20 191 102 122 111 114 103
## 10 142 55 61 52 75 64
## 11 67 53 60 58 61 58
## 30 97 69 51 77 94 67
## 18 156 69 88 110 105 105
## 33 116 101 110 87 110 68
## commercial1o.commercial_r1_4365 commercial2o.commercial_r2_4366
## 20 132 155
## 10 96 65
## 11 55 84
## 30 78 82
## 18 104 92
## 33 88 105
## commercial3o.commercial_r3_4367 commercial4o.commercial_r4_4368
## 20 203 123
## 10 123 64
## 11 96 67
## 30 94 51
## 18 109 89
## 33 97 86
## commercial5o.commercial_r5_4369 controlMeans MS1_Means MS2_Means
## 20 99 1.000000 161.4 110.4
## 10 107 1.000000 103.4 61.4
## 11 57 1.000000 69.4 58.0
## 30 46 1.000000 80.0 71.6
## 18 64 1.333333 126.4 95.4
## 33 64 1.333333 91.8 95.2
## commercial_Means foldchange_MS1_vs_control foldchange_MS2_vs_control
## 20 142.4 161.40 110.40
## 10 91.0 103.40 61.40
## 11 71.8 69.40 58.00
## 30 70.2 80.00 71.60
## 18 91.6 94.80 71.55
## 33 88.0 68.85 71.40
## RealFC_commercial_control_means inverse_FC_control_over_commercial_means
## 20 142.4 0.007022472
## 10 91.0 0.010989011
## 11 71.8 0.013927577
## 30 70.2 0.014245014
## 18 68.7 0.014556041
## 33 66.0 0.015151515
## mRNA aminoAcids_simulated
## 20 CUCAGCAAAUUUCCGAGAGA -Leu-A-Ala-AA-Phe-CC-Glu-Arg-
## 10 GUGGCAGCAAAAACACUGGC -Val-Ala-Ala-Lys-Thr-Leu-GC
## 11 GGUCGUCUCAGCGAGCUUUA G-Val-GU-Leu-A-Ala-AGC-Phe-A
## 30 CAUCUCAGCAAUGGGCUGUG -His-Leu-A-Ala-A-Trp-G-Leu-UG
## 18 GAGGUCUCGGCAAAAGCCAC -Glu-GU-Leu-G-Ala-Lys-Ala-AC
## 33 AUCUGUACGUCAGCAAAGCU AU-Leu-Tyr-Val-A-Ala-AA-Ala-
tail(DATA1)
## ID_REF geneSynonyms Ensembl_Name control1.4362
## 1 AACGCACGGGCGTGTTAGTC TSNARE1 ENSG00000171045 25
## 22 GCCGTCCTGTCTTTCTCATT MIR4432HG ENSG00000228590 28
## 37 TGCCCGCGCCTACAGTAGTG RPL31P30 ENSG00000230702 27
## 35 TAGTGGCGTGAGATTTGCGT RNU6-280P ENSG00000201015 28
## 38 TGCGGTCGCGACCTTTCAGC KLHL29 ENSG00000119771 21
## 39 TTCACGGTCCTTTTGGTCAC DDAH1 ENSG00000153904 31
## control2.4363 control3.4364 MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## 1 46 32 1 2 1 3
## 22 16 23 1 3 2 3
## 37 24 32 1 1 4 4
## 35 29 23 1 1 3 3
## 38 22 30 1 2 2 2
## 39 23 25 2 2 2 1
## MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377 MS2_r4_4378 MS2_r5_4379
## 1 8 6 6 3 3 3
## 22 1 3 1 2 2 4
## 37 5 3 2 3 1 4
## 35 3 1 3 4 6 1
## 38 3 5 5 2 3 3
## 39 1 3 1 3 1 1
## commercial1o.commercial_r1_4365 commercial2o.commercial_r2_4366
## 1 5 4
## 22 2 2
## 37 5 1
## 35 3 4
## 38 2 3
## 39 3 2
## commercial3o.commercial_r3_4367 commercial4o.commercial_r4_4368
## 1 3 2
## 22 1 2
## 37 1 5
## 35 2 1
## 38 3 1
## 39 2 1
## commercial5o.commercial_r5_4369 controlMeans MS1_Means MS2_Means
## 1 4 34.33333 3.0 4.2
## 22 4 22.33333 2.0 2.4
## 37 1 27.66667 3.0 2.6
## 35 2 26.66667 2.2 3.0
## 38 1 24.33333 2.0 3.6
## 39 1 26.33333 1.6 1.8
## commercial_Means foldchange_MS1_vs_control foldchange_MS2_vs_control
## 1 3.6 0.08737864 0.12233010
## 22 2.2 0.08955224 0.10746269
## 37 2.6 0.10843373 0.09397590
## 35 2.4 0.08250000 0.11250000
## 38 2.0 0.08219178 0.14794521
## 39 1.8 0.06075949 0.06835443
## RealFC_commercial_control_means inverse_FC_control_over_commercial_means
## 1 0.10485437 9.537037
## 22 0.09850746 10.151515
## 37 0.09397590 10.641026
## 35 0.09000000 11.111111
## 38 0.08219178 12.166667
## 39 0.06835443 14.629630
## mRNA aminoAcids_simulated
## 1 UUGCGUGCCCGCACAAUCAG -Leu-C-Val-CCC-Ala-CA-Ile-AG
## 22 CGGCAGGACAGAAAGAGUAA CG-Ala-G-Asp-A-Glu-Arg-Val-A
## 37 ACGGGCGCGGAUGUCAUCAC -Thr-G-Ala-CG-Asp-Val-Ile-AC
## 35 AUCACCGCACUCUAAACGCA -Ile-Thr-Ala-Leu-U-Lys-C-Ala-
## 38 ACGCCAGCGCUGGAAAGUCG AC-Ala-A-Ala-Leu-Glu-A-Val-G
## 39 AAGUGCCAGGAAAACCAGUG AA-Val-Pro-G-Glu-AA-Pro-Val-
It seems to be ordered from enhancer genes at the top and silencer genes at the bottom. We will now write this out to csv and I will add in the gene summaries to these genes. These genes will be in our machine model to predict Lyme Disease, fibromyalgia, mononucleosis, multiple sclerosis, and when we get to it Burkett and Hodgkin lymphoma separately and then head and neck cancer or sarcoma related to EBV.
We have to get the data first on the lymphomas and the sarcoma of head and neck related to EBV, and then we have to see which media is the same and extracted the same or just do a whole other formatting of data extraction to get the same media in gene expression data from i.e. peripheral mononucleated blood cells, extracted with high throughput sequencing of same type, and counts or fragments, but all have to be one type and also not normalized by dividing by the max of each gene across all samples.
Lets list our genes now before writing it out to csv.
DATA1[,c(2,28)]
## geneSynonyms RealFC_commercial_control_means
## 20 CLINT1 142.40000000
## 10 IGFBP7 91.00000000
## 11 FAM20C 71.80000000
## 30 SMARCA2 70.20000000
## 18 HSD3B1 68.70000000
## 33 ST3GAL3 66.00000000
## 8 TSHZ2 65.60000000
## 28 NLRP3 65.40000000
## 29 POM121L15P 63.52500000
## 6 CASC8 62.70000000
## 25 ESRP1 62.00000000
## 40 RPL7P61 62.00000000
## 5 PLSCR5 60.60000000
## 41 CPA6 58.20000000
## 24 ERBB4 56.85000000
## 2 CACNA1E 56.40000000
## 23 TMEM200A 54.60000000
## 12 NDUFB5P1 54.45000000
## 3 STX12 52.80000000
## 16 CRLF3P3 52.10000000
## 32 CDH8 50.74285714
## 7 PIK3C2A 50.52000000
## 21 KCNA6-AS1 49.20000000
## 14 SLIRPP1 48.96923077
## 4 PDE4DIPP5 48.67500000
## 9 ST8SIA4 48.00000000
## 15 HDAC4 47.64000000
## 13 SCHLAP1 47.50000000
## 36 ATP5MC3 44.92500000
## 31 CAMKMT 42.36000000
## 19 AQP12B 41.60000000
## 17 LZIC 0.12692308
## 26 WDR35 0.11650485
## 34 TFPI 0.11186441
## 27 ACTN4 0.10800000
## 1 TSNARE1 0.10485437
## 22 MIR4432HG 0.09850746
## 37 RPL31P30 0.09397590
## 35 RNU6-280P 0.09000000
## 38 KLHL29 0.08219178
## 39 DDAH1 0.06835443
All genes above 1 in value are enhanced or upregulated, while those below the value of 1 are silenced or down regulated genes. We will write the DATA1 table to csv and you can get it here.
write.csv(DATA1, 'MS_top41genes_FCs_summariesNeeded.csv', row.names=F)
I will add in the gene summaries and share it as an attachment to same file link above when I do.
Thanks.