Top 41 genes in common to the multiple sclerosis patients participating and the comparison commercial store bought multiple sclerosis patient from the top 50 and bottom 50 or top 50 enhancer and bottom 50 silencer complementary DNA fragments of 20 base pairs long each called barcodes on study GSE293036 of NCBI.
This finds the top 100 complementary DNA strands of fragments in 20 nucleic DNA base pairs long of multiple sclerosis patients Mean values in two MS patients, a commercial MS patient for comparison, and a control of a healthy patient. The study is from GSE293036 but find the data and information detail on the data extraction portion of the data that makes this very large 8.8 Million row size data frame by 19 features here.
data <- read.csv('allSampleRepeatsControlsMS1MS2Commercial.csv',sep=',',header=T, na.strings=c('',' ','na','NA'))
str(data)
## 'data.frame': 8838657 obs. of 19 variables:
## $ ID_REF : chr "TTTTTTTTTTTTTTCGTCCC" "TTTTTTTTTTTTTCCTTGCT" "TTTTTTTTTTTTGCAGTGAT" "TTTTTTTTTTTTCTGCTATG" ...
## $ control1.4362 : int 4 2 4 4 5 1 4 3 2 5 ...
## $ control2.4363 : int 4 1 1 1 5 2 3 1 3 2 ...
## $ control3.4364 : int 3 5 1 2 8 4 1 2 3 3 ...
## $ MS1_r1_4370 : int 3 5 3 3 15 23 2 4 11 11 ...
## $ MS1_r2_4371 : int 1 4 4 8 23 15 10 4 4 6 ...
## $ MS1_r3_4372 : int 3 3 3 2 16 7 5 10 7 1 ...
## $ MS1_r4_4373 : int 5 9 3 4 43 17 18 22 24 17 ...
## $ MS1_r5_4374 : int 6 12 5 9 26 12 21 8 5 8 ...
## $ MS2_r1_4375 : int 4 5 3 1 19 9 8 12 12 6 ...
## $ MS2_r2_4376 : int 8 3 7 8 27 19 6 7 13 5 ...
## $ MS2_r3_4377 : int 11 10 8 7 25 19 6 6 8 6 ...
## $ MS2_r4_4378 : int 3 8 4 4 19 22 11 8 20 4 ...
## $ MS2_r5_4379 : int 4 9 5 5 17 21 9 8 14 8 ...
## $ commercial1o.commercial_r1_4365: int 5 5 6 9 24 14 6 4 7 5 ...
## $ commercial2o.commercial_r2_4366: int 8 8 8 13 16 16 8 6 7 6 ...
## $ commercial3o.commercial_r3_4367: int 5 3 5 6 33 17 8 4 13 6 ...
## $ commercial4o.commercial_r4_4368: int 9 8 4 3 29 12 4 7 8 4 ...
## $ commercial5o.commercial_r5_4369: int 1 8 2 2 16 10 6 5 18 1 ...
colnames(data)
## [1] "ID_REF" "control1.4362"
## [3] "control2.4363" "control3.4364"
## [5] "MS1_r1_4370" "MS1_r2_4371"
## [7] "MS1_r3_4372" "MS1_r4_4373"
## [9] "MS1_r5_4374" "MS2_r1_4375"
## [11] "MS2_r2_4376" "MS2_r3_4377"
## [13] "MS2_r4_4378" "MS2_r5_4379"
## [15] "commercial1o.commercial_r1_4365" "commercial2o.commercial_r2_4366"
## [17] "commercial3o.commercial_r3_4367" "commercial4o.commercial_r4_4368"
## [19] "commercial5o.commercial_r5_4369"
data$controlMeans <- rowMeans(data[,2:4],na.rm=F,dims=1)
data$MS1_Means <- rowMeans(data[,5:9], na.rm=F, dims=1)
data$MS2_Means <- rowMeans(data[,10:14], na.rm=F, dims=1)
data$commercial_Means <- rowMeans(data[,15:19],na.rm=F, dims=1)
summary(data)
## ID_REF control1.4362 control2.4363 control3.4364
## Length:8838657 Min. : 1.00 Min. : 1.00 Min. : 1.00
## Class :character 1st Qu.: 6.00 1st Qu.: 5.00 1st Qu.: 6.00
## Mode :character Median : 10.00 Median : 10.00 Median : 10.00
## Mean : 12.06 Mean : 11.66 Mean : 11.76
## 3rd Qu.: 16.00 3rd Qu.: 16.00 3rd Qu.: 16.00
## Max. :724.00 Max. :634.00 Max. :693.00
## MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 12.00 1st Qu.: 14.00 1st Qu.: 10.00 1st Qu.: 21.00
## Median : 25.00 Median : 26.00 Median : 18.00 Median : 37.00
## Mean : 33.24 Mean : 33.42 Mean : 23.49 Mean : 47.73
## 3rd Qu.: 45.00 3rd Qu.: 44.00 3rd Qu.: 31.00 3rd Qu.: 62.00
## Max. :2287.00 Max. :2734.00 Max. :2089.00 Max. :3993.00
## MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 18.00 1st Qu.: 16.00 1st Qu.: 16.00 1st Qu.: 14.00
## Median : 32.00 Median : 28.00 Median : 27.00 Median : 25.00
## Mean : 40.83 Mean : 34.06 Mean : 33.71 Mean : 30.51
## 3rd Qu.: 53.00 3rd Qu.: 45.00 3rd Qu.: 44.00 3rd Qu.: 40.00
## Max. :3215.00 Max. :2398.00 Max. :2412.00 Max. :2127.00
## MS2_r4_4378 MS2_r5_4379 commercial1o.commercial_r1_4365
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 16.00 1st Qu.: 15.00 1st Qu.: 14.00
## Median : 28.00 Median : 25.00 Median : 25.00
## Mean : 34.03 Mean : 31.16 Mean : 31.17
## 3rd Qu.: 45.00 3rd Qu.: 41.00 3rd Qu.: 41.00
## Max. :2298.00 Max. :2173.00 Max. :2496.00
## commercial2o.commercial_r2_4366 commercial3o.commercial_r3_4367
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 13.00 1st Qu.: 15.00
## Median : 23.00 Median : 26.00
## Mean : 29.57 Mean : 34.86
## 3rd Qu.: 39.00 3rd Qu.: 45.00
## Max. :2226.00 Max. :3084.00
## commercial4o.commercial_r4_4368 commercial5o.commercial_r5_4369
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 11.00 1st Qu.: 11.00
## Median : 19.00 Median : 20.00
## Mean : 25.01 Mean : 25.54
## 3rd Qu.: 33.00 3rd Qu.: 34.00
## Max. :1908.00 Max. :1908.00
## controlMeans MS1_Means MS2_Means commercial_Means
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 6.00 1st Qu.: 17.20 1st Qu.: 16.40 1st Qu.: 13.80
## Median : 10.00 Median : 28.20 Median : 26.80 Median : 23.00
## Mean : 11.83 Mean : 35.74 Mean : 32.69 Mean : 29.23
## 3rd Qu.: 15.67 3rd Qu.: 45.60 3rd Qu.: 42.20 3rd Qu.: 37.60
## Max. :683.67 Max. :2853.20 Max. :2281.60 Max. :2324.40
Lets use fold change of the MS1, MS2, and commercial MS patient sample compared to the control mean to get our changes in pathology compared to healthy.
data$foldchange_MS1_vs_control <- data$MS1_Means/data$controlMeans
summary(data$foldchange_MS1_vs_control)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06076 2.14054 2.92258 3.31576 4.03333 161.40000
data$foldchange_MS2_vs_control <- data$MS2_Means/data$controlMeans
summary(data$foldchange_MS2_vs_control)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0625 2.1600 2.7120 2.9620 3.4645 110.4000
data$foldchange_commercialMS_vs_control <- data$controlMeans/data$commercial_Means
summary(data$foldchange_commercialMS_vs_control)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007023 0.302144 0.415225 0.470008 0.575540 17.916667
top50 and bottom 50 genes when ordered by fold change type.
top50bottom50_MS1_FC <- data[order(data$foldchange_MS1_vs_control,decreasing=T)[c(1:50,8838608:8838657)],]
top100_MS1_cDNA <- top50bottom50_MS1_FC$ID_REF
top100_MS1_cDNA
## [1] "GAGTCGTTTAAAGGCTCTCT" "CACCGTCGTTTTTGTGACCG" "CCCATAGCGATCTAACTTTT"
## [4] "CTCCAGAGCCGTTTTCGGTG" "TTTAGAGTCGGTGGTAGATC" "GGCTCGGAGTCGCTGAAAAT"
## [7] "GTGATTCCACAGTCGTTAAT" "TCTGCTCTCTTTACTTATAC" "GTAGAGTCGTTACCCGACAC"
## [10] "GTACCGTCGGTTGCTCGTGC" "CCAGTCGATTCTTTTCATAT" "CCTCTCACCAGTCGTTTTGG"
## [13] "AGAGTCGATTTGTCCAATCG" "TTTCGGGGAACCGAGTCGAT" "ATCGTCGGTCTTAGCGGTCA"
## [16] "AGAGTCGCTCGTTAGGATCT" "GGAGTCGTCTTTTTATCCCC" "GCTTCGCAGTCGTTAGAGTT"
## [19] "CCAGCAGAGTCGCTCGAAAT" "TAGACATGCAGTCGTTTCGA" "ATCTTCGTTTTTCTTTCGGA"
## [22] "AGATTAACCCAATACATTAT" "CGGTTAGAGTCGATAGCTTT" "CTATCAACAGAGTCGCTAAT"
## [25] "GGTGTTGTCAGAGTCGTTAA" "GTGAGGATACAGTCGGTTTT" "ACCCCCGTCGTTAATTCGAC"
## [28] "ATCGTCGTTTTAGCCGTAGG" "AGAGTCGCTCAACTCCGACT" "CTTGCTCCAGGTCAGAGCGA"
## [31] "ATGAGTCGTTTCGTGTTTGG" "GGATGATACTGTCGTTTTCG" "TTAGAGTCGTCGGTTTTACT"
## [34] "TTAATGTCGTTTTTTGGCGA" "CGAGTCGTTTGACCGGCGCA" "ATTCGGGTACCGTCGGTTTT"
## [37] "CATTGTCGTTTTGAGACCGG" "AGACCCGAGCCGTTTTCTTC" "AGAGGCGTTCGATCTTAGAC"
## [40] "AACATTTAAGATCCGGGTTG" "TGAATTTTAGAGTCGGTTTC" "GCCGAGTCGTTATGGACCCA"
## [43] "AGGAGTCGTTAATTCTGATC" "GAGTCGTTCTCGTTTCGCAG" "ACCGTCGCTTGAGGTCAGAT"
## [46] "CAGCGTCGTTTCTTCGTAGT" "GACACCGGTCGTTTGTCAGC" "GGGGGAGTCGTTTCGCTCCA"
## [49] "ACTGTCGTTTCAACGTTGAA" "GGTGCGTGGTCGTTTTGAGA" "TAGAGTACCGTTTTTGAACT"
## [52] "AACCAGAGTAGCGTTTGCTT" "ACCCGGACCCTTGACTCACC" "GCATCCCGGGCGCGTCTAAC"
## [55] "TGCCCGCGCCTACAGTAGTG" "AAGTGTGGCCCTTTGGGTTT" "CATGCCGGTCCCTTTATCTT"
## [58] "ACTTCCGGGTCCCTTTCGTC" "GGCTTTTTTTTTTCTTTGTG" "GCAGTCCTGTTTTACTCCCG"
## [61] "TAGGGCCCTTTCTTCGCCAG" "TCCGGTCGCGTGGCTCATAC" "CGCCTCCCCGGGCCTTAATT"
## [64] "CCGTGGCTTTTTTTCTTACG" "TGAGAGTACCCGGGCCTTTC" "ATCCGGGTGGCGCTTTTTTC"
## [67] "ACGGCCCCTCTTTGCCCATT" "CTGTCCGGCCCTGTCTTATT" "GCACAATTTTCATGTGGGAC"
## [70] "GGGCGTGTTTTTCTGGAGTA" "CAGGGGCGTGAGCTTTCTGT" "CTATGGTCCCTTAGTGTTTA"
## [73] "CGGGCCTTTCTAGTCATCAG" "GGCGGGTCTTGTGTTTTGCT" "GCCGTCCTGTCTTTCTCATT"
## [76] "GCGGTCCCTTAGCTCTTCCG" "ACCGGCCTTTTTGGCAGGTC" "GCCCGGGCTTGTAGGTCTTT"
## [79] "ATGCTGGCCTTTGTATTTAC" "CGGGTGGCGGGGTTTTTATC" "AACGCACGGGCGTGTTAGTC"
## [82] "GTGCGGGCCCTTCGTCCTGT" "ACACTGGCGCGTTTTTCCCA" "ACGGTGGCTTCTCTTACGTG"
## [85] "ATGTCGCGGCGTGTGGTTTT" "TAGTGGCGTGAGATTTGCGT" "TAGACACGGGCCTTTGCTAC"
## [88] "TGCGGTCGCGACCTTTCAGC" "GGGGTCCTTTTATCCTAATC" "GTGGCACAGGGTCGCGTAAA"
## [91] "TTGGTGTGGTGTTTGTTCCA" "GGTCCTGTCTTTTCTGCTGA" "CGGGCCTGAGTTTTCTACGC"
## [94] "GTGGGCCCCTTTGATTCTTC" "AGGGTCCTTTGGGGTCAGAA" "TTACGGCCGCGGTTTTACTG"
## [97] "TGTCGCGTATTTTCTCCAAA" "CCATGGTCGTGTACCGTTAA" "GGCCGGCCCTTTAGGCTTGA"
## [100] "TTCACGGTCCTTTTGGTCAC"
top50bottom50_MS2_FC <- data[order(data$foldchange_MS2_vs_control,decreasing=T)[c(1:50,8838608:8838657)],]
top100_MS2_cDNA <- top50bottom50_MS2_FC$ID_REF
top100_MS2_cDNA
## [1] "GAGTCGTTTAAAGGCTCTCT" "ATCGTCGGTCTTAGCGGTCA" "GTAGAGTCGTTACCCGACAC"
## [4] "CTCCAGAGCCGTTTTCGGTG" "TAGACATGCAGTCGTTTCGA" "GGCTCGGAGTCGCTGAAAAT"
## [7] "ACCCCCGTCGTTAATTCGAC" "AGAGTCGCTCGTTAGGATCT" "CCAGTCGATTCTTTTCATAT"
## [10] "ACCGCGAGTCGCTTGAACTC" "GGAGTCGTCTTTTTATCCCC" "CACCGTCGTTTTTGTGACCG"
## [13] "GGTGTTGTCAGAGTCGTTAA" "AGAGTCGATTTGTCCAATCG" "CCAGCAGAGTCGCTCGAAAT"
## [16] "CCTCTCACCAGTCGTTTTGG" "AGACCCGAGCCGTTTTCTTC" "TTTAGAGTCGGTGGTAGATC"
## [19] "GAGTCGTTCTCGTTTCGCAG" "CGGTTAGAGTCGATAGCTTT" "AGAGTCGCTCAACTCCGACT"
## [22] "TGTATCCACCCCCGCCCTAT" "CAGCGTCGTTTCTTCGTAGT" "CGAGTCGTTTGACCGGCGCA"
## [25] "TCCGAGTCGATTTCGCTAAC" "CGACCAGTCGTTTATACACC" "GTGATTCCACAGTCGTTAAT"
## [28] "TAACGGAGTCGTTTTTCAAG" "AGATTAACCCAATACATTAT" "ATCGTCGTTTTAGCCGTAGG"
## [31] "GTGAGGATACAGTCGGTTTT" "GCTTCGCAGTCGTTAGAGTT" "TAGAGTCGTTCTCTACGCGA"
## [34] "GTACCGTCGGTTGCTCGTGC" "GGGTTCCGAGTCGTTCAAGT" "GCTATCGGCGTTTTCGTATT"
## [37] "ATGAGTCGTTTCGTGTTTGG" "TTTCGGGGAACCGAGTCGAT" "ACTGTCGTTTCAACGTTGAA"
## [40] "TAGCGCCGTTGTTGTTCTTA" "TTAGAGTCGTCGGTTTTACT" "AGAGGCGTTCGATCTTAGAC"
## [43] "ACCGTCGCTTGAGGTCAGAT" "GCCGAGTCGTTATGGACCCA" "AGATGCCAGTCGTTTCTCTT"
## [46] "TGAATTTTAGAGTCGGTTTC" "CTAAAGCGTCGCTTGTAGTT" "TTTACCGGGGCCGAGTCGCT"
## [49] "CTATCAACAGAGTCGCTAAT" "ATTCGGGTACCGTCGGTTTT" "TTGTTATCGTTATAGGCGTG"
## [52] "TGAAAAGTGGCGAGTCTATT" "GGTGGCGGGCCTTTATACCT" "TGCGTATGGTCGCGTCTTGC"
## [55] "CTCGATGGCGTGTAGTGTAG" "CTTTATCTGATACAGTAGTG" "TAATAAACCCGATAGTGTAG"
## [58] "TGCGCGGGCGCGTTTCGATA" "GCCAGGGCCCCTTTCGTCAT" "GGTCACAGTAGTGTCGAGCT"
## [61] "GGAACCAGTGTAGTGAAGAG" "TGCGGTCGCGACCTTTCAGC" "ACTTCCACTTTTTAGTGGCG"
## [64] "ACGGCCCCTCTTTGCCCATT" "TGTAGTGCTATTGGCGTGTC" "TTGTAGGCGTGTATTTTCTA"
## [67] "TCGGTGTATTTTTAGCGGCG" "GGCTACCTCGAAGAGTAGTG" "GGGCGTGTTTTTCTGGAGTA"
## [70] "GTCAGTGGCCTGTACGTTTC" "CGCTCGGGCCTGTTTTCTCA" "TCATAGCGTAGTGTGGCTTA"
## [73] "TGAAGTGTAGTGGATCATTT" "GCTGATACCGCGTAGTGTAG" "AATTGCGGCCCTTCCATTTT"
## [76] "TAGAGTACCGTTTTTGAACT" "TAGTGAAGTGTCCCATCGCA" "AGCTCTAGGGCCCCTTTTCG"
## [79] "CGGTCTGTAGGAGTGTCGTG" "GGTCCTGTCTTTTCTGCTGA" "TCTATGTACTTACCGTAGTG"
## [82] "AACGCACGGGCGTGTTAGTC" "CGTTCCATGGTAGTCTAGTG" "CTATCCCAAGTAGTGTATTG"
## [85] "ACCGGCCTTTTTGGCAGGTC" "TGTAGATTACTGTAGTGGCG" "TAGTGGCGTGAGATTTGCGT"
## [88] "GTGGGCCCCTTTGATTCTTC" "GCCGTCCTGTCTTTCTCATT" "TCATACTTACCTGCCTTTAA"
## [91] "TTTGCCACGGGCGCGTTTCA" "TGAAATACGTCAGTGTAGTG" "TCCCGGGGCCTCTGTTTTAT"
## [94] "AACCAGAGTAGCGTTTGCTT" "TACAGTCCTTTCTGTTGACG" "TGCCCGCGCCTACAGTAGTG"
## [97] "TTCTAGTAGTGTCCTGTACC" "CTATGGTCCCTTAGTGTTTA" "TTCACGGTCCTTTTGGTCAC"
## [100] "CTTCTGTTAGTGTAGTGTTG"
top50bottom50_commercial_FC <- data[order(data$foldchange_commercialMS_vs_control,decreasing=T)[c(1:50,8838608:8838657)],]
top100_commercialMS_cDNA <- top50bottom50_commercial_FC$ID_REF
top100_commercialMS_cDNA
## [1] "TTACGGCCGCGGTTTTACTG" "TTAGCGACGTGTACAGCCTG" "TCTGCTTACGGTCCCTTTTA"
## [4] "TTCACGGTCCTTTTGGTCAC" "GCCAGGGCCCCTTTCGTCAT" "TGAAAAGTGGCGAGTCTATT"
## [7] "TGCGGTCGCGACCTTTCAGC" "GTCAGTGGCCTGTACGTTTC" "GACAGTGTAGTGAATATTGT"
## [10] "TCGGTGGTAGGGTCCTTTTC" "TAGTGGCGTGAGATTTGCGT" "TGTAGATTACTGTAGTGGCG"
## [13] "TGCCCGCGCCTACAGTAGTG" "GCATTCAGAGTAGTGTGTCT" "GGCGCCTAAATTTATCTTTT"
## [16] "GCCGTCCTGTCTTTCTCATT" "CAAATCAACCCTTAGTGGCG" "AACGCACGGGCGTGTTAGTC"
## [19] "ACAGGCCTGTCTTATGTTTG" "CGGTCTGTAGGAGTGTCGTG" "ACAGTAGGGTCTTGGCTGCT"
## [22] "GGTCCTGTCTTTTCTGCTGA" "TAGAGTACCGTTTTTGAACT" "ATGGACCTGTTTTCTTTTAG"
## [25] "CGCTCGGGCCTGTTTTCTCA" "CTTCTGTTAGTGTAGTGTTG" "CTGTCCGGCCCTGTCTTATT"
## [28] "CAATATCGGTCCTGTTTTTT" "CGGGCCTTTCTAGTCATCAG" "CGGGTGGCGGGGTTTTTATC"
## [31] "TGCGTATGGTCGCGTCTTGC" "GGTCACAGTAGTGTCGAGCT" "GGGCGTGTTTTTCTGGAGTA"
## [34] "GCTCGTGGGCCCTTTTTCGT" "ACTTCCGGGTCCCTTTCGTC" "ATCCGGGTGGCGCTTTTTTC"
## [37] "GGGGGTCCTTTTTGAATTCG" "ATTGGCCTGTATTATTGCGC" "ACGGGCCTCTTTGCTCGTGT"
## [40] "ACCCGGACCCTTGACTCACC" "AATACGGGCCCGTGTTACCC" "GTGCGGGCCCTTCGTCCTGT"
## [43] "CAAGCAGTCCTTTCTTTTAA" "CGCACCGGGGTCCCGTTTTT" "CGGACCCGGTAGTGTAGCTT"
## [46] "AGTTCAGGGGCCCTTTCTCG" "GCCCTGGCCCTTTATCTTGA" "GCCTCCGGCCCTTTTCCTTC"
## [49] "AAACCGCGGGCCCTTTAGGA" "CTATGGTCCCTTAGTGTTTA" "AACATTTAAGATCCGGGTTG"
## [52] "AGTGCACATTTTAACCGATC" "GGGGGAGTCGTTTCGCTCCA" "GAGTCGTTCTCGTTTCGCAG"
## [55] "AGTCTGTGGGCGGAAAGATG" "TTTTACAGTCGTTCGGATGT" "TGTCAAGTCGTTTGTGTTGA"
## [58] "GTGAGGATACAGTCGGTTTT" "TAACGGAGTCGTTTTTCAAG" "AGCTTCGTTTTTCGTTACGG"
## [61] "ACCGCGAGTCGCTTGAACTC" "CGGACCCGGTCGATTCGGTA" "CCAGTCGTTTTGACTAGGCC"
## [64] "TGAATTTTAGAGTCGGTTTC" "CCTCTCACCAGTCGTTTTGG" "CGGTTAGAGTCGATAGCTTT"
## [67] "ATCGTCGTTTTAGCCGTAGG" "TTAATGTCGTTTTTTGGCGA" "AGAGGCGTTCGATCTTAGAC"
## [70] "CGAGTCGTTTGACCGGCGCA" "GCCGAGTCGTTATGGACCCA" "AGATTAACCCAATACATTAT"
## [73] "GTGATTCCACAGTCGTTAAT" "CAAGGGATATCCACTTGCGT" "TAGAGTCGTTCTCTACGCGA"
## [76] "CTATCAACAGAGTCGCTAAT" "TCCGAGTCGATTTCGCTAAC" "ACTGTCGTTTCAACGTTGAA"
## [79] "CCAGTCGATTCTTTTCATAT" "GCTTCGCAGTCGTTAGAGTT" "CGGGGCTAGGTACAGTGATC"
## [82] "ACCCCCGTCGTTAATTCGAC" "GGAGTCGTCTTTTTATCCCC" "TTTCGGGGAACCGAGTCGAT"
## [85] "GGTCATGACCGTTCCGTTAA" "AGAGTCGATTTGTCCAATCG" "TTTAGAGTCGGTGGTAGATC"
## [88] "GGCTCGGAGTCGCTGAAAAT" "AGAGTCGCTCGTTAGGATCT" "GTACCGTCGGTTGCTCGTGC"
## [91] "GGTGTTGTCAGAGTCGTTAA" "ATCGTCGGTCTTAGCGGTCA" "TAGACATGCAGTCGTTTCGA"
## [94] "CTCCAGAGCCGTTTTCGGTG" "GTAGAGTCGTTACCCGACAC" "CCAGCAGAGTCGCTCGAAAT"
## [97] "ATGTTTTAATTGCTATAAGA" "CACCGTCGTTTTTGTGACCG" "CTATCCCGAGATCCGGCTGG"
## [100] "GAGTCGTTTAAAGGCTCTCT"
Are there any strands in the fold change groups common to all 3 sets of top 100 genes?
common1 <- top100_commercialMS_cDNA[which(top100_commercialMS_cDNA %in% top100_MS1_cDNA)]
common2 <- top100_MS1_cDNA[which(top100_MS1_cDNA %in% top100_MS2_cDNA)]
commonAll3 <- common1[which(common1 %in% common2 )]
commonAll3
## [1] "TTCACGGTCCTTTTGGTCAC" "TGCGGTCGCGACCTTTCAGC" "TAGTGGCGTGAGATTTGCGT"
## [4] "TGCCCGCGCCTACAGTAGTG" "GCCGTCCTGTCTTTCTCATT" "AACGCACGGGCGTGTTAGTC"
## [7] "GGTCCTGTCTTTTCTGCTGA" "TAGAGTACCGTTTTTGAACT" "GGGCGTGTTTTTCTGGAGTA"
## [10] "CTATGGTCCCTTAGTGTTTA" "GAGTCGTTCTCGTTTCGCAG" "GTGAGGATACAGTCGGTTTT"
## [13] "TGAATTTTAGAGTCGGTTTC" "CCTCTCACCAGTCGTTTTGG" "CGGTTAGAGTCGATAGCTTT"
## [16] "ATCGTCGTTTTAGCCGTAGG" "AGAGGCGTTCGATCTTAGAC" "CGAGTCGTTTGACCGGCGCA"
## [19] "GCCGAGTCGTTATGGACCCA" "AGATTAACCCAATACATTAT" "GTGATTCCACAGTCGTTAAT"
## [22] "CTATCAACAGAGTCGCTAAT" "ACTGTCGTTTCAACGTTGAA" "CCAGTCGATTCTTTTCATAT"
## [25] "GCTTCGCAGTCGTTAGAGTT" "ACCCCCGTCGTTAATTCGAC" "GGAGTCGTCTTTTTATCCCC"
## [28] "TTTCGGGGAACCGAGTCGAT" "AGAGTCGATTTGTCCAATCG" "TTTAGAGTCGGTGGTAGATC"
## [31] "GGCTCGGAGTCGCTGAAAAT" "AGAGTCGCTCGTTAGGATCT" "GTACCGTCGGTTGCTCGTGC"
## [34] "GGTGTTGTCAGAGTCGTTAA" "ATCGTCGGTCTTAGCGGTCA" "TAGACATGCAGTCGTTTCGA"
## [37] "CTCCAGAGCCGTTTTCGGTG" "GTAGAGTCGTTACCCGACAC" "CCAGCAGAGTCGCTCGAAAT"
## [40] "CACCGTCGTTTTTGTGACCG" "GAGTCGTTTAAAGGCTCTCT"
Great, there are some top genes common to all 3 sets of fold change values top 100 genes each. Totaling 41 genes. Lets make this its own data frame of common 41 cDNA gene base pair strands.
top41 <- data[which(data$ID_REF %in% commonAll3),]
summary(top41)
## ID_REF control1.4362 control2.4363 control3.4364
## Length:41 Min. : 1.000 Min. : 1.000 Min. : 1.000
## Class :character 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Mode :character Median : 2.000 Median : 1.000 Median : 2.000
## Mean : 7.512 Mean : 7.634 Mean : 7.927
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 4.000
## Max. :33.000 Max. :46.000 Max. :38.000
## MS1_r1_4370 MS1_r2_4371 MS1_r3_4372 MS1_r4_4373
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1
## 1st Qu.: 17.00 1st Qu.: 31.00 1st Qu.: 34.00 1st Qu.: 61
## Median : 78.00 Median : 76.00 Median : 55.00 Median :103
## Mean : 78.76 Mean : 79.66 Mean : 55.98 Mean :115
## 3rd Qu.:120.00 3rd Qu.:114.00 3rd Qu.: 75.00 3rd Qu.:180
## Max. :225.00 Max. :247.00 Max. :206.00 Max. :298
## MS1_r5_4374 MS2_r1_4375 MS2_r2_4376 MS2_r3_4377
## Min. : 1.0 Min. : 1.00 Min. : 1.00 Min. : 2.00
## 1st Qu.: 42.0 1st Qu.: 34.00 1st Qu.: 43.00 1st Qu.: 21.00
## Median : 99.0 Median : 69.00 Median : 71.00 Median : 77.00
## Mean :100.1 Mean : 67.07 Mean : 67.17 Mean : 64.29
## 3rd Qu.:134.0 3rd Qu.: 93.00 3rd Qu.: 89.00 3rd Qu.: 91.00
## Max. :311.0 Max. :256.00 Max. :228.00 Max. :200.00
## MS2_r4_4378 MS2_r5_4379 commercial1o.commercial_r1_4365
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 36.00 1st Qu.: 26.00 1st Qu.: 50.00
## Median : 75.00 Median : 64.00 Median : 71.00
## Mean : 69.49 Mean : 60.73 Mean : 72.17
## 3rd Qu.: 95.00 3rd Qu.: 87.00 3rd Qu.: 98.00
## Max. :241.00 Max. :177.00 Max. :251.00
## commercial2o.commercial_r2_4366 commercial3o.commercial_r3_4367
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 23.00 1st Qu.: 54.00
## Median : 70.00 Median : 94.00
## Mean : 68.54 Mean : 85.85
## 3rd Qu.: 98.00 3rd Qu.:111.00
## Max. :200.00 Max. :254.00
## commercial4o.commercial_r4_4368 commercial5o.commercial_r5_4369
## Min. : 1.00 Min. : 1.00
## 1st Qu.: 24.00 1st Qu.: 29.00
## Median : 67.00 Median : 60.00
## Mean : 60.68 Mean : 57.95
## 3rd Qu.: 86.00 3rd Qu.: 77.00
## Max. :207.00 Max. :185.00
## controlMeans MS1_Means MS2_Means commercial_Means
## Min. : 1.000 Min. : 1.6 Min. : 1.40 Min. : 1.80
## 1st Qu.: 1.000 1st Qu.: 51.6 1st Qu.: 42.20 1st Qu.: 41.60
## Median : 1.667 Median : 88.8 Median : 73.60 Median : 71.80
## Mean : 7.691 Mean : 85.9 Mean : 65.75 Mean : 69.04
## 3rd Qu.: 4.333 3rd Qu.:125.0 3rd Qu.: 87.00 3rd Qu.: 91.60
## Max. :34.333 Max. :248.0 Max. :220.40 Max. :212.20
## foldchange_MS1_vs_control foldchange_MS2_vs_control
## Min. : 0.06076 Min. : 0.06835
## 1st Qu.: 51.60000 1st Qu.: 41.00000
## Median : 64.40000 Median : 48.60000
## Mean : 55.81724 Mean : 43.15910
## 3rd Qu.: 75.40000 3rd Qu.: 61.40000
## Max. :161.40000 Max. :110.40000
## foldchange_commercialMS_vs_control
## Min. : 0.007023
## 1st Qu.: 0.016129
## Median : 0.019707
## Mean : 2.523062
## 3rd Qu.: 0.024039
## Max. :14.629630
Lets write the data and the top41 genes out to csv to use as needed.
write.csv(data,'foldchange3setsVsControlMeans.csv',row.names=F)
write.csv(top41,'top41genesCommonToAllFoldchangeValues3groupsMS.csv',row.names=F)
We will test these genes out in bioconductor to see if it is working today. And also with random forest modeling later to see if we can use these genes to predict the class of the sample as healthy or Multiple Sclerosis pathology.
=====================================================================
Machine Learning with top 41 gene allele variants in multiple sclerosis data.
top41 <- read.csv('top41genesCommonToAllFoldchangeValues3groupsMS.csv',sep=',', header=T, na.strings=c('',' ','na','NA'))
ID_REF <- top41$ID_REF
samples <- colnames(top41)[2:19]
colnames(top41)
## [1] "ID_REF" "control1.4362"
## [3] "control2.4363" "control3.4364"
## [5] "MS1_r1_4370" "MS1_r2_4371"
## [7] "MS1_r3_4372" "MS1_r4_4373"
## [9] "MS1_r5_4374" "MS2_r1_4375"
## [11] "MS2_r2_4376" "MS2_r3_4377"
## [13] "MS2_r4_4378" "MS2_r5_4379"
## [15] "commercial1o.commercial_r1_4365" "commercial2o.commercial_r2_4366"
## [17] "commercial3o.commercial_r3_4367" "commercial4o.commercial_r4_4368"
## [19] "commercial5o.commercial_r5_4369" "controlMeans"
## [21] "MS1_Means" "MS2_Means"
## [23] "commercial_Means" "foldchange_MS1_vs_control"
## [25] "foldchange_MS2_vs_control" "foldchange_commercialMS_vs_control"
samplesOnly <- top41[,c(2:19)]
ML_data <- data.frame(t(samplesOnly))
colnames(ML_data) <- ID_REF
head(ML_data)
## TTTCGGGGAACCGAGTCGAT TTTAGAGTCGGTGGTAGATC TTCACGGTCCTTTTGGTCAC
## control1.4362 1 1 31
## control2.4363 1 1 23
## control3.4364 3 1 25
## MS1_r1_4370 119 99 2
## MS1_r2_4371 106 89 2
## MS1_r3_4372 101 49 2
## TGCGGTCGCGACCTTTCAGC TGCCCGCGCCTACAGTAGTG TGAATTTTAGAGTCGGTTTC
## control1.4362 21 27 2
## control2.4363 22 24 4
## control3.4364 30 32 2
## MS1_r1_4370 1 1 135
## MS1_r2_4371 2 1 124
## MS1_r3_4372 2 4 99
## TAGTGGCGTGAGATTTGCGT TAGAGTACCGTTTTTGAACT TAGACATGCAGTCGTTTCGA
## control1.4362 28 20 1
## control2.4363 29 26 1
## control3.4364 23 13 2
## MS1_r1_4370 1 1 61
## MS1_r2_4371 1 2 114
## MS1_r3_4372 3 4 65
## GTGATTCCACAGTCGTTAAT GTGAGGATACAGTCGGTTTT GTAGAGTCGTTACCCGACAC
## control1.4362 2 2 1
## control2.4363 1 1 1
## control3.4364 4 2 1
## MS1_r1_4370 209 76 82
## MS1_r2_4371 176 124 67
## MS1_r3_4372 135 52 55
## GTACCGTCGGTTGCTCGTGC GGTGTTGTCAGAGTCGTTAA GGTCCTGTCTTTTCTGCTGA
## control1.4362 3 1 33
## control2.4363 2 1 30
## control3.4364 3 2 37
## MS1_r1_4370 225 102 2
## MS1_r2_4371 167 62 2
## MS1_r3_4372 113 58 2
## GGGCGTGTTTTTCTGGAGTA GGCTCGGAGTCGCTGAAAAT GGAGTCGTCTTTTTATCCCC
## control1.4362 32 1 2
## control2.4363 33 1 1
## control3.4364 38 1 1
## MS1_r1_4370 7 70 155
## MS1_r2_4371 4 69 78
## MS1_r3_4372 2 53 44
## GCTTCGCAGTCGTTAGAGTT GCCGTCCTGTCTTTCTCATT GCCGAGTCGTTATGGACCCA
## control1.4362 3 28 1
## control2.4363 1 16 1
## control3.4364 1 23 1
## MS1_r1_4370 123 1 44
## MS1_r2_4371 123 3 52
## MS1_r3_4372 59 2 52
## GAGTCGTTTAAAGGCTCTCT GAGTCGTTCTCGTTTCGCAG CTCCAGAGCCGTTTTCGGTG
## control1.4362 1 1 1
## control2.4363 1 1 2
## control3.4364 1 1 1
## MS1_r1_4370 125 70 141
## MS1_r2_4371 185 59 84
## MS1_r3_4372 97 38 61
## CTATGGTCCCTTAGTGTTTA CTATCAACAGAGTCGCTAAT CGGTTAGAGTCGATAGCTTT
## control1.4362 15 2 1
## control2.4363 15 2 1
## control3.4364 22 2 3
## MS1_r1_4370 1 139 89
## MS1_r2_4371 2 105 107
## MS1_r3_4372 1 86 58
## CGAGTCGTTTGACCGGCGCA CCTCTCACCAGTCGTTTTGG CCAGTCGATTCTTTTCATAT
## control1.4362 4 1 1
## control2.4363 6 4 1
## control3.4364 3 1 2
## MS1_r1_4370 178 120 105
## MS1_r2_4371 247 152 76
## MS1_r3_4372 206 81 72
## CCAGCAGAGTCGCTCGAAAT CACCGTCGTTTTTGTGACCG ATCGTCGTTTTAGCCGTAGG
## control1.4362 1 1 1
## control2.4363 1 1 1
## control3.4364 1 1 1
## MS1_r1_4370 85 103 78
## MS1_r2_4371 47 63 34
## MS1_r3_4372 65 85 50
## ATCGTCGGTCTTAGCGGTCA AGATTAACCCAATACATTAT AGAGTCGCTCGTTAGGATCT
## control1.4362 1 2 2
## control2.4363 1 1 1
## control3.4364 1 2 1
## MS1_r1_4370 17 160 94
## MS1_r2_4371 81 66 101
## MS1_r3_4372 70 75 49
## AGAGTCGATTTGTCCAATCG AGAGGCGTTCGATCTTAGAC ACTGTCGTTTCAACGTTGAA
## control1.4362 2 3 1
## control2.4363 2 4 1
## control3.4364 2 1 1
## MS1_r1_4370 70 63 49
## MS1_r2_4371 172 180 31
## MS1_r3_4372 68 107 35
## ACCCCCGTCGTTAATTCGAC AACGCACGGGCGTGTTAGTC
## control1.4362 1 25
## control2.4363 1 46
## control3.4364 1 32
## MS1_r1_4370 25 1
## MS1_r2_4371 104 2
## MS1_r3_4372 34 1
sampleClassLabels <- row.names(ML_data)
sampleClassLabels
## [1] "control1.4362" "control2.4363"
## [3] "control3.4364" "MS1_r1_4370"
## [5] "MS1_r2_4371" "MS1_r3_4372"
## [7] "MS1_r4_4373" "MS1_r5_4374"
## [9] "MS2_r1_4375" "MS2_r2_4376"
## [11] "MS2_r3_4377" "MS2_r4_4378"
## [13] "MS2_r5_4379" "commercial1o.commercial_r1_4365"
## [15] "commercial2o.commercial_r2_4366" "commercial3o.commercial_r3_4367"
## [17] "commercial4o.commercial_r4_4368" "commercial5o.commercial_r5_4369"
There are 3 healthy control samples and 15 Multiple Sclerosis samples, lets add a feature for class as those class labels.
class <- as.factor(c("healthy","healthy","healthy",
"MS","MS","MS","MS","MS",
"MS","MS","MS","MS","MS",
"MS","MS","MS","MS","MS"))
ML_data$class <- class
summary(ML_data$class)
## healthy MS
## 3 15
write.csv(ML_data, "ML_data_15MS_3Healthy.csv",row.names=F)
rm( "class", "common1" ,
"common2" , "commonAll3" ,
"data" , "ID_REF" ,
"ML_data" , "sampleClassLabels",
"samples" , "samplesOnly" ,
"top100_commercialMS_cDNA" ,"top100_MS1_cDNA" ,
"top100_MS2_cDNA" ,"top41" ,
"top50bottom50_commercial_FC", "top50bottom50_MS1_FC" ,
"top50bottom50_MS2_FC"
)
ML41data <- read.csv("ML_data_15MS_3Healthy.csv",header=T, sep=',',na.string=c('',' ','na','NA'))
ML41data$class <- as.factor(ML41data$class)
summary(ML41data)
## TTTCGGGGAACCGAGTCGAT TTTAGAGTCGGTGGTAGATC TTCACGGTCCTTTTGGTCAC
## Min. : 1.00 Min. : 1.00 Min. : 1.000
## 1st Qu.: 63.50 1st Qu.: 44.25 1st Qu.: 1.000
## Median : 90.00 Median : 52.50 Median : 2.000
## Mean : 82.39 Mean : 56.83 Mean : 5.833
## 3rd Qu.:104.75 3rd Qu.: 84.50 3rd Qu.: 3.000
## Max. :200.00 Max. :108.00 Max. :31.000
## TGCGGTCGCGACCTTTCAGC TGCCCGCGCCTACAGTAGTG TGAATTTTAGAGTCGGTTTC
## Min. : 1.000 Min. : 1.000 Min. : 2.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 98.25
## Median : 3.000 Median : 3.500 Median :112.50
## Mean : 6.167 Mean : 6.889 Mean :104.56
## 3rd Qu.: 4.500 3rd Qu.: 5.000 3rd Qu.:123.75
## Max. :30.000 Max. :32.000 Max. :235.00
## TAGTGGCGTGAGATTTGCGT TAGAGTACCGTTTTTGAACT TAGACATGCAGTCGTTTCGA
## Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 1.250 1st Qu.: 2.000 1st Qu.: 64.25
## Median : 3.000 Median : 2.000 Median : 87.50
## Mean : 6.556 Mean : 5.222 Mean : 76.61
## 3rd Qu.: 4.000 3rd Qu.: 3.750 3rd Qu.:104.50
## Max. :29.000 Max. :26.000 Max. :116.00
## GTGATTCCACAGTCGTTAAT GTGAGGATACAGTCGGTTTT GTAGAGTCGTTACCCGACAC
## Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.:103.2 1st Qu.: 53.75 1st Qu.:51.00
## Median :114.0 Median : 73.50 Median :68.00
## Mean :120.0 Mean : 70.83 Mean :61.78
## 3rd Qu.:151.5 3rd Qu.: 90.25 3rd Qu.:82.00
## Max. :242.0 Max. :166.00 Max. :99.00
## GTACCGTCGGTTGCTCGTGC GGTGTTGTCAGAGTCGTTAA GGTCCTGTCTTTTCTGCTGA
## Min. : 2.0 Min. : 1.00 Min. : 1.000
## 1st Qu.:112.2 1st Qu.: 63.25 1st Qu.: 2.250
## Median :145.5 Median : 77.00 Median : 3.500
## Mean :138.9 Mean : 70.78 Mean : 8.444
## 3rd Qu.:183.2 3rd Qu.: 97.50 3rd Qu.: 6.000
## Max. :269.0 Max. :111.00 Max. :37.000
## GGGCGTGTTTTTCTGGAGTA GGCTCGGAGTCGCTGAAAAT GGAGTCGTCTTTTTATCCCC
## Min. : 1.000 Min. : 1.00 Min. : 1.00
## 1st Qu.: 2.250 1st Qu.: 51.50 1st Qu.: 49.00
## Median : 4.500 Median : 62.00 Median : 76.50
## Mean : 9.056 Mean : 61.28 Mean : 71.11
## 3rd Qu.: 7.000 3rd Qu.: 76.75 3rd Qu.: 95.75
## Max. :38.000 Max. :128.00 Max. :155.00
## GCTTCGCAGTCGTTAGAGTT GCCGTCCTGTCTTTCTCATT GCCGAGTCGTTATGGACCCA
## Min. : 1.00 Min. : 1.000 Min. : 1.00
## 1st Qu.: 62.00 1st Qu.: 2.000 1st Qu.:32.00
## Median : 86.50 Median : 2.000 Median :44.00
## Mean : 80.39 Mean : 5.556 Mean :40.67
## 3rd Qu.:108.00 3rd Qu.: 3.750 3rd Qu.:58.75
## Max. :166.00 Max. :28.000 Max. :69.00
## GAGTCGTTTAAAGGCTCTCT GAGTCGTTCTCGTTTCGCAG CTCCAGAGCCGTTTTCGGTG
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 99.75 1st Qu.:31.25 1st Qu.: 65.25
## Median :118.00 Median :46.50 Median : 90.50
## Mean :115.22 Mean :41.33 Mean : 87.28
## 3rd Qu.:149.25 3rd Qu.:54.00 3rd Qu.:108.00
## Max. :209.00 Max. :71.00 Max. :190.00
## CTATGGTCCCTTAGTGTTTA CTATCAACAGAGTCGCTAAT CGGTTAGAGTCGATAGCTTT
## Min. : 1.000 Min. : 2.00 Min. : 1.00
## 1st Qu.: 1.000 1st Qu.: 77.25 1st Qu.: 61.50
## Median : 2.000 Median : 88.00 Median : 86.00
## Mean : 4.333 Mean : 87.83 Mean : 77.22
## 3rd Qu.: 3.000 3rd Qu.:109.50 3rd Qu.: 96.50
## Max. :22.000 Max. :180.00 Max. :163.00
## CGAGTCGTTTGACCGGCGCA CCTCTCACCAGTCGTTTTGG CCAGTCGATTCTTTTCATAT
## Min. : 3.0 Min. : 1.00 Min. : 1.00
## 1st Qu.:177.2 1st Qu.: 81.50 1st Qu.: 69.00
## Median :206.5 Median : 96.00 Median : 75.50
## Mean :189.8 Mean : 99.11 Mean : 71.67
## 3rd Qu.:250.0 3rd Qu.:119.75 3rd Qu.: 84.00
## Max. :311.0 Max. :217.00 Max. :172.00
## CCAGCAGAGTCGCTCGAAAT CACCGTCGTTTTTGTGACCG ATCGTCGTTTTAGCCGTAGG
## Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.:53.5 1st Qu.: 56.50 1st Qu.:34.25
## Median :59.0 Median : 64.50 Median :48.00
## Mean :55.5 Mean : 71.22 Mean :43.56
## 3rd Qu.:67.0 3rd Qu.:101.25 3rd Qu.:62.75
## Max. :96.0 Max. :142.00 Max. :78.00
## ATCGTCGGTCTTAGCGGTCA AGATTAACCCAATACATTAT AGAGTCGCTCGTTAGGATCT
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 46.50 1st Qu.: 68.25 1st Qu.: 70.00
## Median : 69.00 Median : 80.00 Median : 83.00
## Mean : 60.61 Mean : 77.89 Mean : 74.39
## 3rd Qu.: 84.00 3rd Qu.: 91.75 3rd Qu.: 97.25
## Max. :110.00 Max. :160.00 Max. :152.00
## AGAGTCGATTTGTCCAATCG AGAGGCGTTCGATCTTAGAC ACTGTCGTTTCAACGTTGAA
## Min. : 2.0 Min. : 1.00 Min. : 1.00
## 1st Qu.: 71.0 1st Qu.: 97.25 1st Qu.:25.75
## Median :118.0 Median :115.50 Median :48.00
## Mean :109.4 Mean :108.78 Mean :41.39
## 3rd Qu.:130.5 3rd Qu.:140.50 3rd Qu.:62.75
## Max. :224.0 Max. :203.00 Max. :72.00
## ACCCCCGTCGTTAATTCGAC AACGCACGGGCGTGTTAGTC class
## Min. : 1.00 Min. : 1.000 healthy: 3
## 1st Qu.: 36.00 1st Qu.: 3.000 MS :15
## Median : 56.00 Median : 3.500
## Mean : 50.78 Mean : 8.722
## 3rd Qu.: 69.50 3rd Qu.: 6.000
## Max. :104.00 Max. :46.000
The data frame is ready to use in a simple random forest machine learning model to predict class. Lets do the bootstrap setting and Accuracy for metric
# intrain <- sample(1:13,.8*13)
training <- ML41data[c(2,3,4,5,6,7,8,9,10,13,11,15),]
testing <- ML41data[c(1,11,14),]
table(testing$class)
##
## healthy MS
## 1 2
table(training$class)
##
## healthy MS
## 2 10
Manually picked training and testing to make sure at least one healthy class is in the testing set.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
ML41.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,
importance=TRUE)
print(ML41.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## healthy MS class.error
## healthy 2 0 0
## MS 0 10 0
Call: randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 10000 No. of variables tried at each split: 6
OOB estimate of error rate: 0%
Confusion matrix: healthy MS class.error healthy 2 0 0 MS 0 10 0
The model trained 100% accuracy on training set with 10,000 trees. Predict on the testing set to see how it performs.
ML41rf_predict <- predict(ML41.rf, newdata = testing)
ML41rf_predict
## 1 11 14
## healthy MS MS
## Levels: healthy MS
1 11 14
healthy MS MS Levels: healthy MS
predictResults <- data.frame(Predicted=ML41rf_predict,Actual=testing$class)
predictResults
## Predicted Actual
## 1 healthy healthy
## 11 MS MS
## 14 MS MS
sum(predictResults$Predicted==predictResults$Actual)/length(predictResults$Predicted)
## [1] 1
[1] 1
The predicted results are 100% on the testing set, so these alleles must be risk loci in the genes that are silenced or enhanced in multiple sclerosis.
Now to find the genes that these allele cDNA 20 base pair nucleotides go to in order to find what gene therapies could be used for stem cell cultures of the defective ones to make the proteins that demyelinate the CNS in MS patients.
Bioconductor wasn’t working earlier but maybe it will when ran. until a manual search for the gene the strand belongs to from ENSMBLE, the UCSC lab,BLAST, Genecards.org might turn up something.
Here are those top 41 genes next to their fold change values.
FC41 <- read.csv('top41genesCommonToAllFoldchangeValues3groupsMS.csv', header=T, sep=',', na.strings=c('',' ','na','NA'))
FCvalues41 <- FC41[order(FC41$foldchange_commercialMS_vs_control, decreasing=T),c(1,24:26)]
FCvalues41
## ID_REF foldchange_MS1_vs_control foldchange_MS2_vs_control
## 3 TTCACGGTCCTTTTGGTCAC 0.06075949 0.06835443
## 4 TGCGGTCGCGACCTTTCAGC 0.08219178 0.14794521
## 7 TAGTGGCGTGAGATTTGCGT 0.08250000 0.11250000
## 5 TGCCCGCGCCTACAGTAGTG 0.10843373 0.09397590
## 20 GCCGTCCTGTCTTTCTCATT 0.08955224 0.10746269
## 41 AACGCACGGGCGTGTTAGTC 0.08737864 0.12233010
## 15 GGTCCTGTCTTTTCTGCTGA 0.07800000 0.12600000
## 8 TAGAGTACCGTTTTTGAACT 0.11186441 0.13220339
## 16 GGGCGTGTTTTTCTGGAGTA 0.09320388 0.13980583
## 25 CTATGGTCCCTTAGTGTTTA 0.09230769 0.08076923
## 23 GAGTCGTTCTCGTTTCGCAG 54.00000000 52.60000000
## 11 GTGAGGATACAGTCGGTTTT 63.00000000 47.04000000
## 6 TGAATTTTAGAGTCGGTTTC 54.45000000 41.17500000
## 29 CCTCTCACCAGTCGTTTTGG 75.80000000 54.50000000
## 27 CGGTTAGAGTCGATAGCTTT 66.36000000 52.20000000
## 33 ATCGTCGTTTTAGCCGTAGG 60.80000000 47.40000000
## 38 AGAGGCGTTCGATCTTAGAC 54.90000000 42.67500000
## 28 CGAGTCGTTTGACCGGCGCA 57.23076923 50.86153846
## 21 GCCGAGTCGTTATGGACCCA 54.40000000 42.20000000
## 35 AGATTAACCCAATACATTAT 68.52000000 48.60000000
## 10 GTGATTCCACAGTCGTTAAT 84.60000000 49.20000000
## 26 CTATCAACAGAGTCGCTAAT 64.40000000 41.00000000
## 39 ACTGTCGTTTCAACGTTGAA 51.60000000 44.00000000
## 30 CCAGTCGATTCTTTTCATAT 76.35000000 62.10000000
## 19 GCTTCGCAGTCGTTAGAGTT 71.76000000 46.68000000
## 40 ACCCCCGTCGTTAATTCGAC 61.60000000 64.20000000
## 18 GGAGTCGTCTTTTTATCCCC 73.05000000 61.50000000
## 1 TTTCGGGGAACCGAGTCGAT 75.00000000 44.16000000
## 37 AGAGTCGATTTGTCCAATCG 75.40000000 60.40000000
## 2 TTTAGAGTCGGTGGTAGATC 88.80000000 53.20000000
## 17 GGCTCGGAGTCGCTGAAAAT 87.60000000 70.40000000
## 36 AGAGTCGCTCGTTAGGATCT 74.10000000 63.45000000
## 13 GTACCGTCGGTTGCTCGTGC 77.55000000 45.90000000
## 14 GGTGTTGTCAGAGTCGTTAA 63.90000000 61.20000000
## 34 ATCGTCGGTCTTAGCGGTCA 74.60000000 77.40000000
## 9 TAGACATGCAGTCGTTTCGA 68.85000000 71.40000000
## 24 CTCCAGAGCCGTTTTCGGTG 94.80000000 71.55000000
## 12 GTAGAGTCGTTACCCGACAC 80.00000000 71.60000000
## 31 CCAGCAGAGTCGCTCGAAAT 69.40000000 58.00000000
## 32 CACCGTCGTTTTTGTGACCG 103.40000000 61.40000000
## 22 GAGTCGTTTAAAGGCTCTCT 161.40000000 110.40000000
## foldchange_commercialMS_vs_control
## 3 14.629629630
## 4 12.166666667
## 7 11.111111111
## 5 10.641025641
## 20 10.151515152
## 41 9.537037037
## 15 9.259259259
## 8 8.939393939
## 16 8.583333333
## 25 7.878787879
## 23 0.024038462
## 11 0.023607177
## 6 0.022259321
## 29 0.021052632
## 27 0.020990764
## 33 0.020833333
## 38 0.020544427
## 28 0.020420986
## 21 0.020325203
## 35 0.019794141
## 10 0.019707207
## 26 0.019193858
## 39 0.018939394
## 30 0.018365473
## 19 0.018315018
## 40 0.017730496
## 18 0.017590150
## 1 0.017182131
## 37 0.016501650
## 2 0.016129032
## 17 0.016129032
## 36 0.015948963
## 13 0.015741834
## 14 0.015290520
## 34 0.015243902
## 9 0.015151515
## 24 0.014556041
## 12 0.014245014
## 31 0.013927577
## 32 0.010989011
## 22 0.007022472
The values in the MS1 and MS2 participating clients weren’t in the same direction as the commercial MS patientin how expressed.Many were lower in expression when the commercial was higher expression, and vice versa.
But incredibly the random forest classifier was able to predict 100% accuracy in these MS allele risk variants of cDNA.
To get the protein made,the complementary base pair substituting T for U would be the mRNA that amino acids in triplet codons would map to. Software already does this but I haven’t personally been able to use it.