This data analysis is on lyme disease using GEO series data made readily available in its normalized state from GSE145974 on ncbi.nlm.nih.gov as the accession number. The data is from the platform GPL13667 and the series data. There are also some CEL/TAR files for Ubuntu but I couldn’t get my ubuntu machines to recognize it, and the instructions and tutorials for accessing the SRAtoolkit and using the Windows Ubuntu app, didn’t avail, so I am using the text files only.If you have a windows 10 tutorial on running SRAtoolkit using the ubuntu app for windows or getting the cel files to work on ubuntu with a VirtualBox disk image of ubuntu that works because you tried it within 24 hours and it worked exactly as explained, please share. I have yet to get those up and running. Possibly the new updates to virtualBox or my other apps, like docker or MongoDB or Tableau are interfering. I am not going to waste time figuring it out, its a time trap.
All the data was there as far as being filled out with values for the feature names, because I do recall exploring some platforms and series downloadable text files and only the header information was there and none of the values. The method used for processing was expression profiling by microarray on peripheral blood mononucleated cells (PBMC). The values seem to be scaled or normalized already as the values are inclusive of negative values.
Tableau Images of Charts Section
There is a research article available from the researchers on this study to accompany the data for free through pubMed:
“Global Transcriptome Analysis Identifies a Diagnostic Signature for Early Disseminated Lyme Disease and Its Resolution” authored by the following researchers: Mary M. Petzke,a Konstantin Volyanskyy,b Yong Mao,b Byron Arevalo,a Raphael Zohn,a Johanna Quituisaca,a Gary P. Wormser,c Nevenka Dimitrova,b Ira Schwartza
They are with the Department of Microbiology and Immunology, School of Medicine, New York Medical College, Valhalla, New York, USA bPhillips Research North America, Valhalla, New York, USA Division of Infectious Diseases, Department of Medicine, New York Medical College, Valhalla, New York, USA Citation Petzke MM, Volyanskyy K, Mao Y, Arevalo B, Zohn R, Quituisaca J, Wormser GP, Dimitrova N, Schwartz I. 2020. Global transcriptome analysis identifies a diagnostic signature for early disseminated Lyme disease and its resolution. mBio 11:e00047-20. https:// doi.org/10.1128/mBio.00047-20. Editor Steven J. Norris, McGovern Medical School Copyright © 2020 Petzke et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution 4.0 International license. Address correspondence to Mary M. Petzke, mpetzke@nymc.edu. This article is a direct contribution from Ira Schwartz, a Fellow of the American Academy of Microbiology, who arranged for and secured reviews by Patricia Rosa, NIAID, NIH, and John Leong, Tufts University School of Medicine. Received 9 January 2020 Accepted 31 January 2020 Published 17 March 2020 “ABSTRACT A bioinformatics approach was employed to identify transcriptome alterations in the peripheral blood mononuclear cells of well-characterized human subjects who were diagnosed with early disseminated Lyme disease (LD) based on stringent microbiological and clinical criteria. Transcriptomes were assessed at the time of presentation and also at approximately 1 month (early convalescence) and 6 months (late convalescence) after initiation of an appropriate antibiotic regimen. Comparative transcriptomics identified 335 transcripts, representing 233 unique genes, with significant alterations of at least 2-fold expression in acute- or convalescent-phase blood samples from LD subjects relative to healthy donors. Acute-phase blood samples from LD subjects had the largest number of differentially expressed transcripts (187 induced, 54 repressed). This transcriptional profile, which was dominated by interferon-regulated genes, was sustained during early convalescence. 6 months after antibiotic treatment the transcriptome of LD subjects was indistinguishable from that of healthy controls based on two separate methods of analysis. Return of the LD expression profile to levels found in control subjects was concordant with disease outcome; 82% of subjects with LD experienced at least one symptom at the baseline visit compared to 43% at the early convalescence time point and only a single patient (9%) at the 6-month convalescence time point. Using the random forest machine learning algorithm, we developed an efficient computational framework to identify sets of 20 classifier genes that discriminated LD from other bacterial and viral infections. These novel LD biomarkers not only differentiated subjects with acute disseminated LD from healthy controls with 96% accuracy but also distinguished between subjects with acute and resolved (late convalescent) disease with 97% accuracy. IMPORTANCE Lyme disease (LD), caused by Borrelia burgdorferi, is the most common tick-borne infectious disease in the United States. We examined gene expression patterns in the blood of individuals with early disseminated LD at the time of diagnosis (acute) and also at approximately 1 month and 6 months following antibiotic treatment. A distinct acute LD profile was observed that was sustained during early convalescence (1 month) but returned to control levels 6 months after treatment. Using a computer learning algorithm, we identified sets of 20 classifier genes that discriminate LD from other bacterial and viral infections. In addition, these novel LD biomarkers are highly accurate in distinguishing patients with acute LD from healthy subjects and in discriminating between individuals with active and resolved infection. This computational approach offers the potential for more accurate diagnosis of early disseminated Lyme disease. It may also allow improved monitoring of treatment efficacy and disease resolution.” ***
The study authors used the same algorithms I always go to for analysis and scored well, random forest. It tends to always perform better in classification. But sometimes other algorithms perform better. Data scientists are suggested to not use just one type for all data as not all data is the same, but also some are almost as good and take much less time, depending on how many trees your algorithm is tuned to. I do want to see if these genes can be discovered that are similar to the genes they discovered and use them to predict samples from other studies, but this data is already normalized, and the method that was used was not given, so the first part of this study is an attempt at bringing back the original raw values. This is RNA blood samples, PBMC, and I do have some COVID-19 samples that are also peripheral Blood mononucleated cells type tissue, but the processing was high throughput expression profiling and not microarray. So, I would be able to split this data and see if it can predict the samples on unseen data of the testing set instead.
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(MASS)
#library(gbm)
library(RANN) #used in the tuning parameter of rf method of caret for 'oob' one out bag
## Warning: package 'RANN' was built under R version 3.6.3
ticks <- read.delim('GSE145974_series_matrix.txt',sep='\t',header=T,
comment.char = '!',na.strings=c('',' ','NA'))
GSM_IDs <- colnames(ticks)[2:87]
Affy_IDs <- ticks$ID_REF
comments <- read.delim('GSE145974_series_matrix.txt',sep='\n',header=T,
na.strings=c('',' ','NA'))
Sample GSM IDs and description
descriptors <- comments[27:28,]
head(descriptors)
## [1] !Sample_title\tPBMC total RNA-Healthy control 1\tPBMC total RNA-Healthy control 2\tPBMC total RNA-Healthy control 3\tPBMC total RNA-Healthy control 4\tPBMC total RNA-Healthy control 5\tPBMC total RNA-Healthy control 6\tPBMC total RNA-Healthy control 7\tPBMC total RNA-Healthy control 8\tPBMC total RNA-Healthy control 9\tPBMC total RNA-Healthy control 10\tPBMC total RNA-Healthy control 11\tPBMC total RNA-Healthy control 12\tPBMC total RNA-Healthy control 13\tPBMC total RNA-Healthy control 14\tPBMC total RNA-Healthy control 15\tPBMC total RNA-Healthy control 16\tPBMC total RNA-Healthy control 17\tPBMC total RNA-Healthy control 18\tPBMC total RNA-Healthy control 19\tPBMC total RNA-Healthy control 20\tPBMC total RNA-Healthy control 21\tPBMC total RNA-Acute Lyme disease subject 1\tPBMC total RNA-Acute Lyme disease subject 2\tPBMC total RNA-Acute Lyme disease subject 3\tPBMC total RNA-Acute Lyme disease subject 4\tPBMC total RNA-Acute Lyme disease subject 5\tPBMC total RNA-Acute Lyme disease subject 6\tPBMC total RNA-Acute Lyme disease subject 7\tPBMC total RNA-Acute Lyme disease subject 8\tPBMC total RNA-Acute Lyme disease subject 9\tPBMC total RNA-Acute Lyme disease subject 10\tPBMC total RNA-Acute Lyme disease subject 11\tPBMC total RNA-Acute Lyme disease subject 12\tPBMC total RNA-Acute Lyme disease subject 13\tPBMC total RNA-Acute Lyme disease subject 14\tPBMC total RNA-Acute Lyme disease subject 15\tPBMC total RNA-Acute Lyme disease subject 16\tPBMC total RNA-Acute Lyme disease subject 17\tPBMC total RNA-Acute Lyme disease subject 18\tPBMC total RNA-Acute Lyme disease subject 19\tPBMC total RNA-Acute Lyme disease subject 20\tPBMC total RNA-Acute Lyme disease subject 21\tPBMC total RNA-Acute Lyme disease subject 22\tPBMC total RNA-Acute Lyme disease subject 23\tPBMC total RNA-Acute Lyme disease subject 24\tPBMC total RNA-Acute Lyme disease subject 25\tPBMC total RNA-Acute Lyme disease subject 26\tPBMC total RNA-Acute Lyme disease subject 27\tPBMC total RNA-Acute Lyme disease subject 28\tPBMC total RNA-early convalescent Lyme disease subject 1\tPBMC total RNA-early convalescent Lyme disease subject 2\tPBMC total RNA-early convalescent Lyme disease subject 3\tPBMC total RNA-early convalescent Lyme disease subject 4\tPBMC total RNA-early convalescent Lyme disease subject 5\tPBMC total RNA-early convalescent Lyme disease subject 6\tPBMC total RNA-early convalescent Lyme disease subject 7\tPBMC total RNA-early convalescent Lyme disease subject 8\tPBMC total RNA-early convalescent Lyme disease subject 9\tPBMC total RNA-early convalescent Lyme disease subject 10\tPBMC total RNA-early convalescent Lyme disease subject 11\tPBMC total RNA-early convalescent Lyme disease subject 12\tPBMC total RNA-early convalescent Lyme disease subject 13\tPBMC total RNA-early convalescent Lyme disease subject 14\tPBMC total RNA-early convalescent Lyme disease subject 15\tPBMC total RNA-early convalescent Lyme disease subject 16\tPBMC total RNA-early convalescent Lyme disease subject 17\tPBMC total RNA-early convalescent Lyme disease subject 18\tPBMC total RNA-early convalescent Lyme disease subject 19\tPBMC total RNA-early convalescent Lyme disease subject 20\tPBMC total RNA-early convalescent Lyme disease subject 21\tPBMC total RNA-early convalescent Lyme disease subject 22\tPBMC total RNA-early convalescent Lyme disease subject 23\tPBMC total RNA-early convalescent Lyme disease subject 24\tPBMC total RNA-early convalescent Lyme disease subject 25\tPBMC total RNA-early convalescent Lyme disease subject 26\tPBMC total RNA-early convalescent Lyme disease subject 27\tPBMC total RNA-late convalescent Lyme disease subject 1\tPBMC total RNA-late convalescent Lyme disease subject 2\tPBMC total RNA-late convalescent Lyme disease subject 3\tPBMC total RNA-late convalescent Lyme disease subject 4\tPBMC total RNA-late convalescent Lyme disease subject 5\tPBMC total RNA-late convalescent Lyme disease subject 6\tPBMC total RNA-late convalescent Lyme disease subject 7\tPBMC total RNA-late convalescent Lyme disease subject 8\tPBMC total RNA-late convalescent Lyme disease subject 9\tPBMC total RNA-late convalescent Lyme disease subject 10
## [2] !Sample_geo_accession\tGSM4340492\tGSM4340493\tGSM4340494\tGSM4340495\tGSM4340496\tGSM4340497\tGSM4340498\tGSM4340499\tGSM4340500\tGSM4340501\tGSM4340502\tGSM4340503\tGSM4340504\tGSM4340505\tGSM4340506\tGSM4340507\tGSM4340508\tGSM4340509\tGSM4340510\tGSM4340511\tGSM4340512\tGSM4340513\tGSM4340514\tGSM4340515\tGSM4340516\tGSM4340517\tGSM4340518\tGSM4340519\tGSM4340520\tGSM4340521\tGSM4340522\tGSM4340523\tGSM4340524\tGSM4340525\tGSM4340526\tGSM4340527\tGSM4340528\tGSM4340529\tGSM4340530\tGSM4340531\tGSM4340532\tGSM4340533\tGSM4340534\tGSM4340535\tGSM4340536\tGSM4340537\tGSM4340538\tGSM4340539\tGSM4340540\tGSM4340541\tGSM4340542\tGSM4340543\tGSM4340544\tGSM4340545\tGSM4340546\tGSM4340547\tGSM4340548\tGSM4340549\tGSM4340550\tGSM4340551\tGSM4340552\tGSM4340553\tGSM4340554\tGSM4340555\tGSM4340556\tGSM4340557\tGSM4340558\tGSM4340559\tGSM4340560\tGSM4340561\tGSM4340562\tGSM4340563\tGSM4340564\tGSM4340565\tGSM4340566\tGSM4340567\tGSM4340568\tGSM4340569\tGSM4340570\tGSM4340571\tGSM4340572\tGSM4340573\tGSM4340574\tGSM4340575\tGSM4340576\tGSM4340577
## 49449 Levels: !Sample_channel_count\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1 ...
descriptors <- gsub('!','',descriptors)
descriptors <- gsub('\\t',',',descriptors)
split1 <- strsplit(descriptors[1],split=',')
type <- split1[1]
type2 <- as.data.frame(type)
colnames(type2) <- 'Sample_Title'
split2 <- strsplit(descriptors[2],split=',')
gsm <- split2[1]
gsm2 <- as.data.frame(gsm)
colnames(gsm2) <- 'Sample_GEO_Accession'
names <- cbind(type2,gsm2)
names$Sample_Title <- as.character(paste(names$Sample_Title))
names$Sample_GEO_Accession <- as.character(paste(names$Sample_GEO_Accession))
names2 <- names[-1,]
row.names(names2) <- NULL
write.csv(names2,'descriptors.csv',row.names=F)
descriptors2 <- read.csv('descriptors.csv',sep=',',na.strings=c('',' ','NA'),
header=TRUE)
head(descriptors2)
## Sample_Title Sample_GEO_Accession
## 1 PBMC total RNA-Healthy control 1 GSM4340492
## 2 PBMC total RNA-Healthy control 2 GSM4340493
## 3 PBMC total RNA-Healthy control 3 GSM4340494
## 4 PBMC total RNA-Healthy control 4 GSM4340495
## 5 PBMC total RNA-Healthy control 5 GSM4340496
## 6 PBMC total RNA-Healthy control 6 GSM4340497
descriptors2$Sample_Title
## [1] PBMC total RNA-Healthy control 1
## [2] PBMC total RNA-Healthy control 2
## [3] PBMC total RNA-Healthy control 3
## [4] PBMC total RNA-Healthy control 4
## [5] PBMC total RNA-Healthy control 5
## [6] PBMC total RNA-Healthy control 6
## [7] PBMC total RNA-Healthy control 7
## [8] PBMC total RNA-Healthy control 8
## [9] PBMC total RNA-Healthy control 9
## [10] PBMC total RNA-Healthy control 10
## [11] PBMC total RNA-Healthy control 11
## [12] PBMC total RNA-Healthy control 12
## [13] PBMC total RNA-Healthy control 13
## [14] PBMC total RNA-Healthy control 14
## [15] PBMC total RNA-Healthy control 15
## [16] PBMC total RNA-Healthy control 16
## [17] PBMC total RNA-Healthy control 17
## [18] PBMC total RNA-Healthy control 18
## [19] PBMC total RNA-Healthy control 19
## [20] PBMC total RNA-Healthy control 20
## [21] PBMC total RNA-Healthy control 21
## [22] PBMC total RNA-Acute Lyme disease subject 1
## [23] PBMC total RNA-Acute Lyme disease subject 2
## [24] PBMC total RNA-Acute Lyme disease subject 3
## [25] PBMC total RNA-Acute Lyme disease subject 4
## [26] PBMC total RNA-Acute Lyme disease subject 5
## [27] PBMC total RNA-Acute Lyme disease subject 6
## [28] PBMC total RNA-Acute Lyme disease subject 7
## [29] PBMC total RNA-Acute Lyme disease subject 8
## [30] PBMC total RNA-Acute Lyme disease subject 9
## [31] PBMC total RNA-Acute Lyme disease subject 10
## [32] PBMC total RNA-Acute Lyme disease subject 11
## [33] PBMC total RNA-Acute Lyme disease subject 12
## [34] PBMC total RNA-Acute Lyme disease subject 13
## [35] PBMC total RNA-Acute Lyme disease subject 14
## [36] PBMC total RNA-Acute Lyme disease subject 15
## [37] PBMC total RNA-Acute Lyme disease subject 16
## [38] PBMC total RNA-Acute Lyme disease subject 17
## [39] PBMC total RNA-Acute Lyme disease subject 18
## [40] PBMC total RNA-Acute Lyme disease subject 19
## [41] PBMC total RNA-Acute Lyme disease subject 20
## [42] PBMC total RNA-Acute Lyme disease subject 21
## [43] PBMC total RNA-Acute Lyme disease subject 22
## [44] PBMC total RNA-Acute Lyme disease subject 23
## [45] PBMC total RNA-Acute Lyme disease subject 24
## [46] PBMC total RNA-Acute Lyme disease subject 25
## [47] PBMC total RNA-Acute Lyme disease subject 26
## [48] PBMC total RNA-Acute Lyme disease subject 27
## [49] PBMC total RNA-Acute Lyme disease subject 28
## [50] PBMC total RNA-early convalescent Lyme disease subject 1
## [51] PBMC total RNA-early convalescent Lyme disease subject 2
## [52] PBMC total RNA-early convalescent Lyme disease subject 3
## [53] PBMC total RNA-early convalescent Lyme disease subject 4
## [54] PBMC total RNA-early convalescent Lyme disease subject 5
## [55] PBMC total RNA-early convalescent Lyme disease subject 6
## [56] PBMC total RNA-early convalescent Lyme disease subject 7
## [57] PBMC total RNA-early convalescent Lyme disease subject 8
## [58] PBMC total RNA-early convalescent Lyme disease subject 9
## [59] PBMC total RNA-early convalescent Lyme disease subject 10
## [60] PBMC total RNA-early convalescent Lyme disease subject 11
## [61] PBMC total RNA-early convalescent Lyme disease subject 12
## [62] PBMC total RNA-early convalescent Lyme disease subject 13
## [63] PBMC total RNA-early convalescent Lyme disease subject 14
## [64] PBMC total RNA-early convalescent Lyme disease subject 15
## [65] PBMC total RNA-early convalescent Lyme disease subject 16
## [66] PBMC total RNA-early convalescent Lyme disease subject 17
## [67] PBMC total RNA-early convalescent Lyme disease subject 18
## [68] PBMC total RNA-early convalescent Lyme disease subject 19
## [69] PBMC total RNA-early convalescent Lyme disease subject 20
## [70] PBMC total RNA-early convalescent Lyme disease subject 21
## [71] PBMC total RNA-early convalescent Lyme disease subject 22
## [72] PBMC total RNA-early convalescent Lyme disease subject 23
## [73] PBMC total RNA-early convalescent Lyme disease subject 24
## [74] PBMC total RNA-early convalescent Lyme disease subject 25
## [75] PBMC total RNA-early convalescent Lyme disease subject 26
## [76] PBMC total RNA-early convalescent Lyme disease subject 27
## [77] PBMC total RNA-late convalescent Lyme disease subject 1
## [78] PBMC total RNA-late convalescent Lyme disease subject 2
## [79] PBMC total RNA-late convalescent Lyme disease subject 3
## [80] PBMC total RNA-late convalescent Lyme disease subject 4
## [81] PBMC total RNA-late convalescent Lyme disease subject 5
## [82] PBMC total RNA-late convalescent Lyme disease subject 6
## [83] PBMC total RNA-late convalescent Lyme disease subject 7
## [84] PBMC total RNA-late convalescent Lyme disease subject 8
## [85] PBMC total RNA-late convalescent Lyme disease subject 9
## [86] PBMC total RNA-late convalescent Lyme disease subject 10
## 86 Levels: PBMC total RNA-Acute Lyme disease subject 1 ...
descriptors2$classDisease <- c(rep('healthyControl',21),
rep('acuteLymeDisease',28),
rep('Antibodies_1month',27),
rep('Antibodies_6months',10))
write.csv(descriptors2,'descriptors2.csv',row.names=F)
descriptors2
## Sample_Title
## 1 PBMC total RNA-Healthy control 1
## 2 PBMC total RNA-Healthy control 2
## 3 PBMC total RNA-Healthy control 3
## 4 PBMC total RNA-Healthy control 4
## 5 PBMC total RNA-Healthy control 5
## 6 PBMC total RNA-Healthy control 6
## 7 PBMC total RNA-Healthy control 7
## 8 PBMC total RNA-Healthy control 8
## 9 PBMC total RNA-Healthy control 9
## 10 PBMC total RNA-Healthy control 10
## 11 PBMC total RNA-Healthy control 11
## 12 PBMC total RNA-Healthy control 12
## 13 PBMC total RNA-Healthy control 13
## 14 PBMC total RNA-Healthy control 14
## 15 PBMC total RNA-Healthy control 15
## 16 PBMC total RNA-Healthy control 16
## 17 PBMC total RNA-Healthy control 17
## 18 PBMC total RNA-Healthy control 18
## 19 PBMC total RNA-Healthy control 19
## 20 PBMC total RNA-Healthy control 20
## 21 PBMC total RNA-Healthy control 21
## 22 PBMC total RNA-Acute Lyme disease subject 1
## 23 PBMC total RNA-Acute Lyme disease subject 2
## 24 PBMC total RNA-Acute Lyme disease subject 3
## 25 PBMC total RNA-Acute Lyme disease subject 4
## 26 PBMC total RNA-Acute Lyme disease subject 5
## 27 PBMC total RNA-Acute Lyme disease subject 6
## 28 PBMC total RNA-Acute Lyme disease subject 7
## 29 PBMC total RNA-Acute Lyme disease subject 8
## 30 PBMC total RNA-Acute Lyme disease subject 9
## 31 PBMC total RNA-Acute Lyme disease subject 10
## 32 PBMC total RNA-Acute Lyme disease subject 11
## 33 PBMC total RNA-Acute Lyme disease subject 12
## 34 PBMC total RNA-Acute Lyme disease subject 13
## 35 PBMC total RNA-Acute Lyme disease subject 14
## 36 PBMC total RNA-Acute Lyme disease subject 15
## 37 PBMC total RNA-Acute Lyme disease subject 16
## 38 PBMC total RNA-Acute Lyme disease subject 17
## 39 PBMC total RNA-Acute Lyme disease subject 18
## 40 PBMC total RNA-Acute Lyme disease subject 19
## 41 PBMC total RNA-Acute Lyme disease subject 20
## 42 PBMC total RNA-Acute Lyme disease subject 21
## 43 PBMC total RNA-Acute Lyme disease subject 22
## 44 PBMC total RNA-Acute Lyme disease subject 23
## 45 PBMC total RNA-Acute Lyme disease subject 24
## 46 PBMC total RNA-Acute Lyme disease subject 25
## 47 PBMC total RNA-Acute Lyme disease subject 26
## 48 PBMC total RNA-Acute Lyme disease subject 27
## 49 PBMC total RNA-Acute Lyme disease subject 28
## 50 PBMC total RNA-early convalescent Lyme disease subject 1
## 51 PBMC total RNA-early convalescent Lyme disease subject 2
## 52 PBMC total RNA-early convalescent Lyme disease subject 3
## 53 PBMC total RNA-early convalescent Lyme disease subject 4
## 54 PBMC total RNA-early convalescent Lyme disease subject 5
## 55 PBMC total RNA-early convalescent Lyme disease subject 6
## 56 PBMC total RNA-early convalescent Lyme disease subject 7
## 57 PBMC total RNA-early convalescent Lyme disease subject 8
## 58 PBMC total RNA-early convalescent Lyme disease subject 9
## 59 PBMC total RNA-early convalescent Lyme disease subject 10
## 60 PBMC total RNA-early convalescent Lyme disease subject 11
## 61 PBMC total RNA-early convalescent Lyme disease subject 12
## 62 PBMC total RNA-early convalescent Lyme disease subject 13
## 63 PBMC total RNA-early convalescent Lyme disease subject 14
## 64 PBMC total RNA-early convalescent Lyme disease subject 15
## 65 PBMC total RNA-early convalescent Lyme disease subject 16
## 66 PBMC total RNA-early convalescent Lyme disease subject 17
## 67 PBMC total RNA-early convalescent Lyme disease subject 18
## 68 PBMC total RNA-early convalescent Lyme disease subject 19
## 69 PBMC total RNA-early convalescent Lyme disease subject 20
## 70 PBMC total RNA-early convalescent Lyme disease subject 21
## 71 PBMC total RNA-early convalescent Lyme disease subject 22
## 72 PBMC total RNA-early convalescent Lyme disease subject 23
## 73 PBMC total RNA-early convalescent Lyme disease subject 24
## 74 PBMC total RNA-early convalescent Lyme disease subject 25
## 75 PBMC total RNA-early convalescent Lyme disease subject 26
## 76 PBMC total RNA-early convalescent Lyme disease subject 27
## 77 PBMC total RNA-late convalescent Lyme disease subject 1
## 78 PBMC total RNA-late convalescent Lyme disease subject 2
## 79 PBMC total RNA-late convalescent Lyme disease subject 3
## 80 PBMC total RNA-late convalescent Lyme disease subject 4
## 81 PBMC total RNA-late convalescent Lyme disease subject 5
## 82 PBMC total RNA-late convalescent Lyme disease subject 6
## 83 PBMC total RNA-late convalescent Lyme disease subject 7
## 84 PBMC total RNA-late convalescent Lyme disease subject 8
## 85 PBMC total RNA-late convalescent Lyme disease subject 9
## 86 PBMC total RNA-late convalescent Lyme disease subject 10
## Sample_GEO_Accession classDisease
## 1 GSM4340492 healthyControl
## 2 GSM4340493 healthyControl
## 3 GSM4340494 healthyControl
## 4 GSM4340495 healthyControl
## 5 GSM4340496 healthyControl
## 6 GSM4340497 healthyControl
## 7 GSM4340498 healthyControl
## 8 GSM4340499 healthyControl
## 9 GSM4340500 healthyControl
## 10 GSM4340501 healthyControl
## 11 GSM4340502 healthyControl
## 12 GSM4340503 healthyControl
## 13 GSM4340504 healthyControl
## 14 GSM4340505 healthyControl
## 15 GSM4340506 healthyControl
## 16 GSM4340507 healthyControl
## 17 GSM4340508 healthyControl
## 18 GSM4340509 healthyControl
## 19 GSM4340510 healthyControl
## 20 GSM4340511 healthyControl
## 21 GSM4340512 healthyControl
## 22 GSM4340513 acuteLymeDisease
## 23 GSM4340514 acuteLymeDisease
## 24 GSM4340515 acuteLymeDisease
## 25 GSM4340516 acuteLymeDisease
## 26 GSM4340517 acuteLymeDisease
## 27 GSM4340518 acuteLymeDisease
## 28 GSM4340519 acuteLymeDisease
## 29 GSM4340520 acuteLymeDisease
## 30 GSM4340521 acuteLymeDisease
## 31 GSM4340522 acuteLymeDisease
## 32 GSM4340523 acuteLymeDisease
## 33 GSM4340524 acuteLymeDisease
## 34 GSM4340525 acuteLymeDisease
## 35 GSM4340526 acuteLymeDisease
## 36 GSM4340527 acuteLymeDisease
## 37 GSM4340528 acuteLymeDisease
## 38 GSM4340529 acuteLymeDisease
## 39 GSM4340530 acuteLymeDisease
## 40 GSM4340531 acuteLymeDisease
## 41 GSM4340532 acuteLymeDisease
## 42 GSM4340533 acuteLymeDisease
## 43 GSM4340534 acuteLymeDisease
## 44 GSM4340535 acuteLymeDisease
## 45 GSM4340536 acuteLymeDisease
## 46 GSM4340537 acuteLymeDisease
## 47 GSM4340538 acuteLymeDisease
## 48 GSM4340539 acuteLymeDisease
## 49 GSM4340540 acuteLymeDisease
## 50 GSM4340541 Antibodies_1month
## 51 GSM4340542 Antibodies_1month
## 52 GSM4340543 Antibodies_1month
## 53 GSM4340544 Antibodies_1month
## 54 GSM4340545 Antibodies_1month
## 55 GSM4340546 Antibodies_1month
## 56 GSM4340547 Antibodies_1month
## 57 GSM4340548 Antibodies_1month
## 58 GSM4340549 Antibodies_1month
## 59 GSM4340550 Antibodies_1month
## 60 GSM4340551 Antibodies_1month
## 61 GSM4340552 Antibodies_1month
## 62 GSM4340553 Antibodies_1month
## 63 GSM4340554 Antibodies_1month
## 64 GSM4340555 Antibodies_1month
## 65 GSM4340556 Antibodies_1month
## 66 GSM4340557 Antibodies_1month
## 67 GSM4340558 Antibodies_1month
## 68 GSM4340559 Antibodies_1month
## 69 GSM4340560 Antibodies_1month
## 70 GSM4340561 Antibodies_1month
## 71 GSM4340562 Antibodies_1month
## 72 GSM4340563 Antibodies_1month
## 73 GSM4340564 Antibodies_1month
## 74 GSM4340565 Antibodies_1month
## 75 GSM4340566 Antibodies_1month
## 76 GSM4340567 Antibodies_1month
## 77 GSM4340568 Antibodies_6months
## 78 GSM4340569 Antibodies_6months
## 79 GSM4340570 Antibodies_6months
## 80 GSM4340571 Antibodies_6months
## 81 GSM4340572 Antibodies_6months
## 82 GSM4340573 Antibodies_6months
## 83 GSM4340574 Antibodies_6months
## 84 GSM4340575 Antibodies_6months
## 85 GSM4340576 Antibodies_6months
## 86 GSM4340577 Antibodies_6months
platform <- read.delim('GPL13667-15572.txt',sep='\t',header=T,
na.strings=c('',' ','NA'),
comment.char='#')
colnames(platform)
## [1] "ID" "GeneChip.Array"
## [3] "Species.Scientific.Name" "Annotation.Date"
## [5] "Sequence.Type" "Sequence.Source"
## [7] "Transcript.ID.Array.Design." "Target.Description"
## [9] "Representative.Public.ID" "Archival.UniGene.Cluster"
## [11] "UniGene.ID" "Genome.Version"
## [13] "Alignments" "Gene.Title"
## [15] "Gene.Symbol" "Chromosomal.Location"
## [17] "GB_LIST" "SPOT_ID"
## [19] "Unigene.Cluster.Type" "Ensembl"
## [21] "Entrez.Gene" "SwissProt"
## [23] "EC" "OMIM"
## [25] "RefSeq.Protein.ID" "RefSeq.Transcript.ID"
## [27] "FlyBase" "AGI"
## [29] "WormBase" "MGI.Name"
## [31] "RGD.Name" "SGD.accession.number"
## [33] "Gene.Ontology.Biological.Process" "Gene.Ontology.Cellular.Component"
## [35] "Gene.Ontology.Molecular.Function" "Pathway"
## [37] "InterPro" "Trans.Membrane"
## [39] "QTL" "Annotation.Description"
## [41] "Annotation.Transcript.Cluster" "Transcript.Assignments"
## [43] "Annotation.Notes"
platform2 <- platform[,c(1,15)]
head(platform2,10)
## ID Gene.Symbol
## 1 11715100_at HIST1H3G
## 2 11715101_s_at HIST1H3G
## 3 11715102_x_at HIST1H3G
## 4 11715103_x_at TNFAIP8L1
## 5 11715104_s_at OTOP2
## 6 11715105_at C17orf78
## 7 11715106_x_at CTAGE6
## 8 11715107_s_at F8A1 /// F8A2 /// F8A3
## 9 11715108_x_at LOC285501
## 10 11715109_at SAMD7
split3 <- strsplit(as.character(platform2$Gene.Symbol),split='///')
Gene1 <- lapply(split3,'[',1)
platform2$Gene <- as.character(paste(Gene1))
platform2$Gene <- trimws(platform2$Gene,which='both',whitespace=' ')
platform3 <- platform2[,c(1,3)]
Lyme <- merge(platform3,ticks,by.x='ID',by.y='ID_REF')
head(Lyme,10)
## ID Gene GSM4340492 GSM4340493 GSM4340494 GSM4340495
## 1 11715100_at HIST1H3G -0.59253310 -0.009284496 0.88924026 -0.59085226
## 2 11715101_s_at HIST1H3G 0.09195518 -0.286612030 -0.05651927 0.01545429
## 3 11715102_x_at HIST1H3G -0.30191730 -0.298989770 0.53580880 -0.05129719
## 4 11715103_x_at TNFAIP8L1 0.31854916 0.513157370 0.95201826 -0.17165422
## 5 11715104_s_at OTOP2 0.35021090 0.417993550 0.64977026 -0.87235403
## 6 11715105_at C17orf78 0.23255038 0.105412245 0.93498800 -0.38537788
## 7 11715106_x_at CTAGE6 -0.23309612 -0.247609620 0.05952883 0.16506481
## 8 11715107_s_at F8A1 0.21802092 0.263677600 0.25610542 -0.06133032
## 9 11715108_x_at LOC285501 0.15773225 0.230084420 -0.16884637 0.01592112
## 10 11715109_at SAMD7 -0.07625985 0.069449190 0.86671830 0.07166767
## GSM4340496 GSM4340497 GSM4340498 GSM4340499 GSM4340500 GSM4340501
## 1 -0.25674057 0.178862570 0.33442068 0.71101570 -0.39509892 0.46790314
## 2 0.46735048 -0.661887170 -0.19262838 -0.65387726 -0.23723197 -0.11107683
## 3 0.00169158 0.154232260 0.95216393 0.80829550 -0.22131062 -0.13876462
## 4 -0.49376535 -0.003461361 0.35323380 0.25973320 0.11914110 0.48709917
## 5 -0.27472186 0.518686800 -0.37734365 -0.18517780 -0.05672860 0.02594519
## 6 -0.18685770 0.038143635 -0.09946012 -0.19551945 -0.02428436 0.43764305
## 7 -0.17645860 0.284028300 -0.16674256 -0.03273916 -0.22399735 -0.35533237
## 8 0.24444056 0.127098080 0.35930157 -0.32224035 0.15163136 0.23986864
## 9 -0.07901430 0.194977760 0.01868057 0.40068722 0.20100140 0.01788568
## 10 -0.06297445 0.088138580 -0.11154175 0.31087565 -0.11259913 0.22273993
## GSM4340502 GSM4340503 GSM4340504 GSM4340505 GSM4340506 GSM4340507
## 1 -0.91596320 -0.21084070 0.507361400 -0.10268044 -0.268592120 -0.2066014
## 2 -0.53500676 -0.01545405 0.026301146 -0.22284293 -0.096830610 0.3286717
## 3 -0.32182740 -0.33841515 0.547380700 -0.27017474 -0.444824930 -0.2811055
## 4 0.37288857 -0.33219337 0.187441830 0.02667522 0.138779160 -0.2921214
## 5 0.84479380 -0.56166600 0.176565890 0.70575760 0.009498119 -0.2518997
## 6 0.47393608 -0.22646427 0.001132488 0.03431201 -0.122164965 0.1782284
## 7 0.52311490 0.42100382 0.003138065 -0.21974206 -0.107515570 -0.4599485
## 8 0.59180880 -0.09978533 -0.083554980 -0.35681129 0.452571400 -0.5096481
## 9 -0.07756519 0.12501192 0.252948280 0.25551580 -0.194641110 -0.0900197
## 10 0.38259960 0.20906234 0.245586870 0.81757355 0.399212840 0.3086305
## GSM4340508 GSM4340509 GSM4340510 GSM4340511 GSM4340512 GSM4340513
## 1 -0.03616428 1.39556170 0.9336066 -0.345187660 1.41630410 0.024940968
## 2 -0.10901141 0.26494336 0.2645502 -0.172512530 -0.01915169 0.590458400
## 3 -0.48442793 -0.00169158 0.3964074 -0.438740730 0.79996130 -0.003316164
## 4 -0.27666283 0.59313583 1.3088722 -0.078464985 -0.22184610 -0.125149250
## 5 -0.49455237 -0.12355471 1.2536860 0.005551815 0.14471460 -0.299937000
## 6 -0.26185846 0.03588915 1.0256069 -0.016168356 -0.08185172 0.283830400
## 7 0.26642323 0.49813986 0.6025591 0.103127960 0.28940630 0.305600400
## 8 -0.66656685 -0.04444027 -0.2184668 -0.607126950 -0.43012738 -0.034221650
## 9 -0.11579466 0.33119917 0.6969066 -0.205174210 0.44025946 0.700205300
## 10 0.47717070 -0.09540820 1.0605373 -0.025308609 0.09913993 -0.150889160
## GSM4340514 GSM4340515 GSM4340516 GSM4340517 GSM4340518 GSM4340519
## 1 0.36390543 -0.05049491 0.17156029 -0.17820406 -0.6384110 1.45310120
## 2 0.92116880 0.13653588 0.40749073 0.06032562 -0.6903899 -0.31139135
## 3 0.51725410 -0.09225488 0.17572045 -0.53029585 -0.4140344 1.46454930
## 4 -0.26830244 -0.42997742 0.15891123 0.73606540 0.3150458 0.25862217
## 5 -0.36973786 0.10511756 0.07034254 -0.32845616 -0.1203601 0.41244340
## 6 0.14113188 -0.04387975 -0.01858592 -0.08883715 -0.4785538 1.40855050
## 7 0.03930974 0.02128148 -0.35518550 -0.18491459 0.2934127 0.02635431
## 8 0.19977641 0.12956524 0.76010180 -0.25856924 0.1035805 -0.07077360
## 9 -0.34498215 0.18119955 0.18576646 -0.13998628 -0.2464888 1.01377010
## 10 0.27483344 0.29325080 -0.12200546 -0.01883483 0.0276742 0.91223645
## GSM4340520 GSM4340521 GSM4340522 GSM4340523 GSM4340524 GSM4340525
## 1 0.867766860 0.06011248 -0.04372168 0.99262430 0.32651900 1.221489000
## 2 1.184989500 -0.07329583 0.10649586 0.14782143 -0.04292679 0.122339725
## 3 1.107213300 -0.62243030 0.23995805 0.51144240 -0.21093988 0.734566900
## 4 0.399795530 -0.36117554 -0.14437151 0.07272816 0.40393830 -0.057187557
## 5 -0.190790410 0.21818352 0.03414512 0.24560475 0.25962280 -0.026625872
## 6 0.108215090 0.10102868 -0.21768450 -0.27324247 -0.06491280 0.093775510
## 7 -0.003138065 -0.03363609 -0.35512495 0.37538410 -0.48708916 -0.035444260
## 8 -0.350707530 -0.32779574 0.62094736 0.02097416 0.29126263 -0.032155037
## 9 -0.385637760 0.19942021 0.05432105 -0.35548830 -0.16131115 0.305632100
## 10 -0.340252160 0.67416120 -0.35456777 -0.07404351 0.22817540 0.003587961
## GSM4340526 GSM4340527 GSM4340528 GSM4340529 GSM4340530 GSM4340531
## 1 -0.079782010 0.16625214 -0.05562854 0.74712276 0.11671686 0.202030660
## 2 1.331762800 1.05427030 0.77380896 1.21485230 0.17339611 0.373914240
## 3 0.273911700 -0.08973885 0.18000436 0.59774710 0.13011074 0.019492626
## 4 -0.001162052 -0.15073442 -0.15721035 -0.13115883 0.05067396 0.009587288
## 5 -0.174499030 0.01978135 -0.42199445 -0.23907113 0.14861059 0.129543780
## 6 -0.022754430 -0.25086713 0.06694078 0.05301285 0.09123874 -0.136976960
## 7 -0.090754750 -0.25880456 -0.40618014 0.64868000 0.55330443 0.462115760
## 8 0.017302990 0.03397655 -0.46453524 -0.28433204 -0.53935814 0.035744190
## 9 -0.225182060 -0.12666178 -0.28527260 -0.21253347 0.40385842 0.159019710
## 10 0.002527237 0.01034808 -0.32083917 0.20892763 -0.15703940 0.117316484
## GSM4340532 GSM4340533 GSM4340534 GSM4340535 GSM4340536 GSM4340537
## 1 0.695035930 -0.07468033 0.42930126 0.762237100 -0.1346824 -0.66726850
## 2 0.644182700 1.74052330 0.81566286 1.983678300 0.1371200 -0.38933063
## 3 0.351636170 0.56654190 0.19219232 0.004929781 -0.3382273 -0.52081466
## 4 -0.137790200 -0.19500494 0.00483942 -0.624896050 0.2809162 -0.54733040
## 5 -0.005551815 0.17941070 0.21301961 1.075717400 0.2678673 -0.69308877
## 6 0.256294970 -0.19070745 0.01503134 -0.178430320 0.3682134 0.11680889
## 7 0.476877930 -0.13712406 -0.48446155 1.019529300 0.2419057 0.13968611
## 8 0.647078040 0.67237806 -0.27653360 0.587698000 0.1267602 -0.26808643
## 9 0.264019000 0.35026717 0.20412159 0.798325060 -0.1514852 -0.13640094
## 10 0.669338940 0.83077170 -0.47124220 0.010626555 0.2122192 -0.02050471
## GSM4340538 GSM4340539 GSM4340540 GSM4340541 GSM4340542 GSM4340543
## 1 -0.8034751 -0.08240318 -0.26496172 -0.55023000 0.139022350 0.19117546
## 2 -0.4086158 0.07601285 -0.23855066 1.20058390 -0.857179160 0.72668650
## 3 -0.4278896 0.18717742 -0.15261054 0.06187797 0.344895600 0.07341838
## 4 -0.1127582 -0.35107183 0.08686781 -0.11734247 -0.187285900 -0.34354544
## 5 -0.4702437 -0.01245666 -0.26411510 0.28339958 -0.232067350 -0.02993345
## 6 -0.2333312 -0.10420871 0.15422583 -0.09303212 -0.087192300 -0.43732095
## 7 -0.1069174 -0.19418597 -0.48339033 0.60018590 -0.008615255 -0.12579775
## 8 0.4021506 0.24575377 -0.20825243 0.17053008 -0.445592160 -0.09836698
## 9 -0.1217663 -0.21430850 -0.01568246 -0.31617308 -0.003108978 0.08321142
## 10 0.3645365 -0.32145430 -0.23980832 -0.21153498 -0.225210190 -0.16845512
## GSM4340544 GSM4340545 GSM4340546 GSM4340547 GSM4340548 GSM4340549
## 1 -0.13712788 -0.5064247 0.32296610 0.167175770 -0.007938385 0.265477180
## 2 -0.83762050 -0.5171139 0.06512356 1.502949700 0.820524700 -0.603322740
## 3 0.25898100 -0.1969023 -0.14703488 0.655670400 0.005455732 -0.004225254
## 4 0.09970617 -0.3247981 -0.16297817 -0.040671825 0.011886120 0.507034800
## 5 0.26365137 -0.4023223 -0.18715930 0.527691360 0.068028930 0.449331760
## 6 0.02165437 -0.1708701 -0.02961493 0.156007530 -0.175745730 0.357031350
## 7 -0.35575510 0.4802373 0.43824100 0.014536381 0.540118460 -0.238581660
## 8 0.04122806 -0.1636445 0.41594410 0.566439600 0.423860550 0.581760400
## 9 -0.07597423 0.1993938 -0.04255629 -0.026537180 0.096963880 0.349265580
## 10 -0.12926245 -0.1056190 0.22595882 -0.001816034 -0.009675026 -0.021210432
## GSM4340550 GSM4340551 GSM4340552 GSM4340553 GSM4340554 GSM4340555
## 1 0.310139660 0.14941120 0.06148148 -0.03426623 0.07923126 0.338356970
## 2 0.377018930 -0.08147907 -0.04239464 -0.16666675 0.43988752 0.569839950
## 3 0.511428600 -0.15845299 -0.26334380 -0.99466133 0.24187303 0.172201400
## 4 0.003585339 -0.03613806 1.10038520 0.07085276 0.11792040 0.243794920
## 5 0.188768390 0.24183226 -0.30591822 -0.36849856 -0.02074099 0.322824720
## 6 0.196948290 0.43674254 0.45061135 -0.31486250 -0.01961470 0.069894314
## 7 -0.102472780 -0.12055969 -0.38386060 0.09604788 0.00589323 0.616808650
## 8 -0.329080340 -0.07314849 -0.02191258 -0.10152102 -0.10343003 0.140138150
## 9 -0.258805500 -0.04596662 0.30989194 -0.15038610 -0.43206978 -0.006614685
## 10 0.698671800 -0.21091223 0.50165390 -0.34214520 -0.08751988 -0.007938623
## GSM4340556 GSM4340557 GSM4340558 GSM4340559 GSM4340560 GSM4340561
## 1 2.874059700 0.442127230 0.9249401 0.839578150 -0.38884664 -0.07621074
## 2 -0.110048770 -0.393716340 -0.8799987 0.684186460 0.30365920 0.88441324
## 3 1.739939900 0.296143770 1.2088530 0.370505570 -0.09722352 -0.05574894
## 4 0.259531970 0.001162052 1.4090724 0.464235300 -1.03242210 -0.09247875
## 5 -0.032611130 0.322963240 -0.1202502 0.496089460 0.30323625 -0.26851058
## 6 -0.001132488 0.085085150 -0.2871523 0.356799360 -0.27124834 -0.33872580
## 7 0.074517730 0.222347740 -0.2620816 0.525415200 0.45337200 -0.02035952
## 8 -0.022999763 -0.435444600 0.3761067 0.394931800 1.00498100 -0.22367430
## 9 -0.092698574 0.315870760 0.2548003 0.003108978 -0.08749151 -0.12848425
## 10 -0.005806685 -0.037118435 -0.2099145 0.491355180 0.21082091 -0.26856208
## GSM4340562 GSM4340563 GSM4340564 GSM4340565 GSM4340566 GSM4340567
## 1 -0.030527353 0.7965670 1.32892900 -0.70203495 0.17316818 -0.17138457
## 2 0.229472160 -0.5226088 -0.42213202 -0.42114854 -0.57192636 -0.09266257
## 3 0.467982530 0.7239873 1.06951830 -0.69173074 0.04611826 -0.20751357
## 4 -0.611855030 0.4623060 0.15209580 0.01517963 -0.12642765 -0.11141872
## 5 -0.190755370 0.3697445 -0.29396987 -0.04537916 0.17844630 -0.30581330
## 6 0.480561020 0.4825172 0.09900594 -0.34519982 -0.06524348 -0.23261500
## 7 -0.012398958 -0.3951211 0.34062195 0.03960013 -0.15827584 0.42929006
## 8 -0.004727602 0.3185978 -0.78857090 0.15007639 -0.31071950 -0.06571126
## 9 0.391360280 -0.1007588 -0.34945035 -0.17555260 0.21184110 -0.22346115
## 10 -0.384369130 -0.2083082 -0.23706675 -0.15146804 0.54136395 -0.40481090
## GSM4340568 GSM4340569 GSM4340570 GSM4340571 GSM4340572 GSM4340573
## 1 -0.47117710 -0.57729626 0.00291729 -0.00291729 -0.02237558 -0.215449810
## 2 -0.42650986 0.57595587 -0.06454945 -0.69773720 0.17670655 -1.009703200
## 3 -0.14878845 -0.02167058 -0.08139014 -0.14530134 -0.13251233 -0.116862774
## 4 -0.02393150 -0.02706146 1.04600050 -0.40366602 0.51856995 -0.090086940
## 5 -0.22946358 -0.23458219 1.23416950 0.19375610 -0.16548180 -0.057461023
## 6 0.05986667 -0.12558055 -0.11191845 0.47772480 0.11514950 0.773427000
## 7 0.55531836 0.30098010 -0.07369185 0.14053250 -0.02606392 -0.231655360
## 8 0.89040090 0.00472784 0.04761553 -0.11750078 0.75627136 -0.346018550
## 9 0.01146317 0.10802078 -0.14302516 -0.12559128 0.01791525 0.141523840
## 10 -0.26789665 -0.04320884 0.61968660 0.05324388 0.40543246 0.001815796
## GSM4340574 GSM4340575 GSM4340576 GSM4340577
## 1 -0.23842883 -0.13297105 -0.25816083 -0.65128374
## 2 -0.41535997 -0.36541247 -0.63268210 0.32752848
## 3 -0.01295447 0.06384516 -0.88006690 -0.50552154
## 4 -0.36591434 -0.05154228 -0.27018833 0.69949150
## 5 0.01363945 -0.04463029 -0.03419995 0.68252800
## 6 0.11831260 -0.01090026 -0.17179346 0.06035352
## 7 -0.08951592 0.15467095 -0.15713477 -0.21521902
## 8 -0.16519380 -0.02013493 -0.58750010 0.47252607
## 9 -0.01791692 -0.12587452 0.02695108 0.28917623
## 10 -0.20205832 0.02986431 -0.20121956 0.32708670
noGeneSymbol <- Lyme[grep('---',Lyme$Gene),]
platform4 <- platform[,c(1,20)]
Ensembl <- merge(platform4,noGeneSymbol,by.x='ID',by.y='ID')
Ensembl2 <- Ensembl[-grep('---',Ensembl$Ensembl),]
string5 <- strsplit(as.character(paste(Ensembl2$Ensembl)),'///')
Ensembl2$EnsemblID <- as.character(paste(lapply(string5,'[',1)))
Ensembl3 <- Ensembl2[,c(90,4:89)]
colnames(Ensembl3)[1] <- 'Gene'
LymeDisease <- Lyme[,-1]
Lets combine the Ensembl IDs data frame with the Gene Symbol data frame as they are some of the missing observations of the LymeDisease data frame with the gene symbols missing. Its only 75 out of 600 missing, but still replaces some missing values, and genecards.org will look up either gene symbol and we can grep out the Ensembl IDs with their prepended ‘ENSG’ ID names.
LymeDisease2 <- LymeDisease[-grep('---',LymeDisease$Gene),]
LymeDisease3 <- rbind(LymeDisease2,Ensembl3)
write.csv(LymeDisease3,'LymeDisease.csv',row.names=FALSE)
Our data is log2 normalized, and this means it is scaled to be between 0 and 1. There are many different ways to log2 normalize such as each x elements of a sample minus the mean of all x’s in the sample, then divided by the standard deviation of all x’s in the sample. Or take an element x of a sample then subtract the min(all x’s in sample) and divide that by the max(all x’s in sample)-min(all x’s in sample). To inverse log 2 you just take 2 and raise it to the output y of log2 normalized x. To inverse the normalized method, you reverse the operations. For the first inverse, you would multiply by the std error of x then subtract the mean of x and for the second method you would multiply by the max-min and then add the min. The normalization is done before the log2 according to Dr. Quackenbush on a posted question on biostars. I want to inverse the scaling, because when doing machine learning, the data is supposed to be scaled after splitting the data into training and testing sets. And Affymetrix data has more steps for normalization as well. Lets suppose that the normalization is the second method, because I could get back the original x by converting the decimal to a fraction, but couldn’t with the mean and std error method of scaling. Also if a value was zero I added 10^-8 to make it a value log2 would recognize and not quit on.
So, lets assume the formula is log2[(x-min(x))/(max(x)-min(x))]=y, then the inverse would be [2^(y)]*[max(y)-min(y)]+min(y)
a <- LymeDisease3$GSM4340492
head(a,10)
## [1] -0.59253310 0.09195518 -0.30191730 0.31854916 0.35021090 0.23255038
## [7] -0.23309612 0.21802092 0.15773225 -0.07625985
Inverse step 1 to take the base 2 and raise it by y we named A.
A <- (2^a)
head(A,10)
## [1] 0.6631775 1.0658136 0.8111737 1.2470758 1.2747470 1.1749101 0.8508070
## [8] 1.1631369 1.1155323 0.9485135
Step 2 of inverse is to inverse the standardization steps that set all values between 0 and 1. But we notice that the values above are not between 0 and 1 so they must not have been normalized with this method. And they likely weren’t because Dr. Quackenbush said the values are ‘background corrected,’‘quantile normalized,’‘probe summarisation (i.e. across transcripts),’ and ‘log (base 2) transformation.’-www.biostars.org/p/3121133/
AA <- A*(max(A)-min(A))+min(A)
head(AA,10)
## [1] 22.59247 36.29708 27.62984 42.46674 43.40859 40.01042 28.97885 39.60969
## [9] 37.98936 32.30451
Those values don’t look extreme, we could try to use the fractional method to get the original values back.
AAA <- as.fractions(AA)
head(AAA,10)
## [1] 618288/27367 22359/616 19258/697 156990101/3696778
## [5] 332640/7663 108901357/2721825 106758269/3684006 29430/743
## [9] 983198089/25880879 61637/1908
Multiply by the maximum value in the list of de-normalized or de-standardized values.The denominators are not all common, We need a common denominator and we might need these fractions to all have common denominators.
maxA <- max(AAA)
A4 <- AAA*maxA
A5 <- as.numeric(A4)
head(A5,10)
## [1] 26189.73 42076.44 32029.18 49228.46 50320.27 46381.04 33592.98 45916.50
## [9] 44038.18 37448.16
Those values are extremely high. We were better at stopping after de-standardizing the inverse log2 of y as our x.
%%%%%%%%%%%%% demonstration of what was expected %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Let me show you what I expected when using this on a different set of values. We start with x having 10 elements, but one is a 0, and then we standardize to fit between 0 and 1.
x <- c(1,2,3,4,5,5,43,0,23,11)
x_a <- (x-min(x))/(max(x)-min(x))
x_a
## [1] 0.02325581 0.04651163 0.06976744 0.09302326 0.11627907 0.11627907
## [7] 1.00000000 0.00000000 0.53488372 0.25581395
correct the 0 value for taking the log by adding a very small value, otherwise it will be a NaN or log error.
x_b <- x_a+10^-8
x_b
## [1] 0.02325582 0.04651164 0.06976745 0.09302327 0.11627908 0.11627908
## [7] 1.00000001 0.00000001 0.53488373 0.25581396
y <- log(x_b,2)
y
## [1] -5.426264e+00 -4.426264e+00 -3.841302e+00 -3.426265e+00 -3.104337e+00
## [6] -3.104337e+00 1.442695e-08 -2.657542e+01 -9.027028e-01 -1.966833e+00
The above is y, the log2 normalized output of x.
Lets get x back by reversing the operations.
x_c <- 2^y
x_c
## [1] 0.02325582 0.04651164 0.06976745 0.09302327 0.11627908 0.11627908
## [7] 1.00000001 0.00000001 0.53488373 0.25581396
The above is equal to x_b, the normalized value plus the 10^-8 small value.
x_d <- x_c-0.00000001
x_d
## [1] 2.325581e-02 4.651163e-02 6.976744e-02 9.302326e-02 1.162791e-01
## [6] 1.162791e-01 1.000000e+00 8.271806e-24 5.348837e-01 2.558140e-01
Notice that the zero is 10^-24, or a very small value, that is otherwise 0. That could be the size of the tiniest atom.
x_e <- x_d*(max(x_d)-min(x_d))+min(x_d)
x_e
## [1] 2.325581e-02 4.651163e-02 6.976744e-02 9.302326e-02 1.162791e-01
## [6] 1.162791e-01 1.000000e+00 1.654361e-23 5.348837e-01 2.558140e-01
#library(MASS)
X <- as.fractions(x_e)
X
## [1] 1/43 2/43 3/43 4/43 5/43 5/43 1 0 23/43 11/43
Notice, because its normalized the values aren’t the original values, but the denominator is the max value. We can multiply by that value and get our original values back.
X2 <- X*43
x
## [1] 1 2 3 4 5 5 43 0 23 11
X2
## [1] 1 2 3 4 5 5 43 0 23 11
We got back the original values using the second normalization method. %%%%%%%%%%%%%%%%%%%%%%%%%%% end of demonstration %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
There are 86 samples and we would have to do this to 85 more samples or create a function that will do those steps to each column in our data frame and write it out to a file we can read back in. We are going to forget about multiplying each entry by the max and just end the inverse log2 normalization after de-normalizing the y output vector assumedly back to x, the input vector. I tried to write a functional for loop to write this out to file, but it returned a long vector, and then when reading it in, the matrix() and the as.matrix() didn’t change the 4M+ long vector (48851*86 elements as rows) into the number of rows and columns, it just kept it as a very long vector. Online, the community says that the functions decide on their own.
LymeMX <- LymeDisease3[,2:87]
denormalize <- function(datatable){
for (i in LymeMX[,1:86]){
a <- i
A <- 2^a
AA <- A*(max(A)-min(A))+min(A)
write.table(AA,'lymeMX.csv',sep=',',append=TRUE,col.names=FALSE,row.names=FALSE)
}
}
if (file.exists('lymeMX.csv')){
file.remove('lymeMX.csv')
}
## [1] TRUE
denormalize(LymeMX)
lymeVector <- read.csv('lymeMX.csv',sep=',',header=F)
lymeMatrix <- as.matrix(lymeVector,nrow=48851,ncol=86)
lymeMatrix2 <- as.matrix(lymeVector,nrow=48851,ncol=86)
Both matrices are still just one long 4,201,186 X 1 matrix.
So, we must do this the long way, but technically copy and paste still make it somewhat fast. Just making sure to put in the right indices manually.
lymeMx <- as.data.frame(LymeDisease3[,1])
colnames(lymeMx) <- 'gene'
lymeMx$s1 <-2^(LymeDisease3[,2])*(max(2^(LymeDisease3[,2]))-min(2^(LymeDisease3[,2])))+min(2^(LymeDisease3[,2]))
lymeMx$s2 <-2^(LymeDisease3[,3])*(max(2^(LymeDisease3[,3]))-min(2^(LymeDisease3[,3])))+min(2^(LymeDisease3[,3]))
lymeMx$s3 <-2^(LymeDisease3[,4])*(max(2^(LymeDisease3[,4]))-min(2^(LymeDisease3[,4])))+min(2^(LymeDisease3[,4]))
lymeMx$s4 <-2^(LymeDisease3[,5])*(max(2^(LymeDisease3[,5]))-min(2^(LymeDisease3[,5])))+min(2^(LymeDisease3[,5]))
lymeMx$s5 <-2^(LymeDisease3[,6])*(max(2^(LymeDisease3[,6]))-min(2^(LymeDisease3[,6])))+min(2^(LymeDisease3[,6]))
lymeMx$s6 <-2^(LymeDisease3[,7])*(max(2^(LymeDisease3[,7]))-min(2^(LymeDisease3[,7])))+min(2^(LymeDisease3[,7]))
lymeMx$s7 <-2^(LymeDisease3[,8])*(max(2^(LymeDisease3[,8]))-min(2^(LymeDisease3[,8])))+min(2^(LymeDisease3[,8]))
lymeMx$s8 <-2^(LymeDisease3[,9])*(max(2^(LymeDisease3[,9]))-min(2^(LymeDisease3[,9])))+min(2^(LymeDisease3[,9]))
lymeMx$s9 <-2^(LymeDisease3[,10])*(max(2^(LymeDisease3[,10]))-min(2^(LymeDisease3[,10])))+min(2^(LymeDisease3[,10]))
lymeMx$s10 <-2^(LymeDisease3[,11])*(max(2^(LymeDisease3[,11]))-min(2^(LymeDisease3[,11])))+min(2^(LymeDisease3[,11]))
lymeMx$s11 <-2^(LymeDisease3[,12])*(max(2^(LymeDisease3[,12]))-min(2^(LymeDisease3[,12])))+min(2^(LymeDisease3[,12]))
lymeMx$s12 <-2^(LymeDisease3[,13])*(max(2^(LymeDisease3[,13]))-min(2^(LymeDisease3[,13])))+min(2^(LymeDisease3[,13]))
lymeMx$s13 <-2^(LymeDisease3[,14])*(max(2^(LymeDisease3[,14]))-min(2^(LymeDisease3[,14])))+min(2^(LymeDisease3[,14]))
lymeMx$s14 <-2^(LymeDisease3[,15])*(max(2^(LymeDisease3[,15]))-min(2^(LymeDisease3[,15])))+min(2^(LymeDisease3[,15]))
lymeMx$s15 <-2^(LymeDisease3[,16])*(max(2^(LymeDisease3[,16]))-min(2^(LymeDisease3[,16])))+min(2^(LymeDisease3[,16]))
lymeMx$s16 <-2^(LymeDisease3[,17])*(max(2^(LymeDisease3[,17]))-min(2^(LymeDisease3[,17])))+min(2^(LymeDisease3[,17]))
lymeMx$s17 <-2^(LymeDisease3[,18])*(max(2^(LymeDisease3[,18]))-min(2^(LymeDisease3[,18])))+min(2^(LymeDisease3[,18]))
lymeMx$s18 <-2^(LymeDisease3[,19])*(max(2^(LymeDisease3[,19]))-min(2^(LymeDisease3[,19])))+min(2^(LymeDisease3[,19]))
lymeMx$s19 <-2^(LymeDisease3[,20])*(max(2^(LymeDisease3[,20]))-min(2^(LymeDisease3[,20])))+min(2^(LymeDisease3[,20]))
lymeMx$s20 <-2^(LymeDisease3[,21])*(max(2^(LymeDisease3[,21]))-min(2^(LymeDisease3[,21])))+min(2^(LymeDisease3[,21]))
lymeMx$s21 <-2^(LymeDisease3[,22])*(max(2^(LymeDisease3[,22]))-min(2^(LymeDisease3[,22])))+min(2^(LymeDisease3[,22]))
lymeMx$s22 <-2^(LymeDisease3[,23])*(max(2^(LymeDisease3[,23]))-min(2^(LymeDisease3[,23])))+min(2^(LymeDisease3[,23]))
lymeMx$s23 <-2^(LymeDisease3[,24])*(max(2^(LymeDisease3[,24]))-min(2^(LymeDisease3[,24])))+min(2^(LymeDisease3[,24]))
lymeMx$s24 <-2^(LymeDisease3[,25])*(max(2^(LymeDisease3[,25]))-min(2^(LymeDisease3[,25])))+min(2^(LymeDisease3[,25]))
lymeMx$s25 <-2^(LymeDisease3[,26])*(max(2^(LymeDisease3[,26]))-min(2^(LymeDisease3[,26])))+min(2^(LymeDisease3[,26]))
lymeMx$s26 <-2^(LymeDisease3[,27])*(max(2^(LymeDisease3[,27]))-min(2^(LymeDisease3[,27])))+min(2^(LymeDisease3[,27]))
lymeMx$s27 <-2^(LymeDisease3[,28])*(max(2^(LymeDisease3[,28]))-min(2^(LymeDisease3[,28])))+min(2^(LymeDisease3[,28]))
lymeMx$s28 <-2^(LymeDisease3[,29])*(max(2^(LymeDisease3[,29]))-min(2^(LymeDisease3[,29])))+min(2^(LymeDisease3[,29]))
lymeMx$s29 <-2^(LymeDisease3[,30])*(max(2^(LymeDisease3[,30]))-min(2^(LymeDisease3[,30])))+min(2^(LymeDisease3[,30]))
lymeMx$s30 <-2^(LymeDisease3[,31])*(max(2^(LymeDisease3[,31]))-min(2^(LymeDisease3[,31])))+min(2^(LymeDisease3[,31]))
lymeMx$s31 <-2^(LymeDisease3[,32])*(max(2^(LymeDisease3[,32]))-min(2^(LymeDisease3[,32])))+min(2^(LymeDisease3[,32]))
lymeMx$s32 <-2^(LymeDisease3[,33])*(max(2^(LymeDisease3[,33]))-min(2^(LymeDisease3[,33])))+min(2^(LymeDisease3[,33]))
lymeMx$s33 <-2^(LymeDisease3[,34])*(max(2^(LymeDisease3[,34]))-min(2^(LymeDisease3[,34])))+min(2^(LymeDisease3[,34]))
lymeMx$s34 <-2^(LymeDisease3[,35])*(max(2^(LymeDisease3[,35]))-min(2^(LymeDisease3[,35])))+min(2^(LymeDisease3[,35]))
lymeMx$s35 <-2^(LymeDisease3[,36])*(max(2^(LymeDisease3[,36]))-min(2^(LymeDisease3[,36])))+min(2^(LymeDisease3[,36]))
lymeMx$s36 <-2^(LymeDisease3[,37])*(max(2^(LymeDisease3[,37]))-min(2^(LymeDisease3[,37])))+min(2^(LymeDisease3[,37]))
lymeMx$s37 <-2^(LymeDisease3[,38])*(max(2^(LymeDisease3[,38]))-min(2^(LymeDisease3[,38])))+min(2^(LymeDisease3[,38]))
lymeMx$s38 <-2^(LymeDisease3[,39])*(max(2^(LymeDisease3[,39]))-min(2^(LymeDisease3[,39])))+min(2^(LymeDisease3[,39]))
lymeMx$s39 <-2^(LymeDisease3[,40])*(max(2^(LymeDisease3[,40]))-min(2^(LymeDisease3[,40])))+min(2^(LymeDisease3[,40]))
lymeMx$s40 <-2^(LymeDisease3[,41])*(max(2^(LymeDisease3[,41]))-min(2^(LymeDisease3[,41])))+min(2^(LymeDisease3[,41]))
lymeMx$s41 <-2^(LymeDisease3[,42])*(max(2^(LymeDisease3[,42]))-min(2^(LymeDisease3[,42])))+min(2^(LymeDisease3[,42]))
lymeMx$s42 <-2^(LymeDisease3[,43])*(max(2^(LymeDisease3[,43]))-min(2^(LymeDisease3[,43])))+min(2^(LymeDisease3[,43]))
lymeMx$s43 <-2^(LymeDisease3[,44])*(max(2^(LymeDisease3[,44]))-min(2^(LymeDisease3[,44])))+min(2^(LymeDisease3[,44]))
lymeMx$s44 <-2^(LymeDisease3[,45])*(max(2^(LymeDisease3[,45]))-min(2^(LymeDisease3[,45])))+min(2^(LymeDisease3[,45]))
lymeMx$s45 <-2^(LymeDisease3[,46])*(max(2^(LymeDisease3[,46]))-min(2^(LymeDisease3[,46])))+min(2^(LymeDisease3[,46]))
lymeMx$s46 <-2^(LymeDisease3[,47])*(max(2^(LymeDisease3[,47]))-min(2^(LymeDisease3[,47])))+min(2^(LymeDisease3[,47]))
lymeMx$s47 <-2^(LymeDisease3[,48])*(max(2^(LymeDisease3[,48]))-min(2^(LymeDisease3[,48])))+min(2^(LymeDisease3[,48]))
lymeMx$s48 <-2^(LymeDisease3[,49])*(max(2^(LymeDisease3[,49]))-min(2^(LymeDisease3[,49])))+min(2^(LymeDisease3[,49]))
lymeMx$s49 <-2^(LymeDisease3[,50])*(max(2^(LymeDisease3[,50]))-min(2^(LymeDisease3[,50])))+min(2^(LymeDisease3[,50]))
lymeMx$s50 <-2^(LymeDisease3[,51])*(max(2^(LymeDisease3[,51]))-min(2^(LymeDisease3[,51])))+min(2^(LymeDisease3[,51]))
lymeMx$s51 <-2^(LymeDisease3[,52])*(max(2^(LymeDisease3[,52]))-min(2^(LymeDisease3[,52])))+min(2^(LymeDisease3[,52]))
lymeMx$s52 <-2^(LymeDisease3[,53])*(max(2^(LymeDisease3[,53]))-min(2^(LymeDisease3[,53])))+min(2^(LymeDisease3[,53]))
lymeMx$s53 <-2^(LymeDisease3[,54])*(max(2^(LymeDisease3[,54]))-min(2^(LymeDisease3[,54])))+min(2^(LymeDisease3[,54]))
lymeMx$s54 <-2^(LymeDisease3[,55])*(max(2^(LymeDisease3[,55]))-min(2^(LymeDisease3[,55])))+min(2^(LymeDisease3[,55]))
lymeMx$s55 <-2^(LymeDisease3[,56])*(max(2^(LymeDisease3[,56]))-min(2^(LymeDisease3[,56])))+min(2^(LymeDisease3[,56]))
lymeMx$s56 <-2^(LymeDisease3[,57])*(max(2^(LymeDisease3[,57]))-min(2^(LymeDisease3[,57])))+min(2^(LymeDisease3[,57]))
lymeMx$s57 <-2^(LymeDisease3[,58])*(max(2^(LymeDisease3[,58]))-min(2^(LymeDisease3[,58])))+min(2^(LymeDisease3[,58]))
lymeMx$s58 <-2^(LymeDisease3[,59])*(max(2^(LymeDisease3[,59]))-min(2^(LymeDisease3[,59])))+min(2^(LymeDisease3[,59]))
lymeMx$s59 <-2^(LymeDisease3[,60])*(max(2^(LymeDisease3[,60]))-min(2^(LymeDisease3[,60])))+min(2^(LymeDisease3[,60]))
lymeMx$s60 <-2^(LymeDisease3[,61])*(max(2^(LymeDisease3[,61]))-min(2^(LymeDisease3[,61])))+min(2^(LymeDisease3[,61]))
lymeMx$s61 <-2^(LymeDisease3[,62])*(max(2^(LymeDisease3[,62]))-min(2^(LymeDisease3[,62])))+min(2^(LymeDisease3[,62]))
lymeMx$s62 <-2^(LymeDisease3[,63])*(max(2^(LymeDisease3[,63]))-min(2^(LymeDisease3[,63])))+min(2^(LymeDisease3[,63]))
lymeMx$s63 <-2^(LymeDisease3[,64])*(max(2^(LymeDisease3[,64]))-min(2^(LymeDisease3[,64])))+min(2^(LymeDisease3[,64]))
lymeMx$s64 <-2^(LymeDisease3[,65])*(max(2^(LymeDisease3[,65]))-min(2^(LymeDisease3[,65])))+min(2^(LymeDisease3[,65]))
lymeMx$s65 <-2^(LymeDisease3[,66])*(max(2^(LymeDisease3[,66]))-min(2^(LymeDisease3[,66])))+min(2^(LymeDisease3[,66]))
lymeMx$s66 <-2^(LymeDisease3[,67])*(max(2^(LymeDisease3[,67]))-min(2^(LymeDisease3[,67])))+min(2^(LymeDisease3[,67]))
lymeMx$s67 <-2^(LymeDisease3[,68])*(max(2^(LymeDisease3[,68]))-min(2^(LymeDisease3[,68])))+min(2^(LymeDisease3[,68]))
lymeMx$s68 <-2^(LymeDisease3[,69])*(max(2^(LymeDisease3[,69]))-min(2^(LymeDisease3[,69])))+min(2^(LymeDisease3[,69]))
lymeMx$s69 <-2^(LymeDisease3[,70])*(max(2^(LymeDisease3[,70]))-min(2^(LymeDisease3[,70])))+min(2^(LymeDisease3[,70]))
lymeMx$s70 <-2^(LymeDisease3[,71])*(max(2^(LymeDisease3[,71]))-min(2^(LymeDisease3[,71])))+min(2^(LymeDisease3[,71]))
lymeMx$s71 <-2^(LymeDisease3[,72])*(max(2^(LymeDisease3[,72]))-min(2^(LymeDisease3[,72])))+min(2^(LymeDisease3[,72]))
lymeMx$s72 <-2^(LymeDisease3[,73])*(max(2^(LymeDisease3[,73]))-min(2^(LymeDisease3[,73])))+min(2^(LymeDisease3[,73]))
lymeMx$s73 <-2^(LymeDisease3[,74])*(max(2^(LymeDisease3[,74]))-min(2^(LymeDisease3[,74])))+min(2^(LymeDisease3[,74]))
lymeMx$s74 <-2^(LymeDisease3[,75])*(max(2^(LymeDisease3[,75]))-min(2^(LymeDisease3[,75])))+min(2^(LymeDisease3[,75]))
lymeMx$s75 <-2^(LymeDisease3[,76])*(max(2^(LymeDisease3[,76]))-min(2^(LymeDisease3[,76])))+min(2^(LymeDisease3[,76]))
lymeMx$s76 <-2^(LymeDisease3[,77])*(max(2^(LymeDisease3[,77]))-min(2^(LymeDisease3[,77])))+min(2^(LymeDisease3[,77]))
lymeMx$s77 <-2^(LymeDisease3[,78])*(max(2^(LymeDisease3[,78]))-min(2^(LymeDisease3[,78])))+min(2^(LymeDisease3[,78]))
lymeMx$s78 <-2^(LymeDisease3[,79])*(max(2^(LymeDisease3[,79]))-min(2^(LymeDisease3[,79])))+min(2^(LymeDisease3[,79]))
lymeMx$s79 <-2^(LymeDisease3[,80])*(max(2^(LymeDisease3[,80]))-min(2^(LymeDisease3[,80])))+min(2^(LymeDisease3[,80]))
lymeMx$s80 <-2^(LymeDisease3[,81])*(max(2^(LymeDisease3[,81]))-min(2^(LymeDisease3[,81])))+min(2^(LymeDisease3[,81]))
lymeMx$s81 <-2^(LymeDisease3[,82])*(max(2^(LymeDisease3[,82]))-min(2^(LymeDisease3[,82])))+min(2^(LymeDisease3[,82]))
lymeMx$s82 <-2^(LymeDisease3[,83])*(max(2^(LymeDisease3[,83]))-min(2^(LymeDisease3[,83])))+min(2^(LymeDisease3[,83]))
lymeMx$s83 <-2^(LymeDisease3[,84])*(max(2^(LymeDisease3[,84]))-min(2^(LymeDisease3[,84])))+min(2^(LymeDisease3[,84]))
lymeMx$s84 <-2^(LymeDisease3[,85])*(max(2^(LymeDisease3[,85]))-min(2^(LymeDisease3[,85])))+min(2^(LymeDisease3[,85]))
lymeMx$s85 <-2^(LymeDisease3[,86])*(max(2^(LymeDisease3[,86]))-min(2^(LymeDisease3[,86])))+min(2^(LymeDisease3[,86]))
lymeMx$s86 <-2^(LymeDisease3[,87])*(max(2^(LymeDisease3[,87]))-min(2^(LymeDisease3[,87])))+min(2^(LymeDisease3[,87]))
We now have our suspected original x values from taking the inverse of the log2(normalized x)
head(lymeMx,10)
## gene s1 s2 s3 s4 s5 s6 s7
## 1 HIST1H3G 22.59247 38.57177 39.54174 13.59257 26.04888 21.51069 50.33299
## 2 HIST1H3G 36.29708 31.84160 20.55844 20.66623 43.00957 12.01783 34.94171
## 3 HIST1H3G 27.62984 31.57033 30.96372 19.73411 31.15322 21.14685 77.21367
## 4 TNFAIP8L1 42.46674 55.36564 41.29757 18.15865 22.10683 18.95894 50.99312
## 5 OTOP2 43.40859 51.83702 33.50366 11.19206 25.72660 27.21961 30.74722
## 6 C17orf78 40.01042 41.75623 40.81368 15.66535 27.34023 19.51317 37.27006
## 7 CTAGE6 28.97885 32.71176 22.27521 22.91889 27.53780 23.13604 35.57361
## 8 F8A1 39.60969 46.58748 25.51772 19.59770 36.85641 20.75314 51.20788
## 9 LOC285501 37.98936 45.51724 19.02319 20.67290 29.45996 21.75213 40.44717
## 10 SAMD7 32.30451 40.73037 38.93021 21.48535 29.78898 20.20065 36.95958
## s8 s9 s10 s11 s12 s13 s14 s15
## 1 18.361496 16.17992 29.27993 36.74482 39.58568 17.62928 26.42135 16.19021
## 2 7.151825 18.04948 19.61416 47.83345 45.32257 12.66312 24.31543 18.22622
## 3 19.639702 18.24964 19.24207 55.44210 36.23800 18.12191 23.53288 14.33849
## 4 13.439440 23.10354 29.67159 89.70427 36.39449 14.14594 28.89328 21.44424
## 5 9.882871 20.45345 21.56443 124.39415 31.04687 14.04057 46.21978 19.61363
## 6 9.812547 20.91835 28.67301 96.20869 39.15962 12.44611 29.04626 17.91047
## 7 10.980143 18.21571 16.56545 99.54299 61.32365 12.46327 24.36760 18.09237
## 8 8.990543 23.62946 25.00492 104.39482 42.75090 11.74310 22.16534 26.63348
## 9 14.814970 24.45164 21.44452 65.66078 49.95431 14.79771 33.84786 17.03718
## 10 13.923062 19.67697 24.71026 90.30977 52.94930 14.72298 49.93891 25.66958
## s16 s17 s18 s19 s20 s21 s22 s23
## 1 18.65325 40.55269 48.72731 94.71925 27.68216 49.86249 53.64034 31.75436
## 2 26.99391 38.55864 22.27286 59.57215 31.19783 18.44950 79.37574 46.70039
## 3 17.71874 29.73707 18.52009 65.27301 25.94602 32.53405 52.60025 35.30951
## 4 17.58461 34.33457 27.95371 122.85707 33.29721 16.03413 48.34192 20.50662
## 5 18.07933 29.52951 17.02266 118.24643 35.29199 20.66606 42.82781 19.11798
## 6 24.32932 34.68813 19.00797 100.95596 34.76511 17.66578 64.18064 27.21852
## 7 15.66297 50.00273 26.17448 75.29876 37.75931 22.84390 65.15621 25.36739
## 8 15.13550 26.21678 17.98034 42.62383 23.09122 13.88158 51.48577 28.34550
## 9 20.21591 38.37803 23.31803 80.38736 30.50018 25.35947 85.64826 19.44793
## 10 26.62270 57.85869 17.35737 103.43002 34.54575 20.02411 47.48736 29.85636
## s24 s25 s26 s27 s28 s29 s30 s31
## 1 40.36005 24.71115 41.21683 21.97899 242.04619 28.97658 16.83710 53.92571
## 2 45.94425 29.09320 48.62383 21.20297 71.27186 36.09075 15.35622 59.84022
## 3 39.20903 24.78237 32.29525 25.66908 243.97417 34.19917 10.51733 65.63736
## 4 31.02927 24.49584 77.66133 42.51588 105.78490 20.96305 12.59117 50.29372
## 5 44.95489 23.04002 37.14204 31.45284 117.68216 13.93743 18.77854 56.91465
## 6 40.54546 21.66550 43.84952 24.54859 234.68729 17.13590 17.31942 47.80322
## 7 42.41794 17.16673 41.02564 41.88385 90.06063 15.86667 15.78225 43.46195
## 8 45.72288 37.13531 38.98464 36.72603 84.19979 12.48020 12.88423 85.46665
## 9 47.38832 24.95522 42.32277 28.82399 178.51466 12.18283 18.53681 57.71579
## 10 51.21420 20.16999 46.02872 34.84622 166.38602 12.57062 25.73242 43.47873
## s32 s33 s34 s35 s36 s37 s38 s39
## 1 90.70970 27.57647 19.531918 22.60302 29.23281 39.67376 19.79227 28.35322
## 2 50.52619 21.34949 9.144246 59.96040 54.04548 70.48516 27.35002 29.48876
## 3 64.99650 19.00399 13.951374 28.85456 24.49016 46.70942 17.85099 28.61755
## 4 47.96585 29.09598 8.080204 23.86327 23.47894 36.97766 10.79269 27.08516
## 5 54.06629 26.32759 8.252118 21.17318 26.41670 30.78056 10.01888 28.98670
## 6 37.74823 21.02680 8.965967 23.51028 21.90884 43.19002 12.25469 27.85714
## 7 59.15106 15.69574 8.202139 22.43253 21.78898 31.11961 18.49046 38.36838
## 8 46.27727 26.91106 8.220745 24.16935 26.67729 29.88674 9.71116 17.99804
## 9 35.65898 19.66863 10.376200 20.44577 23.87293 33.83838 10.20384 34.59415
## 10 43.33050 25.76022 8.425694 23.92411 26.24495 33.01483 13.64693 23.45511
## s40 s41 s42 s43 s44 s45 s46 s47
## 1 19.97120 29.13563 48.46683 47.87722 44.83273 24.58380 7.340767 16.65791
## 2 22.49588 28.12865 170.41287 62.57221 104.44944 29.65918 8.879782 21.87477
## 3 17.59975 22.97713 75.55876 40.62497 26.55102 21.36238 8.114731 21.58562
## 4 17.47945 16.38441 44.59320 35.68056 17.18297 32.75699 7.968716 26.83404
## 5 18.99342 17.95132 57.78958 41.21533 55.69708 32.46296 7.212277 20.96364
## 6 15.79263 21.51168 44.72606 35.93334 23.39033 34.79392 12.570699 24.68950
## 7 23.91300 25.05539 46.41627 25.42504 53.57219 31.88584 12.770074 26.94255
## 8 17.79893 28.18503 81.30571 29.36271 39.73188 29.44770 9.649790 38.30607
## 9 19.38511 21.62683 65.04796 40.96207 45.96660 24.30032 10.562778 26.66755
## 10 18.83328 28.62235 90.73399 25.65884 26.65580 31.23851 11.438219 37.32249
## s48 s49 s50 s51 s52 s53 s54 s55
## 1 21.84250 25.00497 15.03689 19.28145 20.95660 14.38273 15.33420 52.18946
## 2 24.37222 25.46586 50.56783 9.69547 30.34963 8.87291 15.22122 43.66884
## 3 26.32074 27.02532 22.97513 22.22991 19.31855 18.90868 18.99640 37.71431
## 4 18.13896 31.89456 20.29308 15.39031 14.48411 16.93831 17.38741 37.30122
## 5 22.92530 25.01962 26.78544 14.92172 17.98702 18.96980 16.47932 36.68333
## 6 21.51554 33.41618 20.63765 16.49177 13.57623 16.04933 19.34174 40.90143
## 7 20.21744 21.50006 33.35868 17.41162 16.83438 12.36843 30.35652 56.52007
## 8 27.40950 26.00508 24.77098 12.87700 17.15641 16.26777 19.43870 55.65522
## 9 19.93804 29.71015 17.68258 17.47798 19.44973 15.00299 24.99209 40.53731
## 10 18.51423 25.44372 19.01152 14.99254 16.34560 14.46104 20.23523 48.80388
## s56 s57 s58 s59 s60 s61 s62 s63
## 1 18.81358 38.60427 37.42814 27.37149 66.26418 51.51663 21.96050 41.78819
## 2 47.41096 68.53473 20.50622 28.66727 56.47553 47.93870 20.03999 53.65111
## 3 26.37484 38.96412 31.05025 31.46036 53.54524 41.13340 11.31445 46.77273
## 4 16.29607 39.13807 44.24589 22.14372 58.27621 105.83491 23.61590 42.92348
## 5 24.14021 40.69022 42.51203 25.16801 70.64305 39.93768 17.43137 38.99176
## 6 18.66889 34.36792 39.87882 25.31076 80.85079 67.46208 18.08939 39.02220
## 7 16.92980 56.43290 26.39818 20.57866 54.96826 37.83800 24.03091 39.71794
## 8 24.79599 52.06556 46.59680 17.59646 56.80215 48.62395 20.96293 36.82081
## 9 16.45601 41.51407 39.66486 18.47170 57.88106 61.19389 20.26675 29.32376
## 10 16.73956 38.55786 30.68710 35.81168 51.63592 69.89111 17.75162 37.22891
## s64 s65 s66 s67 s68 s69 s70 s71
## 1 10.586208 287.06210 47.57328 31.757893 82.46059 29.10355 38.66926 37.37517
## 2 12.411786 36.30340 26.66048 9.121163 74.04372 47.00593 75.20406 44.74178
## 3 9.445118 130.80362 42.99657 38.655164 59.58087 35.61306 39.22082 52.77229
## 4 9.920727 46.89542 35.04883 44.403146 63.57810 18.64624 38.23628 25.00343
## 5 10.473884 38.30379 43.80303 15.412572 64.99677 46.99216 33.85058 33.45397
## 6 8.805156 39.14815 37.14713 13.733739 59.01782 31.57138 32.24516 53.23378
## 7 12.819311 41.25430 40.85336 13.973693 66.33082 52.14100 40.19344 37.84687
## 8 9.239648 38.55965 25.90088 21.723133 60.59767 76.40293 34.91737 38.04827
## 9 8.355369 36.74231 43.58830 19.974912 46.19317 35.85380 37.29519 50.04637
## 10 8.347794 39.02161 34.13150 14.486557 64.78393 44.07917 33.84937 29.26158
## s72 s73 s74 s75 s76 s77 s78 s79
## 1 25.17885 83.46367 8.726264 31.65491 16.70587 16.53895 11.41146 32.06021
## 2 10.12092 24.80718 10.590134 18.90590 17.64200 17.05701 25.26796 30.59706
## 3 23.94595 69.73067 8.788421 28.99057 16.29304 20.66426 16.72934 30.24238
## 4 19.98203 36.92691 14.310777 25.72812 17.41430 22.52651 16.66729 66.02685
## 5 18.74339 27.11026 13.724748 31.77075 15.22076 19.54391 14.44662 75.22061
## 6 20.26322 35.59332 11.159615 26.84060 16.01224 23.86987 15.57318 29.60989
## 7 11.05134 42.07957 14.554143 25.16745 25.32619 33.62460 20.89906 30.40400
## 8 18.09229 19.24648 15.708073 22.64868 17.97442 42.39900 17.03657 33.06796
## 9 13.54135 26.08811 12.545321 32.51360 16.11407 23.08441 18.29423 28.97904
## 10 12.57219 28.20026 12.755591 40.84400 14.21224 19.03183 16.48280 49.14358
## s80 s81 s82 s83 s84 s85 s86
## 1 32.59497 36.59878 16.169422 17.04565 20.95848 23.88463 17.68828
## 2 20.17319 42.00942 9.332273 15.09083 17.84571 18.43729 34.83445
## 3 29.54068 33.91117 17.311605 19.91072 24.01601 15.54136 19.56592
## 4 24.71268 53.23373 17.635533 15.61322 22.17310 23.68683 45.07172
## 5 37.34162 33.14573 18.038448 20.27913 22.27939 27.88571 44.54518
## 6 45.44414 40.25601 32.071137 21.79697 22.80547 25.35448 28.95011
## 7 35.99257 36.50542 15.989030 18.88726 25.57395 25.61280 23.92105
## 8 30.11359 62.76273 14.772002 17.92761 22.66021 19.02194 38.51459
## 9 29.94572 37.63438 20.703257 19.84273 21.06163 29.09052 33.92133
## 10 33.88505 49.22103 18.794217 17.47806 23.45789 24.84377 34.82380
We can play around with the normalized data in some Tableau charts or this data right here that could be the raw values or close. Lets add in the actual names for our denormalized data.
colnames(lymeMx)[2:87] <- colnames(LymeDisease3)[2:87]
head(lymeMx,10)
## gene GSM4340492 GSM4340493 GSM4340494 GSM4340495 GSM4340496 GSM4340497
## 1 HIST1H3G 22.59247 38.57177 39.54174 13.59257 26.04888 21.51069
## 2 HIST1H3G 36.29708 31.84160 20.55844 20.66623 43.00957 12.01783
## 3 HIST1H3G 27.62984 31.57033 30.96372 19.73411 31.15322 21.14685
## 4 TNFAIP8L1 42.46674 55.36564 41.29757 18.15865 22.10683 18.95894
## 5 OTOP2 43.40859 51.83702 33.50366 11.19206 25.72660 27.21961
## 6 C17orf78 40.01042 41.75623 40.81368 15.66535 27.34023 19.51317
## 7 CTAGE6 28.97885 32.71176 22.27521 22.91889 27.53780 23.13604
## 8 F8A1 39.60969 46.58748 25.51772 19.59770 36.85641 20.75314
## 9 LOC285501 37.98936 45.51724 19.02319 20.67290 29.45996 21.75213
## 10 SAMD7 32.30451 40.73037 38.93021 21.48535 29.78898 20.20065
## GSM4340498 GSM4340499 GSM4340500 GSM4340501 GSM4340502 GSM4340503 GSM4340504
## 1 50.33299 18.361496 16.17992 29.27993 36.74482 39.58568 17.62928
## 2 34.94171 7.151825 18.04948 19.61416 47.83345 45.32257 12.66312
## 3 77.21367 19.639702 18.24964 19.24207 55.44210 36.23800 18.12191
## 4 50.99312 13.439440 23.10354 29.67159 89.70427 36.39449 14.14594
## 5 30.74722 9.882871 20.45345 21.56443 124.39415 31.04687 14.04057
## 6 37.27006 9.812547 20.91835 28.67301 96.20869 39.15962 12.44611
## 7 35.57361 10.980143 18.21571 16.56545 99.54299 61.32365 12.46327
## 8 51.20788 8.990543 23.62946 25.00492 104.39482 42.75090 11.74310
## 9 40.44717 14.814970 24.45164 21.44452 65.66078 49.95431 14.79771
## 10 36.95958 13.923062 19.67697 24.71026 90.30977 52.94930 14.72298
## GSM4340505 GSM4340506 GSM4340507 GSM4340508 GSM4340509 GSM4340510 GSM4340511
## 1 26.42135 16.19021 18.65325 40.55269 48.72731 94.71925 27.68216
## 2 24.31543 18.22622 26.99391 38.55864 22.27286 59.57215 31.19783
## 3 23.53288 14.33849 17.71874 29.73707 18.52009 65.27301 25.94602
## 4 28.89328 21.44424 17.58461 34.33457 27.95371 122.85707 33.29721
## 5 46.21978 19.61363 18.07933 29.52951 17.02266 118.24643 35.29199
## 6 29.04626 17.91047 24.32932 34.68813 19.00797 100.95596 34.76511
## 7 24.36760 18.09237 15.66297 50.00273 26.17448 75.29876 37.75931
## 8 22.16534 26.63348 15.13550 26.21678 17.98034 42.62383 23.09122
## 9 33.84786 17.03718 20.21591 38.37803 23.31803 80.38736 30.50018
## 10 49.93891 25.66958 26.62270 57.85869 17.35737 103.43002 34.54575
## GSM4340512 GSM4340513 GSM4340514 GSM4340515 GSM4340516 GSM4340517 GSM4340518
## 1 49.86249 53.64034 31.75436 40.36005 24.71115 41.21683 21.97899
## 2 18.44950 79.37574 46.70039 45.94425 29.09320 48.62383 21.20297
## 3 32.53405 52.60025 35.30951 39.20903 24.78237 32.29525 25.66908
## 4 16.03413 48.34192 20.50662 31.02927 24.49584 77.66133 42.51588
## 5 20.66606 42.82781 19.11798 44.95489 23.04002 37.14204 31.45284
## 6 17.66578 64.18064 27.21852 40.54546 21.66550 43.84952 24.54859
## 7 22.84390 65.15621 25.36739 42.41794 17.16673 41.02564 41.88385
## 8 13.88158 51.48577 28.34550 45.72288 37.13531 38.98464 36.72603
## 9 25.35947 85.64826 19.44793 47.38832 24.95522 42.32277 28.82399
## 10 20.02411 47.48736 29.85636 51.21420 20.16999 46.02872 34.84622
## GSM4340519 GSM4340520 GSM4340521 GSM4340522 GSM4340523 GSM4340524 GSM4340525
## 1 242.04619 28.97658 16.83710 53.92571 90.70970 27.57647 19.531918
## 2 71.27186 36.09075 15.35622 59.84022 50.52619 21.34949 9.144246
## 3 243.97417 34.19917 10.51733 65.63736 64.99650 19.00399 13.951374
## 4 105.78490 20.96305 12.59117 50.29372 47.96585 29.09598 8.080204
## 5 117.68216 13.93743 18.77854 56.91465 54.06629 26.32759 8.252118
## 6 234.68729 17.13590 17.31942 47.80322 37.74823 21.02680 8.965967
## 7 90.06063 15.86667 15.78225 43.46195 59.15106 15.69574 8.202139
## 8 84.19979 12.48020 12.88423 85.46665 46.27727 26.91106 8.220745
## 9 178.51466 12.18283 18.53681 57.71579 35.65898 19.66863 10.376200
## 10 166.38602 12.57062 25.73242 43.47873 43.33050 25.76022 8.425694
## GSM4340526 GSM4340527 GSM4340528 GSM4340529 GSM4340530 GSM4340531 GSM4340532
## 1 22.60302 29.23281 39.67376 19.79227 28.35322 19.97120 29.13563
## 2 59.96040 54.04548 70.48516 27.35002 29.48876 22.49588 28.12865
## 3 28.85456 24.49016 46.70942 17.85099 28.61755 17.59975 22.97713
## 4 23.86327 23.47894 36.97766 10.79269 27.08516 17.47945 16.38441
## 5 21.17318 26.41670 30.78056 10.01888 28.98670 18.99342 17.95132
## 6 23.51028 21.90884 43.19002 12.25469 27.85714 15.79263 21.51168
## 7 22.43253 21.78898 31.11961 18.49046 38.36838 23.91300 25.05539
## 8 24.16935 26.67729 29.88674 9.71116 17.99804 17.79893 28.18503
## 9 20.44577 23.87293 33.83838 10.20384 34.59415 19.38511 21.62683
## 10 23.92411 26.24495 33.01483 13.64693 23.45511 18.83328 28.62235
## GSM4340533 GSM4340534 GSM4340535 GSM4340536 GSM4340537 GSM4340538 GSM4340539
## 1 48.46683 47.87722 44.83273 24.58380 7.340767 16.65791 21.84250
## 2 170.41287 62.57221 104.44944 29.65918 8.879782 21.87477 24.37222
## 3 75.55876 40.62497 26.55102 21.36238 8.114731 21.58562 26.32074
## 4 44.59320 35.68056 17.18297 32.75699 7.968716 26.83404 18.13896
## 5 57.78958 41.21533 55.69708 32.46296 7.212277 20.96364 22.92530
## 6 44.72606 35.93334 23.39033 34.79392 12.570699 24.68950 21.51554
## 7 46.41627 25.42504 53.57219 31.88584 12.770074 26.94255 20.21744
## 8 81.30571 29.36271 39.73188 29.44770 9.649790 38.30607 27.40950
## 9 65.04796 40.96207 45.96660 24.30032 10.562778 26.66755 19.93804
## 10 90.73399 25.65884 26.65580 31.23851 11.438219 37.32249 18.51423
## GSM4340540 GSM4340541 GSM4340542 GSM4340543 GSM4340544 GSM4340545 GSM4340546
## 1 25.00497 15.03689 19.28145 20.95660 14.38273 15.33420 52.18946
## 2 25.46586 50.56783 9.69547 30.34963 8.87291 15.22122 43.66884
## 3 27.02532 22.97513 22.22991 19.31855 18.90868 18.99640 37.71431
## 4 31.89456 20.29308 15.39031 14.48411 16.93831 17.38741 37.30122
## 5 25.01962 26.78544 14.92172 17.98702 18.96980 16.47932 36.68333
## 6 33.41618 20.63765 16.49177 13.57623 16.04933 19.34174 40.90143
## 7 21.50006 33.35868 17.41162 16.83438 12.36843 30.35652 56.52007
## 8 26.00508 24.77098 12.87700 17.15641 16.26777 19.43870 55.65522
## 9 29.71015 17.68258 17.47798 19.44973 15.00299 24.99209 40.53731
## 10 25.44372 19.01152 14.99254 16.34560 14.46104 20.23523 48.80388
## GSM4340547 GSM4340548 GSM4340549 GSM4340550 GSM4340551 GSM4340552 GSM4340553
## 1 18.81358 38.60427 37.42814 27.37149 66.26418 51.51663 21.96050
## 2 47.41096 68.53473 20.50622 28.66727 56.47553 47.93870 20.03999
## 3 26.37484 38.96412 31.05025 31.46036 53.54524 41.13340 11.31445
## 4 16.29607 39.13807 44.24589 22.14372 58.27621 105.83491 23.61590
## 5 24.14021 40.69022 42.51203 25.16801 70.64305 39.93768 17.43137
## 6 18.66889 34.36792 39.87882 25.31076 80.85079 67.46208 18.08939
## 7 16.92980 56.43290 26.39818 20.57866 54.96826 37.83800 24.03091
## 8 24.79599 52.06556 46.59680 17.59646 56.80215 48.62395 20.96293
## 9 16.45601 41.51407 39.66486 18.47170 57.88106 61.19389 20.26675
## 10 16.73956 38.55786 30.68710 35.81168 51.63592 69.89111 17.75162
## GSM4340554 GSM4340555 GSM4340556 GSM4340557 GSM4340558 GSM4340559 GSM4340560
## 1 41.78819 10.586208 287.06210 47.57328 31.757893 82.46059 29.10355
## 2 53.65111 12.411786 36.30340 26.66048 9.121163 74.04372 47.00593
## 3 46.77273 9.445118 130.80362 42.99657 38.655164 59.58087 35.61306
## 4 42.92348 9.920727 46.89542 35.04883 44.403146 63.57810 18.64624
## 5 38.99176 10.473884 38.30379 43.80303 15.412572 64.99677 46.99216
## 6 39.02220 8.805156 39.14815 37.14713 13.733739 59.01782 31.57138
## 7 39.71794 12.819311 41.25430 40.85336 13.973693 66.33082 52.14100
## 8 36.82081 9.239648 38.55965 25.90088 21.723133 60.59767 76.40293
## 9 29.32376 8.355369 36.74231 43.58830 19.974912 46.19317 35.85380
## 10 37.22891 8.347794 39.02161 34.13150 14.486557 64.78393 44.07917
## GSM4340561 GSM4340562 GSM4340563 GSM4340564 GSM4340565 GSM4340566 GSM4340567
## 1 38.66926 37.37517 25.17885 83.46367 8.726264 31.65491 16.70587
## 2 75.20406 44.74178 10.12092 24.80718 10.590134 18.90590 17.64200
## 3 39.22082 52.77229 23.94595 69.73067 8.788421 28.99057 16.29304
## 4 38.23628 25.00343 19.98203 36.92691 14.310777 25.72812 17.41430
## 5 33.85058 33.45397 18.74339 27.11026 13.724748 31.77075 15.22076
## 6 32.24516 53.23378 20.26322 35.59332 11.159615 26.84060 16.01224
## 7 40.19344 37.84687 11.05134 42.07957 14.554143 25.16745 25.32619
## 8 34.91737 38.04827 18.09229 19.24648 15.708073 22.64868 17.97442
## 9 37.29519 50.04637 13.54135 26.08811 12.545321 32.51360 16.11407
## 10 33.84937 29.26158 12.57219 28.20026 12.755591 40.84400 14.21224
## GSM4340568 GSM4340569 GSM4340570 GSM4340571 GSM4340572 GSM4340573 GSM4340574
## 1 16.53895 11.41146 32.06021 32.59497 36.59878 16.169422 17.04565
## 2 17.05701 25.26796 30.59706 20.17319 42.00942 9.332273 15.09083
## 3 20.66426 16.72934 30.24238 29.54068 33.91117 17.311605 19.91072
## 4 22.52651 16.66729 66.02685 24.71268 53.23373 17.635533 15.61322
## 5 19.54391 14.44662 75.22061 37.34162 33.14573 18.038448 20.27913
## 6 23.86987 15.57318 29.60989 45.44414 40.25601 32.071137 21.79697
## 7 33.62460 20.89906 30.40400 35.99257 36.50542 15.989030 18.88726
## 8 42.39900 17.03657 33.06796 30.11359 62.76273 14.772002 17.92761
## 9 23.08441 18.29423 28.97904 29.94572 37.63438 20.703257 19.84273
## 10 19.03183 16.48280 49.14358 33.88505 49.22103 18.794217 17.47806
## GSM4340575 GSM4340576 GSM4340577
## 1 20.95848 23.88463 17.68828
## 2 17.84571 18.43729 34.83445
## 3 24.01601 15.54136 19.56592
## 4 22.17310 23.68683 45.07172
## 5 22.27939 27.88571 44.54518
## 6 22.80547 25.35448 28.95011
## 7 25.57395 25.61280 23.92105
## 8 22.66021 19.02194 38.51459
## 9 21.06163 29.09052 33.92133
## 10 23.45789 24.84377 34.82380
Actually, these column names aren’t going to do much justice to the sample identifiers in the charts, so we should align these column names up to their aliases or descriptive names. We named that table after creating it earlier as descriptors2.
head(descriptors2,10)
## Sample_Title Sample_GEO_Accession classDisease
## 1 PBMC total RNA-Healthy control 1 GSM4340492 healthyControl
## 2 PBMC total RNA-Healthy control 2 GSM4340493 healthyControl
## 3 PBMC total RNA-Healthy control 3 GSM4340494 healthyControl
## 4 PBMC total RNA-Healthy control 4 GSM4340495 healthyControl
## 5 PBMC total RNA-Healthy control 5 GSM4340496 healthyControl
## 6 PBMC total RNA-Healthy control 6 GSM4340497 healthyControl
## 7 PBMC total RNA-Healthy control 7 GSM4340498 healthyControl
## 8 PBMC total RNA-Healthy control 8 GSM4340499 healthyControl
## 9 PBMC total RNA-Healthy control 9 GSM4340500 healthyControl
## 10 PBMC total RNA-Healthy control 10 GSM4340501 healthyControl
Lets test the colnames of our denormalized and normalized data frames arethe same order as our descriptor names so we can replace the names.
descriptors2$denormalized <- as.factor(paste(colnames(lymeMx)[2:87]))
descriptors2$normalized <- as.factor(paste(colnames(LymeDisease3)[2:87]))
descriptors2[,1:5]
## Sample_Title
## 1 PBMC total RNA-Healthy control 1
## 2 PBMC total RNA-Healthy control 2
## 3 PBMC total RNA-Healthy control 3
## 4 PBMC total RNA-Healthy control 4
## 5 PBMC total RNA-Healthy control 5
## 6 PBMC total RNA-Healthy control 6
## 7 PBMC total RNA-Healthy control 7
## 8 PBMC total RNA-Healthy control 8
## 9 PBMC total RNA-Healthy control 9
## 10 PBMC total RNA-Healthy control 10
## 11 PBMC total RNA-Healthy control 11
## 12 PBMC total RNA-Healthy control 12
## 13 PBMC total RNA-Healthy control 13
## 14 PBMC total RNA-Healthy control 14
## 15 PBMC total RNA-Healthy control 15
## 16 PBMC total RNA-Healthy control 16
## 17 PBMC total RNA-Healthy control 17
## 18 PBMC total RNA-Healthy control 18
## 19 PBMC total RNA-Healthy control 19
## 20 PBMC total RNA-Healthy control 20
## 21 PBMC total RNA-Healthy control 21
## 22 PBMC total RNA-Acute Lyme disease subject 1
## 23 PBMC total RNA-Acute Lyme disease subject 2
## 24 PBMC total RNA-Acute Lyme disease subject 3
## 25 PBMC total RNA-Acute Lyme disease subject 4
## 26 PBMC total RNA-Acute Lyme disease subject 5
## 27 PBMC total RNA-Acute Lyme disease subject 6
## 28 PBMC total RNA-Acute Lyme disease subject 7
## 29 PBMC total RNA-Acute Lyme disease subject 8
## 30 PBMC total RNA-Acute Lyme disease subject 9
## 31 PBMC total RNA-Acute Lyme disease subject 10
## 32 PBMC total RNA-Acute Lyme disease subject 11
## 33 PBMC total RNA-Acute Lyme disease subject 12
## 34 PBMC total RNA-Acute Lyme disease subject 13
## 35 PBMC total RNA-Acute Lyme disease subject 14
## 36 PBMC total RNA-Acute Lyme disease subject 15
## 37 PBMC total RNA-Acute Lyme disease subject 16
## 38 PBMC total RNA-Acute Lyme disease subject 17
## 39 PBMC total RNA-Acute Lyme disease subject 18
## 40 PBMC total RNA-Acute Lyme disease subject 19
## 41 PBMC total RNA-Acute Lyme disease subject 20
## 42 PBMC total RNA-Acute Lyme disease subject 21
## 43 PBMC total RNA-Acute Lyme disease subject 22
## 44 PBMC total RNA-Acute Lyme disease subject 23
## 45 PBMC total RNA-Acute Lyme disease subject 24
## 46 PBMC total RNA-Acute Lyme disease subject 25
## 47 PBMC total RNA-Acute Lyme disease subject 26
## 48 PBMC total RNA-Acute Lyme disease subject 27
## 49 PBMC total RNA-Acute Lyme disease subject 28
## 50 PBMC total RNA-early convalescent Lyme disease subject 1
## 51 PBMC total RNA-early convalescent Lyme disease subject 2
## 52 PBMC total RNA-early convalescent Lyme disease subject 3
## 53 PBMC total RNA-early convalescent Lyme disease subject 4
## 54 PBMC total RNA-early convalescent Lyme disease subject 5
## 55 PBMC total RNA-early convalescent Lyme disease subject 6
## 56 PBMC total RNA-early convalescent Lyme disease subject 7
## 57 PBMC total RNA-early convalescent Lyme disease subject 8
## 58 PBMC total RNA-early convalescent Lyme disease subject 9
## 59 PBMC total RNA-early convalescent Lyme disease subject 10
## 60 PBMC total RNA-early convalescent Lyme disease subject 11
## 61 PBMC total RNA-early convalescent Lyme disease subject 12
## 62 PBMC total RNA-early convalescent Lyme disease subject 13
## 63 PBMC total RNA-early convalescent Lyme disease subject 14
## 64 PBMC total RNA-early convalescent Lyme disease subject 15
## 65 PBMC total RNA-early convalescent Lyme disease subject 16
## 66 PBMC total RNA-early convalescent Lyme disease subject 17
## 67 PBMC total RNA-early convalescent Lyme disease subject 18
## 68 PBMC total RNA-early convalescent Lyme disease subject 19
## 69 PBMC total RNA-early convalescent Lyme disease subject 20
## 70 PBMC total RNA-early convalescent Lyme disease subject 21
## 71 PBMC total RNA-early convalescent Lyme disease subject 22
## 72 PBMC total RNA-early convalescent Lyme disease subject 23
## 73 PBMC total RNA-early convalescent Lyme disease subject 24
## 74 PBMC total RNA-early convalescent Lyme disease subject 25
## 75 PBMC total RNA-early convalescent Lyme disease subject 26
## 76 PBMC total RNA-early convalescent Lyme disease subject 27
## 77 PBMC total RNA-late convalescent Lyme disease subject 1
## 78 PBMC total RNA-late convalescent Lyme disease subject 2
## 79 PBMC total RNA-late convalescent Lyme disease subject 3
## 80 PBMC total RNA-late convalescent Lyme disease subject 4
## 81 PBMC total RNA-late convalescent Lyme disease subject 5
## 82 PBMC total RNA-late convalescent Lyme disease subject 6
## 83 PBMC total RNA-late convalescent Lyme disease subject 7
## 84 PBMC total RNA-late convalescent Lyme disease subject 8
## 85 PBMC total RNA-late convalescent Lyme disease subject 9
## 86 PBMC total RNA-late convalescent Lyme disease subject 10
## Sample_GEO_Accession classDisease denormalized normalized
## 1 GSM4340492 healthyControl GSM4340492 GSM4340492
## 2 GSM4340493 healthyControl GSM4340493 GSM4340493
## 3 GSM4340494 healthyControl GSM4340494 GSM4340494
## 4 GSM4340495 healthyControl GSM4340495 GSM4340495
## 5 GSM4340496 healthyControl GSM4340496 GSM4340496
## 6 GSM4340497 healthyControl GSM4340497 GSM4340497
## 7 GSM4340498 healthyControl GSM4340498 GSM4340498
## 8 GSM4340499 healthyControl GSM4340499 GSM4340499
## 9 GSM4340500 healthyControl GSM4340500 GSM4340500
## 10 GSM4340501 healthyControl GSM4340501 GSM4340501
## 11 GSM4340502 healthyControl GSM4340502 GSM4340502
## 12 GSM4340503 healthyControl GSM4340503 GSM4340503
## 13 GSM4340504 healthyControl GSM4340504 GSM4340504
## 14 GSM4340505 healthyControl GSM4340505 GSM4340505
## 15 GSM4340506 healthyControl GSM4340506 GSM4340506
## 16 GSM4340507 healthyControl GSM4340507 GSM4340507
## 17 GSM4340508 healthyControl GSM4340508 GSM4340508
## 18 GSM4340509 healthyControl GSM4340509 GSM4340509
## 19 GSM4340510 healthyControl GSM4340510 GSM4340510
## 20 GSM4340511 healthyControl GSM4340511 GSM4340511
## 21 GSM4340512 healthyControl GSM4340512 GSM4340512
## 22 GSM4340513 acuteLymeDisease GSM4340513 GSM4340513
## 23 GSM4340514 acuteLymeDisease GSM4340514 GSM4340514
## 24 GSM4340515 acuteLymeDisease GSM4340515 GSM4340515
## 25 GSM4340516 acuteLymeDisease GSM4340516 GSM4340516
## 26 GSM4340517 acuteLymeDisease GSM4340517 GSM4340517
## 27 GSM4340518 acuteLymeDisease GSM4340518 GSM4340518
## 28 GSM4340519 acuteLymeDisease GSM4340519 GSM4340519
## 29 GSM4340520 acuteLymeDisease GSM4340520 GSM4340520
## 30 GSM4340521 acuteLymeDisease GSM4340521 GSM4340521
## 31 GSM4340522 acuteLymeDisease GSM4340522 GSM4340522
## 32 GSM4340523 acuteLymeDisease GSM4340523 GSM4340523
## 33 GSM4340524 acuteLymeDisease GSM4340524 GSM4340524
## 34 GSM4340525 acuteLymeDisease GSM4340525 GSM4340525
## 35 GSM4340526 acuteLymeDisease GSM4340526 GSM4340526
## 36 GSM4340527 acuteLymeDisease GSM4340527 GSM4340527
## 37 GSM4340528 acuteLymeDisease GSM4340528 GSM4340528
## 38 GSM4340529 acuteLymeDisease GSM4340529 GSM4340529
## 39 GSM4340530 acuteLymeDisease GSM4340530 GSM4340530
## 40 GSM4340531 acuteLymeDisease GSM4340531 GSM4340531
## 41 GSM4340532 acuteLymeDisease GSM4340532 GSM4340532
## 42 GSM4340533 acuteLymeDisease GSM4340533 GSM4340533
## 43 GSM4340534 acuteLymeDisease GSM4340534 GSM4340534
## 44 GSM4340535 acuteLymeDisease GSM4340535 GSM4340535
## 45 GSM4340536 acuteLymeDisease GSM4340536 GSM4340536
## 46 GSM4340537 acuteLymeDisease GSM4340537 GSM4340537
## 47 GSM4340538 acuteLymeDisease GSM4340538 GSM4340538
## 48 GSM4340539 acuteLymeDisease GSM4340539 GSM4340539
## 49 GSM4340540 acuteLymeDisease GSM4340540 GSM4340540
## 50 GSM4340541 Antibodies_1month GSM4340541 GSM4340541
## 51 GSM4340542 Antibodies_1month GSM4340542 GSM4340542
## 52 GSM4340543 Antibodies_1month GSM4340543 GSM4340543
## 53 GSM4340544 Antibodies_1month GSM4340544 GSM4340544
## 54 GSM4340545 Antibodies_1month GSM4340545 GSM4340545
## 55 GSM4340546 Antibodies_1month GSM4340546 GSM4340546
## 56 GSM4340547 Antibodies_1month GSM4340547 GSM4340547
## 57 GSM4340548 Antibodies_1month GSM4340548 GSM4340548
## 58 GSM4340549 Antibodies_1month GSM4340549 GSM4340549
## 59 GSM4340550 Antibodies_1month GSM4340550 GSM4340550
## 60 GSM4340551 Antibodies_1month GSM4340551 GSM4340551
## 61 GSM4340552 Antibodies_1month GSM4340552 GSM4340552
## 62 GSM4340553 Antibodies_1month GSM4340553 GSM4340553
## 63 GSM4340554 Antibodies_1month GSM4340554 GSM4340554
## 64 GSM4340555 Antibodies_1month GSM4340555 GSM4340555
## 65 GSM4340556 Antibodies_1month GSM4340556 GSM4340556
## 66 GSM4340557 Antibodies_1month GSM4340557 GSM4340557
## 67 GSM4340558 Antibodies_1month GSM4340558 GSM4340558
## 68 GSM4340559 Antibodies_1month GSM4340559 GSM4340559
## 69 GSM4340560 Antibodies_1month GSM4340560 GSM4340560
## 70 GSM4340561 Antibodies_1month GSM4340561 GSM4340561
## 71 GSM4340562 Antibodies_1month GSM4340562 GSM4340562
## 72 GSM4340563 Antibodies_1month GSM4340563 GSM4340563
## 73 GSM4340564 Antibodies_1month GSM4340564 GSM4340564
## 74 GSM4340565 Antibodies_1month GSM4340565 GSM4340565
## 75 GSM4340566 Antibodies_1month GSM4340566 GSM4340566
## 76 GSM4340567 Antibodies_1month GSM4340567 GSM4340567
## 77 GSM4340568 Antibodies_6months GSM4340568 GSM4340568
## 78 GSM4340569 Antibodies_6months GSM4340569 GSM4340569
## 79 GSM4340570 Antibodies_6months GSM4340570 GSM4340570
## 80 GSM4340571 Antibodies_6months GSM4340571 GSM4340571
## 81 GSM4340572 Antibodies_6months GSM4340572 GSM4340572
## 82 GSM4340573 Antibodies_6months GSM4340573 GSM4340573
## 83 GSM4340574 Antibodies_6months GSM4340574 GSM4340574
## 84 GSM4340575 Antibodies_6months GSM4340575 GSM4340575
## 85 GSM4340576 Antibodies_6months GSM4340576 GSM4340576
## 86 GSM4340577 Antibodies_6months GSM4340577 GSM4340577
descriptors2$Sample_GEO_Accession==descriptors2$denormalized
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
descriptors2$Sample_GEO_Accession==descriptors2$normalized
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The sample IDs are the same order as our aliases for the class they belong to. Here are our unique classes, there are four of them.
unique(descriptors2$classDisease)
## [1] "healthyControl" "acuteLymeDisease" "Antibodies_1month"
## [4] "Antibodies_6months"
We can still use our shorter names or gsub() the extended names with the information we don’t need. But we have to add a number to the end that makes each column name different.
n21 <- as.character(c(1:21))
n28 <- as.character(c(1:28))
n27 <- as.character(c(1:27))
n10 <- as.character(c(1:10))
descriptors2$classDisease[1:21] <- paste(descriptors2$classDisease[1:21],n21,sep='_')
descriptors2$classDisease[22:49] <- paste(descriptors2$classDisease[22:49],n28,sep='_')
descriptors2$classDisease[50:76] <- paste(descriptors2$classDisease[50:76],n27,sep='_')
descriptors2$classDisease[77:86] <- paste(descriptors2$classDisease[77:86],n10,sep='_')
head(descriptors2)
## Sample_Title Sample_GEO_Accession classDisease
## 1 PBMC total RNA-Healthy control 1 GSM4340492 healthyControl_1
## 2 PBMC total RNA-Healthy control 2 GSM4340493 healthyControl_2
## 3 PBMC total RNA-Healthy control 3 GSM4340494 healthyControl_3
## 4 PBMC total RNA-Healthy control 4 GSM4340495 healthyControl_4
## 5 PBMC total RNA-Healthy control 5 GSM4340496 healthyControl_5
## 6 PBMC total RNA-Healthy control 6 GSM4340497 healthyControl_6
## denormalized normalized
## 1 GSM4340492 GSM4340492
## 2 GSM4340493 GSM4340493
## 3 GSM4340494 GSM4340494
## 4 GSM4340495 GSM4340495
## 5 GSM4340496 GSM4340496
## 6 GSM4340497 GSM4340497
descriptors2$classDisease
## [1] "healthyControl_1" "healthyControl_2" "healthyControl_3"
## [4] "healthyControl_4" "healthyControl_5" "healthyControl_6"
## [7] "healthyControl_7" "healthyControl_8" "healthyControl_9"
## [10] "healthyControl_10" "healthyControl_11" "healthyControl_12"
## [13] "healthyControl_13" "healthyControl_14" "healthyControl_15"
## [16] "healthyControl_16" "healthyControl_17" "healthyControl_18"
## [19] "healthyControl_19" "healthyControl_20" "healthyControl_21"
## [22] "acuteLymeDisease_1" "acuteLymeDisease_2" "acuteLymeDisease_3"
## [25] "acuteLymeDisease_4" "acuteLymeDisease_5" "acuteLymeDisease_6"
## [28] "acuteLymeDisease_7" "acuteLymeDisease_8" "acuteLymeDisease_9"
## [31] "acuteLymeDisease_10" "acuteLymeDisease_11" "acuteLymeDisease_12"
## [34] "acuteLymeDisease_13" "acuteLymeDisease_14" "acuteLymeDisease_15"
## [37] "acuteLymeDisease_16" "acuteLymeDisease_17" "acuteLymeDisease_18"
## [40] "acuteLymeDisease_19" "acuteLymeDisease_20" "acuteLymeDisease_21"
## [43] "acuteLymeDisease_22" "acuteLymeDisease_23" "acuteLymeDisease_24"
## [46] "acuteLymeDisease_25" "acuteLymeDisease_26" "acuteLymeDisease_27"
## [49] "acuteLymeDisease_28" "Antibodies_1month_1" "Antibodies_1month_2"
## [52] "Antibodies_1month_3" "Antibodies_1month_4" "Antibodies_1month_5"
## [55] "Antibodies_1month_6" "Antibodies_1month_7" "Antibodies_1month_8"
## [58] "Antibodies_1month_9" "Antibodies_1month_10" "Antibodies_1month_11"
## [61] "Antibodies_1month_12" "Antibodies_1month_13" "Antibodies_1month_14"
## [64] "Antibodies_1month_15" "Antibodies_1month_16" "Antibodies_1month_17"
## [67] "Antibodies_1month_18" "Antibodies_1month_19" "Antibodies_1month_20"
## [70] "Antibodies_1month_21" "Antibodies_1month_22" "Antibodies_1month_23"
## [73] "Antibodies_1month_24" "Antibodies_1month_25" "Antibodies_1month_26"
## [76] "Antibodies_1month_27" "Antibodies_6months_1" "Antibodies_6months_2"
## [79] "Antibodies_6months_3" "Antibodies_6months_4" "Antibodies_6months_5"
## [82] "Antibodies_6months_6" "Antibodies_6months_7" "Antibodies_6months_8"
## [85] "Antibodies_6months_9" "Antibodies_6months_10"
write.csv(descriptors2,'descriptors2.csv',row.names=F)
LymeDisease4 <- LymeDisease3
colnames(LymeDisease4)[2:87] <- descriptors2$classDisease
lymeMx2 <- lymeMx
colnames(lymeMx2)[2:87] <- descriptors2$classDisease
write.csv(LymeDisease3,'LymeDisease3.csv',row.names=FALSE)
write.csv(LymeDisease4,'LymeDisease4normalized-easynames.csv',row.names=FALSE)
write.csv(lymeMx2,'lymeMx2-denormalized-easynames.csv',row.names=FALSE)
write.csv(lymeMx,'lymeMx-denormalized-originalnames.csv',row.names=FALSE)
Now, we can use this data to find the mean values across samples and get the fold change values, then plot the data in Tableau.
LymeDisease5 <- LymeDisease4 %>% group_by(Gene) %>% summarise_at(vars('healthyControl_1':'Antibodies_6months_10'),mean)
lymeMx3 <- lymeMx2 %>% group_by(gene) %>% summarise_at(vars('healthyControl_1':'Antibodies_6months_10'),mean)
Lyme6 <- LymeDisease5 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(healthyControl_1:healthyControl_21,na.rm=T),
acuteLymeDisease_Mean = mean(acuteLymeDisease_1:acuteLymeDisease_28,na.rm=T),
antibodies_1month_Mean = mean(Antibodies_1month_1:Antibodies_1month_27,na.rm=T),
antibodies_6month_Mean = mean(Antibodies_6months_1:Antibodies_6months_10,na.rm=T)
)
tail(colnames(Lyme6),5)
## [1] "Antibodies_6months_10" "healthy_Mean" "acuteLymeDisease_Mean"
## [4] "antibodies_1month_Mean" "antibodies_6month_Mean"
lymeMx4 <- lymeMx3 %>% group_by(gene) %>%
mutate(
healthy_Mean = mean(healthyControl_1:healthyControl_21,na.rm=T),
acuteLymeDisease_Mean = mean(acuteLymeDisease_1:acuteLymeDisease_28,na.rm=T),
antibodies_1month_Mean = mean(Antibodies_1month_1:Antibodies_1month_27,na.rm=T),
antibodies_6month_Mean = mean(Antibodies_6months_1:Antibodies_6months_10,na.rm=T)
)
tail(colnames(lymeMx4),5)
## [1] "Antibodies_6months_10" "healthy_Mean" "acuteLymeDisease_Mean"
## [4] "antibodies_1month_Mean" "antibodies_6month_Mean"
lymeMx5 <- lymeMx4 %>% group_by(gene) %>%
mutate(acuteHealthy_foldChange=acuteLymeDisease_Mean/healthy_Mean,
antibodies_1month_healthy_foldChange=antibodies_1month_Mean/healthy_Mean,
antibodies_6month_healthy_foldchange=antibodies_6month_Mean/healthy_Mean)
tail(colnames(lymeMx5),10)
## [1] "Antibodies_6months_8"
## [2] "Antibodies_6months_9"
## [3] "Antibodies_6months_10"
## [4] "healthy_Mean"
## [5] "acuteLymeDisease_Mean"
## [6] "antibodies_1month_Mean"
## [7] "antibodies_6month_Mean"
## [8] "acuteHealthy_foldChange"
## [9] "antibodies_1month_healthy_foldChange"
## [10] "antibodies_6month_healthy_foldchange"
Lyme7 <- Lyme6 %>% group_by(Gene) %>%
mutate(acuteHealthy_foldChange=acuteLymeDisease_Mean/healthy_Mean,
antibodies_1month_healthy_foldChange=antibodies_1month_Mean/healthy_Mean,
antibodies_6month_healthy_foldchange=antibodies_6month_Mean/healthy_Mean)
tail(colnames(Lyme7),10)
## [1] "Antibodies_6months_8"
## [2] "Antibodies_6months_9"
## [3] "Antibodies_6months_10"
## [4] "healthy_Mean"
## [5] "acuteLymeDisease_Mean"
## [6] "antibodies_1month_Mean"
## [7] "antibodies_6month_Mean"
## [8] "acuteHealthy_foldChange"
## [9] "antibodies_1month_healthy_foldChange"
## [10] "antibodies_6month_healthy_foldchange"
Our tables of unique genes grouped by genes to get their means of each gene within each sample for the duplicate genes, the added features of each class’s mean gene expression per gene, and the fold change ratio of the diseased or treated to the healthy gene expression values have been created. The normalized data or the original data is the Lyme7 data frame and the denormalized data is the lymeMx5 data frame. Now each shrunk from 48851 genes to 19526 genes when grouping by unique genes, but now that is still a lot of genes, so lets take the gene that have the top 10 most expressed and least expressed values in both data frames by acute/healthy fold change, and the top 10 and bottom 10 of the 1month of antibodies/healthy fold change values, and finally the top 10 and bottom 10 of the 6 month of antibodies/healthy fold change values. *** The denormalized group first:
Acute/healthy top 10 and bottom 10 genes by fold change data frame:
acuteHealthy20 <- lymeMx5[order(lymeMx5$acuteHealthy_foldChange,
decreasing=T)[c(1:10,19517:19526)],]
One month/healthy top 10 and bottom 10 genes by fold change data frame:
month1healthy20 <- lymeMx5[order(lymeMx5$antibodies_1month_healthy_foldChange,
decreasing=T)[c(1:10,19517:19526)],]
Six month/healthy top 10 and bottom 10 genes by fold change data frame:
month6healthy20 <- lymeMx5[order(lymeMx5$antibodies_6month_healthy_foldchange,
decreasing=T)[c(1:10,19517:19526)],]
lymeMx6 <- rbind(acuteHealthy20,month1healthy20,month6healthy20)
lymeMx7 <- lymeMx6[!duplicated(lymeMx6),]
There were 43 unique genes between all three fold change groups in the denormalized data out of 60 genes that were either the top 10 or bottom 10 of genes being expressed.
Now, for the normalized data:
Acute/healthy top 10 and bottom 10 genes by fold change data frame:
acuteHealthy20b <- Lyme7[order(Lyme7$acuteHealthy_foldChange,
decreasing=T)[c(1:10,19517:19526)],]
One month/healthy top 10 and bottom 10 genes by fold change data frame:
month1healthy20b <- Lyme7[order(Lyme7$antibodies_1month_healthy_foldChange,
decreasing=T)[c(1:10,19517:19526)],]
Six month/healthy top 10 and bottom 10 genes by fold change data frame:
month6healthy20b <- Lyme7[order(Lyme7$antibodies_6month_healthy_foldchange,
decreasing=T)[c(1:10,19517:19526)],]
Lyme8 <- rbind(acuteHealthy20b,month1healthy20b,month6healthy20b)
Lyme9 <- Lyme8[!duplicated(Lyme8),]
There are 33 genes unique to the normalized data, probably because this data had negative values. The scaling done to denormalize this data is probably not exactly what the true raw values are. But they should have the same number of genes, but this one has 10 less than the normalized data. We will see later which one can be split into training and testing sets with better prediction accuracy within each class and overall.
Lets also add the gene summaries to these data frames and create a field that will give the class of each sample. This file,genecards2.R, is an R file sourced for the functions made in previous scripts. We lose one of the genes in the original data frame because it isn’t in genecards.org and end up with 32 instead of 33 genes for that data frame.
source('geneCards2.R')
## Warning: package 'rvest' was built under R version 3.6.3
## Loading required package: xml2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
LOC400657 (#23 in list) is a gene that genecards.org doesn’t recognize and it will throw an error, so we should skip it.
for (i in Lyme9$Gene[1:22]){
getSummaries2(i,'protein')
}
for (i in Lyme9$Gene[24:33]){
getSummaries2(i,'protein')
}
getGeneSummaries('protein')
summsLyme9 <- read.csv("proteinGeneSummaries_protein.csv")
for (i in lymeMx7$gene){
getSummaries2(i,'immune')
}
getGeneSummaries('immune')
summsLymeMx7 <- read.csv("proteinGeneSummaries_immune.csv")
Lyme10 <- merge(summsLyme9,Lyme9,by.x='gene', by.y='Gene')
lymeMx8 <- merge(summsLymeMx7,lymeMx7, by.x='gene',by.y='gene')
Lets create those classes for each data frame. But first we have to tidy the data.
Lyme11 <- gather(Lyme10, key='classSample',value='classValue',8:93)
lymeMx9 <- gather(lymeMx8,key='classSample',value='classValue',8:93)
Lyme11$class <- Lyme11$classSample
Lyme11$class <- gsub('^hea.*$','healthy',Lyme11$class, perl=T)
Lyme11$class <- gsub('^acute.*$','acute Lyme Disease',Lyme11$class,perl=T)
Lyme11$class <- gsub('^.*1month.*$','1 month treatment', Lyme11$class, perl=T)
Lyme11$class <- gsub('^.*6month.*$','6 months treatment',Lyme11$class,perl=T)
lymeMx9$class <- lymeMx9$classSample
lymeMx9$class <- gsub('^hea.*$','healthy',lymeMx9$class, perl=T)
lymeMx9$class <- gsub('^acute.*$','acute Lyme Disease',lymeMx9$class,perl=T)
lymeMx9$class <- gsub('^.*1month.*$','1 month treatment', lymeMx9$class, perl=T)
lymeMx9$class <- gsub('^.*6month.*$','6 months treatment',lymeMx9$class,perl=T)
unique(Lyme11$gene)
## factor(0)
## 61 Levels: AANAT ANKH APOA1 APOB CACNA1B CALCA CALCR CALCRL CASR ... VDR
unique(lymeMx9$gene)
## [1] AREG BEST1 BPI CAMP CEACAM8 CHI3L1 CKMT1B CTSG
## [9] CXCL2 DEFA1 DEFA4 DHX58 FCGR3B FSIP1 GAPT GZMH
## [17] HBG1 HLA-DRB4 HTR3C IL1B KIAA1245 KIR2DL3 KIR2DS1 LCN2
## [25] LIPN LSM2 LTF MS4A3 MUC12 MYOM2 OLR1 OR2B11
## [33] POLR2I RGS18 S100B SERPINB2 THBD THBS1 TNFSF10 TSIX
## [41] TXNL4A XIST
## 43 Levels: AREG BEST1 BPI C7ORF55 CAMP CEACAM8 CHI3L1 CKMT1B CTSG ... XIST
It looks like the genes aren’t even the same genes.
unique(Lyme11$Gene) %in% unique(lymeMx9)
## logical(0)
Apparently, they are not the same genes. Its ok, maybe they still offer some information. The techniques and methods are the same to inverse what was assumed to be the normalization method, but for typical studies. In bioinformatics, with gene expression data, there is usually more to it, like trimming the bottom and top outliers, and taking the quantile normalization, then scaling. We used the standardization method of normalizing values between 0 and 1 as log2 normalized is to f(x)=log2[(x-min(x))/(max(x)-min(x))]=y and the inverse would be: f(y)=21=x So, there is some logic to this, and at some point rounded values could lose information in the numer of scientific placeholders of precision is used in calculating the inverse of the base 2 log, or the exact values for max and min of X need to be used. Reminder, when I demonstrated this earlier, the method worked using this procedure for 10 values that included a 0 where a small value was added to take the log2 of x=0 without an error, but the exact values were still decimals at the final step. To fix this they were turned to fractions, where the denominator was the max(x), and so each value multiplied by the denominator at that point returned the original x values in our list of 10. When addidng that step to the last step we used on this data to denormalize the data, the values were extremely large, approximately 103-104 larger. So we stopped before taking the fractional values. We will continue with these genes in our machine learning to see if either set makes good gene targets for pathogenesis of lyme disease by how accurately the classes of: healthy, acute disease, 1 month convalescing or developing antibodies after being given a regimen of antibiotics, and 6 months convalescing after being given antibiotics. This is temporal or time specific data, and there were some discrepencies in the study when being done, because it spanned 2 years, some patients didn’t know how long they had it but if they had symptoms they were assumed to be suffering from lyme disease, like the facial paralysis or the skin lesion type marks. Also, some patients dropped out and if the study spanned two years, and only the last 6 months recorded the convalescing at 6 months then the first batches of patients in the acute phase weren’t being recorded or they were actually being monitored after six months and up to two years after being given antibiotics. So we can imagine the data might be skewed for these differences or discrepencies.
Lets write these two tables out to csv files to analyze visually in Tableau.
write.csv(Lyme11,'LymeDisease_originalValues_foldchages32.csv',row.names=F)
write.csv(lymeMx9,'LymeDisease_denormalizedValues_foldchanges43.csv',row.names=F)
Lets see what great charts were created in Tableau with this data using our de-normalized or de-standardized data.
Tableau Dashboard of Lime Disease De-Standardized
The link to this dashboard is at this site:https://public.tableau.com/profile/janis5126#!/vizhome/LymeDiseaseDashboardGSE145974/LymeDiseaseDashboardGSE145974?publish=yes
Dashboard De-standardized Lyme Disease Data GSE145974
In the above image and the dashboard if you click on the link above, You can see the genes to the right with the gene summaries if you hover over the text to the right of the dashboard in the ‘Gene Filtering’ box. It will select only the genes you select to show the median gene expression values within each class of healthy, acute lyme disease, one month after antibiotics treatment, and six months after antibiotics treatement, with varying class sizes due to changes in patient participation and methods during the study. The top chart of the warm colors is for the median gene expression values for each gene of 43 genes that were filtered from 19,000 genes as having the most or least fold change in disease or treatment to healthy ratios for all three classes with duplicates removed from the top 10 or bottom 10 genes in each class by fold change. The lower left chart with the greens is the fold change values for each gene within each class of acute lyme disease, one month of treatment, or six months of treatment compared to healthy samples by mean values of all samples in each class. The lower right chart of the purple colors is a tree map that is categorized by class and within each class each box is a gene with the average gene expression value within that class for that gene. The upper right box shows that the gene DEFA1 was selected and it is displayed in all three accompanying charts on the dashboard.
The following images are the charts that are in the dashboard above.
link to image of the bar chart of fold change values.
Bar chart of Fold Change Values
link to image of the bar chart of median gene expression values.
Bar chart of Median Gene Expression Values
link to image of the treechart of average gene expression values within each class.
Treemap chart of the average gene expression values for each gene within each of the four classes
We will perform machine learning on this data in the upcoming additions to this post. But first we will also look at those genes that are from the original data as the most or least expressed genes in our lyme disease data obtained from NCBI with accession ID, GSE145974.
The original data with log2 normalized values including negative values was used to make the same or similar dashboard of charts as was done above with the de-standardized lyme disease data of GSE145974. However, because of the negative values, the treemap chart was replaced with a highlight chart as the treemap removed 45 negative values. This data took a total of 32 genes (technically 33 but one didn’t have a gene summary so it was omitted), and compared the most or least expressed genes as up and down regulated in the disease (acute) or treatment (1 month or 6 months of antibiotics) to the healthy samples all as the fold change of class sample mean values per gene.
dashboard of the original GSE145974 top changing genes
image dashboard of the original GSE145974 top changing genes
Figure 5: The above image is a dashboard of the original log2 normalized data that has negative values as well as positive values. The filter at the top right will display only the gene or genes selected if you select one then use ctrl + click on each additional gene in that ‘Gene Filter’ box. The other three charts will show the respective gene as it relates in the top for median gene expression values within each class and the number of samples in each class, in the lower left the fold change values for that gene in the three classes of diseased or treatment to healthy as a ratio of means, and in the lower right the highlight table with a color gradient bar of oranges and grays for the gene or genes selected. The oranges are the lowest values, or down-regulated negative gene expression values from the median, and the grays are the genes with more positive gene expression values or up regulation in each class of acute, 1 month of treatment, 6 months of treatment, or healthy. The image above will link to this dashboard to try out the different genes associated with lyme disease and treatment. I just read an article about Gigi Hadid having lyme disease since age 14 and is now 21 years old with brain fog, joint pain and stiffness, light sensitivity, headaches, anxiety, and possibly other symptoms like face paralysis related to lyme disease.
Highlight chart of original GSE145974 data
image of highlight chart of original GSE145974 data
Figure 6: The above image is a highlight chart of the average gene expression values within each class of acute, one month of treatment, six months of treatment, and healthy classes. This chart was used instead of the treemap chart used in the de-standardized data because it allows negative values, and treemap charts do not and would have eliminated 45 gene values across these samples. The genes are gradient color coded, so that under expressed genes in samples having negative values are reddish-orange, and those genes with more gene expression values or up regulated are gray. Colors in between the reddish-orange and gray are for genes that didn’t change as much up or down in gene expression. After six months of treatments one gene is highly under expressed, ISG20, and it is in the middle of the chart in a reddish color that indicates it is the most under expressed gene or at the end of the lowest values for gene expression. The gene summary for this gene is in the dashboard that Figure 5 links to. The Entrez gene summary says Hepatitis C and Yellow fever are associated with abnormalities in this gene and its network involvements include the innate immune system. A gene with the highest up regulation is CENPF in the acute phase it is highly up regulated. The gene summary for this gene says it could possibly have some involvement with chromosome segregation during mitosis and also that it encodes a protein associated with the centromere-kinetochore complex. Also, autoantibodies in cancer patients have been found that target this gene, CENPF. And a quick online wikipedia search says autoantibodies are the antibodies your own body produces to attack your own body’s proteins.
Original Fold Change Values GSE145974
image of original GSE145974 fold change values
Figure 7: The image above is to the chart of original lyme disease data fold change values for each gene across all samples. It is bidirectional as is the other charts in direction or color gradient, because this data has negative values accomodating the log2 normalized data. Negative values indicate down regulation and positive values indicate up regulation or you can think suppression versus explosion if magnitude dramatic enough relative to the neighboring genes. All these genes are the most or least expressed of all genes in the data using the original values, so they should have some change visible. The gene CTXN3 is shown to be highly down regulated with -128,817 in fold change comparison of this gene in patients’ average gene expression after six months of treatment compared to the healthy samples. It was down regulated on average 10^-6 approximately less than healthy samples. That is a very large magnitude and could possibly be a target gene for having the disease or antibodies. It is important to get the treatment early to avoid symptoms, but some people still have symptoms and treatment might not work well or at all is what this could be indicating. Because the other classes of acute and one month of treatment as well as healthy don’t have this magnitude of down regulation at all. Keep in mind the patients in this class was nearly a third of the original sample size 10/28 of acute patients. The gene summary for this gene, CTXN3, in the dashboard says it is a protein encoding gene that Autotopagnosia and Clear Cell Adinoma are diseases associated to CTXN3. Autotopagnosia is the inability for one to identify his or her body parts or locate them on his or her body. And Clear Cell Adinoma is a vaginal/cervical cancer that is rare and usually diethylstilbestrol (DES) exposure in utero of a female’s mother. The daughters of moms exposed to DES are more likely to get Clear Cell Adinoma and a gene that is highly underexpressed in our 6 months of treatment group, CTXN3, is associated to that disease. Either by having lower risk by not producing as much, or increased risk due to not producing much of it as the healthy and acute disease phase are.
Median Gene Expression Values original GSE145974 data
image of median gene expression values of original GSE145974 data
Figure 8: The above image is to the bar chart that is bidirectional like Figure 7 of fold change values, but this chart is of the median gene expression values across all four samples for each gene. The number of samples in each class is also labeled on each bar. Scrolling through the genes in the chart you will see other genes like ENO1 which has a gene summary stating it encodes alpha-enolase, one of three enolase isoenzymes found in mammals. This gene is associated with an autoantigen in Hoshimoto encephalopothy, another autoimmune contributor it sounds like. We see it is dramatically under regulated in the healthy samples and also under regulated in the samples who received six months of antibiotic treatment. But in the acute phase it is up regulated almost 50% more than the healthy samples and in the acute samples it is also up regulated but by about 25% of the healthy samples. ISG20 is very highly under regulated in the 6 month class at about 10 fold the amount of the healthy class median values which is also under regulated. We saw this gene earlier in our highlight chart as being associated with yellow fever and hepatitis C as well as innate immunity network signaling. It is the most under regulated gene in all. A gene, RNF168, is also highly under regulated in the 6 month class, but the healthy class and 1 month class are up regulated in this gene by 3-4 fold more than the 6 month class by visual inspection. This gene, RNF168, has an Entrez gene summary that states it is involved in DNA Double-Strand Break (DSB) repair, and that it has mutations associated with Riddle syndrome. Wikipedia says this is a rare genetic disease that causes radiosensitivity, ImmunoDeficiency Dysmorphic features, and learning difficulties as an acronym meaning.
There are a lot of different genes with useful information and they are the top genes in changes in gene expression in either data, but we still need to test these genes to see how they compare using machine learning to see how well the classifications can be predicted by these genes. We will get to that later but soon.
Lets start the machine learning by first making the data frames with the class as the output or target feature and the samples as observations and the genes as predictors from both sets separately.
The 43 de-standardized genes will be created first then the 32 original genes that are completely different. Both are the filtered top or bottom 10 genes out of their respective 19526 unique gene sets of each class by fold change of acute/healthy, 1 month/healthy, or 6 months/healthy by means of their respective class samples.
The destandardized set. Lets just name our data sets something silly to keep track of them. Dance is the de-standardized set and Stand is the original log2 normalized set.
The Dance Machine Learning set, made from the lymeMx7, not-tidied, de-standardized data frame:
colnames(lymeMx7)
## [1] "gene"
## [2] "healthyControl_1"
## [3] "healthyControl_2"
## [4] "healthyControl_3"
## [5] "healthyControl_4"
## [6] "healthyControl_5"
## [7] "healthyControl_6"
## [8] "healthyControl_7"
## [9] "healthyControl_8"
## [10] "healthyControl_9"
## [11] "healthyControl_10"
## [12] "healthyControl_11"
## [13] "healthyControl_12"
## [14] "healthyControl_13"
## [15] "healthyControl_14"
## [16] "healthyControl_15"
## [17] "healthyControl_16"
## [18] "healthyControl_17"
## [19] "healthyControl_18"
## [20] "healthyControl_19"
## [21] "healthyControl_20"
## [22] "healthyControl_21"
## [23] "acuteLymeDisease_1"
## [24] "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3"
## [26] "acuteLymeDisease_4"
## [27] "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6"
## [29] "acuteLymeDisease_7"
## [30] "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9"
## [32] "acuteLymeDisease_10"
## [33] "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12"
## [35] "acuteLymeDisease_13"
## [36] "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15"
## [38] "acuteLymeDisease_16"
## [39] "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18"
## [41] "acuteLymeDisease_19"
## [42] "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21"
## [44] "acuteLymeDisease_22"
## [45] "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24"
## [47] "acuteLymeDisease_25"
## [48] "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27"
## [50] "acuteLymeDisease_28"
## [51] "Antibodies_1month_1"
## [52] "Antibodies_1month_2"
## [53] "Antibodies_1month_3"
## [54] "Antibodies_1month_4"
## [55] "Antibodies_1month_5"
## [56] "Antibodies_1month_6"
## [57] "Antibodies_1month_7"
## [58] "Antibodies_1month_8"
## [59] "Antibodies_1month_9"
## [60] "Antibodies_1month_10"
## [61] "Antibodies_1month_11"
## [62] "Antibodies_1month_12"
## [63] "Antibodies_1month_13"
## [64] "Antibodies_1month_14"
## [65] "Antibodies_1month_15"
## [66] "Antibodies_1month_16"
## [67] "Antibodies_1month_17"
## [68] "Antibodies_1month_18"
## [69] "Antibodies_1month_19"
## [70] "Antibodies_1month_20"
## [71] "Antibodies_1month_21"
## [72] "Antibodies_1month_22"
## [73] "Antibodies_1month_23"
## [74] "Antibodies_1month_24"
## [75] "Antibodies_1month_25"
## [76] "Antibodies_1month_26"
## [77] "Antibodies_1month_27"
## [78] "Antibodies_6months_1"
## [79] "Antibodies_6months_2"
## [80] "Antibodies_6months_3"
## [81] "Antibodies_6months_4"
## [82] "Antibodies_6months_5"
## [83] "Antibodies_6months_6"
## [84] "Antibodies_6months_7"
## [85] "Antibodies_6months_8"
## [86] "Antibodies_6months_9"
## [87] "Antibodies_6months_10"
## [88] "healthy_Mean"
## [89] "acuteLymeDisease_Mean"
## [90] "antibodies_1month_Mean"
## [91] "antibodies_6month_Mean"
## [92] "acuteHealthy_foldChange"
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"
Lets remove the fold change and mean value features from our lymeMx7 data frame and save it as ‘Dance’ after we transpose it to get the unique genes as predictors and the samples as observations.
dance <- lymeMx7[,-c(88:94)]
danceSampleNames <- colnames(dance)[2:87]
month1 <- grep('1month',danceSampleNames)
month6 <- grep('6month',danceSampleNames)
healthy <- grep('healthy',danceSampleNames)
acute <- grep('acute',danceSampleNames)
class <- danceSampleNames
class[month1] <- '1 month'
class[month6] <- '6 months'
class[healthy] <- 'healthy'
class[acute] <- 'acute'
danceGeneNames <- dance$gene
Dance <- as.data.frame(t(dance[,-1]))
colnames(Dance) <- danceGeneNames
Dance$class <- class
Dance2 <- Dance[,c(44,1:43)]
head(Dance2)
## class LCN2 LTF CEACAM8 DEFA4 CAMP BPI
## healthyControl_1 healthy 18.345169 36.72210 40.13472 38.79050 25.56147 23.02797
## healthyControl_2 healthy 33.503142 75.86353 59.99411 68.67612 61.39548 44.71151
## healthyControl_3 healthy 10.400323 21.55983 29.53441 28.76057 15.62534 29.02589
## healthyControl_4 healthy 12.799352 13.72309 18.74716 11.98227 16.94088 18.36342
## healthyControl_5 healthy 20.690155 21.22504 26.42882 21.02267 27.18118 33.89164
## healthyControl_6 healthy 6.900668 18.82061 18.28476 19.84215 20.98911 33.41352
## MS4A3 TNFSF10 FCGR3B DEFA1 IL1B CKMT1B
## healthyControl_1 14.804557 3.757020 23.042961 48.895788 156.60191 219.10525
## healthyControl_2 45.414723 28.335535 48.768151 66.449316 46.72271 39.60314
## healthyControl_3 7.174274 8.998383 11.442811 23.156031 14.95010 83.29841
## healthyControl_4 39.871848 13.483297 8.406282 7.708771 25.60839 15.19072
## healthyControl_5 24.112126 24.915041 27.191459 13.763571 16.24523 24.02644
## healthyControl_6 6.043684 12.700514 35.130123 23.632009 26.42206 24.03571
## THBD HTR3C TXNL4A DHX58 MUC12 LSM2
## healthyControl_1 347.32608 224.84244 275.64991 314.02427 257.42280 714.84725
## healthyControl_2 69.89785 41.26142 27.09973 40.93505 30.79249 46.27268
## healthyControl_3 30.60276 28.55769 17.62511 30.02908 97.04293 24.66836
## healthyControl_4 21.77138 18.31557 23.46840 13.98155 17.95108 28.95942
## healthyControl_5 30.47450 30.04598 27.85309 33.09064 24.11742 27.05694
## healthyControl_6 25.34718 16.63337 18.20711 16.78156 16.57356 10.13407
## MYOM2 HBG1 HLA-DRB4 CTSG RGS18 GAPT
## healthyControl_1 238.652070 815.078715 14.21339 34.19820 9.651897 6.980782
## healthyControl_2 323.065372 167.289522 165.56168 53.10345 40.654359 26.524937
## healthyControl_3 65.086732 10.658862 11.97314 23.49343 24.912526 16.167267
## healthyControl_4 9.536578 29.211161 312.03920 11.58055 17.404908 27.702748
## healthyControl_5 46.234651 26.330291 245.62247 12.99521 24.331860 24.667356
## healthyControl_6 14.813989 8.593226 360.87281 74.44276 12.925465 34.413011
## SERPINB2 THBS1 AREG CXCL2 XIST OLR1
## healthyControl_1 468.25347 403.26591 261.23314 214.90236 5.636720 44.43886
## healthyControl_2 140.75193 138.67140 136.37357 90.08176 422.253222 81.50329
## healthyControl_3 24.34336 34.46742 10.75996 26.80778 206.491863 19.01474
## healthyControl_4 33.59483 20.47388 33.58836 12.26752 3.421391 22.83240
## healthyControl_5 23.84726 29.81402 44.00702 36.01149 5.291298 51.09338
## healthyControl_6 53.35440 33.16393 18.14765 29.94835 174.811375 20.10506
## OR2B11 FSIP1 TSIX C7orf55 CHI3L1 KIAA1245
## healthyControl_1 41.91948 70.20709 17.90936 29.88277 29.77403 33.17650
## healthyControl_2 32.42068 26.85640 433.24591 29.82837 58.14152 91.34233
## healthyControl_3 19.85783 27.90551 134.56488 19.38080 16.31402 22.47583
## healthyControl_4 21.78934 16.33424 10.88805 28.87899 12.99754 21.60191
## healthyControl_5 24.37846 31.07355 12.66765 45.15163 26.19157 27.89315
## healthyControl_6 14.81774 16.89240 53.40405 18.44080 20.26103 20.81851
## BEST1 LIPN GZMH KIR2DL3 KIR2DS1 POLR2I
## healthyControl_1 29.67359 20.03313 49.52473 176.990491 100.08657 77.65711
## healthyControl_2 31.69743 84.80267 14.86283 44.288607 61.10841 22.72473
## healthyControl_3 25.92331 16.76715 24.14236 15.927277 21.24897 17.52461
## healthyControl_4 20.75736 29.99576 16.10039 39.862912 65.46716 20.43849
## healthyControl_5 45.23739 40.94288 34.47135 96.261702 56.76246 34.97384
## healthyControl_6 22.38874 27.26274 11.90818 9.957902 14.84648 19.87330
## S100B
## healthyControl_1 280.46369
## healthyControl_2 25.98400
## healthyControl_3 20.74111
## healthyControl_4 48.50091
## healthyControl_5 123.08727
## healthyControl_6 21.27883
We have our machine learning ready data frame of de-standardized genes, and will be using the target, class, for predictions. We could use all 43 genes or just take those genes in the visualizations that we saw had very peculiar fold change values like in the 6 months of treatment or acute stages. Or we could test both. Might as well test bost as we will see how likely these genes are in predicting an acute disease stage, treatment time, or healthy class by blood analysis.
Lets refresh our memories on what those genes were. We put them in our notes on the visualizations above for the de-standardized Tableau charts. We might miss some, as those were scanned visuals, so I am going to make a list of those genes that have noticeable shifts in gene expression or fold change values compared to the other classes and make that are peculiar set of genes. We could even divide those genes up into the ones up or down regulated in the 6 month or the acute stage only or even the healthy samples only. I will revisit that dashboard and select the genes from the filter and compare across all charts available and bring back the findings here. We’ll call that set Dance-odd6, Dance-oddAcute, or Dance-oddHealthy. Possibly Dance-odd1, but I didn’t notice anything the first quick scan through the genes. There are only 43, so it shouldn’t be a problem.
Using the fold change values:
acute up, decreasing order of up is 1 month, healthy, 6 months
up acute
monotonically decreasing from acute -> 1 month -> 6 months:
BPI
CAMP CEACAM8 CTSG DEFA1 DEFA4 DHX58 FCGR3B
GAPT GZMH HBG1 HLA-DRB4 KIR2DL3 KIR2DS1 LCN2 LSM2 LTF MS4A3 POLR21 RGS18 S100B TNFSF10
THBD CHI3L1 SERPINB2
-up in acute and down 1/2 in 1 month with slight increase in 6 months by approx 5%
TXNL4A
HTR3C
%%%%% - up 6 months
AREG BEST1 KIAA1245 LIPN OR2B11
CKMT1B
IL1B MUC12 MYOM2 OLR1 THBS1 TSIX XIST
FSIP1
Few observations, many of the genes are monotonically decreasing from acute to 1 month to 6 months in gene expression levels, where they start high in the acute stage, then decrease gradually in 1 month: BPI, CAMP, CEACAM8, CTSG, DEFA1, DEFA4, DHX58, FCGR3B, GAPT, GZMH, HBG1, HLA-DRB4,KIR2DL3, KIR2DS1, LCN2, LSM2, LTF, MS4A3, POLR21, RGS18, S100B, TNFSF10
, and more decrease in 6 months. The we have some odd genes , THBD, CHI3L1, and SERPINB2 in the acute up regulated genes that don’t behave this way, but they indicate that maybe treatment is working, because they start high in acute, then drop in 1 month of treatment, and then increase almost to half the same levels as in the acute phase after 6 months of treatment. Also, TXNL4A, drops in 1 month and stays about the same after 6 months. And another gene, HTR3C, drops in 1 month, then increases to almost same acute levels in 6 months. Lets make those lists of the acute, then the lists for the 6 months genes.
#monotonically decreasing
Acute_md <- c('BPI', 'CAMP', 'CEACAM8', 'CTSG', 'DEFA1', 'DEFA4', 'DHX58', 'FCGR3B', 'GAPT', 'GZMH', 'HBG1', 'HLA-DRB4','KIR2DL3', 'KIR2DS1', 'LCN2', 'LSM2', 'LTF', 'MS4A3', 'POLR21', 'RGS18', 'S100B', 'TNFSF10')
#high in acute, drop after 1 month, then half as high as acute after 6 months
Acute_mayWork <- c('THBD', 'CHI3L1','SERPINB2')
#odd ones in acute, starts high in acute, then drops, and increases slightly in 6 months
Acute_dropsThenUpslightly <- 'TXNL4A'
Acute_dropsReturnsSame <- 'HTR3C'
For the six months of treatment, genes that were noticeably increased after 6 months compared to the acute stage before treatment, none monotonically increased from acute levels, to 1 month of treatment levels, to six months of treatment levels. But some did drop in 1 month, then increase in 6 months to levels much higher by 5-10 fold than the acute levels, IL1B, MUC12, MYOM2, OLR1, THBS1, TSIX, XIST. A few genes start lower, stays lower, then increase 5-10 fold of the acute level in 6 months, FSIP1 and CKMT1B. And there are those genes that are more up regulated after 6 months of treatment, but only slightly more than the acute phase and after decreasing in the 1 month of treatment phase, AREG, BEST1, KIAA1245, LIPN, and OR2B11. Lets now make those lists to show genes that are more up regulated in the 6 month samples.
month6_5foldup <- c('IL1B', 'MUC12', 'MYOM2', 'OLR1', 'THBS1', 'TSIX', 'XIST')
month6_5foldupStartLow <- c('FSIP1','CKMT1B')
month6_upMoreThanAcute <- c('AREG', 'BEST1', 'KIAA1245', 'LIPN', 'OR2B11')
Now that we have our lists, lets see about those data frames for the seven different groups of gene anomolies or similarities. The following are our ML ready dataframes for our seven groups in our de-standardized Lyme disease data.
Acute_md_DF <- Dance2[,colnames(Dance2) %in% Acute_md]
Acute_md_DF$class <- Dance2$class
Acute_mayWork_DF <- Dance2[,colnames(Dance2) %in% Acute_mayWork]
Acute_mayWork_DF$class <- Dance2$class
Acute_dropsThenUpslightly_DF <- data.frame(TXNL4A=Dance2[,colnames(Dance2) %in% Acute_dropsThenUpslightly], row.names=row.names(Dance2))
Acute_dropsThenUpslightly_DF$class <- Dance2$class
Acute_dropsReturnsSame_DF <- data.frame(HTR3C=Dance2[,colnames(Dance2) %in% Acute_dropsReturnsSame],row.names=row.names(Dance2))
Acute_dropsReturnsSame_DF$class <- Dance2$class
month6_5foldup_DF <- Dance2[,colnames(Dance2) %in% month6_5foldup]
month6_5foldup_DF$class <- Dance2$class
month6_5foldupStartLow_DF <- Dance2[,colnames(Dance2) %in% month6_5foldupStartLow]
month6_5foldupStartLow_DF$class <- Dance2$class
month6_upMoreThanAcute_DF <- Dance2[,colnames(Dance2) %in% month6_upMoreThanAcute]
month6_upMoreThanAcute_DF$class <- Dance2$class
Great, now we need to run through each of these 7 data frames and split into separate training and testing sets, and test a machine learning algorithm on. I tend to always use random forest to start with, or caret’s rpart.
Lets make sure we keep the same samples in our testing set and training set for each group to test machine learning algorithm(s) on. Lets keep the standard 70% training set and 30% testing set using a random sampling of our classes.
set.seed(34567)
train <- sample(1:86,.7*86)
training <- class[train]
testing <- class[-train]
t <- data.frame(train = training)
ts <- data.frame(test= testing)
t %>% group_by(train) %>% count(train)
## # A tibble: 4 x 2
## # Groups: train [4]
## train n
## <fct> <int>
## 1 1 month 21
## 2 6 months 8
## 3 acute 18
## 4 healthy 13
ts %>% group_by(test) %>% count(test)
## # A tibble: 4 x 2
## # Groups: test [4]
## test n
## <fct> <int>
## 1 1 month 6
## 2 6 months 2
## 3 acute 10
## 4 healthy 8
We can see we have a fair share of samples in our training set and at least one of each class in our testing set to make predictions based on the model we train. Lets keep these same samples in each of our 8 groups to classify with. Lets make our 8 training and testing sets with our indices labeled ‘train’ and note the numeric labeling of each correspongs to their data frame:
Training/Testing split 1: Acute_md_DF Training/Testing split 2: Acute_mayWork_DF Training/Testing split 3: Acute_dropsThenUpslightly_DF Training/Testing split 4: Acute_dropsReturnsSame_DF Training/Testing split 5: month6_5foldup_DF Training/Testing split 6: month6_5foldupStartLow_DF Training/Testing split 7: month6_upMoreThanAcute_DF Training/Testing split 8: Dance2
training1 <- Acute_md_DF[train,]
testing1 <- Acute_md_DF[-train,]
training2 <- Acute_mayWork_DF[train,]
testing2 <- Acute_mayWork_DF[-train,]
training3 <- Acute_dropsThenUpslightly_DF[train,]
testing3 <- Acute_dropsThenUpslightly_DF[-train,]
training4 <- Acute_dropsReturnsSame_DF[train,]
testing4 <- Acute_dropsReturnsSame_DF[-train,]
training5 <- month6_5foldup_DF[train,]
testing5 <- month6_5foldup_DF[-train,]
training6 <- month6_5foldupStartLow_DF[train,]
testing6 <- month6_5foldupStartLow_DF[-train,]
training7 <- month6_upMoreThanAcute_DF[train,]
testing7 <- month6_upMoreThanAcute_DF[-train,]
training8 <- Dance2[train,]
testing8 <- Dance2[-train,]
Lets make a function specific to our data frames to return the precision, recall, and accuracy of these four classes. I actually made this in a previous script,monotonicGenes.Rmd, when testing the COVID-19 samples with GSE152418 that also had four classes to classify. But those classes were healthy, moderate, severe, or ICU grades of severity of Covid19. Actually, I found out later, that the convalescent class was its own class even though it was only one sample. So there should have been five classes. But no need to alter that function now. There is also some other packages or in the caret package, that I never use that can return the precision and recall, but i don’t think as a confusion matrix. I thought the convalescent class was mislabeled, so had it relabeled as healthy, since the models pedicted it as such. I didn’t find out until this study, when the summary of this study, GSE145974, used ‘convalesced’ blood after 1 and 6 months of antibiotics, that the sample in GSE152418 was likely its own class. I assumed it was identifying the source of its patient sample,because another previous study on Rheumatoid Arthritis (RA), GSE151161, did use convalescent patients, and it preceded the analysis on GSE152418. Typically in research, you need a client consent and informed consent from people who aren’t incarcerated or in the care of another person or facility,because it violates the human research subjects guidelines for ethical research and not victimizing vulnerable populations or culpabe and incoherant populations. This stems from research that was criminal in the Tuskegee hospital on injecting black populations with syphilis or polio vaccines on inmates in other studies for some small reward or break from their punishment or lowered/free cost clinic for medical treatment. Any researcher knows this, especially if they are funded by government agencies. Also, due to the Nazi research done on Jewish victims during World War 2, the Nuremberg Code, was created, as well as later the Belmont report. “The Nuremberg Code states that”the voluntary consent of the human subject is absolutely essential" and it further explains the details implied by this requirement: capacity to consent, freedom from coercion, no penalty for withdrawal, and comprehension of the risks and benefits involved."-The Nuremberg Code, taken from a resource for getting certified in understanding compliance with human research experiments as part of my graduate research project this had to be completed. The agency who provided this, similar to HIPPA compliance for healthcare providers, is CITI.
precisionRecallAccuracy <- function(df){
colnames(df) <- c('pred','type')
df$pred <- as.character(paste(df$pred))
df$type <- as.character(paste(df$type))
classes <- unique(df$type)
class1a <- as.character(paste(classes[1]))
class2a <- as.character(paste(classes[2]))
class3a <- as.character(paste(classes[3]))
class4a <- as.character(paste(classes[4]))
#correct classes
class1 <- subset(df, df$type==class1a)
class2 <- subset(df, df$type==class2a)
class3 <- subset(df, df$type==class3a)
class4 <- subset(df, df$type==class4a)
#incorrect classes
notClass1 <- subset(df,df$type != class1a)
notClass2 <- subset(df,df$type != class2a)
notClass3 <- subset(df,df$type != class3a)
notClass4 <- subset(df, df$type != class4a)
#true positives (real positives predicted positive)
tp_1 <- sum(class1$pred==class1$type)
tp_2 <- sum(class2$pred==class2$type)
tp_3 <- sum(class3$pred==class3$type)
tp_4 <- sum(class4$pred==class4$type)
#false positives (real negatives predicted positive)
fp_1 <- sum(notClass1$pred==class1a)
fp_2 <- sum(notClass2$pred==class2a)
fp_3 <- sum(notClass3$pred==class3a)
fp_4 <- sum(notClass4$pred==class4a)
#false negatives (real positive predicted negative)
fn_1 <- sum(class1$pred!=class1$type)
fn_2 <- sum(class2$pred!=class2$type)
fn_3 <- sum(class3$pred!=class3$type)
fn_4 <- sum(class4$pred!=class4$type)
#true negatives (real negatives predicted negative)
tn_1 <- sum(notClass1$pred!=class1a)
tn_2 <- sum(notClass2$pred!=class2a)
tn_3 <- sum(notClass3$pred!=class3a)
tn_4 <- sum(notClass4$pred!=class4a)
#precision
p1 <- tp_1/(tp_1+fp_1)
p2 <- tp_2/(tp_2+fp_2)
p3 <- tp_3/(tp_3+fp_3)
p4 <- tp_4/(tp_4+fp_4)
p1 <- ifelse(p1=='NaN',0,p1)
p2 <- ifelse(p2=='NaN',0,p2)
p3 <- ifelse(p3=='NaN',0,p3)
p4 <- ifelse(p4=='NaN',0,p4)
#recall
r1 <- tp_1/(tp_1+fn_1)
r2 <- tp_2/(tp_2+fn_2)
r3 <- tp_3/(tp_3+fn_3)
r4 <- tp_4/(tp_4+fn_4)
r1 <- ifelse(r1=='NaN',0,r1)
r2 <- ifelse(r2=='NaN',0,r2)
r3 <- ifelse(r3=='NaN',0,r3)
r4 <- ifelse(r4=='NaN',0,r4)
#accuracy
ac1 <- (tp_1+tn_1)/(tp_1+tn_1+fp_1+fn_1)
ac2 <- (tp_2+tn_2)/(tp_2+tn_2+fp_2+fn_2)
ac3 <- (tp_3+tn_3)/(tp_3+tn_3+fp_3+fn_3)
ac4 <- (tp_4+tn_4)/(tp_4+tn_4+fp_4+fn_4)
table <- as.data.frame(rbind(c(class1a,p1,r1,ac1),
c(class2a,p2,r2,ac2),
c(class3a,p3,r3,ac3),
c(class4a,p4,r4,ac4)))
colnames(table) <- c('class','precision','recall','accuracy')
acc <- (sum(df$pred==df$type)/length(df$type))*100
cat('accuracy is: ',as.character(paste(acc)),'%')
return(table)
}
Lets start with the first group of genes using Training/Testing 1:
set.seed(589647)
rfMod1 <- train(class~., method='rf',
na.action=na.pass,
data=(training1), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF1 <- predict(rfMod1, testing1)
predDF1 <- data.frame(predRF1, type=testing1$class)
predDF1
## predRF1 type
## 1 healthy healthy
## 2 1 month healthy
## 3 acute healthy
## 4 healthy healthy
## 5 1 month healthy
## 6 1 month healthy
## 7 healthy healthy
## 8 healthy healthy
## 9 acute acute
## 10 1 month acute
## 11 acute acute
## 12 acute acute
## 13 1 month acute
## 14 1 month acute
## 15 acute acute
## 16 acute acute
## 17 acute acute
## 18 acute acute
## 19 1 month 1 month
## 20 acute 1 month
## 21 acute 1 month
## 22 acute 1 month
## 23 1 month 1 month
## 24 1 month 1 month
## 25 1 month 6 months
## 26 healthy 6 months
pra1 <- precisionRecallAccuracy(predDF1)
## accuracy is: 53.8461538461538 %
pra1
## class precision recall accuracy
## 1 healthy 0.8 0.5 0.807692307692308
## 2 acute 0.636363636363636 0.7 0.730769230769231
## 3 1 month 0.3 0.5 0.615384615384615
## 4 6 months 0 0 0.923076923076923
That set wasn’t so great. Lets run through the other 7 sets using the same format and compare the results at the end.
Training/Testing 2:
rfMod2 <- train(class~., method='rf',
na.action=na.pass,
data=(training2), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
predRF2 <- predict(rfMod2, testing2)
predDF2 <- data.frame(predRF2, type=testing2$class)
predDF2
## predRF2 type
## 1 healthy healthy
## 2 acute healthy
## 3 acute healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 acute healthy
## 7 healthy healthy
## 8 1 month healthy
## 9 acute acute
## 10 1 month acute
## 11 6 months acute
## 12 acute acute
## 13 1 month acute
## 14 1 month acute
## 15 6 months acute
## 16 healthy acute
## 17 1 month acute
## 18 healthy acute
## 19 1 month 1 month
## 20 acute 1 month
## 21 acute 1 month
## 22 acute 1 month
## 23 1 month 1 month
## 24 1 month 1 month
## 25 1 month 6 months
## 26 6 months 6 months
pra2 <- precisionRecallAccuracy(predDF2)
## accuracy is: 38.4615384615385 %
pra2
## class precision recall accuracy
## 1 healthy 0.666666666666667 0.5 0.769230769230769
## 2 acute 0.25 0.2 0.461538461538462
## 3 1 month 0.333333333333333 0.5 0.653846153846154
## 4 6 months 0.333333333333333 0.5 0.884615384615385
Training/Testing 3:
rfMod3 <- train(class~., method='rf',
na.action=na.pass,
data=(training3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF3 <- predict(rfMod3, testing3)
predDF3 <- data.frame(predRF3, type=testing3$class)
predDF3
## predRF3 type
## 1 1 month healthy
## 2 1 month healthy
## 3 1 month healthy
## 4 acute healthy
## 5 acute healthy
## 6 healthy healthy
## 7 acute healthy
## 8 healthy healthy
## 9 acute acute
## 10 healthy acute
## 11 healthy acute
## 12 acute acute
## 13 1 month acute
## 14 1 month acute
## 15 acute acute
## 16 acute acute
## 17 healthy acute
## 18 1 month acute
## 19 healthy 1 month
## 20 1 month 1 month
## 21 1 month 1 month
## 22 1 month 1 month
## 23 acute 1 month
## 24 acute 1 month
## 25 6 months 6 months
## 26 acute 6 months
pra3 <- precisionRecallAccuracy(predDF3)
## accuracy is: 38.4615384615385 %
pra3
## class precision recall accuracy
## 1 healthy 0.333333333333333 0.25 0.615384615384615
## 2 acute 0.4 0.4 0.538461538461538
## 3 1 month 0.333333333333333 0.5 0.653846153846154
## 4 6 months 1 0.5 0.961538461538462
Training/Testing 4:
rfMod4 <- train(class~., method='rf',
na.action=na.pass,
data=(training4), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF4 <- predict(rfMod4, testing4)
predDF4 <- data.frame(predRF4, type=testing4$class)
predDF4
## predRF4 type
## 1 6 months healthy
## 2 healthy healthy
## 3 1 month healthy
## 4 acute healthy
## 5 1 month healthy
## 6 healthy healthy
## 7 healthy healthy
## 8 healthy healthy
## 9 healthy acute
## 10 1 month acute
## 11 healthy acute
## 12 healthy acute
## 13 1 month acute
## 14 1 month acute
## 15 6 months acute
## 16 acute acute
## 17 1 month acute
## 18 healthy acute
## 19 1 month 1 month
## 20 1 month 1 month
## 21 acute 1 month
## 22 acute 1 month
## 23 acute 1 month
## 24 1 month 1 month
## 25 6 months 6 months
## 26 1 month 6 months
pra4 <- precisionRecallAccuracy(predDF4)
## accuracy is: 34.6153846153846 %
pra4
## class precision recall accuracy
## 1 healthy 0.5 0.5 0.692307692307692
## 2 acute 0.2 0.1 0.5
## 3 1 month 0.3 0.5 0.615384615384615
## 4 6 months 0.333333333333333 0.5 0.884615384615385
Training/Testing 5:
rfMod5 <- train(class~., method='rf',
na.action=na.pass,
data=(training5), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF5 <- predict(rfMod5, testing5)
predDF5 <- data.frame(predRF5, type=testing5$class)
predDF5
## predRF5 type
## 1 healthy healthy
## 2 6 months healthy
## 3 6 months healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 acute healthy
## 7 healthy healthy
## 8 acute healthy
## 9 1 month acute
## 10 1 month acute
## 11 healthy acute
## 12 6 months acute
## 13 1 month acute
## 14 1 month acute
## 15 6 months acute
## 16 1 month acute
## 17 1 month acute
## 18 healthy acute
## 19 acute 1 month
## 20 healthy 1 month
## 21 1 month 1 month
## 22 acute 1 month
## 23 acute 1 month
## 24 1 month 1 month
## 25 6 months 6 months
## 26 6 months 6 months
pra5 <- precisionRecallAccuracy(predDF5)
## accuracy is: 30.7692307692308 %
pra5
## class precision recall accuracy
## 1 healthy 0.571428571428571 0.5 0.730769230769231
## 2 acute 0 0 0.423076923076923
## 3 1 month 0.25 0.333333333333333 0.615384615384615
## 4 6 months 0.333333333333333 1 0.846153846153846
Training/Testing 6:
rfMod6 <- train(class~., method='rf',
na.action=na.pass,
data=(training6), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
predRF6 <- predict(rfMod6, testing6)
predDF6 <- data.frame(predRF6, type=testing6$class)
predDF6
## predRF6 type
## 1 1 month healthy
## 2 6 months healthy
## 3 6 months healthy
## 4 acute healthy
## 5 acute healthy
## 6 6 months healthy
## 7 6 months healthy
## 8 1 month healthy
## 9 1 month acute
## 10 acute acute
## 11 1 month acute
## 12 6 months acute
## 13 1 month acute
## 14 1 month acute
## 15 6 months acute
## 16 1 month acute
## 17 6 months acute
## 18 1 month acute
## 19 1 month 1 month
## 20 acute 1 month
## 21 1 month 1 month
## 22 6 months 1 month
## 23 healthy 1 month
## 24 acute 1 month
## 25 6 months 6 months
## 26 6 months 6 months
pra6 <- precisionRecallAccuracy(predDF6)
## accuracy is: 19.2307692307692 %
pra6
## class precision recall accuracy
## 1 healthy 0 0 0.653846153846154
## 2 acute 0.2 0.1 0.5
## 3 1 month 0.2 0.333333333333333 0.538461538461538
## 4 6 months 0.2 1 0.692307692307692
Training/Testing 7:
rfMod7 <- train(class~., method='rf',
na.action=na.pass,
data=(training7), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF7 <- predict(rfMod7, testing7)
predDF7 <- data.frame(predRF7, type=testing7$class)
predDF7
## predRF7 type
## 1 1 month healthy
## 2 acute healthy
## 3 acute healthy
## 4 acute healthy
## 5 1 month healthy
## 6 acute healthy
## 7 acute healthy
## 8 6 months healthy
## 9 6 months acute
## 10 acute acute
## 11 1 month acute
## 12 acute acute
## 13 1 month acute
## 14 1 month acute
## 15 acute acute
## 16 1 month acute
## 17 healthy acute
## 18 acute acute
## 19 1 month 1 month
## 20 acute 1 month
## 21 acute 1 month
## 22 6 months 1 month
## 23 acute 1 month
## 24 6 months 1 month
## 25 acute 6 months
## 26 6 months 6 months
pra7 <- precisionRecallAccuracy(predDF1)
## accuracy is: 53.8461538461538 %
pra7
## class precision recall accuracy
## 1 healthy 0.8 0.5 0.807692307692308
## 2 acute 0.636363636363636 0.7 0.730769230769231
## 3 1 month 0.3 0.5 0.615384615384615
## 4 6 months 0 0 0.923076923076923
Training/Testing 8:
rfMod8 <- train(class~., method='rf',
na.action=na.pass,
data=(training8), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF8 <- predict(rfMod8, testing8)
predDF8 <- data.frame(predRF8, type=testing8$class)
predDF8
## predRF8 type
## 1 healthy healthy
## 2 acute healthy
## 3 acute healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 6 months healthy
## 7 healthy healthy
## 8 healthy healthy
## 9 acute acute
## 10 1 month acute
## 11 acute acute
## 12 acute acute
## 13 1 month acute
## 14 1 month acute
## 15 acute acute
## 16 acute acute
## 17 1 month acute
## 18 acute acute
## 19 1 month 1 month
## 20 acute 1 month
## 21 acute 1 month
## 22 acute 1 month
## 23 1 month 1 month
## 24 1 month 1 month
## 25 6 months 6 months
## 26 6 months 6 months
pra8 <- precisionRecallAccuracy(predDF1)
## accuracy is: 53.8461538461538 %
pra8
## class precision recall accuracy
## 1 healthy 0.8 0.5 0.807692307692308
## 2 acute 0.636363636363636 0.7 0.730769230769231
## 3 1 month 0.3 0.5 0.615384615384615
## 4 6 months 0 0 0.923076923076923
The seed for randomness within the computations of this operating system and R has to be set before the models, because running different times after setting the seed when generating the random indices of the train vector didn’t work for the model generation. I reran 3-5 times and got inconsistent results unless using the set.seed before the 8 models were ran. It is supposed to work only once, and generate the same results everytime. But in either case it represents how the random forest works by randomly selecting samples within the sampels to test an ensemble of models and trees. This current seed still kept the 1st, 7th, and 8th groups as the highest scoring in accuracy. Overall accuracy was not good for any of the groups where it ranged from 19-54% accuracy in predicted being the same as the actual type. But there was class accuracy differences that could best be compared by combining the precision and recall accuracy tables then adding in a feature to identify which model the result came from. Note that the worst group for prediction accuracy was group 6, and the three best groups with 54% accuracy were groups 1, 7, and 8. Worst set of genes to keep as target genes for Lyme Disease are… These are the groups by gene behaviors in fold change of diseased or treated mean values compared to healthy mean values: Training/Testing split 1: Acute_md_DF Training/Testing split 2: Acute_mayWork_DF Training/Testing split 3: Acute_dropsThenUpslightly_DF Training/Testing split 4: Acute_dropsReturnsSame_DF Training/Testing split 5: month6_5foldup_DF Training/Testing split 6: month6_5foldupStartLow_DF Training/Testing split 7: month6_upMoreThanAcute_DF Training/Testing split 8: Dance2
So, without tuning our models or testing other algorithms, we can assume from this point on, all the 43 genes are best, as well as the set of genes with more up regulation after 6 months than in the acute phase, and the set of genes with the monotoncially decreasing gene values from acute to one month of treatment to six months of treatment with the acute phase having the highest gene expression values. The other genes are possibly noisy or add noise to our data that prevents the model from classifying greatly. But lets see if any of the sets did have better recall or precision in a class by class prediction accuracy first, before attempting to tune our random forest models.Also note, that I omitted the preprocessing step in the model training to begin with and then added it in and improved the range from a best score of 34% to a best score in overall accuracy of 50%.
pra_all <- rbind(pra1,pra2,pra3,pra4,pra5,pra6,pra7,pra8)
pra_all$GroupMembership <- c(rep(1,4),
rep(2,4),
rep(3,4),
rep(4,4),
rep(5,4),
rep(6,4),
rep(7,4),
rep(8,4))
pra_all2 <- pra_all %>% group_by(class) %>% mutate(max=
ifelse(accuracy==max(as.numeric(paste(accuracy))),'max','not max'))
max <- subset(pra_all2, pra_all2$max=='max')
max
## # A tibble: 9 x 6
## # Groups: class [4]
## class precision recall accuracy GroupMembership max
## <fct> <fct> <fct> <fct> <dbl> <chr>
## 1 healthy 0.8 0.5 0.807692307692308 1 max
## 2 acute 0.636363636363636 0.7 0.730769230769231 1 max
## 3 1 month 0.333333333333333 0.5 0.653846153846154 2 max
## 4 1 month 0.333333333333333 0.5 0.653846153846154 3 max
## 5 6 months 1 0.5 0.961538461538462 3 max
## 6 healthy 0.8 0.5 0.807692307692308 7 max
## 7 acute 0.636363636363636 0.7 0.730769230769231 7 max
## 8 healthy 0.8 0.5 0.807692307692308 8 max
## 9 acute 0.636363636363636 0.7 0.730769230769231 8 max
We can see from the above chart of class membership accuracies that some other groups also did make good gene targets for some classes. Group 2 and group 3 had the best accuracy in predicting 1 or 6 months for group 3 and only the 1 month class for group 2. The 1st, 7th, and 8th groups were better at predicting the healthy and acute class memberships. We had fewer of the 6 month class, but many 1 month samples, yet that class for 1 month didn’t have any noticeable changes in our 43 genes to distinguish with the random forest classification. We could try more trees or tuning the model to see if there is an improvement. These models were fast and that was likely due to the number of trees being small. Lets use the randomForest package and its randomForest() to tune our model and test our same 8 groups.
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training1) <- gsub('-','',colnames(training1))
colnames(testing1) <- gsub('-','',colnames(testing1))
testing1$class <- as.factor(paste(testing1$class))
training1$class <- as.factor(paste(training1$class))
RF1 <- randomForest(class ~ ., data=training1,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
predict1 <- predict(RF1,testing1)
predict1df <- data.frame(predict1, type=testing1$class)
predict1df
## predict1 type
## healthyControl_3 1 month healthy
## healthyControl_11 1 month healthy
## healthyControl_12 acute healthy
## healthyControl_13 healthy healthy
## healthyControl_18 1 month healthy
## healthyControl_19 1 month healthy
## healthyControl_20 healthy healthy
## healthyControl_21 healthy healthy
## acuteLymeDisease_1 acute acute
## acuteLymeDisease_4 1 month acute
## acuteLymeDisease_6 acute acute
## acuteLymeDisease_7 acute acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 1 month acute
## acuteLymeDisease_23 1 month acute
## acuteLymeDisease_24 healthy acute
## acuteLymeDisease_27 6 months acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 acute 1 month
## Antibodies_1month_11 acute 1 month
## Antibodies_1month_12 acute 1 month
## Antibodies_1month_13 1 month 1 month
## Antibodies_1month_26 1 month 1 month
## Antibodies_6months_1 1 month 6 months
## Antibodies_6months_10 healthy 6 months
PRA1 <- precisionRecallAccuracy(predict1df)
## accuracy is: 34.6153846153846 %
PRA1
## class precision recall accuracy
## 1 healthy 0.6 0.375 0.730769230769231
## 2 acute 0.428571428571429 0.3 0.576923076923077
## 3 1 month 0.230769230769231 0.5 0.5
## 4 6 months 0 0 0.884615384615385
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training2) <- gsub('-','',colnames(training2))
colnames(testing2) <- gsub('-','',colnames(testing2))
testing2$class <- as.factor(paste(testing2$class))
training2$class <- as.factor(paste(training2$class))
RF2 <- randomForest(class ~ ., data=training2,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
predict2 <- predict(RF2,testing2)
predict2df <- data.frame(predict2, type=testing2$class)
predict2df
## predict2 type
## healthyControl_3 healthy healthy
## healthyControl_11 acute healthy
## healthyControl_12 acute healthy
## healthyControl_13 healthy healthy
## healthyControl_18 healthy healthy
## healthyControl_19 acute healthy
## healthyControl_20 healthy healthy
## healthyControl_21 1 month healthy
## acuteLymeDisease_1 acute acute
## acuteLymeDisease_4 1 month acute
## acuteLymeDisease_6 6 months acute
## acuteLymeDisease_7 acute acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 6 months acute
## acuteLymeDisease_23 healthy acute
## acuteLymeDisease_24 1 month acute
## acuteLymeDisease_27 healthy acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 acute 1 month
## Antibodies_1month_11 acute 1 month
## Antibodies_1month_12 acute 1 month
## Antibodies_1month_13 1 month 1 month
## Antibodies_1month_26 1 month 1 month
## Antibodies_6months_1 1 month 6 months
## Antibodies_6months_10 acute 6 months
PRA2 <- precisionRecallAccuracy(predict2df)
## accuracy is: 34.6153846153846 %
PRA2
## class precision recall accuracy
## 1 healthy 0.666666666666667 0.5 0.769230769230769
## 2 acute 0.222222222222222 0.2 0.423076923076923
## 3 1 month 0.333333333333333 0.5 0.653846153846154
## 4 6 months 0 0 0.846153846153846
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training3) <- gsub('-','',colnames(training3))
colnames(testing3) <- gsub('-','',colnames(testing3))
testing3$class <- as.factor(paste(testing3$class))
training3$class <- as.factor(paste(training3$class))
RF3 <- randomForest(class ~ ., data=training3,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
predict3 <- predict(RF3,testing3)
predict3df <- data.frame(predict3, type=testing3$class)
predict3df
## predict3 type
## healthyControl_3 1 month healthy
## healthyControl_11 1 month healthy
## healthyControl_12 1 month healthy
## healthyControl_13 acute healthy
## healthyControl_18 acute healthy
## healthyControl_19 healthy healthy
## healthyControl_20 acute healthy
## healthyControl_21 healthy healthy
## acuteLymeDisease_1 acute acute
## acuteLymeDisease_4 healthy acute
## acuteLymeDisease_6 healthy acute
## acuteLymeDisease_7 acute acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 acute acute
## acuteLymeDisease_23 1 month acute
## acuteLymeDisease_24 healthy acute
## acuteLymeDisease_27 1 month acute
## Antibodies_1month_4 healthy 1 month
## Antibodies_1month_6 1 month 1 month
## Antibodies_1month_11 1 month 1 month
## Antibodies_1month_12 1 month 1 month
## Antibodies_1month_13 acute 1 month
## Antibodies_1month_26 acute 1 month
## Antibodies_6months_1 6 months 6 months
## Antibodies_6months_10 acute 6 months
PRA3 <- precisionRecallAccuracy(predict3df)
## accuracy is: 34.6153846153846 %
PRA3
## class precision recall accuracy
## 1 healthy 0.333333333333333 0.25 0.615384615384615
## 2 acute 0.333333333333333 0.3 0.5
## 3 1 month 0.3 0.5 0.615384615384615
## 4 6 months 1 0.5 0.961538461538462
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training4) <- gsub('-','',colnames(training4))
colnames(testing4) <- gsub('-','',colnames(testing4))
testing4$class <- as.factor(paste(testing4$class))
training4$class <- as.factor(paste(training4$class))
RF4 <- randomForest(class ~ ., data=training4,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
predict4 <- predict(RF4,testing4)
predict4df <- data.frame(predict4, type=testing4$class)
predict4df
## predict4 type
## healthyControl_3 6 months healthy
## healthyControl_11 healthy healthy
## healthyControl_12 1 month healthy
## healthyControl_13 acute healthy
## healthyControl_18 1 month healthy
## healthyControl_19 healthy healthy
## healthyControl_20 healthy healthy
## healthyControl_21 healthy healthy
## acuteLymeDisease_1 healthy acute
## acuteLymeDisease_4 1 month acute
## acuteLymeDisease_6 healthy acute
## acuteLymeDisease_7 healthy acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 6 months acute
## acuteLymeDisease_23 acute acute
## acuteLymeDisease_24 1 month acute
## acuteLymeDisease_27 healthy acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 1 month 1 month
## Antibodies_1month_11 acute 1 month
## Antibodies_1month_12 acute 1 month
## Antibodies_1month_13 acute 1 month
## Antibodies_1month_26 1 month 1 month
## Antibodies_6months_1 6 months 6 months
## Antibodies_6months_10 1 month 6 months
PRA4 <- precisionRecallAccuracy(predict4df)
## accuracy is: 34.6153846153846 %
PRA4
## class precision recall accuracy
## 1 healthy 0.5 0.5 0.692307692307692
## 2 acute 0.2 0.1 0.5
## 3 1 month 0.3 0.5 0.615384615384615
## 4 6 months 0.333333333333333 0.5 0.884615384615385
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training5) <- gsub('-','',colnames(training5))
colnames(testing5) <- gsub('-','',colnames(testing5))
testing5$class <- as.factor(paste(testing5$class))
training5$class <- as.factor(paste(training5$class))
RF5 <- randomForest(class ~ ., data=training5,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
predict5 <- predict(RF5,testing5)
predict5df <- data.frame(predict5, type=testing5$class)
predict5df
## predict5 type
## healthyControl_3 healthy healthy
## healthyControl_11 6 months healthy
## healthyControl_12 6 months healthy
## healthyControl_13 healthy healthy
## healthyControl_18 healthy healthy
## healthyControl_19 acute healthy
## healthyControl_20 healthy healthy
## healthyControl_21 1 month healthy
## acuteLymeDisease_1 6 months acute
## acuteLymeDisease_4 1 month acute
## acuteLymeDisease_6 healthy acute
## acuteLymeDisease_7 6 months acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 6 months acute
## acuteLymeDisease_23 1 month acute
## acuteLymeDisease_24 1 month acute
## acuteLymeDisease_27 healthy acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 healthy 1 month
## Antibodies_1month_11 1 month 1 month
## Antibodies_1month_12 acute 1 month
## Antibodies_1month_13 1 month 1 month
## Antibodies_1month_26 1 month 1 month
## Antibodies_6months_1 6 months 6 months
## Antibodies_6months_10 6 months 6 months
PRA5 <- precisionRecallAccuracy(predict5df)
## accuracy is: 38.4615384615385 %
PRA5
## class precision recall accuracy
## 1 healthy 0.571428571428571 0.5 0.730769230769231
## 2 acute 0 0 0.538461538461538
## 3 1 month 0.4 0.666666666666667 0.692307692307692
## 4 6 months 0.285714285714286 1 0.807692307692308
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training6) <- gsub('-','',colnames(training6))
colnames(testing6) <- gsub('-','',colnames(testing6))
testing6$class <- as.factor(paste(testing6$class))
training6$class <- as.factor(paste(training6$class))
RF6 <- randomForest(class ~ ., data=training6,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
predict6 <- predict(RF6,testing6)
predict6df <- data.frame(predict6, type=testing6$class)
predict6df
## predict6 type
## healthyControl_3 1 month healthy
## healthyControl_11 6 months healthy
## healthyControl_12 6 months healthy
## healthyControl_13 acute healthy
## healthyControl_18 acute healthy
## healthyControl_19 6 months healthy
## healthyControl_20 6 months healthy
## healthyControl_21 1 month healthy
## acuteLymeDisease_1 1 month acute
## acuteLymeDisease_4 acute acute
## acuteLymeDisease_6 1 month acute
## acuteLymeDisease_7 6 months acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 6 months acute
## acuteLymeDisease_23 1 month acute
## acuteLymeDisease_24 6 months acute
## acuteLymeDisease_27 1 month acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 acute 1 month
## Antibodies_1month_11 1 month 1 month
## Antibodies_1month_12 6 months 1 month
## Antibodies_1month_13 healthy 1 month
## Antibodies_1month_26 acute 1 month
## Antibodies_6months_1 6 months 6 months
## Antibodies_6months_10 6 months 6 months
PRA6 <- precisionRecallAccuracy(predict6df)
## accuracy is: 19.2307692307692 %
PRA6
## class precision recall accuracy
## 1 healthy 0 0 0.653846153846154
## 2 acute 0.2 0.1 0.5
## 3 1 month 0.2 0.333333333333333 0.538461538461538
## 4 6 months 0.2 1 0.692307692307692
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training7) <- gsub('-','',colnames(training7))
colnames(testing7) <- gsub('-','',colnames(testing7))
testing7$class <- as.factor(paste(testing7$class))
training7$class <- as.factor(paste(training7$class))
RF7 <- randomForest(class ~ ., data=training7,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
predict7 <- predict(RF7,testing7)
predict7df <- data.frame(predict7, type=testing7$class)
predict7df
## predict7 type
## healthyControl_3 1 month healthy
## healthyControl_11 acute healthy
## healthyControl_12 acute healthy
## healthyControl_13 1 month healthy
## healthyControl_18 1 month healthy
## healthyControl_19 acute healthy
## healthyControl_20 acute healthy
## healthyControl_21 6 months healthy
## acuteLymeDisease_1 6 months acute
## acuteLymeDisease_4 acute acute
## acuteLymeDisease_6 1 month acute
## acuteLymeDisease_7 acute acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 acute acute
## acuteLymeDisease_23 1 month acute
## acuteLymeDisease_24 healthy acute
## acuteLymeDisease_27 acute acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 acute 1 month
## Antibodies_1month_11 acute 1 month
## Antibodies_1month_12 6 months 1 month
## Antibodies_1month_13 acute 1 month
## Antibodies_1month_26 6 months 1 month
## Antibodies_6months_1 acute 6 months
## Antibodies_6months_10 6 months 6 months
PRA7 <- precisionRecallAccuracy(predict7df)
## accuracy is: 23.0769230769231 %
PRA7
## class precision recall accuracy
## 1 healthy 0 0 0.653846153846154
## 2 acute 0.333333333333333 0.4 0.461538461538462
## 3 1 month 0.125 0.166666666666667 0.538461538461538
## 4 6 months 0.2 0.5 0.807692307692308
#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training8) <- gsub('-','',colnames(training8))
colnames(testing8) <- gsub('-','',colnames(testing8))
testing8$class <- as.factor(paste(testing8$class))
training8$class <- as.factor(paste(training8$class))
RF8 <- randomForest(class ~ ., data=training8,
importance=TRUE, nodesize=2, ntree=400,mtry=3)
predict8 <- predict(RF8,testing8)
predict8df <- data.frame(predict8, type=testing8$class)
predict8df
## predict8 type
## healthyControl_3 healthy healthy
## healthyControl_11 acute healthy
## healthyControl_12 acute healthy
## healthyControl_13 healthy healthy
## healthyControl_18 healthy healthy
## healthyControl_19 acute healthy
## healthyControl_20 healthy healthy
## healthyControl_21 healthy healthy
## acuteLymeDisease_1 acute acute
## acuteLymeDisease_4 1 month acute
## acuteLymeDisease_6 1 month acute
## acuteLymeDisease_7 acute acute
## acuteLymeDisease_9 1 month acute
## acuteLymeDisease_13 1 month acute
## acuteLymeDisease_22 acute acute
## acuteLymeDisease_23 1 month acute
## acuteLymeDisease_24 healthy acute
## acuteLymeDisease_27 healthy acute
## Antibodies_1month_4 1 month 1 month
## Antibodies_1month_6 acute 1 month
## Antibodies_1month_11 acute 1 month
## Antibodies_1month_12 acute 1 month
## Antibodies_1month_13 1 month 1 month
## Antibodies_1month_26 1 month 1 month
## Antibodies_6months_1 6 months 6 months
## Antibodies_6months_10 6 months 6 months
PRA8 <- precisionRecallAccuracy(predict8df)
## accuracy is: 50 %
PRA8
## class precision recall accuracy
## 1 healthy 0.714285714285714 0.625 0.807692307692308
## 2 acute 0.333333333333333 0.3 0.5
## 3 1 month 0.375 0.5 0.692307692307692
## 4 6 months 1 1 1
PRA_all <- rbind(PRA1,PRA2,PRA3,PRA4,PRA5,PRA6,PRA7,PRA8)
PRA_all$groupMembership <- c(rep(1,4),
rep(2,4),
rep(3,4),
rep(4,4),
rep(5,4),
rep(6,4),
rep(7,4),
rep(8,4))
PRA_all2 <- PRA_all %>% group_by(class) %>% mutate(max=
ifelse(accuracy==max(as.numeric(paste(accuracy))),'max','not max'))
max2 <- subset(PRA_all2, PRA_all2$max=='max')
max2
## # A tibble: 5 x 6
## # Groups: class [4]
## class precision recall accuracy groupMembership max
## <fct> <fct> <fct> <fct> <dbl> <chr>
## 1 acute 0.4285714285714~ 0.3 0.576923076923~ 1 max
## 2 1 month 0.4 0.66666666666~ 0.692307692307~ 5 max
## 3 healthy 0.7142857142857~ 0.625 0.807692307692~ 8 max
## 4 1 month 0.375 0.5 0.692307692307~ 8 max
## 5 6 months 1 1 1 8 max
The accuracy wasn’t as great using the randomForest() instead of caret’s built in random forest function. And we did still see group 8 or all 43 genes score the best or with the best. Group 5 made a class prediction best score that it didn’t in the other model. Group 5 scored the best on class ‘1 month’, and group 8 scored the best on the same class but only in accuracy, becasue group 5 had better recall and precision than Group 8 on that class. Group 8 did score 100% accuracy in the ‘6 months’ class in recall, precision, and total accuracy. Recall that our sets are split with the same share of training and testing samples and that there were 8 samples of the 6 months class to train our model and 2 to predict in the testing set with our model. Group 8 got all relevant 6 month class samples in the testing set (precision) and no other samples were misclassified as the 6 months class (recall). The ‘healthy’ class was also correctly predicted by Group 8 with 81% accuracy where the precision is 71% and recall was 63%. The acute class was predicted best by group 5 with 58% accuracy, 30% recall (misclassified 70%), and 43% precision (didn’t find 57%).
We could test out more algorithms or we could test out the original data of 32 completely different genes. and go through the same process of grouping the genes by those gene fold change ratios that we saw in our 7 groups above. We could also test out a data set of those genes in our top performing groups, 1,8, and 2,3,5,or 7 with groups 4 and 6 not being a better performer at any class prediction or overall accuracy. Those groups again are:
Training/Testing split 1: Acute_md_DF Training/Testing split 2: Acute_mayWork_DF Training/Testing split 3: Acute_dropsThenUpslightly_DF Training/Testing split 4: Acute_dropsReturnsSame_DF Training/Testing split 5: month6_5foldup_DF Training/Testing split 6: month6_5foldupStartLow_DF Training/Testing split 7: month6_upMoreThanAcute_DF Training/Testing split 8: Dance2
where md is monotonically decreasing from acute -> month 1 -> month 6 and Dance2 is all genes. There weren’t any monotonically increasing genes, all the remaining genes started higher than the 1 month class then increased to a level close to the acute levels, just under the acute levels, slightly more than the acute levels, or much higher than the acute levels. All 43 genes (Dance2), the monotonically decreasing genes, and all groups except for group 4 and 5 can be used. But really we are just picking the ones that aren’t useful from group 8. Groups 4 and 6, with group 6 seeming to always score the minimum accuracy will be in our data set to test our models on. This means neither of the genes that return to similar levels from acute to the 6th month levels or the genes from the group that starts low in the acute phase but end up increasing to about five fold the acute levels by month 6 will be used. And I would have thought those genes would be indicative of the class. We should just make two data sets, where one is Groups 4 and 6, and the other is groups 1,2,3,5, and 7. Because group 8 is all the genes in the set. Either set could have some noise. These are fold change values of the mean values across all samples. It is possible to go back to the dashboard and find some outlier samples that skew the gene values from group 4 and 6. Lets see what those genes are again. Acute_dropsReturnsSame, and month6_5foldupStartLow are those gene lists made earlier.
poorPathogenesisTargets <- c(month6_5foldupStartLow, Acute_dropsReturnsSame)
poorPathogenesisTargets
## [1] "FSIP1" "CKMT1B" "HTR3C"
We need to go back to the dashboard and see if FSIP1, CKMT1B, or HTR3C have some samples that are skewing their gene expression values greatly.
I actually didn’t post the individual gene expression values up in a chart on Tableau, so I just loaded one that shows there are some samples skewing the data with those three genes, FSIP1, CKMT1B, and HTR3C. I have decided to backtrack and see if I take the median fold change values instead of the mean across all samples if the results will be better.
individual samples’ gene expression values across all four classes.
FSIP1, CKMT1B, and HTR3C spread across all samples
Figure 9: I added the sample chart to see the groups of individual samples within each class of healthy, acute lyme disease, one month of antibiotic treatment, and six months of antibiotics treatments, after realizing, some of the genes’ fold change values are skewing the data greatly. And we can see in the above image of this chart (linked to through the image) that these samples in this set of genes that skewed our data when running some machine learning algorithms were samples: sample 7 of the acute class samples, sample 12 of the 1 month class samples, sample 10 of the 6 month class samples, and samples 1, 11, and 12 of the healthy class samples. I want to remove these samples and run some machine learning on the set, or just take the median sample values instead when deriving the fold change values.
I want to backtrack at this point and use the median values by switching to a new document to test the median, and referencing back to it in this document, with the machine learning results.
I did that work on the median sample values and dropped those six classes that seemed to skew the data, but the results weren’t better and had a best score of 42% accuracy, where here the best score was 54% accuracy so far. We still need to test the machine learning results on the original data that destandardization wasn’t made to. But to access the median sample values as fold changes and the results with the six outlier samples in the mean sample derived fold change data before getting the median derived fold changes, it is on rpubs as part 2 of the Lyme Disease Ticks document.
I took out six samples that were skewed in this set, but never tested if taking those samples out of this data would improve the classification accuracy in this data. We can do that fast with our testing and training sets. Lets use all the data of training and testing set 8.
row.names(training8)
## [1] "Antibodies_6months_5" "Antibodies_1month_2" "Antibodies_6months_9"
## [4] "Antibodies_1month_10" "Antibodies_1month_1" "healthyControl_4"
## [7] "Antibodies_6months_4" "acuteLymeDisease_19" "Antibodies_1month_25"
## [10] "Antibodies_1month_3" "Antibodies_1month_16" "acuteLymeDisease_21"
## [13] "healthyControl_8" "Antibodies_1month_8" "acuteLymeDisease_15"
## [16] "acuteLymeDisease_11" "healthyControl_5" "Antibodies_1month_5"
## [19] "Antibodies_1month_20" "acuteLymeDisease_12" "Antibodies_1month_9"
## [22] "acuteLymeDisease_16" "acuteLymeDisease_28" "acuteLymeDisease_25"
## [25] "Antibodies_6months_7" "healthyControl_2" "Antibodies_1month_14"
## [28] "healthyControl_1" "acuteLymeDisease_26" "Antibodies_1month_27"
## [31] "healthyControl_6" "acuteLymeDisease_18" "Antibodies_1month_19"
## [34] "healthyControl_16" "Antibodies_1month_21" "healthyControl_15"
## [37] "healthyControl_9" "acuteLymeDisease_17" "healthyControl_14"
## [40] "Antibodies_1month_24" "Antibodies_1month_15" "healthyControl_17"
## [43] "Antibodies_6months_6" "acuteLymeDisease_14" "acuteLymeDisease_20"
## [46] "Antibodies_1month_7" "healthyControl_10" "Antibodies_6months_8"
## [49] "acuteLymeDisease_8" "Antibodies_6months_2" "Antibodies_1month_17"
## [52] "acuteLymeDisease_3" "Antibodies_6months_3" "healthyControl_7"
## [55] "Antibodies_1month_23" "Antibodies_1month_18" "acuteLymeDisease_2"
## [58] "Antibodies_1month_22" "acuteLymeDisease_5" "acuteLymeDisease_10"
sample 7 of the acute class samples, sample 12 of the 1 month class samples, sample 10 of the 6 month class samples, and samples 1, 11, and 12 of the healthy class samples.
Check back for machine learning on the original data.
skewSamples <- c('Antibodies_6months_10','Antibodies_1month_12','acuteLymeDisease_7',
'healthyControl_1','healthyControl_11','healthyControl_12')
sort(row.names(training8))
## [1] "acuteLymeDisease_10" "acuteLymeDisease_11" "acuteLymeDisease_12"
## [4] "acuteLymeDisease_14" "acuteLymeDisease_15" "acuteLymeDisease_16"
## [7] "acuteLymeDisease_17" "acuteLymeDisease_18" "acuteLymeDisease_19"
## [10] "acuteLymeDisease_2" "acuteLymeDisease_20" "acuteLymeDisease_21"
## [13] "acuteLymeDisease_25" "acuteLymeDisease_26" "acuteLymeDisease_28"
## [16] "acuteLymeDisease_3" "acuteLymeDisease_5" "acuteLymeDisease_8"
## [19] "Antibodies_1month_1" "Antibodies_1month_10" "Antibodies_1month_14"
## [22] "Antibodies_1month_15" "Antibodies_1month_16" "Antibodies_1month_17"
## [25] "Antibodies_1month_18" "Antibodies_1month_19" "Antibodies_1month_2"
## [28] "Antibodies_1month_20" "Antibodies_1month_21" "Antibodies_1month_22"
## [31] "Antibodies_1month_23" "Antibodies_1month_24" "Antibodies_1month_25"
## [34] "Antibodies_1month_27" "Antibodies_1month_3" "Antibodies_1month_5"
## [37] "Antibodies_1month_7" "Antibodies_1month_8" "Antibodies_1month_9"
## [40] "Antibodies_6months_2" "Antibodies_6months_3" "Antibodies_6months_4"
## [43] "Antibodies_6months_5" "Antibodies_6months_6" "Antibodies_6months_7"
## [46] "Antibodies_6months_8" "Antibodies_6months_9" "healthyControl_1"
## [49] "healthyControl_10" "healthyControl_14" "healthyControl_15"
## [52] "healthyControl_16" "healthyControl_17" "healthyControl_2"
## [55] "healthyControl_4" "healthyControl_5" "healthyControl_6"
## [58] "healthyControl_7" "healthyControl_8" "healthyControl_9"
sort(skewSamples)
## [1] "acuteLymeDisease_7" "Antibodies_1month_12" "Antibodies_6months_10"
## [4] "healthyControl_1" "healthyControl_11" "healthyControl_12"
skewSamples %in% row.names(training8)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
skewSamples %in% row.names(testing8)
## [1] TRUE TRUE TRUE FALSE TRUE TRUE
dim(training8);dim(testing8)
## [1] 60 44
## [1] 26 44
training8b <- subset(training8,!(row.names(training8) %in% skewSamples))
testing8b <- subset(testing8, !(row.names(testing8) %in% skewSamples))
skewSamples %in% row.names(training8b)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
skewSamples %in% row.names(testing8b)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
dim(training8b);dim(testing8b)
## [1] 59 44
## [1] 21 44
Now, we can see if there is an improvement in accuracy in machine learning prediction. Training/Testing 1:
set.seed(589647)
rfMod8b <- train(class~., method='rf',
na.action=na.pass,
data=(training8b), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRF8b <- predict(rfMod8b, testing8b)
predDF8b <- data.frame(predRF8b, type=testing8b$class)
predDF8b
## predRF8b type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 acute healthy
## 5 healthy healthy
## 6 1 month healthy
## 7 acute acute
## 8 1 month acute
## 9 1 month acute
## 10 1 month acute
## 11 1 month acute
## 12 acute acute
## 13 1 month acute
## 14 6 months acute
## 15 healthy acute
## 16 1 month 1 month
## 17 acute 1 month
## 18 acute 1 month
## 19 1 month 1 month
## 20 1 month 1 month
## 21 6 months 6 months
pra8b <- precisionRecallAccuracy(predDF8b)
## accuracy is: 47.6190476190476 %
pra8b
## class precision recall accuracy
## 1 healthy 0.8 0.666666666666667 0.857142857142857
## 2 acute 0.4 0.222222222222222 0.523809523809524
## 3 1 month 0.333333333333333 0.6 0.619047619047619
## 4 6 months 0.5 1 0.952380952380952
The accuracy is 47.6% for all genes used on all samples except the six skewed ones. The highest Group 8 scored was 54% earlier. Maybe we could try training more samples and having a smaller test set to predict the classifications? Or making the classes more balanced. Lets see what are class counts are in each set. The healthy class isn’t oo bad at 80% precision, it only missed 20% of the healthy samples, but the recall is 67% on the healthy class, meaning it misclassified some samples as healthy. And the recall was 100% on the 6 month class, when there is only one 6 month class in the testing set, it correctly predicted the only class as 6 months out of the total number of classes there are (recall) is 100% but also incorrectly predicted one of the acute classes as a 6 month class so that the total predicted correctly/total predicted as correct or incorrect (precision) is 50%. When looking at recall and precision, both are the number of predicted for a class but the precision is a ratio of the number of classes it predicted as a class as the denominator while the recall is the true number of classes there are. People have tried shortening it and it leaves out those facts or putting prime over the P as P’ to condense the interpretation, but it just adds confusion, and really, needs to be fully written out as such, instead of assuming the readers know the shorthand abbreviations that could have been explained many pages or chapters prior to the current page. I will always bring this up. Because the shorthand differences are the reason for the inconsistencies and misinterpretation or confusion by people who don’t use these measures day in and out like their normal cup of coffee. You’ll see this in type I (count of false positives) and type II (count of false negatives) errors too for hypothesis testing. And people wired to paraphrase will also get confused there, because they want to condense that to false positives as negatives and false negatives as positives, but really its true negatives labeled positive, and true positives labeled negative. It’s not a simple rule as the derivative of a constant is always 0, or the derivative of a x^2 with respect to y is 0. Its a mnemonic of sorts that goes through many versions of shorthand text. I believe you can honestly pull the top 90th percentile of calculus III students aside at random and ask them to calculate the precision and recall and not get consistent results. Because its not used as much in calculus but statistics, and also not seen as really relevant until predictive analytics or machine learning. To them it seems trivial, because class imbalance is irrelevant until trying to improve the accuracy and test out ways to make the class balance produce higher prediction accuracy overall, or separate the classess, and make the model predict within a subset of the classess accurately, but not within the total set of classes. So, lets get to class balancing the best we can.
train8b <- training8b %>% group_by(class) %>% count(class)
test8b <- testing8b %>% group_by(class) %>% count(class)
train8b
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month 21
## 2 6 months 8
## 3 acute 18
## 4 healthy 12
test8b
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month 5
## 2 6 months 1
## 3 acute 9
## 4 healthy 6
Lets first try training the data on all 86 of the samples except for 1 in each class, then use that 1 sample from each class in the testing set. To select which one to include from each class, lets only keep the one closest to the mean of their class. But we’ll use the mean of all the samples in each class.
all <- rbind(testing8,training8)
dim(all)
## [1] 86 44
healthy <- subset(all, all$class=='healthy')
m1 <- subset(all, all$class=='1 month')
m6 <- subset(all, all$class=='6 months')
acute <- subset(all, all$class=='acute')
healthyMean <- order(apply(healthy[,-1],1,mean))
m1Mean <- order(apply(m1[,-1],1,mean))
m6Mean <- order(apply(m6[,-1],1,mean))
acuteMean <- order(apply(acute[,-1],1,mean))
H1 <- healthyMean[floor(length(healthyMean)/2)]
M1 <- m1Mean[floor(length(m1Mean)/2)]
M6 <- m6Mean[floor(length(m6Mean)/2)]
A1 <- acuteMean[floor(length(acuteMean)/2)]
testUno <- rbind(healthy[H1,],m1[M1,],m6[M6,],acute[A1,])
testUno
## class LCN2 LTF CEACAM8 DEFA4 CAMP
## healthyControl_15 healthy 17.13554 38.03209 28.97459 54.96061 18.54468
## Antibodies_1month_22 1 month 74.79538 30.14569 31.15147 23.96823 34.69399
## Antibodies_6months_6 6 months 14.03154 10.51058 24.86117 15.04481 15.14718
## acuteLymeDisease_28 acute 18.31185 24.21692 23.66863 24.37679 20.76201
## BPI MS4A3 TNFSF10 FCGR3B DEFA1 IL1B
## healthyControl_15 26.27302 30.679988 4.737728 10.43546 38.64095 15.05342
## Antibodies_1month_22 26.42632 69.686460 45.287117 137.63276 25.53752 20.64861
## Antibodies_6months_6 18.03675 3.824587 10.129959 15.14794 10.99052 80.27107
## acuteLymeDisease_28 23.55601 9.509528 47.924129 59.37641 18.46380 80.03062
## CKMT1B THBD HTR3C TXNL4A DHX58 MUC12
## healthyControl_15 21.77310 16.95726 22.03265 17.98004 17.52017 18.00291
## Antibodies_1month_22 37.74576 34.85232 36.38374 43.34992 34.07528 26.58173
## Antibodies_6months_6 47.40098 41.04443 26.94393 15.92046 13.26633 52.08020
## acuteLymeDisease_28 26.54026 66.61738 24.00392 23.84010 25.46579 23.86389
## LSM2 MYOM2 HBG1 HLADRB4 CTSG RGS18
## healthyControl_15 14.55579 18.88724 74.250569 134.11953 30.868166 14.23276
## Antibodies_1month_22 44.82963 130.37108 33.606807 29.88086 23.700120 64.04231
## Antibodies_6months_6 14.52782 10.68510 9.518176 158.95025 8.712228 8.22711
## acuteLymeDisease_28 26.99585 10.82321 27.438872 19.20605 28.341905 22.74309
## GAPT SERPINB2 THBS1 AREG CXCL2 XIST
## healthyControl_15 12.78011 21.86668 115.1202 44.29163 40.54945 264.995632
## Antibodies_1month_22 51.84625 15.36698 14.5359 18.25920 40.07748 5.577409
## Antibodies_6months_6 14.47655 56.79544 123.0045 49.87920 48.29439 245.154850
## acuteLymeDisease_28 27.25833 251.52046 115.4409 107.67192 238.19923 4.213683
## OLR1 OR2B11 FSIP1 TSIX C7orf55 CHI3L1
## healthyControl_15 29.20186 15.40234 17.00508 144.156847 19.09080 26.03393
## Antibodies_1month_22 29.22141 22.13845 43.96003 7.827089 37.14990 35.64124
## Antibodies_6months_6 118.89554 49.42415 207.96209 122.374346 22.68011 17.43028
## acuteLymeDisease_28 67.36953 87.49594 46.01348 15.058749 30.79871 29.23335
## KIAA1245 BEST1 LIPN GZMH KIR2DL3 KIR2DS1
## healthyControl_15 31.57280 18.19139 12.71581 5.92233 10.14529 9.736479
## Antibodies_1month_22 27.71482 29.81649 67.69835 49.63028 42.19264 48.858579
## Antibodies_6months_6 75.20643 62.78859 38.05924 12.85319 25.17363 23.375599
## acuteLymeDisease_28 54.28491 36.54691 51.24253 48.10395 32.97931 10.734790
## POLR2I S100B
## healthyControl_15 17.48189 13.097410
## Antibodies_1month_22 44.75784 223.073644
## Antibodies_6months_6 14.71374 7.099816
## acuteLymeDisease_28 25.42501 35.586478
t1Names <- row.names(testUno)
trainUno <- subset(all, !(row.names(all) %in% t1Names))
dim(trainUno);dim(testUno)
## [1] 82 44
## [1] 4 44
Now, lets see how well our model does.
set.seed(589647)
rfModUno <- train(class~., method='rf',
na.action=na.pass,
data=(trainUno), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRFUno <- predict(rfModUno, testUno)
predDFUno <- data.frame(predRFUno, type=testUno$class)
predDFUno
## predRFUno type
## 1 healthy healthy
## 2 1 month 1 month
## 3 6 months 6 months
## 4 healthy acute
pra_uno <- precisionRecallAccuracy(predDFUno)
## accuracy is: 75 %
pra_uno
## class precision recall accuracy
## 1 healthy 0.5 1 0.75
## 2 1 month 1 1 1
## 3 6 months 1 1 1
## 4 acute 0 0 0.75
The overall accuracy was improved by taking the samples closer to the mean of each class to use in the test set and train all samples on. But there was one class that was not identified correctly, and another class that was identified correctly but misclassified another class as its own. Maybe we can improve the accuracy even more by removing those class samples that are outside of the standard deviation by the most and then reselecting our one sample per class testing set. Lets see if we can.
healthyStd <- order(apply(healthy[,-1],1,sd))
m1Std <- order(apply(m1[,-1],1,sd))
m6Std <- order(apply(m6[,-1],1,sd))
acuteStd <- order(apply(acute[,-1],1,sd))
H1b <- healthyStd[length(healthyStd)]
m1b <- m1Std[length(m1Std)]
m6b <- m6Std[length(m6Std)]
acuteb <- acuteStd[length(acuteStd)]
allb <- all[-c(H1b,m1b,m6b,acuteb),]
row.names(testUno)==row.names(allb)
## Warning in row.names(testUno) == row.names(allb): longer object length is not a
## multiple of shorter object length
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
testDos <- testUno
trainDos <- allb
Lets see if removing the outliers as far as the sample in each class with the most deviation from the mean of all gene values, will improve accuracy in prediction.
set.seed(589647)
rfModDos <- train(class~., method='rf',
na.action=na.pass,
data=(trainDos), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRFDos <- predict(rfModDos, testDos)
predDFDos <- data.frame(predRFDos, type=testDos$class)
predDFDos
## predRFDos type
## 1 healthy healthy
## 2 1 month 1 month
## 3 6 months 6 months
## 4 acute acute
It worked, we see all classes were predicted correctly by removing the largest standard error sample across all genes, then using the sample closest to the mean of the genes as the samples to predict. We scored 100% accuracy. Lets see this for precision and recall too.
pra_dos <- precisionRecallAccuracy(predDFDos)
## accuracy is: 100 %
pra_dos
## class precision recall accuracy
## 1 healthy 1 1 1
## 2 1 month 1 1 1
## 3 6 months 1 1 1
## 4 acute 1 1 1
Great! all 100% as it should be since the accuracy was 100%
We could also see what the accuracy is with more samples, like the next closest to the mean, and iteratively after running our model on the growing test sets of samples closest to their class mean, until we can get a measure thats indicative of the population. There must be some discrepencies in the samples as far as what is in their systems, how long they had lyme disease before they began treatment, what vitamins and other medications their on, how old are they, condition of healthy, gender, recent injuries, etc. But otherwise, with all other variables being constant, these genes can identify separate classes of lyme disease or healthy in PBMC or blood.
Lets look at the original data that wasn’t de-standardized and see if we can get the same results, better, or filter out the most deviated samples from the average gene expressions. I have been saying this for a while, but more approaches occur in thought before leaving this de-standardized set of gene expression data.
stand <- Lyme9[,-c(88:94)]
standSampleNames <- colnames(stand)[2:87]
month1 <- grep('1month',standSampleNames)
month6 <- grep('6month',standSampleNames)
healthy <- grep('healthy',standSampleNames)
acute <- grep('acute',standSampleNames)
class <- standSampleNames
class[month1] <- '1 month'
class[month6] <- '6 months'
class[healthy] <- 'healthy'
class[acute] <- 'acute'
standGeneNames <- stand$Gene
stand <- as.data.frame(t(stand[,-1]))
colnames(stand) <- standGeneNames
stand$class <- class
stand2 <- stand[,c(34,1:33)]
head(stand2)
## class CABP1 POU3F2 CTXN3 CYP7B1
## healthyControl_1 healthy 0.00003390 -0.00003100 -2.068333e-06 0.00007770
## healthyControl_2 healthy 0.20398974 0.23925924 -4.171753e-02 -0.13945055
## healthyControl_3 healthy 0.53017545 0.12777781 2.462832e-01 0.48965120
## healthyControl_4 healthy 0.03769469 -0.07923627 -1.347235e-01 0.11624050
## healthyControl_5 healthy -0.22982526 -0.10329986 -2.256233e-01 -0.07544518
## healthyControl_6 healthy 0.14261413 -0.05624676 2.141688e-01 0.17962694
## CENPF PEX26 ISG20 CLEC2L
## healthyControl_1 0.000020504 8.380425e-05 -0.00000525 0.00003390
## healthyControl_2 -0.061241150 -2.901954e-02 -0.81921244 -0.05779099
## healthyControl_3 -0.331103565 1.364853e-01 -1.26617620 0.23713994
## healthyControl_4 0.225760225 -1.274686e-02 0.22721529 -0.29611158
## healthyControl_5 -0.158033967 1.454494e-01 0.11260080 -0.34715940
## healthyControl_6 -0.449312565 2.032977e-02 -0.07338428 0.17190409
## TMEM194A PDZRN3 NUDT18 DLG3
## healthyControl_1 -0.0002138617 5.150233e-05 0.001091003 -0.000759415
## healthyControl_2 0.0008786520 9.941888e-02 -0.235054730 -0.133482646
## healthyControl_3 -0.0146088577 3.188891e-01 0.881111150 -0.208074767
## healthyControl_4 0.0448924700 -1.342481e-01 -0.086553570 -0.164650345
## healthyControl_5 -0.2484165067 -1.581577e-01 -0.254269360 0.047922324
## healthyControl_6 -0.0732025320 8.387852e-02 1.138130700 0.024404717
## IGFALS SLC1A1 F2 OTOS
## healthyControl_1 -0.000343000 0.000299095 0.0001342283 0.0002070
## healthyControl_2 0.113092660 -0.032329679 -0.0762577053 -0.2211900
## healthyControl_3 0.423058270 0.131755352 0.4079195013 0.8309946
## healthyControl_4 0.101755140 -0.076371432 -0.2981442627 -0.1290300
## healthyControl_5 0.496858360 -0.239856365 -0.2190894300 -0.1959085
## healthyControl_6 0.007001638 0.177237512 0.3654349633 0.2405429
## ENO1 GATC FAM162A PSMF1
## healthyControl_1 -0.0001921633 -0.000256000 0.0000765325 -8.066483e-05
## healthyControl_2 -1.3440033333 -0.127287860 -0.6245107500 7.158557e-02
## healthyControl_3 -0.6280503333 -0.152866360 -0.7001379900 -4.501788e-01
## healthyControl_4 0.1182982100 -0.359013560 0.0864962350 5.658539e-02
## healthyControl_5 -0.3270891500 -0.038307190 0.8462015400 -5.105789e-02
## healthyControl_6 -0.4178791000 -0.002932549 -0.2148986000 4.468243e-02
## HECW1 MAP2K7 LOC400657 PRR24 OR52A4
## healthyControl_1 -0.000307797 -0.0002498642 -0.00065100 -0.0005050 -0.00027200
## healthyControl_2 -0.085825205 0.1400634320 -0.42919350 -0.5969741 0.06674075
## healthyControl_3 0.706030240 0.6052526480 -0.12039113 -0.5586877 0.27201056
## healthyControl_4 -0.165934685 0.0417925380 -0.45719910 -0.1085486 -0.26390958
## healthyControl_5 -0.276568890 -0.0204313716 -0.13633752 -0.3364162 0.01595378
## healthyControl_6 0.296823980 0.4235142192 -0.06637859 0.3638177 0.08342671
## RGPD3 FRS3 HPGD RNF168 KCNJ16
## healthyControl_1 -0.00024600 -0.0001400 0.0000453905 -0.00027800 0.00022185
## healthyControl_2 0.26228500 0.5222826 -0.0960063920 -0.08893681 -0.14524805
## healthyControl_3 -0.04079795 0.0001400 -0.1628251363 -0.10046721 0.11858976
## healthyControl_4 -0.37593746 0.1984134 0.3786586206 -0.23633862 -0.29451060
## healthyControl_5 0.19060660 -0.3893075 0.2273819169 -0.01087332 -0.22437679
## healthyControl_6 0.26550126 -0.6040711 -0.0666730125 0.18056202 0.63788391
## ESYT1 POU4F2 KHDRBS3
## healthyControl_1 -0.0005580 -0.000424625 -0.0002194645
## healthyControl_2 -0.2626066 0.046180010 0.3207392700
## healthyControl_3 0.1145935 0.909273735 0.1385450350
## healthyControl_4 0.2984104 -0.464481365 -0.0196878910
## healthyControl_5 0.2822208 -0.246123550 -0.0518192050
## healthyControl_6 0.5622082 0.152537220 0.0963140750
We could look through all the genes and note the subcategories of behaviors, but there wasn’t really an improvement in accuracy when doing this earlier for the de-standardized data. So we will just use all the genes. One gene doesn’t have a genes summary, LOC400657 gene, and it won’t be in the Tableau charts on this data.But we can still compare how it is in predicting accuracy of classification with our other genes.
Lets split the data into testing and training sets.
set.seed(1234)
train2 <- sample(1:86,.7*86)
trainingNorm <- stand2[train2,]
testingNorm <- stand2[-train2,]
dim(trainingNorm);dim(testingNorm)
## [1] 60 34
## [1] 26 34
Training/Testing 1:
set.seed(589647)
rfModNorm <- train(class~., method='rf',
na.action=na.pass,
data=(trainingNorm), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRFNorm <- predict(rfModNorm, testingNorm)
predDFNorm <- data.frame(predRFNorm, type=testingNorm$class)
predDFNorm
## predRFNorm type
## 1 1 month healthy
## 2 1 month healthy
## 3 acute healthy
## 4 healthy healthy
## 5 1 month acute
## 6 acute acute
## 7 acute acute
## 8 healthy acute
## 9 acute acute
## 10 acute acute
## 11 acute acute
## 12 acute acute
## 13 1 month 1 month
## 14 acute 1 month
## 15 acute 1 month
## 16 1 month 1 month
## 17 acute 1 month
## 18 1 month 1 month
## 19 acute 1 month
## 20 acute 1 month
## 21 1 month 1 month
## 22 acute 1 month
## 23 1 month 1 month
## 24 6 months 6 months
## 25 1 month 6 months
## 26 healthy 6 months
praNorm <- precisionRecallAccuracy(predDFNorm)
## accuracy is: 50 %
praNorm
## class precision recall accuracy
## 1 healthy 0.333333333333333 0.25 0.807692307692308
## 2 acute 0.461538461538462 0.75 0.653846153846154
## 3 1 month 0.555555555555556 0.454545454545455 0.615384615384615
## 4 6 months 1 0.333333333333333 0.923076923076923
The accuracy was 50% on this log2 normalized data, which is in the same range of accuracy the de-standardized data scored. Lets try removing the samples with the highest standard deviation from the mean of the samples. But first lets see what the class balance or number of samples in each class for each of the training or testing set.The best precision was on the 6 month class, then the 1 month, acute, and healthy classes. The recall was best on the acute class, then the 1 month, 6 month, and healthy class.
train2Bal <- trainingNorm %>% group_by(class) %>% count(class)
test2Bal <- testingNorm %>% group_by(class) %>% count(class)
train2Bal
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 16
## 2 6 months 7
## 3 acute 20
## 4 healthy 17
test2Bal
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 11
## 2 6 months 3
## 3 acute 8
## 4 healthy 4
The balance seems to be good, and the 6 to 3 split of the 6 month class seemed to help it score 100% precision accuracy in classification. There must be a lot of variance in those other classes, especially the acute and 1 month, since they had a lot of samples to train and didn’t score well when predicting many of those samples.
Lets remove the sample from each class that has the highest deviation and see if it helps in predictions accuracy.
H1nb <- subset(stand2, stand2$class=='healthy')
m1nb <- subset(stand2,stand2$class=='1 month')
m6nb <- subset(stand2, stand2$class=='6 months')
acutenb <- subset(stand2, stand2$class=='acute')
dim(H1nb);dim(m1nb);dim(m6nb);dim(acutenb)
## [1] 21 34
## [1] 27 34
## [1] 10 34
## [1] 28 34
The dimensions are as they should be, 21 samples as healthy, 27 samples as acute, 10 samples as 6 months, and 28 samples as 1 month.
Hnc <- order(apply(H1nb[,-1],1,sd))
m1nc <- order(apply(m1nb[,-1],1,sd))
m6nc <- order(apply(m6nb[,-1],1,sd))
acnc <- order(apply(acutenb[,-1],1,sd))
ac_sd1 <- acnc[length(acnc)]
H_sd1 <- Hnc[length(Hnc)]
m1_sd1 <- m1nc[length(m1nc)]
m6_sd1 <- m6nc[length(m6nc)]
stdNormUno <- c(H_sd1,ac_sd1,m1_sd1,m6_sd1)
stand2_std <- stand2[-stdNormUno,]
dim(stand2);dim(stand2_std)
## [1] 86 34
## [1] 82 34
We removed the most deviated samples from each class, now lets split the data and test our classification model we train on it.
set.seed(1234)
s <-sample(1:82,.7*82)
trainingNormUno <- stand2_std[s,]
testingNormUno <- stand2_std[-s,]
dim(trainingNormUno);dim(testingNormUno)
## [1] 57 34
## [1] 25 34
Lets see the class samples of each set.
trainCounts <- trainingNormUno %>% group_by(class) %>% count(class)
trainCounts
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 18
## 2 6 months 7
## 3 acute 19
## 4 healthy 13
testCounts <- testingNormUno %>% group_by(class) %>% count(class)
testCounts
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 9
## 2 6 months 3
## 3 acute 8
## 4 healthy 5
There seems to be a fair distribution of samples in each set. Lets see how our model will classify after removing the most deviated samples of each class.
set.seed(589647)
rfModNormUno <- train(class~., method='rf',
na.action=na.pass,
data=(trainingNormUno), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRFNormUno <- predict(rfModNormUno, testingNormUno)
predDFNormUno <- data.frame(predRFNormUno, type=testingNormUno$class)
predDFNormUno
## predRFNormUno type
## 1 1 month healthy
## 2 1 month healthy
## 3 1 month healthy
## 4 acute healthy
## 5 healthy healthy
## 6 acute acute
## 7 1 month acute
## 8 acute acute
## 9 acute acute
## 10 1 month acute
## 11 acute acute
## 12 acute acute
## 13 acute acute
## 14 1 month 1 month
## 15 1 month 1 month
## 16 1 month 1 month
## 17 1 month 1 month
## 18 acute 1 month
## 19 healthy 1 month
## 20 1 month 1 month
## 21 acute 1 month
## 22 healthy 1 month
## 23 6 months 6 months
## 24 6 months 6 months
## 25 6 months 6 months
pra_NormUno <- precisionRecallAccuracy(predDFNormUno)
## accuracy is: 60 %
pra_NormUno
## class precision recall accuracy
## 1 healthy 0.333333333333333 0.2 0.76
## 2 acute 0.666666666666667 0.75 0.8
## 3 1 month 0.5 0.555555555555556 0.64
## 4 6 months 1 1 1
The accuracy jumped up to 60% from the previous overall accuracy of 50%, so we increased the accuracy 10% better overall by removing a single sample from each class that was the most deviated.The 6 months class scored 100% for accuracy, precision, and recall. So based on these genes we could detect up to 100% accuracy in whether or not a blood sample has had 6 months of antibiotic treatment for lyme disease or not.
Lets not remove any more of the samples from the 6 month class, since we already scored 100% accurate readings. But we should definitely remove some from the other three classes. Lets remove the to 3 classes from those other classes and test the prediction accuracy.
ac_sd1b <- acnc[(length(acnc)-2):length(acnc)]
H_sd1b <- Hnc[(length(Hnc)-2):length(Hnc)]
m1_sd1b <- m1nc[(length(m1nc)-2):length(m1nc)]
m6_sd1b <- m6nc[length(m6nc)]
stdNormUnob <- c(H_sd1b,ac_sd1b,m1_sd1b,m6_sd1b)
stand2_stdb <- stand2[-stdNormUnob,]
dim(stand2);dim(stand2_stdb)
## [1] 86 34
## [1] 76 34
We removed the three most deviated samples from each class except the 6 months class, and now lets split the data and test our classification model we train on it.
set.seed(1234)
sb <-sample(1:76,.7*76)
trainingNormUnob <- stand2_stdb[sb,]
testingNormUnob <- stand2_stdb[-sb,]
dim(trainingNormUnob);dim(testingNormUnob)
## [1] 53 34
## [1] 23 34
Lets see the class samples of each set.
trainCountsb <- trainingNormUnob %>% group_by(class) %>% count(class)
trainCountsb
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 20
## 2 6 months 8
## 3 acute 17
## 4 healthy 8
testCountsb <- testingNormUnob %>% group_by(class) %>% count(class)
testCountsb
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 7
## 2 6 months 2
## 3 acute 9
## 4 healthy 5
There seems to be a fair distribution of samples in each set. Lets see how our model will classify after removing the most deviated samples of each class.
set.seed(589647)
rfModNormUnob <- train(class~., method='rf',
na.action=na.pass,
data=(trainingNormUnob), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRFNormUnob <- predict(rfModNormUnob, testingNormUnob)
predDFNormUnob <- data.frame(predRFNormUnob, type=testingNormUnob$class)
predDFNormUnob
## predRFNormUnob type
## 1 1 month healthy
## 2 1 month healthy
## 3 acute healthy
## 4 acute healthy
## 5 1 month healthy
## 6 6 months acute
## 7 1 month acute
## 8 1 month acute
## 9 acute acute
## 10 1 month acute
## 11 1 month acute
## 12 1 month acute
## 13 acute acute
## 14 1 month acute
## 15 acute 1 month
## 16 1 month 1 month
## 17 1 month 1 month
## 18 acute 1 month
## 19 1 month 1 month
## 20 1 month 1 month
## 21 1 month 1 month
## 22 1 month 6 months
## 23 1 month 6 months
pra_NormUnob <- precisionRecallAccuracy(predDFNormUnob)
## accuracy is: 30.4347826086957 %
pra_NormUnob
## class precision recall accuracy
## 1 healthy 0 0 0.782608695652174
## 2 acute 0.333333333333333 0.222222222222222 0.521739130434783
## 3 1 month 0.3125 0.714285714285714 0.434782608695652
## 4 6 months 0 0 0.869565217391304
The accuracy of 30% was worse than keeping all samples (50%) and worse than removing the most deviated sample (60%). With one less sample in the 6 months class for the testing set it is not longer 100% but 0% in precision and recall, but still scored 87% accuracy overall for not misclassifying any samples as such. The healthy class also received 0% precision and recall. There were now much less data for the model to train on, so lets change the 70-30 split for training and testing to 95% approximately and 5% testing. Lets see how it does.
set.seed(1234)
sc <-sample(1:76,.95*76)
trainingNormUnoc <- stand2_stdb[sc,]
testingNormUnoc <- stand2_stdb[-sc,]
dim(trainingNormUnoc);dim(testingNormUnoc)
## [1] 72 34
## [1] 4 34
Lets see the class samples of each set.
trainCountsc <- trainingNormUnoc %>% group_by(class) %>% count(class)
trainCountsc
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 1 month 26
## 2 6 months 10
## 3 acute 24
## 4 healthy 12
testCountsc <- testingNormUnoc %>% group_by(class) %>% count(class)
testCountsc
## # A tibble: 3 x 2
## # Groups: class [3]
## class n
## <chr> <int>
## 1 1 month 1
## 2 acute 2
## 3 healthy 1
One of the 6 months class is missing, so lets take one from the training set and give the training set our extra acute class.
dim(trainingNormUnoc)
## [1] 72 34
dim(testingNormUnoc)
## [1] 4 34
s6 <- grep('6month',row.names(trainingNormUnoc))[1]
a6 <- grep('acute',row.names(testingNormUnoc))[2]
S6 <- trainingNormUnoc[s6,]
A6 <- testingNormUnoc[a6,]
testingNormUnoc2 <- testingNormUnoc[-a6,]
testingNormUnoc3 <- rbind(testingNormUnoc2,S6)
trainingNormUnoc2 <- trainingNormUnoc[-s6,]
trainingNormUnoc3 <- rbind(trainingNormUnoc2,A6)
dim(trainingNormUnoc3)
## [1] 72 34
dim(testingNormUnoc3)
## [1] 4 34
testingNormUnoc3
## class CABP1 POU3F2 CTXN3 CYP7B1
## healthyControl_15 healthy -0.07589889 0.23626685 -0.055404983 0.14204478
## acuteLymeDisease_13 acute -0.34953380 -0.14276838 -0.010607242 -0.19537401
## Antibodies_1month_4 1 month -0.00003390 -0.03967643 0.004391431 0.01038003
## Antibodies_6months_4 6 months -0.26161885 0.14507556 0.370101364 0.02928352
## CENPF PEX26 ISG20 CLEC2L
## healthyControl_15 -0.19094265 0.20897907 -0.51682330 0.01077294
## acuteLymeDisease_13 0.23698175 -0.18220122 -0.08197308 -0.47302198
## Antibodies_1month_4 -0.07768702 0.39442974 -0.14839697 -0.00003390
## Antibodies_6months_4 -0.05396521 0.06561916 -0.90647270 1.08304500
## TMEM194A PDZRN3 NUDT18 DLG3 IGFALS
## healthyControl_15 -0.10291028 -0.01527524 0.3245788 -0.195066205 -0.1421464
## acuteLymeDisease_13 0.14031196 -0.07497390 0.2230201 0.240245728 -0.3925624
## Antibodies_1month_4 -0.03302749 0.08099906 -0.6770477 -0.002440358 -0.2245874
## Antibodies_6months_4 0.60594010 0.09496323 -0.2761650 -0.269157226 0.2567589
## SLC1A1 F2 OTOS ENO1 GATC
## healthyControl_15 0.10206497 -0.0289009 0.30036592 -0.51248899 0.7165318
## acuteLymeDisease_13 0.08817935 0.2687805 -0.14195108 0.46049946 0.2304306
## Antibodies_1month_4 -0.15825224 0.6607375 0.79377030 0.07921139 -0.4782224
## Antibodies_6months_4 0.22317195 0.4966492 0.08883715 -0.94455562 -0.8335578
## FAM162A PSMF1 HECW1 MAP2K7 LOC400657
## healthyControl_15 -0.0756954 -0.41451812 0.59047568 -0.1142301 0.31817222
## acuteLymeDisease_13 0.6112867 0.14354845 -0.20705736 0.0973073 -0.03576565
## Antibodies_1month_4 0.7709609 0.09603031 0.26225078 -0.1698379 0.21037936
## Antibodies_6months_4 -0.6535579 -0.19522234 0.02570832 -0.5512561 0.03413296
## PRR24 OR52A4 RGPD3 FRS3 HPGD
## healthyControl_15 0.1925106 0.3949838 -0.07778883 -0.2036598 -0.31984014
## acuteLymeDisease_13 0.5176373 -0.2754617 -0.01048160 0.1514721 0.26282352
## Antibodies_1month_4 -0.2431202 0.4273884 -0.32242823 0.1591983 0.14059359
## Antibodies_6months_4 0.3079028 0.2469349 0.33410000 0.2525678 -0.04983094
## RNF168 KCNJ16 ESYT1 POU4F2 KHDRBS3
## healthyControl_15 0.84589290 0.00178826 0.3702583 0.08467209 0.4332647
## acuteLymeDisease_13 -0.09541321 -0.09380269 -0.5263796 -0.27526760 -0.2171967
## Antibodies_1month_4 -0.72275350 -0.03493690 -0.1543922 0.10672879 0.1636499
## Antibodies_6months_4 -0.78001500 0.70712567 -0.8132048 0.35219288 0.3668466
tail(trainingNormUnoc3)
## class CABP1 POU3F2 CTXN3 CYP7B1
## acuteLymeDisease_22 acute 0.198859930 0.5636461 2.113796e-01 -0.01237798
## Antibodies_6months_3 6 months 0.005481005 0.2960713 1.457065e-01 0.03596401
## Antibodies_1month_19 1 month 0.077300310 0.1283374 8.339159e-02 -0.07159638
## healthyControl_1 healthy 0.000033900 -0.0000310 -2.068333e-06 0.00007770
## Antibodies_1month_25 1 month -0.061007260 -0.3139806 -1.648021e-01 0.03461719
## acuteLymeDisease_16 acute 0.146200180 -0.2785492 -8.703899e-02 -0.03114438
## CENPF PEX26 ISG20 CLEC2L
## acuteLymeDisease_22 0.190623643 2.077085e-01 0.52346134 0.2109024
## Antibodies_6months_3 -0.189381601 1.805775e-01 -0.08145428 0.3529954
## Antibodies_1month_19 -0.021549701 -1.338840e-03 1.34653470 0.6721482
## healthyControl_1 0.000020504 8.380425e-05 -0.00000525 0.0000339
## Antibodies_1month_25 -0.038229700 1.857385e-01 -0.11857653 -0.2541506
## acuteLymeDisease_16 0.122709752 -1.675950e-01 0.25008965 -0.3953524
## TMEM194A PDZRN3 NUDT18 DLG3
## acuteLymeDisease_22 -0.2123386123 2.818368e-01 0.122541904 -0.613058094
## Antibodies_6months_3 -0.0632990217 2.783562e-01 -0.234975340 -0.181450794
## Antibodies_1month_19 0.2203699767 -7.045595e-02 0.775892260 -0.292146015
## healthyControl_1 -0.0002138617 5.150233e-05 0.001091003 -0.000759415
## Antibodies_1month_25 0.1204493800 -2.187032e-01 0.396648880 0.619860268
## acuteLymeDisease_16 0.3153448920 -3.467973e-02 -0.036186695 0.211573980
## IGFALS SLC1A1 F2 OTOS
## acuteLymeDisease_22 -0.08303142 -0.129606130 0.0308383307 0.055172443
## Antibodies_6months_3 0.87783310 0.247302785 0.2235975280 -0.392873050
## Antibodies_1month_19 -0.13888740 0.222857237 -0.0168207747 -0.006484032
## healthyControl_1 -0.00034300 0.000299095 0.0001342283 0.000207000
## Antibodies_1month_25 -0.51494503 0.089521049 -0.3120245933 -0.032780170
## acuteLymeDisease_16 -0.12205100 -0.127218849 -0.1225695383 -0.419476750
## ENO1 GATC FAM162A PSMF1
## acuteLymeDisease_22 0.1172231033 -0.42436218 0.2262662630 -1.502608e-01
## Antibodies_6months_3 -0.0127711307 0.05392551 -0.2357237500 -3.833545e-01
## Antibodies_1month_19 0.8786134567 0.69354630 -0.4382941600 3.992631e-01
## healthyControl_1 -0.0001921633 -0.00025600 0.0000765325 -8.066483e-05
## Antibodies_1month_25 0.1702928577 0.05868816 0.1851600450 1.426514e-02
## acuteLymeDisease_16 0.7285270767 0.06770325 -0.1586052250 5.073412e-02
## HECW1 MAP2K7 LOC400657 PRR24
## acuteLymeDisease_22 0.211710450 -0.3484804180 -0.1357880 -0.08162641
## Antibodies_6months_3 0.305571195 0.6139164520 -0.1460991 0.87414217
## Antibodies_1month_19 0.059431908 -0.0480062960 -0.1668129 0.06050730
## healthyControl_1 -0.000307797 -0.0002498642 -0.0006510 -0.00050500
## Antibodies_1month_25 -0.065520765 0.1364680820 -0.2043467 -0.13330030
## acuteLymeDisease_16 -0.396578185 0.2168077000 0.4075024 0.04976654
## OR52A4 RGPD3 FRS3 HPGD
## acuteLymeDisease_22 -0.19158888 0.025556326 0.324386120 0.0438663946
## Antibodies_6months_3 -0.07754993 -0.176086900 0.321854600 0.1335166987
## Antibodies_1month_19 0.12671065 0.006590605 -0.211962220 -0.1516500700
## healthyControl_1 -0.00027200 -0.000246000 -0.000140000 0.0000453905
## Antibodies_1month_25 0.28631163 0.009201050 -0.007893086 -0.1850745963
## acuteLymeDisease_16 0.16221523 -0.239856960 -0.139697070 0.0678496650
## RNF168 KCNJ16 ESYT1 POU4F2
## acuteLymeDisease_22 0.1008706 0.04425120 -0.33963203 0.058635474
## Antibodies_6months_3 -0.2761946 0.58044040 -0.17030620 -0.219005940
## Antibodies_1month_19 -0.2180100 0.41718853 -0.07835484 0.028754355
## healthyControl_1 -0.0002780 0.00022185 -0.00055800 -0.000424625
## Antibodies_1month_25 0.3044357 -0.07687879 0.07615852 -0.252467275
## acuteLymeDisease_16 0.2950745 0.08341658 0.02792740 -0.271346800
## KHDRBS3
## acuteLymeDisease_22 0.2132141600
## Antibodies_6months_3 0.3479696500
## Antibodies_1month_19 0.1818720115
## healthyControl_1 -0.0002194645
## Antibodies_1month_25 -0.0755381600
## acuteLymeDisease_16 -0.0995574015
Now we have at least one of each class in our testing set. This model will train on the data with the three most deviated from the mean samples removed in all classes except the 6 months class, which only has one most deviated sample removed.
set.seed(589647)
rfModNormUnoc3 <- train(class~., method='rf',
na.action=na.pass,
data=(trainingNormUnoc3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
predRFNormUnoc3 <- predict(rfModNormUnoc3, testingNormUnoc3)
predDFNormUnoc3 <- data.frame(predRFNormUnoc3, type=testingNormUnoc3$class)
predDFNormUnoc3
## predRFNormUnoc3 type
## 1 healthy healthy
## 2 acute acute
## 3 6 months 1 month
## 4 6 months 6 months
pra_NormUnoc3 <- precisionRecallAccuracy(predDFNormUnoc3)
## accuracy is: 75 %
pra_NormUnoc3
## class precision recall accuracy
## 1 healthy 1 1 1
## 2 acute 1 1 1
## 3 1 month 0 0 0.75
## 4 6 months 0.5 1 0.75
The overall accuracy is 75%, because one class was misclassified as 6 months when it wasn’t. That led to a precision of 50% on the 6 month class even though all 6 month classes were predicted accurately (recall of 100%). So this means that originally when keeping all the 6 month samples other than the most deviated we removed, because it scored 100% in prediction for all categories, we now know that removing those additional 6 samples by taking another 2 samples from each of the other three classes affected our model on predicting the 6 months class accurately. But, it did improve the recall and precision for the healthy and acute classes to 100% for precision, recall, and accuracy. but the 1 month class was not identified and was misclassified as the 6 month class. We could be more selective and take only those samples within three standard deviations of the mean and running our algorithms and see how accurate our model is. This is what a course on Linked in learning for recommender systems in python says to do when preprocessing and training your models for classification in that case for sentiment analysis. That course is recommended by Frank Kane, and he has some youtube channel under SunDog he referred to in that online course, but never visited by me. Lets use a little imagination here and say how this effects the model. When someone gives blood at a site, and the values are out of range for the model, then its values in all genes that aren’t within the certain selected range of values would throw an error, and the patient would be told his or her sample came out flawed and needs to be done again, with some questionaire on medications taking or not taking a certain vitamin or not drinking water, or drinking more water, or not eating 12-24 hours before, etc. Then if the next time their sample is taken and it is within the range of values, then it can be used to run the model on to predict whether or not the person has lyme disease in the acute phase or if they are healthy. Because this model scored 100% in precision, recall, and accuracy for the healthy and acute classes, but errored on the classes where the patient was taking antibiotics for 1 month and separately 6 months later. So, these genes are also a good set of genes to use as lyme disease pathogenesis, just like our other sample when excluding the most deviated samples from the training model.
This next section looks at the denormalized data set of 40,000+ genes with out the duplicates removed, and with the sample alias names instead of GSM IDs to look at our body system genes that we explored with the COVID-19 study GSE152418. We will use the lymeMX2-denormalized-easynames.csv file we created ealier.
systemsDF <- read.csv('lymeMX2-denormalized-easynames.csv',sep=',',
header=T, na.strings=c('',' ','NA'))
Lets go ahead and source our scripted file, because I am starting this section of this script after clearing out my objects and closing out the previous sections data. That script of our functions is geneCards2.R
source('geneCards2.R')
For our body systems. Lets look at lymphatic, integumentary, nervous, circulatory, musculature, endocrine, bone structures, and reproductive systems. We will first get the genes that are the top 3 ranked genes of some select genes in those systems. I also want to look up tetanis because the tetanis booster is something we are all supposed to have every 10 years. And alcohol and dopamine genes, as well as OTC drugs like Ibuprofen and aspirin, tylenol, NSAIDs, and cannabidiol genes for the toxic and non-toxic genes related to marijuana. For lymphatic, lets just enter lymphatic, for integumentary we should use some epithelial genes. Lets just see what it pulls up on the systems by what we type for the systems.
find25genes('integumentary')
find25genes('nervous')
getProteinGenes('integumentary')
getProteinGenes('nervous')
integumentary <- read.csv('Top25integumentarys.csv')
nervous <- read.csv('Top25nervouss.csv')
for (i in integumentary$proteinType){
getSummaries2(i,'integumentary')
}
for (i in nervous$proteinType){
getSummaries2(i,'nervous')
}
getGeneSummaries('integumentary')
getGeneSummaries('nervous')
integumentarySumms <- read.csv('proteinGeneSummaries_integumentary.csv')
nervousSumms <- read.csv('proteinGeneSummaries_nervous.csv')
Lets now look at the epithelial system which includes the skin and the lining of the organs and membranes.
find25genes('epithelial')
getProteinGenes('epithelial')
epithelial <- read.csv("Top25epithelials.csv")
for (i in epithelial$proteinType){
getSummaries2(i,'epithelial')
}
getGeneSummaries('epithelial')
epithelialSumms <- read.csv("proteinGeneSummaries_epithelial.csv")
find25genes('lymphatic')
getProteinGenes('lymphatic')
lymphatic <- read.csv("Top25lymphatics.csv")
for (i in lymphatic$proteinType){
getSummaries2(i,'lymphatic')
}
getGeneSummaries('lymphatic')
lymphaticSumms <- read.csv("proteinGeneSummaries_lymphatic.csv")
find25genes('circulatory')
getProteinGenes('circulatory')
circulatory <- read.csv("Top25circulatorys.csv")
for (i in circulatory$proteinType){
getSummaries2(i, 'circulatory')
}
getGeneSummaries('circulatory')
circulatorySumms <- read.csv("proteinGeneSummaries_circulatory.csv")
find25genes('musculature')
getProteinGenes('musculature')
musculature <- read.csv("Top25musculatures.csv")
for (i in musculature$proteinType){
getSummaries2(i,'musculature')
}
getGeneSummaries('musculature')
musculatureSumms <- read.csv("proteinGeneSummaries_musculature.csv")
find25genes('endocrine')
getProteinGenes('endocrine')
endocrine <- read.csv("Top25endocrines.csv")
for (i in endocrine$proteinType){
getSummaries2(i,'endocrine')
}
getGeneSummaries('endocrine')
endocrineSumms <- read.csv("proteinGeneSummaries_endocrine.csv")
find25genes('bone structure')
getProteinGenes('bone structure')
boneStructure <- read.csv("Top25bone-structures.csv")
for (i in boneStructure$proteinType){
getSummaries2(i,'bone structure')
}
getGeneSummaries('bone structure')
boneStructureSumms <- read.csv("proteinGeneSummaries_bone-structure.csv")
find25genes('reproductive')
getProteinGenes('reproductive')
reproductive <- read.csv("Top25reproductives.csv")
for (i in reproductive$proteinType){
getSummaries2(i,'reproductive')
}
getGeneSummaries('reproductive')
reproductiveSumms <- read.csv("proteinGeneSummaries_reproductive.csv")
find25genes('tetanis')
getProteinGenes('tetanis')
tetanis <- read.csv("Top25tetaniss.csv")
for (i in tetanis$proteinType){
getSummaries2(i,'tetanis')
}
getGeneSummaries('tetanis')
tetanisSumms <- read.csv("proteinGeneSummaries_tetanis.csv")
find25genes('alcohol')
getProteinGenes('alcohol')
alcohol <- read.csv("Top25alcohols.csv")
for (i in alcohol$proteinType){
getSummaries2(i,'alcohol')
}
getGeneSummaries('alcohol')
alcoholSumms <- read.csv("proteinGeneSummaries_alcohol.csv")
find25genes('dopamine')
getProteinGenes('dopamine')
dopamine <- read.csv("Top25dopamines.csv")
for (i in dopamine$proteinType){
getSummaries2(i, 'dopamine')
}
getGeneSummaries('dopamine')
dopamineSumms <- read.csv("proteinGeneSummaries_dopamine.csv")
find25genes('ibuprofen')
getProteinGenes('ibuprofen')
ibuprofen <- read.csv("Top25ibuprofens.csv")
for (i in ibuprofen$proteinType){
getSummaries2(i,'ibuprofen')
}
getGeneSummaries('ibuprofen')
ibuprofenSumms <- read.csv("proteinGeneSummaries_ibuprofen.csv")
find25genes('aspirin')
getProteinGenes('aspirin')
aspirin <- read.csv("Top25aspirins.csv")
for (i in aspirin$proteinType){
getSummaries2(i,'aspirin')
}
getGeneSummaries('aspirin')
aspirinSumms <- read.csv("proteinGeneSummaries_aspirin.csv")
find25genes('tylenol')
getProteinGenes('tylenol')
tylenol <- read.csv("Top25tylenols.csv")
for (i in tylenol$proteinType){
getSummaries2(i,'tylenol')
}
getGeneSummaries('tylenol')
tylenolSumms <- read.csv("proteinGeneSummaries_tylenol.csv")
find25genes('NSAIDs')
getProteinGenes('NSAIDs')
nsaid <- read.csv("Top25nsaidss.csv")
for (i in nsaid$proteinType){
getSummaries2(i,'NSAIDs')
}
getGeneSummaries('NSAIDs')
NSAID_summs <- read.csv("proteinGeneSummaries_nsaids.csv")
find25genes('cannabidiol')
getProteinGenes('cannabidiol')
cannabidiol <- read.csv("Top25cannabidiols.csv")
for (i in cannabidiol$proteinType){
getSummaries2(i,'cannabidiol')
}
getGeneSummaries('cannabidiol')
cannabidiolSumms <- read.csv("proteinGeneSummaries_cannabidiol.csv")
Lets combine all the genes from these data sets of 25 genes together for their data sets of gene summaries.
allSystemSumms <- rbind(lymphaticSumms, integumentarySumms,
circulatorySumms, musculatureSumms,
endocrineSumms,boneStructureSumms,
reproductiveSumms,tetanisSumms,
alcoholSumms,ibuprofenSumms,
aspirinSumms,tylenolSumms,
NSAID_summs,cannabidiolSumms)
Lets also just combine the top 3 of each body system gene into a separate data set.
allSystemSummsFirst3 <- rbind(lymphaticSumms[1:3,], integumentarySumms[1:3,],
nervousSumms[1:3,],
circulatorySumms[1:3,], musculatureSumms[1:3,],
endocrineSumms[1:3,],boneStructureSumms[1:3,],
reproductiveSumms[1:3,],tetanisSumms[1:3,],
alcoholSumms[1:3,],ibuprofenSumms[1:3,],
aspirinSumms[1:3,],tylenolSumms[1:3,],
NSAID_summs[1:3,],cannabidiolSumms[1:3,])
Lets also use the vitamin, mineral, and hormonal genes used in our analysis of COVID-19 of study GSE152418.
Lets not stop at the sun genes, as a massage therapist of more than 14 years of experience and having recently studied for and taken and passed my MBLEx or Massage and Bodywork Licensing Examination, I can tell you there are many fascinating items of the body systems and mineral as well as vitamin dependencies that lead to disease in some people. But when relearning the endocrine system and the hormones related to the pineal, hypothalamus, pituitary, adrenals, thyroid, and pancreas many other vitamins, steroids, and hormones should be looked at in studying these different cases of COVID-19.
We will look at the Vitamin C which helps the body absorb Vitamin D and make calcium in the bone blood, the glucagon that turn glucose into sugar and insulin that lowers glucose in the blood having to do with the pancreas hormones, dopamine that relates to parkinsons disease when the hypothalamus doesn’t produce enough, melatonin that regulates sleep and produced by the pineal gland near the pituitary and hypothalamus in the brain that regulates sleep, estrogen, prolactin, and progesterone regulated by the pituitary gland in the brain in females, testosterone regulated by the males in their testes, and corticosteroids and adrenaline regulated by the adrenals when in sympathetic response of danger in the body. Also, the vitamins that people are commonly told to take in addition to Vitamin C and Vitamin D, such as fish oil or omega 3s, vitamin B12 or zinc, and magnesium mineral.Also, calcitonin, a thyroid hormone that breaks down calcium so that the kidneys don’t get kidney stones nor other healthy problems.
find25genes('vitamin D')
find25genes('melanin')
find25genes('vitamin C')
find25genes('glucose')
find25genes('insulin')
find25genes('glucagon')
find25genes('dopamine')
find25genes('estrogen')
find25genes('progesterone')
find25genes('prolactin')
find25genes('testosterone')
find25genes('calcium')
find25genes('melatonin')
find25genes('vitamin B12')
find25genes('zinc')
find25genes('magnesium')
find25genes('fish oil')
find25genes('omega 3s')
find25genes('adrenaline')
find25genes('corticosteroids')
find25genes('calcitonine')
find25genes('iron')
getProteinGenes('vitamin D')
getProteinGenes('melanin')
getProteinGenes('vitamin C')
getProteinGenes('glucose')
getProteinGenes('insulin')
getProteinGenes('glucagon')
getProteinGenes('dopamine')
getProteinGenes('estrogen')
getProteinGenes('progesterone')
getProteinGenes('prolactin')
getProteinGenes('testosterone')
getProteinGenes('calcium')
getProteinGenes('melatonin')
getProteinGenes('vitamin B12')
getProteinGenes('zinc')
getProteinGenes('magnesium')
getProteinGenes('fish oil')
getProteinGenes('omega 3s')
getProteinGenes('adrenaline')
getProteinGenes('corticosteroids')
getProteinGenes('calcitonine')
getProteinGenes('iron')
vitD <- read.csv('Top25vitamin-ds.csv')
melanin <- read.csv('Top25melanins.csv')
vitC <- read.csv('Top25vitamin-cs.csv')
glucose <- read.csv('Top25glucoses.csv')
insulin <- read.csv('Top25insulins.csv')
glucagon <- read.csv('Top25glucagons.csv')
dopamine <- read.csv('Top25dopamines.csv')
estrogen <- read.csv('Top25estrogens.csv')
progesterone <- read.csv('Top25progesterones.csv')
prolactin <- read.csv('Top25prolactins.csv')
testosterone <- read.csv('Top25testosterones.csv')
calcium <- read.csv('Top25calciums.csv')
melatonin <- read.csv('Top25melatonins.csv')
vitB12 <- read.csv('Top25vitamin-b12s.csv')
zinc <- read.csv('Top25zincs.csv')
magnesium <- read.csv('Top25magnesiums.csv')
fishOil <- read.csv('Top25fish-oils.csv')
omega3s <- read.csv('Top25omega-3ss.csv')
adrenaline <- read.csv('Top25adrenalines.csv')
corticosteroid <- read.csv('Top25corticosteroidss.csv')
calcitonine <- read.csv('Top25calcitonines.csv')
iron <- read.csv('Top25irons.csv')
Lets only take the top 3 from each data frame of mineral, vitamin, or steroid.
vitMinSter <- rbind(vitD[1:3,1:2],melanin[1:3,1:2],
vitC[1:3,1:2],glucose[1:3,1:2],
insulin[1:3,1:2],glucagon[1:3,1:2],
dopamine[1:3,1:2],estrogen[1:3,1:2],
progesterone[1:3,1:2],prolactin[1:3,1:2],
testosterone[1:3,1:2],calcium[1:3,1:2],
calcitonine[1:3,1:2],melatonin[1:3,1:2],
vitB12[1:3,1:2],zinc[1:3,1:2],magnesium[1:3,1:2],
fishOil[1:3,1:2],omega3s[1:3,1:2],
adrenaline[1:3,1:2],iron[1:3,1:2],
corticosteroid[1:3,1:2])
head(vitMinSter)
## proteinType proteinSearched
## 1 VDR vitamin-d
## 2 CYP27B1 vitamin-d
## 3 PHEX vitamin-d
## 4 TYR melanin
## 5 TYRP1 melanin
## 6 OCA2 melanin
Some of the genes associated with one vitamin also associate with another. We will keep them this way for the visualizations or charting.We could make a link analysis with these genes that are associated with other vitamins and minerals, but if not then you should.
Lets now get the gene summaries of these genes.
for (i in vitMinSter$proteinType){
getSummaries2(i,'protein')
}
getGeneSummaries('protein')
vitMinSterSumms <- read.csv("proteinGeneSummaries_protein.csv"
)
vitMinSterSumms2 <- vitMinSterSumms[,c(2:7)]
head(vitMinSterSumms2)
## gene EnsemblID
## 1 VDR ENSG00000111424
## 2 CYP27B1 ENSG00000111012
## 3 PHEX ENSG00000102174
## 4 TYR ENSG00000077498
## 5 TYRP1 ENSG00000107165
## 6 OCA2 ENSG00000104044
## EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## GeneCardsSummary
## 1 VDR (Vitamin D Receptor) is a Protein Coding gene. Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets. Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity. An important paralog of this gene is NR1I2.
## 2 CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets. Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP27A1.
## 3 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene. Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive. Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity. An important paralog of this gene is MMEL1.
## 4 TYR (Tyrosinase) is a Protein Coding gene. Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib. Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is TYRP1.
## 5 TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene. Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11. Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is DCT.
## 6 OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene. Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1. Among its related pathways are Viral mRNA Translation and Metabolism. Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity. An important paralog of this gene is SLC13A2.
## UniProtKB_Summary
## 1 Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n VDR_HUMAN,P11473\n
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n CP27B_HUMAN,O15528\n
## 3 Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n PHEX_HUMAN,P78562\n
## 4 This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n TYRO_HUMAN,P14679\n
## 5 Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n TYRP1_HUMAN,P17643\n
## 6 Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n P_HUMAN,Q04671\n
## todaysDate
## 1 Thu Sep 03 14:11:07 2020
## 2 Thu Sep 03 14:11:09 2020
## 3 Thu Sep 03 14:11:12 2020
## 4 Thu Sep 03 14:11:13 2020
## 5 Thu Sep 03 14:11:15 2020
## 6 Thu Sep 03 14:11:16 2020
Combine the vitamin searched with the gene from the last two data frames.
vitamins <- merge(vitMinSter,vitMinSterSumms2,
by.x='proteinType',
by.y='gene')
vitamins2 <- vitamins[!duplicated(vitamins),]
colnames(vitamins2)[1] <- 'gene'
head(vitamins2)
## gene proteinSearched EnsemblID
## 1 AANAT melatonin ENSG00000129673
## 2 ANKH calcium ENSG00000154122
## 3 APOA1 fish-oil ENSG00000118137
## 4 APOB fish-oil ENSG00000084674
## 5 CACNA1B omega-3s ENSG00000148408
## 6 CALCA calcitonine ENSG00000110680
## EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 This gene encodes a multipass transmembrane protein that is expressed in joints and other tissues and controls pyrophosphate levels in cultured cells. Progressive ankylosis-mediated control of pyrophosphate levels has been suggested as a possible mechanism regulating tissue calcification and susceptibility to arthritis in higher animals. Mutations in this gene have been associated with autosomal dominant craniometaphyseal dysplasia. [provided by RefSeq, Jul 2008]
## 3 This gene encodes apolipoprotein A-I, which is the major protein component of high density lipoprotein (HDL) in plasma. The encoded preproprotein is proteolytically processed to generate the mature protein, which promotes cholesterol efflux from tissues to the liver for excretion, and is a cofactor for lecithin cholesterolacyltransferase (LCAT), an enzyme responsible for the formation of most plasma cholesteryl esters. This gene is closely linked with two other apolipoprotein genes on chromosome 11. Defects in this gene are associated with HDL deficiencies, including Tangier disease, and with systemic non-neuropathic amyloidosis. Alternative splicing results in multiple transcript variants, at least one of which encodes a preproprotein. [provided by RefSeq, Dec 2015]
## 4 This gene product is the main apolipoprotein of chylomicrons and low density lipoproteins (LDL), and is the ligand for the LDL receptor. It occurs in plasma as two main isoforms, apoB-48 and apoB-100: the former is synthesized exclusively in the gut and the latter in the liver. The intestinal and the hepatic forms of apoB are encoded by a single gene from a single, very long mRNA. The two isoforms share a common N-terminal sequence. The shorter apoB-48 protein is produced after RNA editing of the apoB-100 transcript at residue 2180 (CAA->UAA), resulting in the creation of a stop codon, and early translation termination. Mutations in this gene or its regulatory region cause hypobetalipoproteinemia, normotriglyceridemic hypobetalipoproteinemia, and hypercholesterolemia due to ligand-defective apoB, diseases affecting plasma cholesterol and apoB levels. [provided by RefSeq, Dec 2019]
## 5 The protein encoded by this gene is the pore-forming subunit of an N-type voltage-dependent calcium channel, which controls neurotransmitter release from neurons. The encoded protein forms a complex with alpha-2, beta, and delta subunits to form the high-voltage activated channel. This channel is sensitive to omega-conotoxin-GVIA and omega-agatoxin-IIIA but insensitive to dihydropyridines. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Aug 2011]
## 6 This gene encodes the peptide hormones calcitonin, calcitonin gene-related peptide and katacalcin by tissue-specific alternative RNA splicing of the gene transcripts and cleavage of inactive precursor proteins. Calcitonin is involved in calcium regulation and acts to regulate phosphorus metabolism. Calcitonin gene-related peptide functions as a vasodilator and as an antimicrobial peptide while katacalcin is a calcium-lowering peptide. Multiple transcript variants encoding different isoforms have been found for this gene.[provided by RefSeq, Aug 2014]
## GeneCardsSummary
## 1 AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene. Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome. Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism. Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.
## 2 ANKH (ANKH Inorganic Pyrophosphate Transport Regulator) is a Protein Coding gene. Diseases associated with ANKH include Craniometaphyseal Dysplasia, Autosomal Dominant and Chondrocalcinosis 2. Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Miscellaneous transport and binding events. Gene Ontology (GO) annotations related to this gene include inorganic phosphate transmembrane transporter activity and inorganic diphosphate transmembrane transporter activity.
## 3 APOA1 (Apolipoprotein A1) is a Protein Coding gene. Diseases associated with APOA1 include Hypoalphalipoproteinemia, Primary, 2 and Amyloidosis, Familial Visceral. Among its related pathways are Lipoprotein metabolism and Folate Metabolism. Gene Ontology (GO) annotations related to this gene include identical protein binding and lipid binding. An important paralog of this gene is APOA4.
## 4 APOB (Apolipoprotein B) is a Protein Coding gene. Diseases associated with APOB include Hypobetalipoproteinemia, Familial, 1 and Hypercholesterolemia, Familial, 2. Among its related pathways are Activated TLR4 signalling and Lipoprotein metabolism. Gene Ontology (GO) annotations related to this gene include binding and heparin binding.
## 5 CACNA1B (Calcium Voltage-Gated Channel Subunit Alpha1 B) is a Protein Coding gene. Diseases associated with CACNA1B include Neurodevelopmental Disorder With Seizures And Nonepileptic Hyperkinetic Movements and Undetermined Early-Onset Epileptic Encephalopathy. Among its related pathways are Nicotine addiction and ADP signalling through P2Y purinoceptor 12. Gene Ontology (GO) annotations related to this gene include calcium ion binding and ion channel activity. An important paralog of this gene is CACNA1A.
## 6 CALCA (Calcitonin Related Polypeptide Alpha) is a Protein Coding gene. Diseases associated with CALCA include Reflex Sympathetic Dystrophy and Spinal Stenosis. Among its related pathways are Neuroscience and Signaling by GPCR. Gene Ontology (GO) annotations related to this gene include identical protein binding. An important paralog of this gene is CALCB.
## UniProtKB_Summary
## 1 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n SNAT_HUMAN,Q16613\n
## 2 Regulates intra- and extracellular levels of inorganic pyrophosphate (PPi), probably functioning as PPi transporter.\n ANKH_HUMAN,Q9HCJ1\n
## 3 Participates in the reverse transport of cholesterol from tissues to the liver for excretion by promoting cholesterol efflux from tissues and by acting as a cofactor for the lecithin cholesterol acyltransferase (LCAT). As part of the SPAP complex, activates spermatozoa motility.\n APOA1_HUMAN,P02647\n
## 4 Apolipoprotein B is a major protein constituent of chylomicrons (apo B-48), LDL (apo B-100) and VLDL (apo B-100). Apo B-100 functions as a recognition signal for the cellular binding and internalization of LDL particles by the apoB/E receptor.\n APOB_HUMAN,P04114\n
## 5 Voltage-sensitive calcium channels (VSCC) mediate the entry of calcium ions into excitable cells and are also involved in a variety of calcium-dependent processes, including muscle contraction, hormone or neurotransmitter release, gene expression, cell motility, cell division and cell death. The isoform alpha-1B gives rise to N-type calcium currents. N-type calcium channels belong to the 'high-voltage activated' (HVA) group and are specifically blocked by omega-conotoxin-GVIA (AC P01522) (AC P01522) (By similarity). They are however insensitive to dihydropyridines (DHP). Calcium channels containing alpha-1B subunit may play a role in directed migration of immature neurons.\n CAC1B_HUMAN,Q00975\n
## 6 CGRP induces vasodilation. It dilates a variety of vessels including the coronary, cerebral and systemic vasculature. Its abundance in the CNS also points toward a neurotransmitter or neuromodulator role. It also elevates platelet cAMP.\n CALCA_HUMAN,P06881\n
## todaysDate
## 1 Thu Sep 03 14:12:01 2020
## 2 Thu Sep 03 14:11:52 2020
## 3 Thu Sep 03 14:12:14 2020
## 4 Thu Sep 03 14:12:16 2020
## 5 Thu Sep 03 14:12:20 2020
## 6 Thu Sep 03 14:11:54 2020
allwithVits <- rbind(allSystemSumms,vitamins2)
head(allwithVits)
## proteinSearched gene EnsemblID
## 1 lymphatic FLT4 ENSG00000037280
## 2 lymphatic VEGFC ENSG00000150630
## 3 lymphatic LYVE1 ENSG00000133800
## 4 lymphatic SOX18 ENSG00000203883
## 5 lymphatic PIK3CA ENSG00000121879
## 6 lymphatic CCBE1 ENSG00000183287
## EntrezSummary
## 1 This gene encodes a tyrosine kinase receptor for vascular endothelial growth factors C and D. The protein is thought to be involved in lymphangiogenesis and maintenance of the lymphatic endothelium. Mutations in this gene cause hereditary lymphedema type IA. [provided by RefSeq, Jul 2008]
## 2 The protein encoded by this gene is a member of the platelet-derived growth factor/vascular endothelial growth factor (PDGF/VEGF) family. The encoded protein promotes angiogenesis and endothelial cell growth, and can also affect the permeability of blood vessels. The proprotein is further cleaved into a fully processed form that can bind and activate VEGFR-2 and VEGFR-3 receptors. [provided by RefSeq, Apr 2014]
## 3 This gene encodes a type I integral membrane glycoprotein. The encoded protein acts as a receptor and binds to both soluble and immobilized hyaluronan. This protein may function in lymphatic hyaluronan transport and have a role in tumor metastasis. [provided by RefSeq, Jul 2008]
## 4 This gene encodes a member of the SOX (SRY-related HMG-box) family of transcription factors involved in the regulation of embryonic development and in the determination of the cell fate. The encoded protein may act as a transcriptional regulator after forming a protein complex with other proteins. This protein plays a role in hair, blood vessel, and lymphatic vessel development. Mutations in this gene have been associated with recessive and dominant forms of hypotrichosis-lymphedema-telangiectasia. [provided by RefSeq, Jul 2008]
## 5 Phosphatidylinositol 3-kinase is composed of an 85 kDa regulatory subunit and a 110 kDa catalytic subunit. The protein encoded by this gene represents the catalytic subunit, which uses ATP to phosphorylate PtdIns, PtdIns4P and PtdIns(4,5)P2. This gene has been found to be oncogenic and has been implicated in cervical cancers. A pseudogene of this gene has been defined on chromosome 22. [provided by RefSeq, Apr 2016]
## 6 This gene is thought to function in extracellular matrix remodeling and migration. It is predominantly expressed in the ovary, but down regulated in ovarian cancer cell lines and primary carcinomas, suggesting its role as a tumour suppressor. Mutations in this gene have been associated with Hennekam lymphangiectasia-lymphedema syndrome, a generalized lymphatic dysplasia in humans. [provided by RefSeq, Mar 2010]
## GeneCardsSummary
## 1 FLT4 (Fms Related Receptor Tyrosine Kinase 4) is a Protein Coding gene. Diseases associated with FLT4 include Lymphatic Malformation 1 and Congenital Heart Defects, Multiple Types, 7. Among its related pathways are Signaling by GPCR and NF-KappaB Family Pathway. Gene Ontology (GO) annotations related to this gene include transferase activity, transferring phosphorus-containing groups and protein tyrosine kinase activity. An important paralog of this gene is KDR.
## 2 VEGFC (Vascular Endothelial Growth Factor C) is a Protein Coding gene. Diseases associated with VEGFC include Lymphatic Malformation 4 and Hereditary Lymphedema Id. Among its related pathways are HIF1Alpha Pathway and Signaling by GPCR. Gene Ontology (GO) annotations related to this gene include growth factor activity and vascular endothelial growth factor receptor 3 binding. An important paralog of this gene is VEGFD.
## 3 LYVE1 (Lymphatic Vessel Endothelial Hyaluronan Receptor 1) is a Protein Coding gene. Diseases associated with LYVE1 include Intramuscular Hemangioma and Middle Cerebral Artery Infarction. Among its related pathways are Cell adhesion_Cell-matrix glycoconjugates and Glycosaminoglycan metabolism. Gene Ontology (GO) annotations related to this gene include hyaluronic acid binding. An important paralog of this gene is CD44.
## 4 SOX18 (SRY-Box Transcription Factor 18) is a Protein Coding gene. Diseases associated with SOX18 include Hypotrichosis-Lymphedema-Telangiectasia-Renal Defect Syndrome and Hypotrichosis-Lymphedema-Telangiectasia Syndrome. Among its related pathways are ERK Signaling. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and protein heterodimerization activity. An important paralog of this gene is SOX17.
## 5 PIK3CA (Phosphatidylinositol-4,5-Bisphosphate 3-Kinase Catalytic Subunit Alpha) is a Protein Coding gene. Diseases associated with PIK3CA include Hepatocellular Carcinoma and Megalencephaly-Capillary Malformation-Polymicrogyria Syndrome. Among its related pathways are GDNF-Family Ligands and Receptor Interactions and RET signaling. Gene Ontology (GO) annotations related to this gene include transferase activity, transferring phosphorus-containing groups and protein serine/threonine kinase activity. An important paralog of this gene is PIK3CB.
## 6 CCBE1 (Collagen And Calcium Binding EGF Domains 1) is a Protein Coding gene. Diseases associated with CCBE1 include Hennekam Lymphangiectasia-Lymphedema Syndrome 1 and Hennekam Syndrome. Gene Ontology (GO) annotations related to this gene include calcium ion binding and collagen binding.
## UniProtKB_Summary
## 1 Tyrosine-protein kinase that acts as a cell-surface receptor for VEGFC and VEGFD, and plays an essential role in adult lymphangiogenesis and in the development of the vascular network and the cardiovascular system during embryonic development. Promotes proliferation, survival and migration of endothelial cells, and regulates angiogenic sprouting. Signaling by activated FLT4 leads to enhanced production of VEGFC, and to a lesser degree VEGFA, thereby creating a positive feedback loop that enhances FLT4 signaling. Modulates KDR signaling by forming heterodimers. The secreted isoform 3 may function as a decoy receptor for VEGFC and/or VEGFD and play an important role as a negative regulator of VEGFC-mediated lymphangiogenesis and angiogenesis. Binding of vascular growth factors to isoform 1 or isoform 2 leads to the activation of several signaling cascades; isoform 2 seems to be less efficient in signal transduction, because it has a truncated C-terminus and therefore lacks several phosphorylation sites. Mediates activation of the MAPK1/ERK2, MAPK3/ERK1 signaling pathway, of MAPK8 and the JUN signaling pathway, and of the AKT1 signaling pathway. Phosphorylates SHC1. Mediates phosphorylation of PIK3R1, the regulatory subunit of phosphatidylinositol 3-kinase. Promotes phosphorylation of MAPK8 at 'Thr-183' and 'Tyr-185', and of AKT1 at 'Ser-473'.\n VGFR3_HUMAN,P35916\n
## 2 Growth factor active in angiogenesis, and endothelial cell growth, stimulating their proliferation and migration and also has effects on the permeability of blood vessels. May function in angiogenesis of the venous and lymphatic vascular systems during embryogenesis, and also in the maintenance of differentiated lymphatic endothelium in adults. Binds and activates KDR/VEGFR2 and FLT4/VEGFR3 receptors.\n VEGFC_HUMAN,P49767\n
## 3 Ligand-specific transporter trafficking between intracellular organelles (TGN) and the plasma membrane. Plays a role in autocrine regulation of cell growth mediated by growth regulators containing cell surface retention sequence binding (CRS). May act as a hyaluronan (HA) transporter, either mediating its uptake for catabolism within lymphatic endothelial cells themselves, or its transport into the lumen of afferent lymphatic vessels for subsequent re-uptake and degradation in lymph nodes.\n LYVE1_HUMAN,Q9Y5Y7\n
## 4 Transcriptional activator that binds to the consensus sequence 5'-AACAAAG-3' in the promoter of target genes and plays an essential role in embryonic cardiovascular development and lymphangiogenesis. Activates transcription of PROX1 and other genes coding for lymphatic endothelial markers. Plays an essential role in triggering the differentiation of lymph vessels, but is not required for the maintenance of differentiated lymphatic endothelial cells. Plays an important role in postnatal angiogenesis, where it is functionally redundant with SOX17. Interaction with MEF2C enhances transcriptional activation. Besides, required for normal hair development.\n SOX18_HUMAN,P35713\n
## 5 Phosphoinositide-3-kinase (PI3K) that phosphorylates PtdIns (Phosphatidylinositol), PtdIns4P (Phosphatidylinositol 4-phosphate) and PtdIns(4,5)P2 (Phosphatidylinositol 4,5-bisphosphate) to generate phosphatidylinositol 3,4,5-trisphosphate (PIP3). PIP3 plays a key role by recruiting PH domain-containing proteins to the membrane, including AKT1 and PDPK1, activating signaling cascades involved in cell growth, survival, proliferation, motility and morphology. Participates in cellular signaling in response to various growth factors. Involved in the activation of AKT1 upon stimulation by receptor tyrosine kinases ligands such as EGF, insulin, IGF1, VEGFA and PDGF. Involved in signaling via insulin-receptor substrate (IRS) proteins. Essential in endothelial cell migration during vascular development through VEGFA signaling, possibly by regulating RhoA activity. Required for lymphatic vasculature development, possibly by binding to RAS and by activation by EGF and FGF2, but not by PDGF. Regulates invadopodia formation through the PDPK1-AKT1 pathway. Participates in cardiomyogenesis in embryonic stem cells through a AKT1 pathway. Participates in vasculogenesis in embryonic stem cells through PDK1 and protein kinase C pathway. Also has serine-protein kinase activity: phosphorylates PIK3R1 (p85alpha regulatory subunit), EIF4EBP1 and HRAS. Plays a role in the positive regulation of phagocytosis and pinocytosis (By similarity).\n PK3CA_HUMAN,P42336\n
## 6 Required for lymphangioblast budding and angiogenic sprouting from venous endothelium during embryogenesis.\n CCBE1_HUMAN,Q6UXH8\n
## todaysDate
## 1 Thu Sep 03 13:49:53 2020
## 2 Thu Sep 03 13:49:56 2020
## 3 Thu Sep 03 13:49:57 2020
## 4 Thu Sep 03 13:49:58 2020
## 5 Thu Sep 03 13:49:59 2020
## 6 Thu Sep 03 13:50:00 2020
Lets merge the genes of 40k+ with both sets of genes.
all375plus <- merge(allwithVits,systemsDF, by.x='gene', by.y='gene')
head(all375plus)
## gene proteinSearched EnsemblID
## 1 AANAT melatonin ENSG00000129673
## 2 ABCB1 tylenol ENSG00000085563
## 3 ABCB1 tylenol ENSG00000085563
## 4 ABCB1 tylenol ENSG00000085563
## 5 ABCB1 tylenol ENSG00000085563
## 6 ABCC1 cannabidiol ENSG00000103222
## EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 3 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 4 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 5 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 6 The protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra-and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This full transporter is a member of the MRP subfamily which is involved in multi-drug resistance. This protein functions as a multispecific organic anion transporter, with oxidized glutatione, cysteinyl leukotrienes, and activated aflatoxin B1 as substrates. This protein also transports glucuronides and sulfate conjugates of steroid hormones and bile salts. Alternatively spliced variants of this gene have been described but their full-length nature is unknown. [provided by RefSeq, Apr 2012]
## GeneCardsSummary
## 1 AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene. Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome. Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism. Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.
## 2 ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene. Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13. Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics. Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances. An important paralog of this gene is ABCB4.
## 3 ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene. Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13. Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics. Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances. An important paralog of this gene is ABCB4.
## 4 ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene. Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13. Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics. Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances. An important paralog of this gene is ABCB4.
## 5 ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene. Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13. Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics. Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances. An important paralog of this gene is ABCB4.
## 6 ABCC1 (ATP Binding Cassette Subfamily C Member 1) is a Protein Coding gene. Diseases associated with ABCC1 include Dubin-Johnson Syndrome and Pseudoxanthoma Elasticum. Among its related pathways are Arachidonic acid metabolism and Sphingolipid signaling pathway. Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances. An important paralog of this gene is ABCC3.
## UniProtKB_Summary
## 1 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n SNAT_HUMAN,Q16613\n
## 2 Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n MDR1_HUMAN,P08183\n
## 3 Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n MDR1_HUMAN,P08183\n
## 4 Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n MDR1_HUMAN,P08183\n
## 5 Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n MDR1_HUMAN,P08183\n
## 6 Mediates export of organic anions and drugs from the cytoplasm (PubMed:7961706, PubMed:16230346, PubMed:9281595, PubMed:10064732, PubMed:11114332). Mediates ATP-dependent transport of glutathione and glutathione conjugates, leukotriene C4, estradiol-17-beta-o-glucuronide, methotrexate, antiviral drugs and other xenobiotics (PubMed:7961706, PubMed:16230346, PubMed:9281595, PubMed:10064732, PubMed:11114332). Confers resistance to anticancer drugs by decreasing accumulation of drug in cells, and by mediating ATP- and GSH-dependent drug export (PubMed:9281595). Hydrolyzes ATP with low efficiency (PubMed:16230346). Catalyzes the export of sphingosine 1-phosphate from mast cells independently of their degranulation (PubMed:17050692). Participates in inflammatory response by allowing export of leukotriene C4 from leukotriene C4-synthezing cells (By similarity).\n MRP1_HUMAN,P33527\n
## todaysDate healthyControl_1 healthyControl_2 healthyControl_3
## 1 Thu Sep 03 14:12:01 2020 38.65447 43.30859 34.63334
## 2 Thu Sep 03 14:06:27 2020 41.22896 47.16127 21.53185
## 3 Thu Sep 03 14:06:27 2020 32.25172 31.97326 56.89866
## 4 Thu Sep 03 14:06:27 2020 38.24506 41.06669 28.37736
## 5 Thu Sep 03 14:06:27 2020 32.99834 39.03790 27.31621
## 6 Thu Sep 03 14:08:53 2020 28.91772 40.26726 16.87763
## healthyControl_4 healthyControl_5 healthyControl_6 healthyControl_7
## 1 19.88367 31.57302 17.33214 69.51032
## 2 21.29092 30.00952 20.07555 33.02105
## 3 18.42561 22.80082 30.40254 41.85742
## 4 19.97834 26.84866 21.20648 49.07782
## 5 30.76515 29.44986 48.65851 44.60380
## 6 24.20228 35.25948 19.15219 26.95437
## healthyControl_8 healthyControl_9 healthyControl_10 healthyControl_11
## 1 11.902382 29.29495 22.12536 84.22110
## 2 10.757634 18.82793 21.56646 79.84726
## 3 15.256492 16.62954 23.08130 92.62902
## 4 8.592051 19.40342 15.56581 64.05169
## 5 15.099013 21.15442 14.42682 74.99046
## 6 10.947289 25.50794 17.22340 46.21760
## healthyControl_12 healthyControl_13 healthyControl_14 healthyControl_15
## 1 50.67037 10.37422 31.11561 23.81043
## 2 36.34932 11.90963 49.19399 17.42491
## 3 38.90237 12.18897 28.95807 16.91220
## 4 49.74689 12.66472 35.52134 20.32528
## 5 69.92664 14.20984 22.59296 15.04298
## 6 42.64672 13.63434 33.65687 18.83468
## healthyControl_16 healthyControl_17 healthyControl_18 healthyControl_19
## 1 20.46186 41.48475 18.62474 46.30124
## 2 18.40882 49.98046 18.95894 49.55159
## 3 26.11347 32.81002 20.61510 106.03831
## 4 19.75054 34.48769 17.11111 108.70370
## 5 32.37899 33.79629 18.68831 43.80003
## 6 19.55225 40.03501 21.74411 40.66432
## healthyControl_20 healthyControl_21 acuteLymeDisease_1 acuteLymeDisease_2
## 1 45.55125 15.28551 71.80305 21.75414
## 2 36.03069 31.41368 38.05233 21.02175
## 3 38.48303 21.02176 45.82799 20.52045
## 4 45.94058 22.14821 48.30754 25.54595
## 5 35.39077 53.24209 77.74697 43.00508
## 6 30.89479 19.71880 46.64285 21.31349
## acuteLymeDisease_3 acuteLymeDisease_4 acuteLymeDisease_5 acuteLymeDisease_6
## 1 49.93791 20.91372 60.35461 29.56083
## 2 39.44675 21.38210 45.26092 41.05704
## 3 44.08240 18.79489 44.00047 34.33294
## 4 51.66513 26.43767 40.24261 34.65819
## 5 30.09120 22.49047 22.04645 35.70981
## 6 35.81335 17.20735 37.86800 40.76434
## acuteLymeDisease_7 acuteLymeDisease_8 acuteLymeDisease_9 acuteLymeDisease_10
## 1 115.26505 12.74919 14.58672 58.68188
## 2 817.18666 13.66956 14.09339 48.10711
## 3 505.81829 21.29795 13.70197 49.31229
## 4 132.25512 13.35872 11.50697 41.51591
## 5 106.28833 34.43221 15.08469 45.77373
## 6 60.77827 17.23546 18.66173 71.35843
## acuteLymeDisease_11 acuteLymeDisease_12 acuteLymeDisease_13
## 1 33.32225 23.25981 9.588795
## 2 40.30350 19.15355 7.888114
## 3 41.56268 21.22284 8.447833
## 4 40.41479 20.12256 7.350215
## 5 39.50821 29.78886 15.063060
## 6 53.69678 27.22513 8.894936
## acuteLymeDisease_14 acuteLymeDisease_15 acuteLymeDisease_16
## 1 20.86345 29.27038 43.14832
## 2 25.21845 19.74167 29.28886
## 3 25.32556 21.71775 41.87143
## 4 25.56478 24.96515 47.03945
## 5 69.40036 17.89554 69.75110
## 6 24.18120 24.62062 30.00687
## acuteLymeDisease_17 acuteLymeDisease_18 acuteLymeDisease_19
## 1 14.57474 28.45002 15.69350
## 2 11.87161 24.10970 16.67370
## 3 14.17475 22.54048 15.10654
## 4 10.21484 31.30573 20.25107
## 5 13.02415 48.07647 14.64475
## 6 13.35633 30.66082 17.37290
## acuteLymeDisease_20 acuteLymeDisease_21 acuteLymeDisease_22
## 1 21.95486 45.46997 28.11229
## 2 15.69664 57.35794 51.84340
## 3 15.44018 60.99973 34.66314
## 4 20.28179 44.40500 38.41623
## 5 14.21193 71.96156 42.69591
## 6 25.51457 44.92045 34.12562
## acuteLymeDisease_23 acuteLymeDisease_24 acuteLymeDisease_25
## 1 34.01966 36.43714 11.098142
## 2 25.09195 38.82158 10.132455
## 3 32.02725 28.75331 9.661386
## 4 22.30251 24.74467 11.006607
## 5 15.44757 17.43424 19.974906
## 6 23.61546 21.99793 15.807990
## acuteLymeDisease_26 acuteLymeDisease_27 acuteLymeDisease_28
## 1 20.95721 18.77368 22.93936
## 2 30.10835 20.40635 26.60660
## 3 23.14832 27.20347 29.15688
## 4 21.33121 26.27650 20.61601
## 5 26.04407 37.98274 43.11017
## 6 35.08638 18.89033 27.50780
## Antibodies_1month_1 Antibodies_1month_2 Antibodies_1month_3
## 1 15.51901 14.49904 17.05954
## 2 23.35448 12.29871 17.04148
## 3 21.76531 20.58139 13.23527
## 4 18.64363 31.02243 18.31642
## 5 17.44667 15.86079 19.19898
## 6 18.17025 16.46885 15.18037
## Antibodies_1month_4 Antibodies_1month_5 Antibodies_1month_6
## 1 15.50139 22.02124 59.06156
## 2 25.62144 16.42297 29.43557
## 3 18.78380 18.75276 33.07018
## 4 15.65744 22.42357 38.36952
## 5 12.03262 19.64487 51.37644
## 6 13.20924 26.51673 55.85404
## Antibodies_1month_7 Antibodies_1month_8 Antibodies_1month_9
## 1 20.16316 36.54447 24.36854
## 2 20.60197 39.56771 27.53398
## 3 15.76464 60.34087 29.26406
## 4 22.36988 29.06728 23.79171
## 5 34.21045 28.25430 26.92548
## 6 14.08537 46.30949 32.91737
## Antibodies_1month_10 Antibodies_1month_11 Antibodies_1month_12
## 1 16.58916 71.64483 43.23948
## 2 17.04288 80.96806 72.89021
## 3 18.46279 66.27261 56.96983
## 4 21.88265 60.36095 54.82444
## 5 20.85923 102.61530 55.42687
## 6 25.50856 51.15237 35.64776
## Antibodies_1month_13 Antibodies_1month_14 Antibodies_1month_15
## 1 20.09754 33.01413 8.609319
## 2 25.08455 39.58836 7.249021
## 3 19.66783 38.49537 9.142852
## 4 21.90548 38.52790 10.809951
## 5 20.54839 28.82423 24.285132
## 6 19.48506 33.63112 10.733931
## Antibodies_1month_16 Antibodies_1month_17 Antibodies_1month_18
## 1 31.82035 27.06322 22.189114
## 2 43.15264 27.76661 13.287931
## 3 43.18348 33.37388 15.686958
## 4 40.81558 60.06926 13.834432
## 5 50.33261 32.45963 9.912929
## 6 45.11738 29.03249 20.524764
## Antibodies_1month_19 Antibodies_1month_20 Antibodies_1month_21
## 1 43.05822 35.09371 34.37424
## 2 47.68777 37.90771 50.50895
## 3 52.03077 30.40575 30.13461
## 4 47.55413 28.13396 30.64936
## 5 39.33369 22.83271 43.71672
## 6 40.20511 33.37773 33.46759
## Antibodies_1month_22 Antibodies_1month_23 Antibodies_1month_24
## 1 32.27040 14.20441 23.45584
## 2 29.96935 23.32879 14.40262
## 3 38.58699 14.45487 51.44689
## 4 61.40487 14.97484 30.95148
## 5 55.44185 18.42424 30.75718
## 6 48.85811 15.79943 43.76389
## Antibodies_1month_25 Antibodies_1month_26 Antibodies_1month_27
## 1 12.04484 33.61438 25.23156
## 2 19.42912 26.26096 16.56160
## 3 13.82278 24.13657 13.58578
## 4 14.95064 33.09392 21.99314
## 5 13.06079 21.28961 25.59947
## 6 13.23880 34.70037 19.66687
## Antibodies_6months_1 Antibodies_6months_2 Antibodies_6months_3
## 1 23.81457 17.47219 51.39457
## 2 46.38081 18.70302 55.60095
## 3 28.62397 16.77057 65.63707
## 4 16.89601 12.10923 42.76211
## 5 17.95221 14.40513 31.01147
## 6 21.63674 13.62361 32.41461
## Antibodies_6months_4 Antibodies_6months_5 Antibodies_6months_6
## 1 32.73614 29.56239 15.20396
## 2 78.31794 48.37211 26.58536
## 3 43.12171 39.48976 21.41928
## 4 29.90365 40.81556 15.23238
## 5 24.59404 68.37386 11.99703
## 6 36.70282 50.10856 30.27561
## Antibodies_6months_7 Antibodies_6months_8 Antibodies_6months_9
## 1 16.99264 28.08311 39.58991
## 2 33.42081 37.71527 31.17144
## 3 32.91563 23.94722 30.45324
## 4 20.52140 24.96915 28.02090
## 5 13.33228 24.88406 23.71542
## 6 27.45445 22.96609 30.74962
## Antibodies_6months_10
## 1 39.65745
## 2 32.94535
## 3 29.12041
## 4 27.83572
## 5 20.28640
## 6 31.10116
all3withVits <- rbind(allSystemSummsFirst3,vitamins2)
head(all3withVits)
## proteinSearched gene EnsemblID
## 1 lymphatic FLT4 ENSG00000037280
## 2 lymphatic VEGFC ENSG00000150630
## 3 lymphatic LYVE1 ENSG00000133800
## 4 integumentary FLG ENSG00000143631
## 5 integumentary KIT ENSG00000157404
## 6 integumentary COL7A1 ENSG00000114270
## EntrezSummary
## 1 This gene encodes a tyrosine kinase receptor for vascular endothelial growth factors C and D. The protein is thought to be involved in lymphangiogenesis and maintenance of the lymphatic endothelium. Mutations in this gene cause hereditary lymphedema type IA. [provided by RefSeq, Jul 2008]
## 2 The protein encoded by this gene is a member of the platelet-derived growth factor/vascular endothelial growth factor (PDGF/VEGF) family. The encoded protein promotes angiogenesis and endothelial cell growth, and can also affect the permeability of blood vessels. The proprotein is further cleaved into a fully processed form that can bind and activate VEGFR-2 and VEGFR-3 receptors. [provided by RefSeq, Apr 2014]
## 3 This gene encodes a type I integral membrane glycoprotein. The encoded protein acts as a receptor and binds to both soluble and immobilized hyaluronan. This protein may function in lymphatic hyaluronan transport and have a role in tumor metastasis. [provided by RefSeq, Jul 2008]
## 4 The protein encoded by this gene is an intermediate filament-associated protein that aggregates keratin intermediate filaments in mammalian epidermis. It is initially synthesized as a polyprotein precursor, profilaggrin (consisting of multiple filaggrin units of 324 aa each), which is localized in keratohyalin granules, and is subsequently proteolytically processed into individual functional filaggrin molecules. Mutations in this gene are associated with ichthyosis vulgaris.[provided by RefSeq, Dec 2009]
## 5 This gene encodes a receptor tyrosine kinase. This gene was initially identified as a homolog of the feline sarcoma viral oncogene v-kit and is often referred to as proto-oncogene c-Kit. The canonical form of this glycosylated transmembrane protein has an N-terminal extracellular region with five immunoglobulin-like domains, a transmembrane region, and an intracellular tyrosine kinase domain at the C-terminus. Upon activation by its cytokine ligand, stem cell factor (SCF), this protein phosphorylates multiple intracellular proteins that play a role in in the proliferation, differentiation, migration and apoptosis of many cell types and thereby plays an important role in hematopoiesis, stem cell maintenance, gametogenesis, melanogenesis, and in mast cell development, migration and function. This protein can be a membrane-bound or soluble protein. Mutations in this gene are associated with gastrointestinal stromal tumors, mast cell disease, acute myelogenous leukemia, and piebaldism. Multiple transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, May 2020]
## 6 This gene encodes the alpha chain of type VII collagen. The type VII collagen fibril, composed of three identical alpha collagen chains, is restricted to the basement zone beneath stratified squamous epithelia. It functions as an anchoring fibril between the external epithelia and the underlying stroma. Mutations in this gene are associated with all forms of dystrophic epidermolysis bullosa. In the absence of mutations, however, an acquired form of this disease can result from an autoimmune response made to type VII collagen. [provided by RefSeq, Jul 2008]
## GeneCardsSummary
## 1 FLT4 (Fms Related Receptor Tyrosine Kinase 4) is a Protein Coding gene. Diseases associated with FLT4 include Lymphatic Malformation 1 and Congenital Heart Defects, Multiple Types, 7. Among its related pathways are Signaling by GPCR and NF-KappaB Family Pathway. Gene Ontology (GO) annotations related to this gene include transferase activity, transferring phosphorus-containing groups and protein tyrosine kinase activity. An important paralog of this gene is KDR.
## 2 VEGFC (Vascular Endothelial Growth Factor C) is a Protein Coding gene. Diseases associated with VEGFC include Lymphatic Malformation 4 and Hereditary Lymphedema Id. Among its related pathways are HIF1Alpha Pathway and Signaling by GPCR. Gene Ontology (GO) annotations related to this gene include growth factor activity and vascular endothelial growth factor receptor 3 binding. An important paralog of this gene is VEGFD.
## 3 LYVE1 (Lymphatic Vessel Endothelial Hyaluronan Receptor 1) is a Protein Coding gene. Diseases associated with LYVE1 include Intramuscular Hemangioma and Middle Cerebral Artery Infarction. Among its related pathways are Cell adhesion_Cell-matrix glycoconjugates and Glycosaminoglycan metabolism. Gene Ontology (GO) annotations related to this gene include hyaluronic acid binding. An important paralog of this gene is CD44.
## 4 FLG (Filaggrin) is a Protein Coding gene. Diseases associated with FLG include Dermatitis, Atopic, 2 and Ichthyosis Vulgaris. Among its related pathways are Keratinization and Developmental Biology. Gene Ontology (GO) annotations related to this gene include calcium ion binding and structural molecule activity. An important paralog of this gene is HRNR.
## 5 KIT (KIT Proto-Oncogene, Receptor Tyrosine Kinase) is a Protein Coding gene. Diseases associated with KIT include Gastrointestinal Stromal Tumor and Piebald Trait. Among its related pathways are RET signaling and Signaling by GPCR. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and protein kinase activity. An important paralog of this gene is CSF1R.
## 6 COL7A1 (Collagen Type VII Alpha 1 Chain) is a Protein Coding gene. Diseases associated with COL7A1 include Epidermolysis Bullosa Pruriginosa and Transient Bullous Dermolysis Of The Newborn. Among its related pathways are Integrin Pathway and Collagen chain trimerization. Gene Ontology (GO) annotations related to this gene include identical protein binding and serine-type endopeptidase inhibitor activity. An important paralog of this gene is COL2A1.
## UniProtKB_Summary
## 1 Tyrosine-protein kinase that acts as a cell-surface receptor for VEGFC and VEGFD, and plays an essential role in adult lymphangiogenesis and in the development of the vascular network and the cardiovascular system during embryonic development. Promotes proliferation, survival and migration of endothelial cells, and regulates angiogenic sprouting. Signaling by activated FLT4 leads to enhanced production of VEGFC, and to a lesser degree VEGFA, thereby creating a positive feedback loop that enhances FLT4 signaling. Modulates KDR signaling by forming heterodimers. The secreted isoform 3 may function as a decoy receptor for VEGFC and/or VEGFD and play an important role as a negative regulator of VEGFC-mediated lymphangiogenesis and angiogenesis. Binding of vascular growth factors to isoform 1 or isoform 2 leads to the activation of several signaling cascades; isoform 2 seems to be less efficient in signal transduction, because it has a truncated C-terminus and therefore lacks several phosphorylation sites. Mediates activation of the MAPK1/ERK2, MAPK3/ERK1 signaling pathway, of MAPK8 and the JUN signaling pathway, and of the AKT1 signaling pathway. Phosphorylates SHC1. Mediates phosphorylation of PIK3R1, the regulatory subunit of phosphatidylinositol 3-kinase. Promotes phosphorylation of MAPK8 at 'Thr-183' and 'Tyr-185', and of AKT1 at 'Ser-473'.\n VGFR3_HUMAN,P35916\n
## 2 Growth factor active in angiogenesis, and endothelial cell growth, stimulating their proliferation and migration and also has effects on the permeability of blood vessels. May function in angiogenesis of the venous and lymphatic vascular systems during embryogenesis, and also in the maintenance of differentiated lymphatic endothelium in adults. Binds and activates KDR/VEGFR2 and FLT4/VEGFR3 receptors.\n VEGFC_HUMAN,P49767\n
## 3 Ligand-specific transporter trafficking between intracellular organelles (TGN) and the plasma membrane. Plays a role in autocrine regulation of cell growth mediated by growth regulators containing cell surface retention sequence binding (CRS). May act as a hyaluronan (HA) transporter, either mediating its uptake for catabolism within lymphatic endothelial cells themselves, or its transport into the lumen of afferent lymphatic vessels for subsequent re-uptake and degradation in lymph nodes.\n LYVE1_HUMAN,Q9Y5Y7\n
## 4 Aggregates keratin intermediate filaments and promotes disulfide-bond formation among the intermediate filaments during terminal differentiation of mammalian epidermis.\n FILA_HUMAN,P20930\n
## 5 Tyrosine-protein kinase that acts as cell-surface receptor for the cytokine KITLG/SCF and plays an essential role in the regulation of cell survival and proliferation, hematopoiesis, stem cell maintenance, gametogenesis, mast cell development, migration and function, and in melanogenesis. In response to KITLG/SCF binding, KIT can activate several signaling pathways. Phosphorylates PIK3R1, PLCG1, SH2B2/APS and CBL. Activates the AKT1 signaling pathway by phosphorylation of PIK3R1, the regulatory subunit of phosphatidylinositol 3-kinase. Activated KIT also transmits signals via GRB2 and activation of RAS, RAF1 and the MAP kinases MAPK1/ERK2 and/or MAPK3/ERK1. Promotes activation of STAT family members STAT1, STAT3, STAT5A and STAT5B. Activation of PLCG1 leads to the production of the cellular signaling molecules diacylglycerol and inositol 1,4,5-trisphosphate. KIT signaling is modulated by protein phosphatases, and by rapid internalization and degradation of the receptor. Activated KIT promotes phosphorylation of the protein phosphatases PTPN6/SHP-1 and PTPRU, and of the transcription factors STAT1, STAT3, STAT5A and STAT5B. Promotes phosphorylation of PIK3R1, CBL, CRK (isoform Crk-II), LYN, MAPK1/ERK2 and/or MAPK3/ERK1, PLCG1, SRC and SHC1.\n KIT_HUMAN,P10721\n
## 6 Stratified squamous epithelial basement membrane protein that forms anchoring fibrils which may contribute to epithelial basement membrane organization and adherence by interacting with extracellular matrix (ECM) proteins such as type IV collagen.\n CO7A1_HUMAN,Q02388\n
## todaysDate
## 1 Thu Sep 03 13:49:53 2020
## 2 Thu Sep 03 13:49:56 2020
## 3 Thu Sep 03 13:49:57 2020
## 4 Thu Sep 03 13:45:05 2020
## 5 Thu Sep 03 13:45:06 2020
## 6 Thu Sep 03 13:45:07 2020
allTop3systems <- merge(all3withVits,systemsDF,by.x='gene',
by.y='gene')
There are more observations in the merged data, because there are more than one entry per gene in the original data to merge with the unique genes related to our body systems and OTC drugs, cannabidiol, alcohol, and dopamine.
dim(all375plus)
## [1] 1216 93
dim(allTop3systems)
## [1] 350 93
head(allTop3systems)
## gene proteinSearched EnsemblID
## 1 AANAT melatonin ENSG00000129673
## 2 ADH1B alcohol ENSG00000196616
## 3 ADH1B alcohol ENSG00000196616
## 4 ADH1B alcohol ENSG00000196616
## 5 ADH1B alcohol ENSG00000196616
## 6 ADH1B alcohol ENSG00000196616
## EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 3 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 4 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 6 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## GeneCardsSummary
## 1 AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene. Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome. Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism. Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.
## 2 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 3 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 4 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 5 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 6 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## UniProtKB_Summary
## 1 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n SNAT_HUMAN,Q16613\n
## 2 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 3 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 4 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 5 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 6 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## todaysDate healthyControl_1 healthyControl_2 healthyControl_3
## 1 Thu Sep 03 14:12:01 2020 38.65447 43.30859 34.63334
## 2 Thu Sep 03 14:02:01 2020 37.89308 37.26308 23.94914
## 3 Thu Sep 03 14:02:01 2020 33.10215 39.16493 25.82138
## 4 Thu Sep 03 14:02:01 2020 36.52194 57.39502 35.30746
## 5 Thu Sep 03 14:02:01 2020 32.60534 36.08616 25.93888
## 6 Thu Sep 03 14:02:01 2020 29.32294 42.79331 21.29711
## healthyControl_4 healthyControl_5 healthyControl_6 healthyControl_7
## 1 19.88367 31.57302 17.33214 69.51032
## 2 17.88074 28.76127 24.04333 36.20929
## 3 18.49482 27.73537 18.57254 41.24310
## 4 20.50253 24.70425 21.05146 45.07237
## 5 20.11668 31.67117 19.19682 34.59911
## 6 18.51708 25.70072 19.08679 32.79410
## healthyControl_8 healthyControl_9 healthyControl_10 healthyControl_11
## 1 11.90238 29.29495 22.12536 84.22110
## 2 10.89609 23.64742 21.44924 70.99529
## 3 12.05248 15.14195 20.64977 66.46961
## 4 14.53013 17.09744 22.74018 96.51198
## 5 12.27169 18.57713 21.48818 85.75042
## 6 11.39472 18.52626 26.56640 83.33878
## healthyControl_12 healthyControl_13 healthyControl_14 healthyControl_15
## 1 50.67037 10.374223 31.11561 23.81043
## 2 51.22769 11.265804 41.29142 19.52040
## 3 44.22388 11.719167 26.24507 23.72086
## 4 50.89903 9.894555 37.57538 15.86959
## 5 55.35348 11.183972 38.27067 18.14704
## 6 49.70900 12.427741 34.60306 22.33815
## healthyControl_16 healthyControl_17 healthyControl_18 healthyControl_19
## 1 20.46186 41.48475 18.62474 46.30124
## 2 21.43320 41.84901 18.35966 111.66591
## 3 18.35441 43.20239 17.93534 55.34340
## 4 21.01759 37.74332 19.34599 65.39802
## 5 18.26951 45.02007 24.47236 96.89240
## 6 23.96481 42.46289 42.95787 75.40298
## healthyControl_20 healthyControl_21 acuteLymeDisease_1 acuteLymeDisease_2
## 1 45.55125 15.28551 71.80305 21.75414
## 2 46.01727 19.27535 60.70073 21.36817
## 3 43.43277 21.97501 69.80451 24.61685
## 4 27.45052 20.00639 67.25343 25.10416
## 5 31.17226 20.38369 43.01904 18.60533
## 6 28.97394 17.00365 47.91530 22.10635
## acuteLymeDisease_3 acuteLymeDisease_4 acuteLymeDisease_5 acuteLymeDisease_6
## 1 49.93791 20.91372 60.35461 29.56083
## 2 41.31194 22.03221 45.32658 35.77288
## 3 39.38423 22.70426 45.83045 33.39604
## 4 42.68764 24.91723 41.56295 26.09728
## 5 41.48576 17.26287 42.07435 35.45385
## 6 43.90134 20.33773 42.12905 29.83464
## acuteLymeDisease_7 acuteLymeDisease_8 acuteLymeDisease_9 acuteLymeDisease_10
## 1 115.26505 12.74919 14.58672 58.68188
## 2 81.45244 15.14020 17.59256 48.55106
## 3 95.05012 14.23976 15.65921 61.44041
## 4 225.20762 14.68151 13.63993 47.56154
## 5 85.27062 16.34465 17.65256 55.49571
## 6 120.56462 15.63079 15.93060 53.69469
## acuteLymeDisease_11 acuteLymeDisease_12 acuteLymeDisease_13
## 1 33.32225 23.25981 9.588795
## 2 42.21454 22.07492 7.248266
## 3 41.33231 21.94181 9.866293
## 4 40.75957 18.89717 7.216334
## 5 45.82252 19.04510 7.861977
## 6 44.52237 18.59789 7.733616
## acuteLymeDisease_14 acuteLymeDisease_15 acuteLymeDisease_16
## 1 20.86345 29.27038 43.14832
## 2 24.54406 25.49322 32.11470
## 3 19.76155 23.89585 59.11455
## 4 19.51075 23.89222 38.61305
## 5 29.57920 21.97918 37.74002
## 6 25.89079 22.59387 45.24864
## acuteLymeDisease_17 acuteLymeDisease_18 acuteLymeDisease_19
## 1 14.57474 28.45002 15.69350
## 2 11.75390 23.04194 15.93794
## 3 11.33043 22.03662 17.25698
## 4 13.01954 35.32621 21.17568
## 5 16.39203 23.02250 17.74235
## 6 10.52267 24.30442 17.48153
## acuteLymeDisease_20 acuteLymeDisease_21 acuteLymeDisease_22
## 1 21.95486 45.46997 28.11229
## 2 17.41908 55.92618 54.60819
## 3 19.55800 70.16983 44.13887
## 4 15.41068 50.07806 37.08963
## 5 18.09133 46.35208 47.28785
## 6 18.50505 52.38181 43.69408
## acuteLymeDisease_23 acuteLymeDisease_24 acuteLymeDisease_25
## 1 34.01966 36.43714 11.098142
## 2 21.31802 48.58598 11.585635
## 3 22.18329 31.00082 11.101384
## 4 24.19148 36.22127 7.693474
## 5 34.25241 25.66271 10.716630
## 6 35.25920 32.90474 11.448932
## acuteLymeDisease_26 acuteLymeDisease_27 acuteLymeDisease_28
## 1 20.95721 18.77368 22.93936
## 2 28.65290 21.22896 30.24262
## 3 19.88627 22.42390 31.73242
## 4 34.17640 23.67241 29.95222
## 5 30.62333 19.26661 28.75988
## 6 27.67493 23.11083 36.57107
## Antibodies_1month_1 Antibodies_1month_2 Antibodies_1month_3
## 1 15.51901 14.49904 17.05954
## 2 27.99790 17.53823 17.45409
## 3 24.99516 16.98122 17.99978
## 4 25.08477 18.85538 15.41842
## 5 21.39371 18.72753 18.61650
## 6 18.73800 18.67212 15.78598
## Antibodies_1month_4 Antibodies_1month_5 Antibodies_1month_6
## 1 15.50139 22.02124 59.06156
## 2 17.15096 20.74663 41.57386
## 3 26.85810 33.94878 43.68206
## 4 31.43632 17.65364 37.33166
## 5 20.21483 23.68388 45.04598
## 6 18.25358 22.51697 42.97516
## Antibodies_1month_7 Antibodies_1month_8 Antibodies_1month_9
## 1 20.16316 36.54447 24.36854
## 2 17.66582 46.34294 26.26307
## 3 18.38415 38.46949 34.83685
## 4 15.72403 39.26543 27.46579
## 5 14.83868 37.63227 32.73706
## 6 15.53457 35.80330 33.23423
## Antibodies_1month_10 Antibodies_1month_11 Antibodies_1month_12
## 1 16.58916 71.64483 43.23948
## 2 20.88941 69.77408 50.10176
## 3 21.46458 49.33115 49.48506
## 4 25.84191 66.44381 44.60148
## 5 20.71777 58.78334 49.35734
## 6 18.91623 51.90412 56.34813
## Antibodies_1month_13 Antibodies_1month_14 Antibodies_1month_15
## 1 20.09754 33.01413 8.609319
## 2 22.28110 42.87377 7.016328
## 3 28.64540 47.57697 8.424463
## 4 25.71589 33.50081 6.436601
## 5 22.76641 37.61831 9.280659
## 6 19.49066 36.03596 7.416995
## Antibodies_1month_16 Antibodies_1month_17 Antibodies_1month_18
## 1 31.82035 27.06322 22.18911
## 2 39.48848 33.98432 15.84832
## 3 34.23690 32.57032 13.98179
## 4 40.12788 35.27267 14.66785
## 5 54.02090 38.60195 32.63592
## 6 39.58097 34.13032 26.50713
## Antibodies_1month_19 Antibodies_1month_20 Antibodies_1month_21
## 1 43.05822 35.09371 34.37424
## 2 36.98065 28.37421 42.29672
## 3 47.55496 45.95646 37.90742
## 4 44.62578 35.60238 36.38586
## 5 44.48999 37.91425 37.01849
## 6 43.65493 42.85820 34.90846
## Antibodies_1month_22 Antibodies_1month_23 Antibodies_1month_24
## 1 32.27040 14.20441 23.45584
## 2 31.52575 21.26285 32.94306
## 3 39.12872 14.04135 28.71038
## 4 34.95027 19.32330 21.73639
## 5 38.18108 15.24316 36.73830
## 6 39.83798 13.71510 50.11779
## Antibodies_1month_25 Antibodies_1month_26 Antibodies_1month_27
## 1 12.04484 33.61438 25.23156
## 2 15.10542 30.28620 27.50727
## 3 13.34428 28.29141 20.68341
## 4 13.17195 34.61330 16.93934
## 5 13.71118 27.07499 14.21209
## 6 13.10811 27.00124 18.82190
## Antibodies_6months_1 Antibodies_6months_2 Antibodies_6months_3
## 1 23.81457 17.47219 51.39457
## 2 29.23864 18.47114 28.10985
## 3 27.43140 15.91317 45.41752
## 4 60.77292 14.68160 47.29560
## 5 26.86661 15.36001 47.90822
## 6 27.11552 16.07950 38.76281
## Antibodies_6months_4 Antibodies_6months_5 Antibodies_6months_6
## 1 32.73614 29.56239 15.20396
## 2 35.55087 43.75245 13.49950
## 3 35.23614 38.91768 27.97887
## 4 50.26657 53.04640 23.69328
## 5 31.21551 47.96653 23.93062
## 6 34.74876 40.62482 21.30499
## Antibodies_6months_7 Antibodies_6months_8 Antibodies_6months_9
## 1 16.99264 28.08311 39.58991
## 2 17.75465 21.52223 31.17582
## 3 24.47172 21.86029 28.76703
## 4 19.18802 19.77882 30.05590
## 5 21.74967 23.60711 24.50661
## 6 22.19385 25.53545 32.77911
## Antibodies_6months_10
## 1 39.65745
## 2 30.49779
## 3 39.56165
## 4 24.02602
## 5 26.39521
## 6 27.63658
bodySystems3_geneCounts <- allTop3systems %>% group_by(gene) %>%
count(gene)
bodySystems3_geneCounts <- bodySystems3_geneCounts[order(bodySystems3_geneCounts$n,decreasing=T),]
colnames(bodySystems3_geneCounts)[2] <- 'geneCounts'
bodySystems3_geneCounts
## # A tibble: 96 x 2
## # Groups: gene [96]
## gene geneCounts
## <fct> <int>
## 1 CYP19A1 25
## 2 ESR1 20
## 3 VDR 16
## 4 PTGS1 12
## 5 HFE 9
## 6 PTGS2 9
## 7 GFAP 8
## 8 ESR2 8
## 9 IGF1R 7
## 10 FLT4 6
## # ... with 86 more rows
bodySystemsTotal <- merge(bodySystems3_geneCounts,allTop3systems,by.x='gene',
by.y='gene')
head(bodySystemsTotal)
## gene geneCounts proteinSearched EnsemblID
## 1 AANAT 1 melatonin ENSG00000129673
## 2 ADH1B 5 alcohol ENSG00000196616
## 3 ADH1B 5 alcohol ENSG00000196616
## 4 ADH1B 5 alcohol ENSG00000196616
## 5 ADH1B 5 alcohol ENSG00000196616
## 6 ADH1B 5 alcohol ENSG00000196616
## EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 3 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 4 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 6 The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## GeneCardsSummary
## 1 AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene. Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome. Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism. Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.
## 2 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 3 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 4 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 5 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## 6 ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene. Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome. Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal). Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent. An important paralog of this gene is ADH1C.
## UniProtKB_Summary
## 1 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n SNAT_HUMAN,Q16613\n
## 2 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 3 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 4 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 5 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## 6 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n ADH1B_HUMAN,P00325\n
## todaysDate healthyControl_1 healthyControl_2 healthyControl_3
## 1 Thu Sep 03 14:12:01 2020 38.65447 43.30859 34.63334
## 2 Thu Sep 03 14:02:01 2020 33.10215 39.16493 25.82138
## 3 Thu Sep 03 14:02:01 2020 29.32294 42.79331 21.29711
## 4 Thu Sep 03 14:02:01 2020 37.89308 37.26308 23.94914
## 5 Thu Sep 03 14:02:01 2020 32.60534 36.08616 25.93888
## 6 Thu Sep 03 14:02:01 2020 36.52194 57.39502 35.30746
## healthyControl_4 healthyControl_5 healthyControl_6 healthyControl_7
## 1 19.88367 31.57302 17.33214 69.51032
## 2 18.49482 27.73537 18.57254 41.24310
## 3 18.51708 25.70072 19.08679 32.79410
## 4 17.88074 28.76127 24.04333 36.20929
## 5 20.11668 31.67117 19.19682 34.59911
## 6 20.50253 24.70425 21.05146 45.07237
## healthyControl_8 healthyControl_9 healthyControl_10 healthyControl_11
## 1 11.90238 29.29495 22.12536 84.22110
## 2 12.05248 15.14195 20.64977 66.46961
## 3 11.39472 18.52626 26.56640 83.33878
## 4 10.89609 23.64742 21.44924 70.99529
## 5 12.27169 18.57713 21.48818 85.75042
## 6 14.53013 17.09744 22.74018 96.51198
## healthyControl_12 healthyControl_13 healthyControl_14 healthyControl_15
## 1 50.67037 10.374223 31.11561 23.81043
## 2 44.22388 11.719167 26.24507 23.72086
## 3 49.70900 12.427741 34.60306 22.33815
## 4 51.22769 11.265804 41.29142 19.52040
## 5 55.35348 11.183972 38.27067 18.14704
## 6 50.89903 9.894555 37.57538 15.86959
## healthyControl_16 healthyControl_17 healthyControl_18 healthyControl_19
## 1 20.46186 41.48475 18.62474 46.30124
## 2 18.35441 43.20239 17.93534 55.34340
## 3 23.96481 42.46289 42.95787 75.40298
## 4 21.43320 41.84901 18.35966 111.66591
## 5 18.26951 45.02007 24.47236 96.89240
## 6 21.01759 37.74332 19.34599 65.39802
## healthyControl_20 healthyControl_21 acuteLymeDisease_1 acuteLymeDisease_2
## 1 45.55125 15.28551 71.80305 21.75414
## 2 43.43277 21.97501 69.80451 24.61685
## 3 28.97394 17.00365 47.91530 22.10635
## 4 46.01727 19.27535 60.70073 21.36817
## 5 31.17226 20.38369 43.01904 18.60533
## 6 27.45052 20.00639 67.25343 25.10416
## acuteLymeDisease_3 acuteLymeDisease_4 acuteLymeDisease_5 acuteLymeDisease_6
## 1 49.93791 20.91372 60.35461 29.56083
## 2 39.38423 22.70426 45.83045 33.39604
## 3 43.90134 20.33773 42.12905 29.83464
## 4 41.31194 22.03221 45.32658 35.77288
## 5 41.48576 17.26287 42.07435 35.45385
## 6 42.68764 24.91723 41.56295 26.09728
## acuteLymeDisease_7 acuteLymeDisease_8 acuteLymeDisease_9 acuteLymeDisease_10
## 1 115.26505 12.74919 14.58672 58.68188
## 2 95.05012 14.23976 15.65921 61.44041
## 3 120.56462 15.63079 15.93060 53.69469
## 4 81.45244 15.14020 17.59256 48.55106
## 5 85.27062 16.34465 17.65256 55.49571
## 6 225.20762 14.68151 13.63993 47.56154
## acuteLymeDisease_11 acuteLymeDisease_12 acuteLymeDisease_13
## 1 33.32225 23.25981 9.588795
## 2 41.33231 21.94181 9.866293
## 3 44.52237 18.59789 7.733616
## 4 42.21454 22.07492 7.248266
## 5 45.82252 19.04510 7.861977
## 6 40.75957 18.89717 7.216334
## acuteLymeDisease_14 acuteLymeDisease_15 acuteLymeDisease_16
## 1 20.86345 29.27038 43.14832
## 2 19.76155 23.89585 59.11455
## 3 25.89079 22.59387 45.24864
## 4 24.54406 25.49322 32.11470
## 5 29.57920 21.97918 37.74002
## 6 19.51075 23.89222 38.61305
## acuteLymeDisease_17 acuteLymeDisease_18 acuteLymeDisease_19
## 1 14.57474 28.45002 15.69350
## 2 11.33043 22.03662 17.25698
## 3 10.52267 24.30442 17.48153
## 4 11.75390 23.04194 15.93794
## 5 16.39203 23.02250 17.74235
## 6 13.01954 35.32621 21.17568
## acuteLymeDisease_20 acuteLymeDisease_21 acuteLymeDisease_22
## 1 21.95486 45.46997 28.11229
## 2 19.55800 70.16983 44.13887
## 3 18.50505 52.38181 43.69408
## 4 17.41908 55.92618 54.60819
## 5 18.09133 46.35208 47.28785
## 6 15.41068 50.07806 37.08963
## acuteLymeDisease_23 acuteLymeDisease_24 acuteLymeDisease_25
## 1 34.01966 36.43714 11.098142
## 2 22.18329 31.00082 11.101384
## 3 35.25920 32.90474 11.448932
## 4 21.31802 48.58598 11.585635
## 5 34.25241 25.66271 10.716630
## 6 24.19148 36.22127 7.693474
## acuteLymeDisease_26 acuteLymeDisease_27 acuteLymeDisease_28
## 1 20.95721 18.77368 22.93936
## 2 19.88627 22.42390 31.73242
## 3 27.67493 23.11083 36.57107
## 4 28.65290 21.22896 30.24262
## 5 30.62333 19.26661 28.75988
## 6 34.17640 23.67241 29.95222
## Antibodies_1month_1 Antibodies_1month_2 Antibodies_1month_3
## 1 15.51901 14.49904 17.05954
## 2 24.99516 16.98122 17.99978
## 3 18.73800 18.67212 15.78598
## 4 27.99790 17.53823 17.45409
## 5 21.39371 18.72753 18.61650
## 6 25.08477 18.85538 15.41842
## Antibodies_1month_4 Antibodies_1month_5 Antibodies_1month_6
## 1 15.50139 22.02124 59.06156
## 2 26.85810 33.94878 43.68206
## 3 18.25358 22.51697 42.97516
## 4 17.15096 20.74663 41.57386
## 5 20.21483 23.68388 45.04598
## 6 31.43632 17.65364 37.33166
## Antibodies_1month_7 Antibodies_1month_8 Antibodies_1month_9
## 1 20.16316 36.54447 24.36854
## 2 18.38415 38.46949 34.83685
## 3 15.53457 35.80330 33.23423
## 4 17.66582 46.34294 26.26307
## 5 14.83868 37.63227 32.73706
## 6 15.72403 39.26543 27.46579
## Antibodies_1month_10 Antibodies_1month_11 Antibodies_1month_12
## 1 16.58916 71.64483 43.23948
## 2 21.46458 49.33115 49.48506
## 3 18.91623 51.90412 56.34813
## 4 20.88941 69.77408 50.10176
## 5 20.71777 58.78334 49.35734
## 6 25.84191 66.44381 44.60148
## Antibodies_1month_13 Antibodies_1month_14 Antibodies_1month_15
## 1 20.09754 33.01413 8.609319
## 2 28.64540 47.57697 8.424463
## 3 19.49066 36.03596 7.416995
## 4 22.28110 42.87377 7.016328
## 5 22.76641 37.61831 9.280659
## 6 25.71589 33.50081 6.436601
## Antibodies_1month_16 Antibodies_1month_17 Antibodies_1month_18
## 1 31.82035 27.06322 22.18911
## 2 34.23690 32.57032 13.98179
## 3 39.58097 34.13032 26.50713
## 4 39.48848 33.98432 15.84832
## 5 54.02090 38.60195 32.63592
## 6 40.12788 35.27267 14.66785
## Antibodies_1month_19 Antibodies_1month_20 Antibodies_1month_21
## 1 43.05822 35.09371 34.37424
## 2 47.55496 45.95646 37.90742
## 3 43.65493 42.85820 34.90846
## 4 36.98065 28.37421 42.29672
## 5 44.48999 37.91425 37.01849
## 6 44.62578 35.60238 36.38586
## Antibodies_1month_22 Antibodies_1month_23 Antibodies_1month_24
## 1 32.27040 14.20441 23.45584
## 2 39.12872 14.04135 28.71038
## 3 39.83798 13.71510 50.11779
## 4 31.52575 21.26285 32.94306
## 5 38.18108 15.24316 36.73830
## 6 34.95027 19.32330 21.73639
## Antibodies_1month_25 Antibodies_1month_26 Antibodies_1month_27
## 1 12.04484 33.61438 25.23156
## 2 13.34428 28.29141 20.68341
## 3 13.10811 27.00124 18.82190
## 4 15.10542 30.28620 27.50727
## 5 13.71118 27.07499 14.21209
## 6 13.17195 34.61330 16.93934
## Antibodies_6months_1 Antibodies_6months_2 Antibodies_6months_3
## 1 23.81457 17.47219 51.39457
## 2 27.43140 15.91317 45.41752
## 3 27.11552 16.07950 38.76281
## 4 29.23864 18.47114 28.10985
## 5 26.86661 15.36001 47.90822
## 6 60.77292 14.68160 47.29560
## Antibodies_6months_4 Antibodies_6months_5 Antibodies_6months_6
## 1 32.73614 29.56239 15.20396
## 2 35.23614 38.91768 27.97887
## 3 34.74876 40.62482 21.30499
## 4 35.55087 43.75245 13.49950
## 5 31.21551 47.96653 23.93062
## 6 50.26657 53.04640 23.69328
## Antibodies_6months_7 Antibodies_6months_8 Antibodies_6months_9
## 1 16.99264 28.08311 39.58991
## 2 24.47172 21.86029 28.76703
## 3 22.19385 25.53545 32.77911
## 4 17.75465 21.52223 31.17582
## 5 21.74967 23.60711 24.50661
## 6 19.18802 19.77882 30.05590
## Antibodies_6months_10
## 1 39.65745
## 2 39.56165
## 3 27.63658
## 4 30.49779
## 5 26.39521
## 6 24.02602
colnames(bodySystemsTotal)
## [1] "gene" "geneCounts" "proteinSearched"
## [4] "EnsemblID" "EntrezSummary" "GeneCardsSummary"
## [7] "UniProtKB_Summary" "todaysDate" "healthyControl_1"
## [10] "healthyControl_2" "healthyControl_3" "healthyControl_4"
## [13] "healthyControl_5" "healthyControl_6" "healthyControl_7"
## [16] "healthyControl_8" "healthyControl_9" "healthyControl_10"
## [19] "healthyControl_11" "healthyControl_12" "healthyControl_13"
## [22] "healthyControl_14" "healthyControl_15" "healthyControl_16"
## [25] "healthyControl_17" "healthyControl_18" "healthyControl_19"
## [28] "healthyControl_20" "healthyControl_21" "acuteLymeDisease_1"
## [31] "acuteLymeDisease_2" "acuteLymeDisease_3" "acuteLymeDisease_4"
## [34] "acuteLymeDisease_5" "acuteLymeDisease_6" "acuteLymeDisease_7"
## [37] "acuteLymeDisease_8" "acuteLymeDisease_9" "acuteLymeDisease_10"
## [40] "acuteLymeDisease_11" "acuteLymeDisease_12" "acuteLymeDisease_13"
## [43] "acuteLymeDisease_14" "acuteLymeDisease_15" "acuteLymeDisease_16"
## [46] "acuteLymeDisease_17" "acuteLymeDisease_18" "acuteLymeDisease_19"
## [49] "acuteLymeDisease_20" "acuteLymeDisease_21" "acuteLymeDisease_22"
## [52] "acuteLymeDisease_23" "acuteLymeDisease_24" "acuteLymeDisease_25"
## [55] "acuteLymeDisease_26" "acuteLymeDisease_27" "acuteLymeDisease_28"
## [58] "Antibodies_1month_1" "Antibodies_1month_2" "Antibodies_1month_3"
## [61] "Antibodies_1month_4" "Antibodies_1month_5" "Antibodies_1month_6"
## [64] "Antibodies_1month_7" "Antibodies_1month_8" "Antibodies_1month_9"
## [67] "Antibodies_1month_10" "Antibodies_1month_11" "Antibodies_1month_12"
## [70] "Antibodies_1month_13" "Antibodies_1month_14" "Antibodies_1month_15"
## [73] "Antibodies_1month_16" "Antibodies_1month_17" "Antibodies_1month_18"
## [76] "Antibodies_1month_19" "Antibodies_1month_20" "Antibodies_1month_21"
## [79] "Antibodies_1month_22" "Antibodies_1month_23" "Antibodies_1month_24"
## [82] "Antibodies_1month_25" "Antibodies_1month_26" "Antibodies_1month_27"
## [85] "Antibodies_6months_1" "Antibodies_6months_2" "Antibodies_6months_3"
## [88] "Antibodies_6months_4" "Antibodies_6months_5" "Antibodies_6months_6"
## [91] "Antibodies_6months_7" "Antibodies_6months_8" "Antibodies_6months_9"
## [94] "Antibodies_6months_10"
bodySystems3_geneMeans <- bodySystemsTotal %>% group_by(gene) %>%
summarise_at(vars('healthyControl_1':'Antibodies_6months_10'),mean)
BodySystems_countsAndMeans <- merge(bodySystems3_geneCounts,
bodySystems3_geneMeans,
by.x='gene',by.y='gene')
We just added the means of each gene per sample and the counts of each gene in the total data. Now we will get the group means for each of healthy, acute, 1 month of treatment, and 6 months of treatment.
BS1 <- BodySystems_countsAndMeans %>% group_by(gene) %>%
mutate(
healthyMean = mean(healthyControl_1:healthyControl_21),
acuteMean=mean(acuteLymeDisease_1:acuteLymeDisease_28),
month1 = mean(Antibodies_1month_1:Antibodies_1month_27),
month6=mean(Antibodies_6months_1:Antibodies_6months_10)
)
Lets get the fold change values of these genes per group.
BS1$acuteHealthy_foldChange <- BS1$acuteMean/BS1$healthyMean
BS1$month1Healthy_foldChange <- BS1$month1/BS1$healthyMean
BS1$month6Healthy_foldChange <- BS1$month6/BS1$healthyMean
library(tidyr)
BS1_tidy <- gather(BS1,key='sample',value='sampleValue',3:88)
BS1_tidy$group <- 'group'
healthy <- grep('healthy',BS1_tidy$sample)
acute <- grep('acute',BS1_tidy$sample)
month_1 <- grep('1month',BS1_tidy$sample)
month_6 <- grep('6month',BS1_tidy$sample)
BS1_tidy[healthy,12] <- 'healthy'
BS1_tidy[acute,12] <- 'acute'
BS1_tidy[month_1,12] <- 'month 1'
BS1_tidy[month_6,12] <- 'month 6'
summs3 <- all3withVits[,c(1,2,4)]
BS1_tidy2_summs <- merge(summs3,BS1_tidy,by.x='gene',by.y='gene')
colnames(BS1_tidy2_summs)
## [1] "gene" "proteinSearched"
## [3] "EntrezSummary" "geneCounts"
## [5] "healthyMean" "acuteMean"
## [7] "month1" "month6"
## [9] "acuteHealthy_foldChange" "month1Healthy_foldChange"
## [11] "month6Healthy_foldChange" "sample"
## [13] "sampleValue" "group"
write.csv(BS1_tidy2_summs,'bodySystemLymeDiseaseGenes.csv',row.names=F)
Lets also write out theses body system and vitamins/minerals/hormone genes to use in future gene expression analysis.
write.csv(all3withVits,'vitaminAndBodySystemSums.csv',row.names=F)
These genes were then analyzed into a fold change, mean value, sample value, and filters for selcting by gene, body system, or group in a Tableau dashboard.
Tableau Dashboard on Lyme Disease Body System Genes.
Tableau Dashboard of Body System Genes
Figure 11: The body system genes as they relate to Lyme disease after 1-6 months of treatment, in the acute phase or a person who is healthy and doesn’t have Lyme disease. The filters at the upper left can be used to select specific body systems, genes, or groups (acute, healthy, 1 month, or 6 months). The upper right corner is the Entrez gene summary of the genes. The middle left is the mean values of each gene in each group with the acute, 1 month, and 6 month mean values compared to the healthy mean values. The middle right is a bar chart of the fold change values of the acute/healthy, 1 month/healthy, and the 6 months/healthy mean value ratios for each gene. The bottom is the gene expression value per gene in each sample colored by group membership in healthy, acute, 1 month, or 6 months of treatment. To select multiple genes, use ctrl+click, to deselect click each gene again.
y*(max(y)-min(y))+min(y)↩