This data analysis is on lyme disease using GEO series data made readily available in its normalized state from GSE145974 on ncbi.nlm.nih.gov as the accession number. The data is from the platform GPL13667 and the series data. There are also some CEL/TAR files for Ubuntu but I couldn’t get my ubuntu machines to recognize it, and the instructions and tutorials for accessing the SRAtoolkit and using the Windows Ubuntu app, didn’t avail, so I am using the text files only.If you have a windows 10 tutorial on running SRAtoolkit using the ubuntu app for windows or getting the cel files to work on ubuntu with a VirtualBox disk image of ubuntu that works because you tried it within 24 hours and it worked exactly as explained, please share. I have yet to get those up and running. Possibly the new updates to virtualBox or my other apps, like docker or MongoDB or Tableau are interfering. I am not going to waste time figuring it out, its a time trap.

All the data was there as far as being filled out with values for the feature names, because I do recall exploring some platforms and series downloadable text files and only the header information was there and none of the values. The method used for processing was expression profiling by microarray on peripheral blood mononucleated cells (PBMC). The values seem to be scaled or normalized already as the values are inclusive of negative values.

Body System Genes Section

There is a research article available from the researchers on this study to accompany the data for free through pubMed:

“Global Transcriptome Analysis Identifies a Diagnostic Signature for Early Disseminated Lyme Disease and Its Resolution” authored by the following researchers: Mary M. Petzke,a Konstantin Volyanskyy,b Yong Mao,b Byron Arevalo,a Raphael Zohn,a Johanna Quituisaca,a Gary P. Wormser,c Nevenka Dimitrova,b Ira Schwartza

They are with the Department of Microbiology and Immunology, School of Medicine, New York Medical College, Valhalla, New York, USA bPhillips Research North America, Valhalla, New York, USA Division of Infectious Diseases, Department of Medicine, New York Medical College, Valhalla, New York, USA Citation Petzke MM, Volyanskyy K, Mao Y, Arevalo B, Zohn R, Quituisaca J, Wormser GP, Dimitrova N, Schwartz I. 2020. Global transcriptome analysis identifies a diagnostic signature for early disseminated Lyme disease and its resolution. mBio 11:e00047-20. https:// doi.org/10.1128/mBio.00047-20. Editor Steven J. Norris, McGovern Medical School Copyright © 2020 Petzke et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution 4.0 International license. Address correspondence to Mary M. Petzke, mpetzke@nymc.edu. This article is a direct contribution from Ira Schwartz, a Fellow of the American Academy of Microbiology, who arranged for and secured reviews by Patricia Rosa, NIAID, NIH, and John Leong, Tufts University School of Medicine. Received 9 January 2020 Accepted 31 January 2020 Published 17 March 2020 “ABSTRACT A bioinformatics approach was employed to identify transcriptome alterations in the peripheral blood mononuclear cells of well-characterized human subjects who were diagnosed with early disseminated Lyme disease (LD) based on stringent microbiological and clinical criteria. Transcriptomes were assessed at the time of presentation and also at approximately 1 month (early convalescence) and 6 months (late convalescence) after initiation of an appropriate antibiotic regimen. Comparative transcriptomics identified 335 transcripts, representing 233 unique genes, with significant alterations of at least 2-fold expression in acute- or convalescent-phase blood samples from LD subjects relative to healthy donors. Acute-phase blood samples from LD subjects had the largest number of differentially expressed transcripts (187 induced, 54 repressed). This transcriptional profile, which was dominated by interferon-regulated genes, was sustained during early convalescence. 6 months after antibiotic treatment the transcriptome of LD subjects was indistinguishable from that of healthy controls based on two separate methods of analysis. Return of the LD expression profile to levels found in control subjects was concordant with disease outcome; 82% of subjects with LD experienced at least one symptom at the baseline visit compared to 43% at the early convalescence time point and only a single patient (9%) at the 6-month convalescence time point. Using the random forest machine learning algorithm, we developed an efficient computational framework to identify sets of 20 classifier genes that discriminated LD from other bacterial and viral infections. These novel LD biomarkers not only differentiated subjects with acute disseminated LD from healthy controls with 96% accuracy but also distinguished between subjects with acute and resolved (late convalescent) disease with 97% accuracy. IMPORTANCE Lyme disease (LD), caused by Borrelia burgdorferi, is the most common tick-borne infectious disease in the United States. We examined gene expression patterns in the blood of individuals with early disseminated LD at the time of diagnosis (acute) and also at approximately 1 month and 6 months following antibiotic treatment. A distinct acute LD profile was observed that was sustained during early convalescence (1 month) but returned to control levels 6 months after treatment. Using a computer learning algorithm, we identified sets of 20 classifier genes that discriminate LD from other bacterial and viral infections. In addition, these novel LD biomarkers are highly accurate in distinguishing patients with acute LD from healthy subjects and in discriminating between individuals with active and resolved infection. This computational approach offers the potential for more accurate diagnosis of early disseminated Lyme disease. It may also allow improved monitoring of treatment efficacy and disease resolution.” ***

The study authors used the same algorithms I always go to for analysis and scored well, random forest. It tends to always perform better in classification. But sometimes other algorithms perform better. Data scientists are suggested to not use just one type for all data as not all data is the same, but also some are almost as good and take much less time, depending on how many trees your algorithm is tuned to. I do want to see if these genes can be discovered that are similar to the genes they discovered and use them to predict samples from other studies, but this data is already normalized, and the method that was used was not given, so the first part of this study is an attempt at bringing back the original raw values. This is RNA blood samples, PBMC, and I do have some COVID-19 samples that are also peripheral Blood mononucleated cells type tissue, but the processing was high throughput expression profiling and not microarray. So, I would be able to split this data and see if it can predict the samples on unseen data of the testing set instead.

library(MASS)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

library(e1071)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(MASS)
#library(gbm)
library(RANN) #used in the tuning parameter of rf method of caret for 'oob' one out bag

## Warning: package 'RANN' was built under R version 3.6.3

ticks <- read.delim('GSE145974_series_matrix.txt',sep='\t',header=T,
                    comment.char = '!',na.strings=c('',' ','NA'))

GSM_IDs <- colnames(ticks)[2:87]
Affy_IDs <- ticks$ID_REF

comments <- read.delim('GSE145974_series_matrix.txt',sep='\n',header=T,
                       na.strings=c('',' ','NA'))

Sample GSM IDs and description

descriptors <- comments[27:28,]
head(descriptors)

## [1] !Sample_title\tPBMC total RNA-Healthy control 1\tPBMC total RNA-Healthy control 2\tPBMC total RNA-Healthy control 3\tPBMC total RNA-Healthy control 4\tPBMC total RNA-Healthy control 5\tPBMC total RNA-Healthy control 6\tPBMC total RNA-Healthy control 7\tPBMC total RNA-Healthy control 8\tPBMC total RNA-Healthy control 9\tPBMC total RNA-Healthy control 10\tPBMC total RNA-Healthy control 11\tPBMC total RNA-Healthy control 12\tPBMC total RNA-Healthy control 13\tPBMC total RNA-Healthy control 14\tPBMC total RNA-Healthy control 15\tPBMC total RNA-Healthy control 16\tPBMC total RNA-Healthy control 17\tPBMC total RNA-Healthy control 18\tPBMC total RNA-Healthy control 19\tPBMC total RNA-Healthy control 20\tPBMC total RNA-Healthy control 21\tPBMC total RNA-Acute Lyme disease subject 1\tPBMC total RNA-Acute Lyme disease subject 2\tPBMC total RNA-Acute Lyme disease subject 3\tPBMC total RNA-Acute Lyme disease subject 4\tPBMC total RNA-Acute Lyme disease subject 5\tPBMC total RNA-Acute Lyme disease subject 6\tPBMC total RNA-Acute Lyme disease subject 7\tPBMC total RNA-Acute Lyme disease subject 8\tPBMC total RNA-Acute Lyme disease subject 9\tPBMC total RNA-Acute Lyme disease subject 10\tPBMC total RNA-Acute Lyme disease subject 11\tPBMC total RNA-Acute Lyme disease subject 12\tPBMC total RNA-Acute Lyme disease subject 13\tPBMC total RNA-Acute Lyme disease subject 14\tPBMC total RNA-Acute Lyme disease subject 15\tPBMC total RNA-Acute Lyme disease subject 16\tPBMC total RNA-Acute Lyme disease subject 17\tPBMC total RNA-Acute Lyme disease subject 18\tPBMC total RNA-Acute Lyme disease subject 19\tPBMC total RNA-Acute Lyme disease subject 20\tPBMC total RNA-Acute Lyme disease subject 21\tPBMC total RNA-Acute Lyme disease subject 22\tPBMC total RNA-Acute Lyme disease subject 23\tPBMC total RNA-Acute Lyme disease subject 24\tPBMC total RNA-Acute Lyme disease subject 25\tPBMC total RNA-Acute Lyme disease subject 26\tPBMC total RNA-Acute Lyme disease subject 27\tPBMC total RNA-Acute Lyme disease subject 28\tPBMC total RNA-early convalescent Lyme disease subject 1\tPBMC total RNA-early convalescent Lyme disease subject 2\tPBMC total RNA-early convalescent Lyme disease subject 3\tPBMC total RNA-early convalescent Lyme disease subject 4\tPBMC total RNA-early convalescent Lyme disease subject 5\tPBMC total RNA-early convalescent Lyme disease subject 6\tPBMC total RNA-early convalescent Lyme disease subject 7\tPBMC total RNA-early convalescent Lyme disease subject 8\tPBMC total RNA-early convalescent Lyme disease subject 9\tPBMC total RNA-early convalescent Lyme disease subject 10\tPBMC total RNA-early convalescent Lyme disease subject 11\tPBMC total RNA-early convalescent Lyme disease subject 12\tPBMC total RNA-early convalescent Lyme disease subject 13\tPBMC total RNA-early convalescent Lyme disease subject 14\tPBMC total RNA-early convalescent Lyme disease subject 15\tPBMC total RNA-early convalescent Lyme disease subject 16\tPBMC total RNA-early convalescent Lyme disease subject 17\tPBMC total RNA-early convalescent Lyme disease subject 18\tPBMC total RNA-early convalescent Lyme disease subject 19\tPBMC total RNA-early convalescent Lyme disease subject 20\tPBMC total RNA-early convalescent Lyme disease subject 21\tPBMC total RNA-early convalescent Lyme disease subject 22\tPBMC total RNA-early convalescent Lyme disease subject 23\tPBMC total RNA-early convalescent Lyme disease subject 24\tPBMC total RNA-early convalescent Lyme disease subject 25\tPBMC total RNA-early convalescent Lyme disease subject 26\tPBMC total RNA-early convalescent Lyme disease subject 27\tPBMC total RNA-late convalescent Lyme disease subject 1\tPBMC total RNA-late convalescent Lyme disease subject 2\tPBMC total RNA-late convalescent Lyme disease subject 3\tPBMC total RNA-late convalescent Lyme disease subject 4\tPBMC total RNA-late convalescent Lyme disease subject 5\tPBMC total RNA-late convalescent Lyme disease subject 6\tPBMC total RNA-late convalescent Lyme disease subject 7\tPBMC total RNA-late convalescent Lyme disease subject 8\tPBMC total RNA-late convalescent Lyme disease subject 9\tPBMC total RNA-late convalescent Lyme disease subject 10
## [2] !Sample_geo_accession\tGSM4340492\tGSM4340493\tGSM4340494\tGSM4340495\tGSM4340496\tGSM4340497\tGSM4340498\tGSM4340499\tGSM4340500\tGSM4340501\tGSM4340502\tGSM4340503\tGSM4340504\tGSM4340505\tGSM4340506\tGSM4340507\tGSM4340508\tGSM4340509\tGSM4340510\tGSM4340511\tGSM4340512\tGSM4340513\tGSM4340514\tGSM4340515\tGSM4340516\tGSM4340517\tGSM4340518\tGSM4340519\tGSM4340520\tGSM4340521\tGSM4340522\tGSM4340523\tGSM4340524\tGSM4340525\tGSM4340526\tGSM4340527\tGSM4340528\tGSM4340529\tGSM4340530\tGSM4340531\tGSM4340532\tGSM4340533\tGSM4340534\tGSM4340535\tGSM4340536\tGSM4340537\tGSM4340538\tGSM4340539\tGSM4340540\tGSM4340541\tGSM4340542\tGSM4340543\tGSM4340544\tGSM4340545\tGSM4340546\tGSM4340547\tGSM4340548\tGSM4340549\tGSM4340550\tGSM4340551\tGSM4340552\tGSM4340553\tGSM4340554\tGSM4340555\tGSM4340556\tGSM4340557\tGSM4340558\tGSM4340559\tGSM4340560\tGSM4340561\tGSM4340562\tGSM4340563\tGSM4340564\tGSM4340565\tGSM4340566\tGSM4340567\tGSM4340568\tGSM4340569\tGSM4340570\tGSM4340571\tGSM4340572\tGSM4340573\tGSM4340574\tGSM4340575\tGSM4340576\tGSM4340577                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 49449 Levels: !Sample_channel_count\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1 ...

descriptors <- gsub('!','',descriptors)
descriptors <- gsub('\\t',',',descriptors)

split1 <- strsplit(descriptors[1],split=',')
type <- split1[1]
type2 <- as.data.frame(type)
colnames(type2) <- 'Sample_Title'

split2 <- strsplit(descriptors[2],split=',')
gsm <- split2[1]
gsm2 <- as.data.frame(gsm)
colnames(gsm2) <- 'Sample_GEO_Accession'

names <- cbind(type2,gsm2)
names$Sample_Title <- as.character(paste(names$Sample_Title))
names$Sample_GEO_Accession <- as.character(paste(names$Sample_GEO_Accession))
names2 <- names[-1,]
row.names(names2) <- NULL

write.csv(names2,'descriptors.csv',row.names=F)

descriptors2 <- read.csv('descriptors.csv',sep=',',na.strings=c('',' ','NA'),
                           header=TRUE)
head(descriptors2)

##                       Sample_Title Sample_GEO_Accession
## 1 PBMC total RNA-Healthy control 1           GSM4340492
## 2 PBMC total RNA-Healthy control 2           GSM4340493
## 3 PBMC total RNA-Healthy control 3           GSM4340494
## 4 PBMC total RNA-Healthy control 4           GSM4340495
## 5 PBMC total RNA-Healthy control 5           GSM4340496
## 6 PBMC total RNA-Healthy control 6           GSM4340497

descriptors2$Sample_Title

##  [1] PBMC total RNA-Healthy control 1                         
##  [2] PBMC total RNA-Healthy control 2                         
##  [3] PBMC total RNA-Healthy control 3                         
##  [4] PBMC total RNA-Healthy control 4                         
##  [5] PBMC total RNA-Healthy control 5                         
##  [6] PBMC total RNA-Healthy control 6                         
##  [7] PBMC total RNA-Healthy control 7                         
##  [8] PBMC total RNA-Healthy control 8                         
##  [9] PBMC total RNA-Healthy control 9                         
## [10] PBMC total RNA-Healthy control 10                        
## [11] PBMC total RNA-Healthy control 11                        
## [12] PBMC total RNA-Healthy control 12                        
## [13] PBMC total RNA-Healthy control 13                        
## [14] PBMC total RNA-Healthy control 14                        
## [15] PBMC total RNA-Healthy control 15                        
## [16] PBMC total RNA-Healthy control 16                        
## [17] PBMC total RNA-Healthy control 17                        
## [18] PBMC total RNA-Healthy control 18                        
## [19] PBMC total RNA-Healthy control 19                        
## [20] PBMC total RNA-Healthy control 20                        
## [21] PBMC total RNA-Healthy control 21                        
## [22] PBMC total RNA-Acute Lyme disease subject 1              
## [23] PBMC total RNA-Acute Lyme disease subject 2              
## [24] PBMC total RNA-Acute Lyme disease subject 3              
## [25] PBMC total RNA-Acute Lyme disease subject 4              
## [26] PBMC total RNA-Acute Lyme disease subject 5              
## [27] PBMC total RNA-Acute Lyme disease subject 6              
## [28] PBMC total RNA-Acute Lyme disease subject 7              
## [29] PBMC total RNA-Acute Lyme disease subject 8              
## [30] PBMC total RNA-Acute Lyme disease subject 9              
## [31] PBMC total RNA-Acute Lyme disease subject 10             
## [32] PBMC total RNA-Acute Lyme disease subject 11             
## [33] PBMC total RNA-Acute Lyme disease subject 12             
## [34] PBMC total RNA-Acute Lyme disease subject 13             
## [35] PBMC total RNA-Acute Lyme disease subject 14             
## [36] PBMC total RNA-Acute Lyme disease subject 15             
## [37] PBMC total RNA-Acute Lyme disease subject 16             
## [38] PBMC total RNA-Acute Lyme disease subject 17             
## [39] PBMC total RNA-Acute Lyme disease subject 18             
## [40] PBMC total RNA-Acute Lyme disease subject 19             
## [41] PBMC total RNA-Acute Lyme disease subject 20             
## [42] PBMC total RNA-Acute Lyme disease subject 21             
## [43] PBMC total RNA-Acute Lyme disease subject 22             
## [44] PBMC total RNA-Acute Lyme disease subject 23             
## [45] PBMC total RNA-Acute Lyme disease subject 24             
## [46] PBMC total RNA-Acute Lyme disease subject 25             
## [47] PBMC total RNA-Acute Lyme disease subject 26             
## [48] PBMC total RNA-Acute Lyme disease subject 27             
## [49] PBMC total RNA-Acute Lyme disease subject 28             
## [50] PBMC total RNA-early convalescent Lyme disease subject 1 
## [51] PBMC total RNA-early convalescent Lyme disease subject 2 
## [52] PBMC total RNA-early convalescent Lyme disease subject 3 
## [53] PBMC total RNA-early convalescent Lyme disease subject 4 
## [54] PBMC total RNA-early convalescent Lyme disease subject 5 
## [55] PBMC total RNA-early convalescent Lyme disease subject 6 
## [56] PBMC total RNA-early convalescent Lyme disease subject 7 
## [57] PBMC total RNA-early convalescent Lyme disease subject 8 
## [58] PBMC total RNA-early convalescent Lyme disease subject 9 
## [59] PBMC total RNA-early convalescent Lyme disease subject 10
## [60] PBMC total RNA-early convalescent Lyme disease subject 11
## [61] PBMC total RNA-early convalescent Lyme disease subject 12
## [62] PBMC total RNA-early convalescent Lyme disease subject 13
## [63] PBMC total RNA-early convalescent Lyme disease subject 14
## [64] PBMC total RNA-early convalescent Lyme disease subject 15
## [65] PBMC total RNA-early convalescent Lyme disease subject 16
## [66] PBMC total RNA-early convalescent Lyme disease subject 17
## [67] PBMC total RNA-early convalescent Lyme disease subject 18
## [68] PBMC total RNA-early convalescent Lyme disease subject 19
## [69] PBMC total RNA-early convalescent Lyme disease subject 20
## [70] PBMC total RNA-early convalescent Lyme disease subject 21
## [71] PBMC total RNA-early convalescent Lyme disease subject 22
## [72] PBMC total RNA-early convalescent Lyme disease subject 23
## [73] PBMC total RNA-early convalescent Lyme disease subject 24
## [74] PBMC total RNA-early convalescent Lyme disease subject 25
## [75] PBMC total RNA-early convalescent Lyme disease subject 26
## [76] PBMC total RNA-early convalescent Lyme disease subject 27
## [77] PBMC total RNA-late convalescent Lyme disease subject 1  
## [78] PBMC total RNA-late convalescent Lyme disease subject 2  
## [79] PBMC total RNA-late convalescent Lyme disease subject 3  
## [80] PBMC total RNA-late convalescent Lyme disease subject 4  
## [81] PBMC total RNA-late convalescent Lyme disease subject 5  
## [82] PBMC total RNA-late convalescent Lyme disease subject 6  
## [83] PBMC total RNA-late convalescent Lyme disease subject 7  
## [84] PBMC total RNA-late convalescent Lyme disease subject 8  
## [85] PBMC total RNA-late convalescent Lyme disease subject 9  
## [86] PBMC total RNA-late convalescent Lyme disease subject 10 
## 86 Levels: PBMC total RNA-Acute Lyme disease subject 1 ...

descriptors2$classDisease <- c(rep('healthyControl',21),
                               rep('acuteLymeDisease',28),
                               rep('Antibodies_1month',27),
                               rep('Antibodies_6months',10))
write.csv(descriptors2,'descriptors2.csv',row.names=F)
descriptors2

##                                                 Sample_Title
## 1                           PBMC total RNA-Healthy control 1
## 2                           PBMC total RNA-Healthy control 2
## 3                           PBMC total RNA-Healthy control 3
## 4                           PBMC total RNA-Healthy control 4
## 5                           PBMC total RNA-Healthy control 5
## 6                           PBMC total RNA-Healthy control 6
## 7                           PBMC total RNA-Healthy control 7
## 8                           PBMC total RNA-Healthy control 8
## 9                           PBMC total RNA-Healthy control 9
## 10                         PBMC total RNA-Healthy control 10
## 11                         PBMC total RNA-Healthy control 11
## 12                         PBMC total RNA-Healthy control 12
## 13                         PBMC total RNA-Healthy control 13
## 14                         PBMC total RNA-Healthy control 14
## 15                         PBMC total RNA-Healthy control 15
## 16                         PBMC total RNA-Healthy control 16
## 17                         PBMC total RNA-Healthy control 17
## 18                         PBMC total RNA-Healthy control 18
## 19                         PBMC total RNA-Healthy control 19
## 20                         PBMC total RNA-Healthy control 20
## 21                         PBMC total RNA-Healthy control 21
## 22               PBMC total RNA-Acute Lyme disease subject 1
## 23               PBMC total RNA-Acute Lyme disease subject 2
## 24               PBMC total RNA-Acute Lyme disease subject 3
## 25               PBMC total RNA-Acute Lyme disease subject 4
## 26               PBMC total RNA-Acute Lyme disease subject 5
## 27               PBMC total RNA-Acute Lyme disease subject 6
## 28               PBMC total RNA-Acute Lyme disease subject 7
## 29               PBMC total RNA-Acute Lyme disease subject 8
## 30               PBMC total RNA-Acute Lyme disease subject 9
## 31              PBMC total RNA-Acute Lyme disease subject 10
## 32              PBMC total RNA-Acute Lyme disease subject 11
## 33              PBMC total RNA-Acute Lyme disease subject 12
## 34              PBMC total RNA-Acute Lyme disease subject 13
## 35              PBMC total RNA-Acute Lyme disease subject 14
## 36              PBMC total RNA-Acute Lyme disease subject 15
## 37              PBMC total RNA-Acute Lyme disease subject 16
## 38              PBMC total RNA-Acute Lyme disease subject 17
## 39              PBMC total RNA-Acute Lyme disease subject 18
## 40              PBMC total RNA-Acute Lyme disease subject 19
## 41              PBMC total RNA-Acute Lyme disease subject 20
## 42              PBMC total RNA-Acute Lyme disease subject 21
## 43              PBMC total RNA-Acute Lyme disease subject 22
## 44              PBMC total RNA-Acute Lyme disease subject 23
## 45              PBMC total RNA-Acute Lyme disease subject 24
## 46              PBMC total RNA-Acute Lyme disease subject 25
## 47              PBMC total RNA-Acute Lyme disease subject 26
## 48              PBMC total RNA-Acute Lyme disease subject 27
## 49              PBMC total RNA-Acute Lyme disease subject 28
## 50  PBMC total RNA-early convalescent Lyme disease subject 1
## 51  PBMC total RNA-early convalescent Lyme disease subject 2
## 52  PBMC total RNA-early convalescent Lyme disease subject 3
## 53  PBMC total RNA-early convalescent Lyme disease subject 4
## 54  PBMC total RNA-early convalescent Lyme disease subject 5
## 55  PBMC total RNA-early convalescent Lyme disease subject 6
## 56  PBMC total RNA-early convalescent Lyme disease subject 7
## 57  PBMC total RNA-early convalescent Lyme disease subject 8
## 58  PBMC total RNA-early convalescent Lyme disease subject 9
## 59 PBMC total RNA-early convalescent Lyme disease subject 10
## 60 PBMC total RNA-early convalescent Lyme disease subject 11
## 61 PBMC total RNA-early convalescent Lyme disease subject 12
## 62 PBMC total RNA-early convalescent Lyme disease subject 13
## 63 PBMC total RNA-early convalescent Lyme disease subject 14
## 64 PBMC total RNA-early convalescent Lyme disease subject 15
## 65 PBMC total RNA-early convalescent Lyme disease subject 16
## 66 PBMC total RNA-early convalescent Lyme disease subject 17
## 67 PBMC total RNA-early convalescent Lyme disease subject 18
## 68 PBMC total RNA-early convalescent Lyme disease subject 19
## 69 PBMC total RNA-early convalescent Lyme disease subject 20
## 70 PBMC total RNA-early convalescent Lyme disease subject 21
## 71 PBMC total RNA-early convalescent Lyme disease subject 22
## 72 PBMC total RNA-early convalescent Lyme disease subject 23
## 73 PBMC total RNA-early convalescent Lyme disease subject 24
## 74 PBMC total RNA-early convalescent Lyme disease subject 25
## 75 PBMC total RNA-early convalescent Lyme disease subject 26
## 76 PBMC total RNA-early convalescent Lyme disease subject 27
## 77   PBMC total RNA-late convalescent Lyme disease subject 1
## 78   PBMC total RNA-late convalescent Lyme disease subject 2
## 79   PBMC total RNA-late convalescent Lyme disease subject 3
## 80   PBMC total RNA-late convalescent Lyme disease subject 4
## 81   PBMC total RNA-late convalescent Lyme disease subject 5
## 82   PBMC total RNA-late convalescent Lyme disease subject 6
## 83   PBMC total RNA-late convalescent Lyme disease subject 7
## 84   PBMC total RNA-late convalescent Lyme disease subject 8
## 85   PBMC total RNA-late convalescent Lyme disease subject 9
## 86  PBMC total RNA-late convalescent Lyme disease subject 10
##    Sample_GEO_Accession       classDisease
## 1            GSM4340492     healthyControl
## 2            GSM4340493     healthyControl
## 3            GSM4340494     healthyControl
## 4            GSM4340495     healthyControl
## 5            GSM4340496     healthyControl
## 6            GSM4340497     healthyControl
## 7            GSM4340498     healthyControl
## 8            GSM4340499     healthyControl
## 9            GSM4340500     healthyControl
## 10           GSM4340501     healthyControl
## 11           GSM4340502     healthyControl
## 12           GSM4340503     healthyControl
## 13           GSM4340504     healthyControl
## 14           GSM4340505     healthyControl
## 15           GSM4340506     healthyControl
## 16           GSM4340507     healthyControl
## 17           GSM4340508     healthyControl
## 18           GSM4340509     healthyControl
## 19           GSM4340510     healthyControl
## 20           GSM4340511     healthyControl
## 21           GSM4340512     healthyControl
## 22           GSM4340513   acuteLymeDisease
## 23           GSM4340514   acuteLymeDisease
## 24           GSM4340515   acuteLymeDisease
## 25           GSM4340516   acuteLymeDisease
## 26           GSM4340517   acuteLymeDisease
## 27           GSM4340518   acuteLymeDisease
## 28           GSM4340519   acuteLymeDisease
## 29           GSM4340520   acuteLymeDisease
## 30           GSM4340521   acuteLymeDisease
## 31           GSM4340522   acuteLymeDisease
## 32           GSM4340523   acuteLymeDisease
## 33           GSM4340524   acuteLymeDisease
## 34           GSM4340525   acuteLymeDisease
## 35           GSM4340526   acuteLymeDisease
## 36           GSM4340527   acuteLymeDisease
## 37           GSM4340528   acuteLymeDisease
## 38           GSM4340529   acuteLymeDisease
## 39           GSM4340530   acuteLymeDisease
## 40           GSM4340531   acuteLymeDisease
## 41           GSM4340532   acuteLymeDisease
## 42           GSM4340533   acuteLymeDisease
## 43           GSM4340534   acuteLymeDisease
## 44           GSM4340535   acuteLymeDisease
## 45           GSM4340536   acuteLymeDisease
## 46           GSM4340537   acuteLymeDisease
## 47           GSM4340538   acuteLymeDisease
## 48           GSM4340539   acuteLymeDisease
## 49           GSM4340540   acuteLymeDisease
## 50           GSM4340541  Antibodies_1month
## 51           GSM4340542  Antibodies_1month
## 52           GSM4340543  Antibodies_1month
## 53           GSM4340544  Antibodies_1month
## 54           GSM4340545  Antibodies_1month
## 55           GSM4340546  Antibodies_1month
## 56           GSM4340547  Antibodies_1month
## 57           GSM4340548  Antibodies_1month
## 58           GSM4340549  Antibodies_1month
## 59           GSM4340550  Antibodies_1month
## 60           GSM4340551  Antibodies_1month
## 61           GSM4340552  Antibodies_1month
## 62           GSM4340553  Antibodies_1month
## 63           GSM4340554  Antibodies_1month
## 64           GSM4340555  Antibodies_1month
## 65           GSM4340556  Antibodies_1month
## 66           GSM4340557  Antibodies_1month
## 67           GSM4340558  Antibodies_1month
## 68           GSM4340559  Antibodies_1month
## 69           GSM4340560  Antibodies_1month
## 70           GSM4340561  Antibodies_1month
## 71           GSM4340562  Antibodies_1month
## 72           GSM4340563  Antibodies_1month
## 73           GSM4340564  Antibodies_1month
## 74           GSM4340565  Antibodies_1month
## 75           GSM4340566  Antibodies_1month
## 76           GSM4340567  Antibodies_1month
## 77           GSM4340568 Antibodies_6months
## 78           GSM4340569 Antibodies_6months
## 79           GSM4340570 Antibodies_6months
## 80           GSM4340571 Antibodies_6months
## 81           GSM4340572 Antibodies_6months
## 82           GSM4340573 Antibodies_6months
## 83           GSM4340574 Antibodies_6months
## 84           GSM4340575 Antibodies_6months
## 85           GSM4340576 Antibodies_6months
## 86           GSM4340577 Antibodies_6months

platform <- read.delim('GPL13667-15572.txt',sep='\t',header=T,
                       na.strings=c('',' ','NA'),
                       comment.char='#')

colnames(platform)

##  [1] "ID"                               "GeneChip.Array"                  
##  [3] "Species.Scientific.Name"          "Annotation.Date"                 
##  [5] "Sequence.Type"                    "Sequence.Source"                 
##  [7] "Transcript.ID.Array.Design."      "Target.Description"              
##  [9] "Representative.Public.ID"         "Archival.UniGene.Cluster"        
## [11] "UniGene.ID"                       "Genome.Version"                  
## [13] "Alignments"                       "Gene.Title"                      
## [15] "Gene.Symbol"                      "Chromosomal.Location"            
## [17] "GB_LIST"                          "SPOT_ID"                         
## [19] "Unigene.Cluster.Type"             "Ensembl"                         
## [21] "Entrez.Gene"                      "SwissProt"                       
## [23] "EC"                               "OMIM"                            
## [25] "RefSeq.Protein.ID"                "RefSeq.Transcript.ID"            
## [27] "FlyBase"                          "AGI"                             
## [29] "WormBase"                         "MGI.Name"                        
## [31] "RGD.Name"                         "SGD.accession.number"            
## [33] "Gene.Ontology.Biological.Process" "Gene.Ontology.Cellular.Component"
## [35] "Gene.Ontology.Molecular.Function" "Pathway"                         
## [37] "InterPro"                         "Trans.Membrane"                  
## [39] "QTL"                              "Annotation.Description"          
## [41] "Annotation.Transcript.Cluster"    "Transcript.Assignments"          
## [43] "Annotation.Notes"

platform2 <- platform[,c(1,15)]
head(platform2,10)

##               ID            Gene.Symbol
## 1    11715100_at               HIST1H3G
## 2  11715101_s_at               HIST1H3G
## 3  11715102_x_at               HIST1H3G
## 4  11715103_x_at              TNFAIP8L1
## 5  11715104_s_at                  OTOP2
## 6    11715105_at               C17orf78
## 7  11715106_x_at                 CTAGE6
## 8  11715107_s_at F8A1 /// F8A2 /// F8A3
## 9  11715108_x_at              LOC285501
## 10   11715109_at                  SAMD7

split3 <- strsplit(as.character(platform2$Gene.Symbol),split='///')
Gene1 <- lapply(split3,'[',1)
platform2$Gene <- as.character(paste(Gene1))
platform2$Gene <- trimws(platform2$Gene,which='both',whitespace=' ')
platform3 <- platform2[,c(1,3)]

Lyme <- merge(platform3,ticks,by.x='ID',by.y='ID_REF')

head(Lyme,10)

##               ID      Gene  GSM4340492   GSM4340493  GSM4340494  GSM4340495
## 1    11715100_at  HIST1H3G -0.59253310 -0.009284496  0.88924026 -0.59085226
## 2  11715101_s_at  HIST1H3G  0.09195518 -0.286612030 -0.05651927  0.01545429
## 3  11715102_x_at  HIST1H3G -0.30191730 -0.298989770  0.53580880 -0.05129719
## 4  11715103_x_at TNFAIP8L1  0.31854916  0.513157370  0.95201826 -0.17165422
## 5  11715104_s_at     OTOP2  0.35021090  0.417993550  0.64977026 -0.87235403
## 6    11715105_at  C17orf78  0.23255038  0.105412245  0.93498800 -0.38537788
## 7  11715106_x_at    CTAGE6 -0.23309612 -0.247609620  0.05952883  0.16506481
## 8  11715107_s_at      F8A1  0.21802092  0.263677600  0.25610542 -0.06133032
## 9  11715108_x_at LOC285501  0.15773225  0.230084420 -0.16884637  0.01592112
## 10   11715109_at     SAMD7 -0.07625985  0.069449190  0.86671830  0.07166767
##     GSM4340496   GSM4340497  GSM4340498  GSM4340499  GSM4340500  GSM4340501
## 1  -0.25674057  0.178862570  0.33442068  0.71101570 -0.39509892  0.46790314
## 2   0.46735048 -0.661887170 -0.19262838 -0.65387726 -0.23723197 -0.11107683
## 3   0.00169158  0.154232260  0.95216393  0.80829550 -0.22131062 -0.13876462
## 4  -0.49376535 -0.003461361  0.35323380  0.25973320  0.11914110  0.48709917
## 5  -0.27472186  0.518686800 -0.37734365 -0.18517780 -0.05672860  0.02594519
## 6  -0.18685770  0.038143635 -0.09946012 -0.19551945 -0.02428436  0.43764305
## 7  -0.17645860  0.284028300 -0.16674256 -0.03273916 -0.22399735 -0.35533237
## 8   0.24444056  0.127098080  0.35930157 -0.32224035  0.15163136  0.23986864
## 9  -0.07901430  0.194977760  0.01868057  0.40068722  0.20100140  0.01788568
## 10 -0.06297445  0.088138580 -0.11154175  0.31087565 -0.11259913  0.22273993
##     GSM4340502  GSM4340503   GSM4340504  GSM4340505   GSM4340506 GSM4340507
## 1  -0.91596320 -0.21084070  0.507361400 -0.10268044 -0.268592120 -0.2066014
## 2  -0.53500676 -0.01545405  0.026301146 -0.22284293 -0.096830610  0.3286717
## 3  -0.32182740 -0.33841515  0.547380700 -0.27017474 -0.444824930 -0.2811055
## 4   0.37288857 -0.33219337  0.187441830  0.02667522  0.138779160 -0.2921214
## 5   0.84479380 -0.56166600  0.176565890  0.70575760  0.009498119 -0.2518997
## 6   0.47393608 -0.22646427  0.001132488  0.03431201 -0.122164965  0.1782284
## 7   0.52311490  0.42100382  0.003138065 -0.21974206 -0.107515570 -0.4599485
## 8   0.59180880 -0.09978533 -0.083554980 -0.35681129  0.452571400 -0.5096481
## 9  -0.07756519  0.12501192  0.252948280  0.25551580 -0.194641110 -0.0900197
## 10  0.38259960  0.20906234  0.245586870  0.81757355  0.399212840  0.3086305
##     GSM4340508  GSM4340509 GSM4340510   GSM4340511  GSM4340512   GSM4340513
## 1  -0.03616428  1.39556170  0.9336066 -0.345187660  1.41630410  0.024940968
## 2  -0.10901141  0.26494336  0.2645502 -0.172512530 -0.01915169  0.590458400
## 3  -0.48442793 -0.00169158  0.3964074 -0.438740730  0.79996130 -0.003316164
## 4  -0.27666283  0.59313583  1.3088722 -0.078464985 -0.22184610 -0.125149250
## 5  -0.49455237 -0.12355471  1.2536860  0.005551815  0.14471460 -0.299937000
## 6  -0.26185846  0.03588915  1.0256069 -0.016168356 -0.08185172  0.283830400
## 7   0.26642323  0.49813986  0.6025591  0.103127960  0.28940630  0.305600400
## 8  -0.66656685 -0.04444027 -0.2184668 -0.607126950 -0.43012738 -0.034221650
## 9  -0.11579466  0.33119917  0.6969066 -0.205174210  0.44025946  0.700205300
## 10  0.47717070 -0.09540820  1.0605373 -0.025308609  0.09913993 -0.150889160
##     GSM4340514  GSM4340515  GSM4340516  GSM4340517 GSM4340518  GSM4340519
## 1   0.36390543 -0.05049491  0.17156029 -0.17820406 -0.6384110  1.45310120
## 2   0.92116880  0.13653588  0.40749073  0.06032562 -0.6903899 -0.31139135
## 3   0.51725410 -0.09225488  0.17572045 -0.53029585 -0.4140344  1.46454930
## 4  -0.26830244 -0.42997742  0.15891123  0.73606540  0.3150458  0.25862217
## 5  -0.36973786  0.10511756  0.07034254 -0.32845616 -0.1203601  0.41244340
## 6   0.14113188 -0.04387975 -0.01858592 -0.08883715 -0.4785538  1.40855050
## 7   0.03930974  0.02128148 -0.35518550 -0.18491459  0.2934127  0.02635431
## 8   0.19977641  0.12956524  0.76010180 -0.25856924  0.1035805 -0.07077360
## 9  -0.34498215  0.18119955  0.18576646 -0.13998628 -0.2464888  1.01377010
## 10  0.27483344  0.29325080 -0.12200546 -0.01883483  0.0276742  0.91223645
##      GSM4340520  GSM4340521  GSM4340522  GSM4340523  GSM4340524   GSM4340525
## 1   0.867766860  0.06011248 -0.04372168  0.99262430  0.32651900  1.221489000
## 2   1.184989500 -0.07329583  0.10649586  0.14782143 -0.04292679  0.122339725
## 3   1.107213300 -0.62243030  0.23995805  0.51144240 -0.21093988  0.734566900
## 4   0.399795530 -0.36117554 -0.14437151  0.07272816  0.40393830 -0.057187557
## 5  -0.190790410  0.21818352  0.03414512  0.24560475  0.25962280 -0.026625872
## 6   0.108215090  0.10102868 -0.21768450 -0.27324247 -0.06491280  0.093775510
## 7  -0.003138065 -0.03363609 -0.35512495  0.37538410 -0.48708916 -0.035444260
## 8  -0.350707530 -0.32779574  0.62094736  0.02097416  0.29126263 -0.032155037
## 9  -0.385637760  0.19942021  0.05432105 -0.35548830 -0.16131115  0.305632100
## 10 -0.340252160  0.67416120 -0.35456777 -0.07404351  0.22817540  0.003587961
##      GSM4340526  GSM4340527  GSM4340528  GSM4340529  GSM4340530   GSM4340531
## 1  -0.079782010  0.16625214 -0.05562854  0.74712276  0.11671686  0.202030660
## 2   1.331762800  1.05427030  0.77380896  1.21485230  0.17339611  0.373914240
## 3   0.273911700 -0.08973885  0.18000436  0.59774710  0.13011074  0.019492626
## 4  -0.001162052 -0.15073442 -0.15721035 -0.13115883  0.05067396  0.009587288
## 5  -0.174499030  0.01978135 -0.42199445 -0.23907113  0.14861059  0.129543780
## 6  -0.022754430 -0.25086713  0.06694078  0.05301285  0.09123874 -0.136976960
## 7  -0.090754750 -0.25880456 -0.40618014  0.64868000  0.55330443  0.462115760
## 8   0.017302990  0.03397655 -0.46453524 -0.28433204 -0.53935814  0.035744190
## 9  -0.225182060 -0.12666178 -0.28527260 -0.21253347  0.40385842  0.159019710
## 10  0.002527237  0.01034808 -0.32083917  0.20892763 -0.15703940  0.117316484
##      GSM4340532  GSM4340533  GSM4340534   GSM4340535 GSM4340536  GSM4340537
## 1   0.695035930 -0.07468033  0.42930126  0.762237100 -0.1346824 -0.66726850
## 2   0.644182700  1.74052330  0.81566286  1.983678300  0.1371200 -0.38933063
## 3   0.351636170  0.56654190  0.19219232  0.004929781 -0.3382273 -0.52081466
## 4  -0.137790200 -0.19500494  0.00483942 -0.624896050  0.2809162 -0.54733040
## 5  -0.005551815  0.17941070  0.21301961  1.075717400  0.2678673 -0.69308877
## 6   0.256294970 -0.19070745  0.01503134 -0.178430320  0.3682134  0.11680889
## 7   0.476877930 -0.13712406 -0.48446155  1.019529300  0.2419057  0.13968611
## 8   0.647078040  0.67237806 -0.27653360  0.587698000  0.1267602 -0.26808643
## 9   0.264019000  0.35026717  0.20412159  0.798325060 -0.1514852 -0.13640094
## 10  0.669338940  0.83077170 -0.47124220  0.010626555  0.2122192 -0.02050471
##    GSM4340538  GSM4340539  GSM4340540  GSM4340541   GSM4340542  GSM4340543
## 1  -0.8034751 -0.08240318 -0.26496172 -0.55023000  0.139022350  0.19117546
## 2  -0.4086158  0.07601285 -0.23855066  1.20058390 -0.857179160  0.72668650
## 3  -0.4278896  0.18717742 -0.15261054  0.06187797  0.344895600  0.07341838
## 4  -0.1127582 -0.35107183  0.08686781 -0.11734247 -0.187285900 -0.34354544
## 5  -0.4702437 -0.01245666 -0.26411510  0.28339958 -0.232067350 -0.02993345
## 6  -0.2333312 -0.10420871  0.15422583 -0.09303212 -0.087192300 -0.43732095
## 7  -0.1069174 -0.19418597 -0.48339033  0.60018590 -0.008615255 -0.12579775
## 8   0.4021506  0.24575377 -0.20825243  0.17053008 -0.445592160 -0.09836698
## 9  -0.1217663 -0.21430850 -0.01568246 -0.31617308 -0.003108978  0.08321142
## 10  0.3645365 -0.32145430 -0.23980832 -0.21153498 -0.225210190 -0.16845512
##     GSM4340544 GSM4340545  GSM4340546   GSM4340547   GSM4340548   GSM4340549
## 1  -0.13712788 -0.5064247  0.32296610  0.167175770 -0.007938385  0.265477180
## 2  -0.83762050 -0.5171139  0.06512356  1.502949700  0.820524700 -0.603322740
## 3   0.25898100 -0.1969023 -0.14703488  0.655670400  0.005455732 -0.004225254
## 4   0.09970617 -0.3247981 -0.16297817 -0.040671825  0.011886120  0.507034800
## 5   0.26365137 -0.4023223 -0.18715930  0.527691360  0.068028930  0.449331760
## 6   0.02165437 -0.1708701 -0.02961493  0.156007530 -0.175745730  0.357031350
## 7  -0.35575510  0.4802373  0.43824100  0.014536381  0.540118460 -0.238581660
## 8   0.04122806 -0.1636445  0.41594410  0.566439600  0.423860550  0.581760400
## 9  -0.07597423  0.1993938 -0.04255629 -0.026537180  0.096963880  0.349265580
## 10 -0.12926245 -0.1056190  0.22595882 -0.001816034 -0.009675026 -0.021210432
##      GSM4340550  GSM4340551  GSM4340552  GSM4340553  GSM4340554   GSM4340555
## 1   0.310139660  0.14941120  0.06148148 -0.03426623  0.07923126  0.338356970
## 2   0.377018930 -0.08147907 -0.04239464 -0.16666675  0.43988752  0.569839950
## 3   0.511428600 -0.15845299 -0.26334380 -0.99466133  0.24187303  0.172201400
## 4   0.003585339 -0.03613806  1.10038520  0.07085276  0.11792040  0.243794920
## 5   0.188768390  0.24183226 -0.30591822 -0.36849856 -0.02074099  0.322824720
## 6   0.196948290  0.43674254  0.45061135 -0.31486250 -0.01961470  0.069894314
## 7  -0.102472780 -0.12055969 -0.38386060  0.09604788  0.00589323  0.616808650
## 8  -0.329080340 -0.07314849 -0.02191258 -0.10152102 -0.10343003  0.140138150
## 9  -0.258805500 -0.04596662  0.30989194 -0.15038610 -0.43206978 -0.006614685
## 10  0.698671800 -0.21091223  0.50165390 -0.34214520 -0.08751988 -0.007938623
##      GSM4340556   GSM4340557 GSM4340558  GSM4340559  GSM4340560  GSM4340561
## 1   2.874059700  0.442127230  0.9249401 0.839578150 -0.38884664 -0.07621074
## 2  -0.110048770 -0.393716340 -0.8799987 0.684186460  0.30365920  0.88441324
## 3   1.739939900  0.296143770  1.2088530 0.370505570 -0.09722352 -0.05574894
## 4   0.259531970  0.001162052  1.4090724 0.464235300 -1.03242210 -0.09247875
## 5  -0.032611130  0.322963240 -0.1202502 0.496089460  0.30323625 -0.26851058
## 6  -0.001132488  0.085085150 -0.2871523 0.356799360 -0.27124834 -0.33872580
## 7   0.074517730  0.222347740 -0.2620816 0.525415200  0.45337200 -0.02035952
## 8  -0.022999763 -0.435444600  0.3761067 0.394931800  1.00498100 -0.22367430
## 9  -0.092698574  0.315870760  0.2548003 0.003108978 -0.08749151 -0.12848425
## 10 -0.005806685 -0.037118435 -0.2099145 0.491355180  0.21082091 -0.26856208
##      GSM4340562 GSM4340563  GSM4340564  GSM4340565  GSM4340566  GSM4340567
## 1  -0.030527353  0.7965670  1.32892900 -0.70203495  0.17316818 -0.17138457
## 2   0.229472160 -0.5226088 -0.42213202 -0.42114854 -0.57192636 -0.09266257
## 3   0.467982530  0.7239873  1.06951830 -0.69173074  0.04611826 -0.20751357
## 4  -0.611855030  0.4623060  0.15209580  0.01517963 -0.12642765 -0.11141872
## 5  -0.190755370  0.3697445 -0.29396987 -0.04537916  0.17844630 -0.30581330
## 6   0.480561020  0.4825172  0.09900594 -0.34519982 -0.06524348 -0.23261500
## 7  -0.012398958 -0.3951211  0.34062195  0.03960013 -0.15827584  0.42929006
## 8  -0.004727602  0.3185978 -0.78857090  0.15007639 -0.31071950 -0.06571126
## 9   0.391360280 -0.1007588 -0.34945035 -0.17555260  0.21184110 -0.22346115
## 10 -0.384369130 -0.2083082 -0.23706675 -0.15146804  0.54136395 -0.40481090
##     GSM4340568  GSM4340569  GSM4340570  GSM4340571  GSM4340572   GSM4340573
## 1  -0.47117710 -0.57729626  0.00291729 -0.00291729 -0.02237558 -0.215449810
## 2  -0.42650986  0.57595587 -0.06454945 -0.69773720  0.17670655 -1.009703200
## 3  -0.14878845 -0.02167058 -0.08139014 -0.14530134 -0.13251233 -0.116862774
## 4  -0.02393150 -0.02706146  1.04600050 -0.40366602  0.51856995 -0.090086940
## 5  -0.22946358 -0.23458219  1.23416950  0.19375610 -0.16548180 -0.057461023
## 6   0.05986667 -0.12558055 -0.11191845  0.47772480  0.11514950  0.773427000
## 7   0.55531836  0.30098010 -0.07369185  0.14053250 -0.02606392 -0.231655360
## 8   0.89040090  0.00472784  0.04761553 -0.11750078  0.75627136 -0.346018550
## 9   0.01146317  0.10802078 -0.14302516 -0.12559128  0.01791525  0.141523840
## 10 -0.26789665 -0.04320884  0.61968660  0.05324388  0.40543246  0.001815796
##     GSM4340574  GSM4340575  GSM4340576  GSM4340577
## 1  -0.23842883 -0.13297105 -0.25816083 -0.65128374
## 2  -0.41535997 -0.36541247 -0.63268210  0.32752848
## 3  -0.01295447  0.06384516 -0.88006690 -0.50552154
## 4  -0.36591434 -0.05154228 -0.27018833  0.69949150
## 5   0.01363945 -0.04463029 -0.03419995  0.68252800
## 6   0.11831260 -0.01090026 -0.17179346  0.06035352
## 7  -0.08951592  0.15467095 -0.15713477 -0.21521902
## 8  -0.16519380 -0.02013493 -0.58750010  0.47252607
## 9  -0.01791692 -0.12587452  0.02695108  0.28917623
## 10 -0.20205832  0.02986431 -0.20121956  0.32708670

noGeneSymbol <- Lyme[grep('---',Lyme$Gene),]
platform4 <- platform[,c(1,20)]
Ensembl <- merge(platform4,noGeneSymbol,by.x='ID',by.y='ID')
Ensembl2 <- Ensembl[-grep('---',Ensembl$Ensembl),]

string5 <- strsplit(as.character(paste(Ensembl2$Ensembl)),'///')
Ensembl2$EnsemblID <- as.character(paste(lapply(string5,'[',1)))
Ensembl3 <- Ensembl2[,c(90,4:89)]
colnames(Ensembl3)[1] <- 'Gene'

LymeDisease <- Lyme[,-1]

Lets combine the Ensembl IDs data frame with the Gene Symbol data frame as they are some of the missing observations of the LymeDisease data frame with the gene symbols missing. Its only 75 out of 600 missing, but still replaces some missing values, and genecards.org will look up either gene symbol and we can grep out the Ensembl IDs with their prepended ‘ENSG’ ID names.

LymeDisease2 <- LymeDisease[-grep('---',LymeDisease$Gene),]
LymeDisease3 <- rbind(LymeDisease2,Ensembl3)

write.csv(LymeDisease3,'LymeDisease.csv',row.names=FALSE)

Our data is log2 normalized, and this means it is scaled to be between 0 and 1. There are many different ways to log2 normalize such as each x elements of a sample minus the mean of all x’s in the sample, then divided by the standard deviation of all x’s in the sample. Or take an element x of a sample then subtract the min(all x’s in sample) and divide that by the max(all x’s in sample)-min(all x’s in sample). To inverse log 2 you just take 2 and raise it to the output y of log2 normalized x. To inverse the normalized method, you reverse the operations. For the first inverse, you would multiply by the std error of x then subtract the mean of x and for the second method you would multiply by the max-min and then add the min. The normalization is done before the log2 according to Dr. Quackenbush on a posted question on biostars. I want to inverse the scaling, because when doing machine learning, the data is supposed to be scaled after splitting the data into training and testing sets. And Affymetrix data has more steps for normalization as well. Lets suppose that the normalization is the second method, because I could get back the original x by converting the decimal to a fraction, but couldn’t with the mean and std error method of scaling. Also if a value was zero I added 10^-8 to make it a value log2 would recognize and not quit on.

So, lets assume the formula is log2[(x-min(x))/(max(x)-min(x))]=y, then the inverse would be [2^(y)]*[max(y)-min(y)]+min(y)

a <- LymeDisease3$GSM4340492
head(a,10)

##  [1] -0.59253310  0.09195518 -0.30191730  0.31854916  0.35021090  0.23255038
##  [7] -0.23309612  0.21802092  0.15773225 -0.07625985

Inverse step 1 to take the base 2 and raise it by y we named A.

A <- (2^a)
head(A,10)

##  [1] 0.6631775 1.0658136 0.8111737 1.2470758 1.2747470 1.1749101 0.8508070
##  [8] 1.1631369 1.1155323 0.9485135

Step 2 of inverse is to inverse the standardization steps that set all values between 0 and 1. But we notice that the values above are not between 0 and 1 so they must not have been normalized with this method. And they likely weren’t because Dr. Quackenbush said the values are ‘background corrected,’‘quantile normalized,’‘probe summarisation (i.e. across transcripts),’ and ‘log (base 2) transformation.’-www.biostars.org/p/3121133/

AA <- A*(max(A)-min(A))+min(A)
head(AA,10)

##  [1] 22.59247 36.29708 27.62984 42.46674 43.40859 40.01042 28.97885 39.60969
##  [9] 37.98936 32.30451

Those values don’t look extreme, we could try to use the fractional method to get the original values back.

AAA <- as.fractions(AA)
head(AAA,10)

##  [1]       618288/27367          22359/616          19258/697  156990101/3696778
##  [5]        332640/7663  108901357/2721825  106758269/3684006          29430/743
##  [9] 983198089/25880879         61637/1908

Multiply by the maximum value in the list of de-normalized or de-standardized values.The denominators are not all common, We need a common denominator and we might need these fractions to all have common denominators.

maxA <- max(AAA)
A4 <- AAA*maxA
A5 <- as.numeric(A4)
head(A5,10)

##  [1] 26189.73 42076.44 32029.18 49228.46 50320.27 46381.04 33592.98 45916.50
##  [9] 44038.18 37448.16

Those values are extremely high. We were better at stopping after de-standardizing the inverse log2 of y as our x.

%%%%%%%%%%%%% demonstration of what was expected %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Let me show you what I expected when using this on a different set of values. We start with x having 10 elements, but one is a 0, and then we standardize to fit between 0 and 1.

x <- c(1,2,3,4,5,5,43,0,23,11)
x_a <- (x-min(x))/(max(x)-min(x))
x_a

##  [1] 0.02325581 0.04651163 0.06976744 0.09302326 0.11627907 0.11627907
##  [7] 1.00000000 0.00000000 0.53488372 0.25581395

correct the 0 value for taking the log by adding a very small value, otherwise it will be a NaN or log error.

x_b <- x_a+10^-8
x_b

##  [1] 0.02325582 0.04651164 0.06976745 0.09302327 0.11627908 0.11627908
##  [7] 1.00000001 0.00000001 0.53488373 0.25581396

y <- log(x_b,2)
y

##  [1] -5.426264e+00 -4.426264e+00 -3.841302e+00 -3.426265e+00 -3.104337e+00
##  [6] -3.104337e+00  1.442695e-08 -2.657542e+01 -9.027028e-01 -1.966833e+00

The above is y, the log2 normalized output of x.

Lets get x back by reversing the operations.

x_c <- 2^y
x_c

##  [1] 0.02325582 0.04651164 0.06976745 0.09302327 0.11627908 0.11627908
##  [7] 1.00000001 0.00000001 0.53488373 0.25581396

The above is equal to x_b, the normalized value plus the 10^-8 small value.

x_d <- x_c-0.00000001
x_d

##  [1] 2.325581e-02 4.651163e-02 6.976744e-02 9.302326e-02 1.162791e-01
##  [6] 1.162791e-01 1.000000e+00 8.271806e-24 5.348837e-01 2.558140e-01

Notice that the zero is 10^-24, or a very small value, that is otherwise 0. That could be the size of the tiniest atom.

x_e <- x_d*(max(x_d)-min(x_d))+min(x_d)
x_e

##  [1] 2.325581e-02 4.651163e-02 6.976744e-02 9.302326e-02 1.162791e-01
##  [6] 1.162791e-01 1.000000e+00 1.654361e-23 5.348837e-01 2.558140e-01

#library(MASS)
X <- as.fractions(x_e)
X

##  [1]  1/43  2/43  3/43  4/43  5/43  5/43     1     0 23/43 11/43

Notice, because its normalized the values aren’t the original values, but the denominator is the max value. We can multiply by that value and get our original values back.

X2 <- X*43
x

##  [1]  1  2  3  4  5  5 43  0 23 11

X2

##  [1]  1  2  3  4  5  5 43  0 23 11

We got back the original values using the second normalization method. %%%%%%%%%%%%%%%%%%%%%%%%%%% end of demonstration %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

There are 86 samples and we would have to do this to 85 more samples or create a function that will do those steps to each column in our data frame and write it out to a file we can read back in. We are going to forget about multiplying each entry by the max and just end the inverse log2 normalization after de-normalizing the y output vector assumedly back to x, the input vector. I tried to write a functional for loop to write this out to file, but it returned a long vector, and then when reading it in, the matrix() and the as.matrix() didn’t change the 4M+ long vector (48851*86 elements as rows) into the number of rows and columns, it just kept it as a very long vector. Online, the community says that the functions decide on their own.

LymeMX <- LymeDisease3[,2:87]

denormalize <- function(datatable){
  for (i in LymeMX[,1:86]){
    a <- i
    A <- 2^a
    AA <- A*(max(A)-min(A))+min(A)
  write.table(AA,'lymeMX.csv',sep=',',append=TRUE,col.names=FALSE,row.names=FALSE)
  }
}

if (file.exists('lymeMX.csv')){
  file.remove('lymeMX.csv')
}

## [1] TRUE

denormalize(LymeMX)

lymeVector <- read.csv('lymeMX.csv',sep=',',header=F)

lymeMatrix <- as.matrix(lymeVector,nrow=48851,ncol=86)
lymeMatrix2 <- as.matrix(lymeVector,nrow=48851,ncol=86)

Both matrices are still just one long 4,201,186 X 1 matrix.

So, we must do this the long way, but technically copy and paste still make it somewhat fast. Just making sure to put in the right indices manually.

lymeMx <- as.data.frame(LymeDisease3[,1])
colnames(lymeMx) <- 'gene'

lymeMx$s1 <-2^(LymeDisease3[,2])*(max(2^(LymeDisease3[,2]))-min(2^(LymeDisease3[,2])))+min(2^(LymeDisease3[,2]))
lymeMx$s2 <-2^(LymeDisease3[,3])*(max(2^(LymeDisease3[,3]))-min(2^(LymeDisease3[,3])))+min(2^(LymeDisease3[,3]))
lymeMx$s3 <-2^(LymeDisease3[,4])*(max(2^(LymeDisease3[,4]))-min(2^(LymeDisease3[,4])))+min(2^(LymeDisease3[,4]))
lymeMx$s4 <-2^(LymeDisease3[,5])*(max(2^(LymeDisease3[,5]))-min(2^(LymeDisease3[,5])))+min(2^(LymeDisease3[,5]))
lymeMx$s5 <-2^(LymeDisease3[,6])*(max(2^(LymeDisease3[,6]))-min(2^(LymeDisease3[,6])))+min(2^(LymeDisease3[,6]))
lymeMx$s6 <-2^(LymeDisease3[,7])*(max(2^(LymeDisease3[,7]))-min(2^(LymeDisease3[,7])))+min(2^(LymeDisease3[,7]))
lymeMx$s7 <-2^(LymeDisease3[,8])*(max(2^(LymeDisease3[,8]))-min(2^(LymeDisease3[,8])))+min(2^(LymeDisease3[,8]))
lymeMx$s8 <-2^(LymeDisease3[,9])*(max(2^(LymeDisease3[,9]))-min(2^(LymeDisease3[,9])))+min(2^(LymeDisease3[,9]))
lymeMx$s9 <-2^(LymeDisease3[,10])*(max(2^(LymeDisease3[,10]))-min(2^(LymeDisease3[,10])))+min(2^(LymeDisease3[,10]))
lymeMx$s10 <-2^(LymeDisease3[,11])*(max(2^(LymeDisease3[,11]))-min(2^(LymeDisease3[,11])))+min(2^(LymeDisease3[,11]))

lymeMx$s11 <-2^(LymeDisease3[,12])*(max(2^(LymeDisease3[,12]))-min(2^(LymeDisease3[,12])))+min(2^(LymeDisease3[,12]))
lymeMx$s12 <-2^(LymeDisease3[,13])*(max(2^(LymeDisease3[,13]))-min(2^(LymeDisease3[,13])))+min(2^(LymeDisease3[,13]))
lymeMx$s13 <-2^(LymeDisease3[,14])*(max(2^(LymeDisease3[,14]))-min(2^(LymeDisease3[,14])))+min(2^(LymeDisease3[,14]))
lymeMx$s14 <-2^(LymeDisease3[,15])*(max(2^(LymeDisease3[,15]))-min(2^(LymeDisease3[,15])))+min(2^(LymeDisease3[,15]))
lymeMx$s15 <-2^(LymeDisease3[,16])*(max(2^(LymeDisease3[,16]))-min(2^(LymeDisease3[,16])))+min(2^(LymeDisease3[,16]))
lymeMx$s16 <-2^(LymeDisease3[,17])*(max(2^(LymeDisease3[,17]))-min(2^(LymeDisease3[,17])))+min(2^(LymeDisease3[,17]))
lymeMx$s17 <-2^(LymeDisease3[,18])*(max(2^(LymeDisease3[,18]))-min(2^(LymeDisease3[,18])))+min(2^(LymeDisease3[,18]))
lymeMx$s18 <-2^(LymeDisease3[,19])*(max(2^(LymeDisease3[,19]))-min(2^(LymeDisease3[,19])))+min(2^(LymeDisease3[,19]))
lymeMx$s19 <-2^(LymeDisease3[,20])*(max(2^(LymeDisease3[,20]))-min(2^(LymeDisease3[,20])))+min(2^(LymeDisease3[,20]))
lymeMx$s20 <-2^(LymeDisease3[,21])*(max(2^(LymeDisease3[,21]))-min(2^(LymeDisease3[,21])))+min(2^(LymeDisease3[,21]))


lymeMx$s21 <-2^(LymeDisease3[,22])*(max(2^(LymeDisease3[,22]))-min(2^(LymeDisease3[,22])))+min(2^(LymeDisease3[,22]))
lymeMx$s22 <-2^(LymeDisease3[,23])*(max(2^(LymeDisease3[,23]))-min(2^(LymeDisease3[,23])))+min(2^(LymeDisease3[,23]))
lymeMx$s23 <-2^(LymeDisease3[,24])*(max(2^(LymeDisease3[,24]))-min(2^(LymeDisease3[,24])))+min(2^(LymeDisease3[,24]))
lymeMx$s24 <-2^(LymeDisease3[,25])*(max(2^(LymeDisease3[,25]))-min(2^(LymeDisease3[,25])))+min(2^(LymeDisease3[,25]))
lymeMx$s25 <-2^(LymeDisease3[,26])*(max(2^(LymeDisease3[,26]))-min(2^(LymeDisease3[,26])))+min(2^(LymeDisease3[,26]))
lymeMx$s26 <-2^(LymeDisease3[,27])*(max(2^(LymeDisease3[,27]))-min(2^(LymeDisease3[,27])))+min(2^(LymeDisease3[,27]))
lymeMx$s27 <-2^(LymeDisease3[,28])*(max(2^(LymeDisease3[,28]))-min(2^(LymeDisease3[,28])))+min(2^(LymeDisease3[,28]))
lymeMx$s28 <-2^(LymeDisease3[,29])*(max(2^(LymeDisease3[,29]))-min(2^(LymeDisease3[,29])))+min(2^(LymeDisease3[,29]))
lymeMx$s29 <-2^(LymeDisease3[,30])*(max(2^(LymeDisease3[,30]))-min(2^(LymeDisease3[,30])))+min(2^(LymeDisease3[,30]))
lymeMx$s30 <-2^(LymeDisease3[,31])*(max(2^(LymeDisease3[,31]))-min(2^(LymeDisease3[,31])))+min(2^(LymeDisease3[,31]))


lymeMx$s31 <-2^(LymeDisease3[,32])*(max(2^(LymeDisease3[,32]))-min(2^(LymeDisease3[,32])))+min(2^(LymeDisease3[,32]))
lymeMx$s32 <-2^(LymeDisease3[,33])*(max(2^(LymeDisease3[,33]))-min(2^(LymeDisease3[,33])))+min(2^(LymeDisease3[,33]))
lymeMx$s33 <-2^(LymeDisease3[,34])*(max(2^(LymeDisease3[,34]))-min(2^(LymeDisease3[,34])))+min(2^(LymeDisease3[,34]))
lymeMx$s34 <-2^(LymeDisease3[,35])*(max(2^(LymeDisease3[,35]))-min(2^(LymeDisease3[,35])))+min(2^(LymeDisease3[,35]))
lymeMx$s35 <-2^(LymeDisease3[,36])*(max(2^(LymeDisease3[,36]))-min(2^(LymeDisease3[,36])))+min(2^(LymeDisease3[,36]))
lymeMx$s36 <-2^(LymeDisease3[,37])*(max(2^(LymeDisease3[,37]))-min(2^(LymeDisease3[,37])))+min(2^(LymeDisease3[,37]))
lymeMx$s37 <-2^(LymeDisease3[,38])*(max(2^(LymeDisease3[,38]))-min(2^(LymeDisease3[,38])))+min(2^(LymeDisease3[,38]))
lymeMx$s38 <-2^(LymeDisease3[,39])*(max(2^(LymeDisease3[,39]))-min(2^(LymeDisease3[,39])))+min(2^(LymeDisease3[,39]))
lymeMx$s39 <-2^(LymeDisease3[,40])*(max(2^(LymeDisease3[,40]))-min(2^(LymeDisease3[,40])))+min(2^(LymeDisease3[,40]))
lymeMx$s40 <-2^(LymeDisease3[,41])*(max(2^(LymeDisease3[,41]))-min(2^(LymeDisease3[,41])))+min(2^(LymeDisease3[,41]))


lymeMx$s41 <-2^(LymeDisease3[,42])*(max(2^(LymeDisease3[,42]))-min(2^(LymeDisease3[,42])))+min(2^(LymeDisease3[,42]))
lymeMx$s42 <-2^(LymeDisease3[,43])*(max(2^(LymeDisease3[,43]))-min(2^(LymeDisease3[,43])))+min(2^(LymeDisease3[,43]))
lymeMx$s43 <-2^(LymeDisease3[,44])*(max(2^(LymeDisease3[,44]))-min(2^(LymeDisease3[,44])))+min(2^(LymeDisease3[,44]))
lymeMx$s44 <-2^(LymeDisease3[,45])*(max(2^(LymeDisease3[,45]))-min(2^(LymeDisease3[,45])))+min(2^(LymeDisease3[,45]))
lymeMx$s45 <-2^(LymeDisease3[,46])*(max(2^(LymeDisease3[,46]))-min(2^(LymeDisease3[,46])))+min(2^(LymeDisease3[,46]))
lymeMx$s46 <-2^(LymeDisease3[,47])*(max(2^(LymeDisease3[,47]))-min(2^(LymeDisease3[,47])))+min(2^(LymeDisease3[,47]))
lymeMx$s47 <-2^(LymeDisease3[,48])*(max(2^(LymeDisease3[,48]))-min(2^(LymeDisease3[,48])))+min(2^(LymeDisease3[,48]))
lymeMx$s48 <-2^(LymeDisease3[,49])*(max(2^(LymeDisease3[,49]))-min(2^(LymeDisease3[,49])))+min(2^(LymeDisease3[,49]))
lymeMx$s49 <-2^(LymeDisease3[,50])*(max(2^(LymeDisease3[,50]))-min(2^(LymeDisease3[,50])))+min(2^(LymeDisease3[,50]))
lymeMx$s50 <-2^(LymeDisease3[,51])*(max(2^(LymeDisease3[,51]))-min(2^(LymeDisease3[,51])))+min(2^(LymeDisease3[,51]))


lymeMx$s51 <-2^(LymeDisease3[,52])*(max(2^(LymeDisease3[,52]))-min(2^(LymeDisease3[,52])))+min(2^(LymeDisease3[,52]))
lymeMx$s52 <-2^(LymeDisease3[,53])*(max(2^(LymeDisease3[,53]))-min(2^(LymeDisease3[,53])))+min(2^(LymeDisease3[,53]))
lymeMx$s53 <-2^(LymeDisease3[,54])*(max(2^(LymeDisease3[,54]))-min(2^(LymeDisease3[,54])))+min(2^(LymeDisease3[,54]))
lymeMx$s54 <-2^(LymeDisease3[,55])*(max(2^(LymeDisease3[,55]))-min(2^(LymeDisease3[,55])))+min(2^(LymeDisease3[,55]))
lymeMx$s55 <-2^(LymeDisease3[,56])*(max(2^(LymeDisease3[,56]))-min(2^(LymeDisease3[,56])))+min(2^(LymeDisease3[,56]))
lymeMx$s56 <-2^(LymeDisease3[,57])*(max(2^(LymeDisease3[,57]))-min(2^(LymeDisease3[,57])))+min(2^(LymeDisease3[,57]))
lymeMx$s57 <-2^(LymeDisease3[,58])*(max(2^(LymeDisease3[,58]))-min(2^(LymeDisease3[,58])))+min(2^(LymeDisease3[,58]))
lymeMx$s58 <-2^(LymeDisease3[,59])*(max(2^(LymeDisease3[,59]))-min(2^(LymeDisease3[,59])))+min(2^(LymeDisease3[,59]))
lymeMx$s59 <-2^(LymeDisease3[,60])*(max(2^(LymeDisease3[,60]))-min(2^(LymeDisease3[,60])))+min(2^(LymeDisease3[,60]))
lymeMx$s60 <-2^(LymeDisease3[,61])*(max(2^(LymeDisease3[,61]))-min(2^(LymeDisease3[,61])))+min(2^(LymeDisease3[,61]))


lymeMx$s61 <-2^(LymeDisease3[,62])*(max(2^(LymeDisease3[,62]))-min(2^(LymeDisease3[,62])))+min(2^(LymeDisease3[,62]))
lymeMx$s62 <-2^(LymeDisease3[,63])*(max(2^(LymeDisease3[,63]))-min(2^(LymeDisease3[,63])))+min(2^(LymeDisease3[,63]))
lymeMx$s63 <-2^(LymeDisease3[,64])*(max(2^(LymeDisease3[,64]))-min(2^(LymeDisease3[,64])))+min(2^(LymeDisease3[,64]))
lymeMx$s64 <-2^(LymeDisease3[,65])*(max(2^(LymeDisease3[,65]))-min(2^(LymeDisease3[,65])))+min(2^(LymeDisease3[,65]))
lymeMx$s65 <-2^(LymeDisease3[,66])*(max(2^(LymeDisease3[,66]))-min(2^(LymeDisease3[,66])))+min(2^(LymeDisease3[,66]))
lymeMx$s66 <-2^(LymeDisease3[,67])*(max(2^(LymeDisease3[,67]))-min(2^(LymeDisease3[,67])))+min(2^(LymeDisease3[,67]))
lymeMx$s67 <-2^(LymeDisease3[,68])*(max(2^(LymeDisease3[,68]))-min(2^(LymeDisease3[,68])))+min(2^(LymeDisease3[,68]))
lymeMx$s68 <-2^(LymeDisease3[,69])*(max(2^(LymeDisease3[,69]))-min(2^(LymeDisease3[,69])))+min(2^(LymeDisease3[,69]))
lymeMx$s69 <-2^(LymeDisease3[,70])*(max(2^(LymeDisease3[,70]))-min(2^(LymeDisease3[,70])))+min(2^(LymeDisease3[,70]))
lymeMx$s70 <-2^(LymeDisease3[,71])*(max(2^(LymeDisease3[,71]))-min(2^(LymeDisease3[,71])))+min(2^(LymeDisease3[,71]))


lymeMx$s71 <-2^(LymeDisease3[,72])*(max(2^(LymeDisease3[,72]))-min(2^(LymeDisease3[,72])))+min(2^(LymeDisease3[,72]))
lymeMx$s72 <-2^(LymeDisease3[,73])*(max(2^(LymeDisease3[,73]))-min(2^(LymeDisease3[,73])))+min(2^(LymeDisease3[,73]))
lymeMx$s73 <-2^(LymeDisease3[,74])*(max(2^(LymeDisease3[,74]))-min(2^(LymeDisease3[,74])))+min(2^(LymeDisease3[,74]))
lymeMx$s74 <-2^(LymeDisease3[,75])*(max(2^(LymeDisease3[,75]))-min(2^(LymeDisease3[,75])))+min(2^(LymeDisease3[,75]))
lymeMx$s75 <-2^(LymeDisease3[,76])*(max(2^(LymeDisease3[,76]))-min(2^(LymeDisease3[,76])))+min(2^(LymeDisease3[,76]))
lymeMx$s76 <-2^(LymeDisease3[,77])*(max(2^(LymeDisease3[,77]))-min(2^(LymeDisease3[,77])))+min(2^(LymeDisease3[,77]))
lymeMx$s77 <-2^(LymeDisease3[,78])*(max(2^(LymeDisease3[,78]))-min(2^(LymeDisease3[,78])))+min(2^(LymeDisease3[,78]))
lymeMx$s78 <-2^(LymeDisease3[,79])*(max(2^(LymeDisease3[,79]))-min(2^(LymeDisease3[,79])))+min(2^(LymeDisease3[,79]))
lymeMx$s79 <-2^(LymeDisease3[,80])*(max(2^(LymeDisease3[,80]))-min(2^(LymeDisease3[,80])))+min(2^(LymeDisease3[,80]))
lymeMx$s80 <-2^(LymeDisease3[,81])*(max(2^(LymeDisease3[,81]))-min(2^(LymeDisease3[,81])))+min(2^(LymeDisease3[,81]))

lymeMx$s81 <-2^(LymeDisease3[,82])*(max(2^(LymeDisease3[,82]))-min(2^(LymeDisease3[,82])))+min(2^(LymeDisease3[,82]))
lymeMx$s82 <-2^(LymeDisease3[,83])*(max(2^(LymeDisease3[,83]))-min(2^(LymeDisease3[,83])))+min(2^(LymeDisease3[,83]))
lymeMx$s83 <-2^(LymeDisease3[,84])*(max(2^(LymeDisease3[,84]))-min(2^(LymeDisease3[,84])))+min(2^(LymeDisease3[,84]))
lymeMx$s84 <-2^(LymeDisease3[,85])*(max(2^(LymeDisease3[,85]))-min(2^(LymeDisease3[,85])))+min(2^(LymeDisease3[,85]))
lymeMx$s85 <-2^(LymeDisease3[,86])*(max(2^(LymeDisease3[,86]))-min(2^(LymeDisease3[,86])))+min(2^(LymeDisease3[,86]))
lymeMx$s86 <-2^(LymeDisease3[,87])*(max(2^(LymeDisease3[,87]))-min(2^(LymeDisease3[,87])))+min(2^(LymeDisease3[,87]))

We now have our suspected original x values from taking the inverse of the log2(normalized x)

head(lymeMx,10)

##         gene       s1       s2       s3       s4       s5       s6       s7
## 1   HIST1H3G 22.59247 38.57177 39.54174 13.59257 26.04888 21.51069 50.33299
## 2   HIST1H3G 36.29708 31.84160 20.55844 20.66623 43.00957 12.01783 34.94171
## 3   HIST1H3G 27.62984 31.57033 30.96372 19.73411 31.15322 21.14685 77.21367
## 4  TNFAIP8L1 42.46674 55.36564 41.29757 18.15865 22.10683 18.95894 50.99312
## 5      OTOP2 43.40859 51.83702 33.50366 11.19206 25.72660 27.21961 30.74722
## 6   C17orf78 40.01042 41.75623 40.81368 15.66535 27.34023 19.51317 37.27006
## 7     CTAGE6 28.97885 32.71176 22.27521 22.91889 27.53780 23.13604 35.57361
## 8       F8A1 39.60969 46.58748 25.51772 19.59770 36.85641 20.75314 51.20788
## 9  LOC285501 37.98936 45.51724 19.02319 20.67290 29.45996 21.75213 40.44717
## 10     SAMD7 32.30451 40.73037 38.93021 21.48535 29.78898 20.20065 36.95958
##           s8       s9      s10       s11      s12      s13      s14      s15
## 1  18.361496 16.17992 29.27993  36.74482 39.58568 17.62928 26.42135 16.19021
## 2   7.151825 18.04948 19.61416  47.83345 45.32257 12.66312 24.31543 18.22622
## 3  19.639702 18.24964 19.24207  55.44210 36.23800 18.12191 23.53288 14.33849
## 4  13.439440 23.10354 29.67159  89.70427 36.39449 14.14594 28.89328 21.44424
## 5   9.882871 20.45345 21.56443 124.39415 31.04687 14.04057 46.21978 19.61363
## 6   9.812547 20.91835 28.67301  96.20869 39.15962 12.44611 29.04626 17.91047
## 7  10.980143 18.21571 16.56545  99.54299 61.32365 12.46327 24.36760 18.09237
## 8   8.990543 23.62946 25.00492 104.39482 42.75090 11.74310 22.16534 26.63348
## 9  14.814970 24.45164 21.44452  65.66078 49.95431 14.79771 33.84786 17.03718
## 10 13.923062 19.67697 24.71026  90.30977 52.94930 14.72298 49.93891 25.66958
##         s16      s17      s18       s19      s20      s21      s22      s23
## 1  18.65325 40.55269 48.72731  94.71925 27.68216 49.86249 53.64034 31.75436
## 2  26.99391 38.55864 22.27286  59.57215 31.19783 18.44950 79.37574 46.70039
## 3  17.71874 29.73707 18.52009  65.27301 25.94602 32.53405 52.60025 35.30951
## 4  17.58461 34.33457 27.95371 122.85707 33.29721 16.03413 48.34192 20.50662
## 5  18.07933 29.52951 17.02266 118.24643 35.29199 20.66606 42.82781 19.11798
## 6  24.32932 34.68813 19.00797 100.95596 34.76511 17.66578 64.18064 27.21852
## 7  15.66297 50.00273 26.17448  75.29876 37.75931 22.84390 65.15621 25.36739
## 8  15.13550 26.21678 17.98034  42.62383 23.09122 13.88158 51.48577 28.34550
## 9  20.21591 38.37803 23.31803  80.38736 30.50018 25.35947 85.64826 19.44793
## 10 26.62270 57.85869 17.35737 103.43002 34.54575 20.02411 47.48736 29.85636
##         s24      s25      s26      s27       s28      s29      s30      s31
## 1  40.36005 24.71115 41.21683 21.97899 242.04619 28.97658 16.83710 53.92571
## 2  45.94425 29.09320 48.62383 21.20297  71.27186 36.09075 15.35622 59.84022
## 3  39.20903 24.78237 32.29525 25.66908 243.97417 34.19917 10.51733 65.63736
## 4  31.02927 24.49584 77.66133 42.51588 105.78490 20.96305 12.59117 50.29372
## 5  44.95489 23.04002 37.14204 31.45284 117.68216 13.93743 18.77854 56.91465
## 6  40.54546 21.66550 43.84952 24.54859 234.68729 17.13590 17.31942 47.80322
## 7  42.41794 17.16673 41.02564 41.88385  90.06063 15.86667 15.78225 43.46195
## 8  45.72288 37.13531 38.98464 36.72603  84.19979 12.48020 12.88423 85.46665
## 9  47.38832 24.95522 42.32277 28.82399 178.51466 12.18283 18.53681 57.71579
## 10 51.21420 20.16999 46.02872 34.84622 166.38602 12.57062 25.73242 43.47873
##         s32      s33       s34      s35      s36      s37      s38      s39
## 1  90.70970 27.57647 19.531918 22.60302 29.23281 39.67376 19.79227 28.35322
## 2  50.52619 21.34949  9.144246 59.96040 54.04548 70.48516 27.35002 29.48876
## 3  64.99650 19.00399 13.951374 28.85456 24.49016 46.70942 17.85099 28.61755
## 4  47.96585 29.09598  8.080204 23.86327 23.47894 36.97766 10.79269 27.08516
## 5  54.06629 26.32759  8.252118 21.17318 26.41670 30.78056 10.01888 28.98670
## 6  37.74823 21.02680  8.965967 23.51028 21.90884 43.19002 12.25469 27.85714
## 7  59.15106 15.69574  8.202139 22.43253 21.78898 31.11961 18.49046 38.36838
## 8  46.27727 26.91106  8.220745 24.16935 26.67729 29.88674  9.71116 17.99804
## 9  35.65898 19.66863 10.376200 20.44577 23.87293 33.83838 10.20384 34.59415
## 10 43.33050 25.76022  8.425694 23.92411 26.24495 33.01483 13.64693 23.45511
##         s40      s41       s42      s43       s44      s45       s46      s47
## 1  19.97120 29.13563  48.46683 47.87722  44.83273 24.58380  7.340767 16.65791
## 2  22.49588 28.12865 170.41287 62.57221 104.44944 29.65918  8.879782 21.87477
## 3  17.59975 22.97713  75.55876 40.62497  26.55102 21.36238  8.114731 21.58562
## 4  17.47945 16.38441  44.59320 35.68056  17.18297 32.75699  7.968716 26.83404
## 5  18.99342 17.95132  57.78958 41.21533  55.69708 32.46296  7.212277 20.96364
## 6  15.79263 21.51168  44.72606 35.93334  23.39033 34.79392 12.570699 24.68950
## 7  23.91300 25.05539  46.41627 25.42504  53.57219 31.88584 12.770074 26.94255
## 8  17.79893 28.18503  81.30571 29.36271  39.73188 29.44770  9.649790 38.30607
## 9  19.38511 21.62683  65.04796 40.96207  45.96660 24.30032 10.562778 26.66755
## 10 18.83328 28.62235  90.73399 25.65884  26.65580 31.23851 11.438219 37.32249
##         s48      s49      s50      s51      s52      s53      s54      s55
## 1  21.84250 25.00497 15.03689 19.28145 20.95660 14.38273 15.33420 52.18946
## 2  24.37222 25.46586 50.56783  9.69547 30.34963  8.87291 15.22122 43.66884
## 3  26.32074 27.02532 22.97513 22.22991 19.31855 18.90868 18.99640 37.71431
## 4  18.13896 31.89456 20.29308 15.39031 14.48411 16.93831 17.38741 37.30122
## 5  22.92530 25.01962 26.78544 14.92172 17.98702 18.96980 16.47932 36.68333
## 6  21.51554 33.41618 20.63765 16.49177 13.57623 16.04933 19.34174 40.90143
## 7  20.21744 21.50006 33.35868 17.41162 16.83438 12.36843 30.35652 56.52007
## 8  27.40950 26.00508 24.77098 12.87700 17.15641 16.26777 19.43870 55.65522
## 9  19.93804 29.71015 17.68258 17.47798 19.44973 15.00299 24.99209 40.53731
## 10 18.51423 25.44372 19.01152 14.99254 16.34560 14.46104 20.23523 48.80388
##         s56      s57      s58      s59      s60       s61      s62      s63
## 1  18.81358 38.60427 37.42814 27.37149 66.26418  51.51663 21.96050 41.78819
## 2  47.41096 68.53473 20.50622 28.66727 56.47553  47.93870 20.03999 53.65111
## 3  26.37484 38.96412 31.05025 31.46036 53.54524  41.13340 11.31445 46.77273
## 4  16.29607 39.13807 44.24589 22.14372 58.27621 105.83491 23.61590 42.92348
## 5  24.14021 40.69022 42.51203 25.16801 70.64305  39.93768 17.43137 38.99176
## 6  18.66889 34.36792 39.87882 25.31076 80.85079  67.46208 18.08939 39.02220
## 7  16.92980 56.43290 26.39818 20.57866 54.96826  37.83800 24.03091 39.71794
## 8  24.79599 52.06556 46.59680 17.59646 56.80215  48.62395 20.96293 36.82081
## 9  16.45601 41.51407 39.66486 18.47170 57.88106  61.19389 20.26675 29.32376
## 10 16.73956 38.55786 30.68710 35.81168 51.63592  69.89111 17.75162 37.22891
##          s64       s65      s66       s67      s68      s69      s70      s71
## 1  10.586208 287.06210 47.57328 31.757893 82.46059 29.10355 38.66926 37.37517
## 2  12.411786  36.30340 26.66048  9.121163 74.04372 47.00593 75.20406 44.74178
## 3   9.445118 130.80362 42.99657 38.655164 59.58087 35.61306 39.22082 52.77229
## 4   9.920727  46.89542 35.04883 44.403146 63.57810 18.64624 38.23628 25.00343
## 5  10.473884  38.30379 43.80303 15.412572 64.99677 46.99216 33.85058 33.45397
## 6   8.805156  39.14815 37.14713 13.733739 59.01782 31.57138 32.24516 53.23378
## 7  12.819311  41.25430 40.85336 13.973693 66.33082 52.14100 40.19344 37.84687
## 8   9.239648  38.55965 25.90088 21.723133 60.59767 76.40293 34.91737 38.04827
## 9   8.355369  36.74231 43.58830 19.974912 46.19317 35.85380 37.29519 50.04637
## 10  8.347794  39.02161 34.13150 14.486557 64.78393 44.07917 33.84937 29.26158
##         s72      s73       s74      s75      s76      s77      s78      s79
## 1  25.17885 83.46367  8.726264 31.65491 16.70587 16.53895 11.41146 32.06021
## 2  10.12092 24.80718 10.590134 18.90590 17.64200 17.05701 25.26796 30.59706
## 3  23.94595 69.73067  8.788421 28.99057 16.29304 20.66426 16.72934 30.24238
## 4  19.98203 36.92691 14.310777 25.72812 17.41430 22.52651 16.66729 66.02685
## 5  18.74339 27.11026 13.724748 31.77075 15.22076 19.54391 14.44662 75.22061
## 6  20.26322 35.59332 11.159615 26.84060 16.01224 23.86987 15.57318 29.60989
## 7  11.05134 42.07957 14.554143 25.16745 25.32619 33.62460 20.89906 30.40400
## 8  18.09229 19.24648 15.708073 22.64868 17.97442 42.39900 17.03657 33.06796
## 9  13.54135 26.08811 12.545321 32.51360 16.11407 23.08441 18.29423 28.97904
## 10 12.57219 28.20026 12.755591 40.84400 14.21224 19.03183 16.48280 49.14358
##         s80      s81       s82      s83      s84      s85      s86
## 1  32.59497 36.59878 16.169422 17.04565 20.95848 23.88463 17.68828
## 2  20.17319 42.00942  9.332273 15.09083 17.84571 18.43729 34.83445
## 3  29.54068 33.91117 17.311605 19.91072 24.01601 15.54136 19.56592
## 4  24.71268 53.23373 17.635533 15.61322 22.17310 23.68683 45.07172
## 5  37.34162 33.14573 18.038448 20.27913 22.27939 27.88571 44.54518
## 6  45.44414 40.25601 32.071137 21.79697 22.80547 25.35448 28.95011
## 7  35.99257 36.50542 15.989030 18.88726 25.57395 25.61280 23.92105
## 8  30.11359 62.76273 14.772002 17.92761 22.66021 19.02194 38.51459
## 9  29.94572 37.63438 20.703257 19.84273 21.06163 29.09052 33.92133
## 10 33.88505 49.22103 18.794217 17.47806 23.45789 24.84377 34.82380

We can play around with the normalized data in some Tableau charts or this data right here that could be the raw values or close. Lets add in the actual names for our denormalized data.

colnames(lymeMx)[2:87] <- colnames(LymeDisease3)[2:87]

head(lymeMx,10)

##         gene GSM4340492 GSM4340493 GSM4340494 GSM4340495 GSM4340496 GSM4340497
## 1   HIST1H3G   22.59247   38.57177   39.54174   13.59257   26.04888   21.51069
## 2   HIST1H3G   36.29708   31.84160   20.55844   20.66623   43.00957   12.01783
## 3   HIST1H3G   27.62984   31.57033   30.96372   19.73411   31.15322   21.14685
## 4  TNFAIP8L1   42.46674   55.36564   41.29757   18.15865   22.10683   18.95894
## 5      OTOP2   43.40859   51.83702   33.50366   11.19206   25.72660   27.21961
## 6   C17orf78   40.01042   41.75623   40.81368   15.66535   27.34023   19.51317
## 7     CTAGE6   28.97885   32.71176   22.27521   22.91889   27.53780   23.13604
## 8       F8A1   39.60969   46.58748   25.51772   19.59770   36.85641   20.75314
## 9  LOC285501   37.98936   45.51724   19.02319   20.67290   29.45996   21.75213
## 10     SAMD7   32.30451   40.73037   38.93021   21.48535   29.78898   20.20065
##    GSM4340498 GSM4340499 GSM4340500 GSM4340501 GSM4340502 GSM4340503 GSM4340504
## 1    50.33299  18.361496   16.17992   29.27993   36.74482   39.58568   17.62928
## 2    34.94171   7.151825   18.04948   19.61416   47.83345   45.32257   12.66312
## 3    77.21367  19.639702   18.24964   19.24207   55.44210   36.23800   18.12191
## 4    50.99312  13.439440   23.10354   29.67159   89.70427   36.39449   14.14594
## 5    30.74722   9.882871   20.45345   21.56443  124.39415   31.04687   14.04057
## 6    37.27006   9.812547   20.91835   28.67301   96.20869   39.15962   12.44611
## 7    35.57361  10.980143   18.21571   16.56545   99.54299   61.32365   12.46327
## 8    51.20788   8.990543   23.62946   25.00492  104.39482   42.75090   11.74310
## 9    40.44717  14.814970   24.45164   21.44452   65.66078   49.95431   14.79771
## 10   36.95958  13.923062   19.67697   24.71026   90.30977   52.94930   14.72298
##    GSM4340505 GSM4340506 GSM4340507 GSM4340508 GSM4340509 GSM4340510 GSM4340511
## 1    26.42135   16.19021   18.65325   40.55269   48.72731   94.71925   27.68216
## 2    24.31543   18.22622   26.99391   38.55864   22.27286   59.57215   31.19783
## 3    23.53288   14.33849   17.71874   29.73707   18.52009   65.27301   25.94602
## 4    28.89328   21.44424   17.58461   34.33457   27.95371  122.85707   33.29721
## 5    46.21978   19.61363   18.07933   29.52951   17.02266  118.24643   35.29199
## 6    29.04626   17.91047   24.32932   34.68813   19.00797  100.95596   34.76511
## 7    24.36760   18.09237   15.66297   50.00273   26.17448   75.29876   37.75931
## 8    22.16534   26.63348   15.13550   26.21678   17.98034   42.62383   23.09122
## 9    33.84786   17.03718   20.21591   38.37803   23.31803   80.38736   30.50018
## 10   49.93891   25.66958   26.62270   57.85869   17.35737  103.43002   34.54575
##    GSM4340512 GSM4340513 GSM4340514 GSM4340515 GSM4340516 GSM4340517 GSM4340518
## 1    49.86249   53.64034   31.75436   40.36005   24.71115   41.21683   21.97899
## 2    18.44950   79.37574   46.70039   45.94425   29.09320   48.62383   21.20297
## 3    32.53405   52.60025   35.30951   39.20903   24.78237   32.29525   25.66908
## 4    16.03413   48.34192   20.50662   31.02927   24.49584   77.66133   42.51588
## 5    20.66606   42.82781   19.11798   44.95489   23.04002   37.14204   31.45284
## 6    17.66578   64.18064   27.21852   40.54546   21.66550   43.84952   24.54859
## 7    22.84390   65.15621   25.36739   42.41794   17.16673   41.02564   41.88385
## 8    13.88158   51.48577   28.34550   45.72288   37.13531   38.98464   36.72603
## 9    25.35947   85.64826   19.44793   47.38832   24.95522   42.32277   28.82399
## 10   20.02411   47.48736   29.85636   51.21420   20.16999   46.02872   34.84622
##    GSM4340519 GSM4340520 GSM4340521 GSM4340522 GSM4340523 GSM4340524 GSM4340525
## 1   242.04619   28.97658   16.83710   53.92571   90.70970   27.57647  19.531918
## 2    71.27186   36.09075   15.35622   59.84022   50.52619   21.34949   9.144246
## 3   243.97417   34.19917   10.51733   65.63736   64.99650   19.00399  13.951374
## 4   105.78490   20.96305   12.59117   50.29372   47.96585   29.09598   8.080204
## 5   117.68216   13.93743   18.77854   56.91465   54.06629   26.32759   8.252118
## 6   234.68729   17.13590   17.31942   47.80322   37.74823   21.02680   8.965967
## 7    90.06063   15.86667   15.78225   43.46195   59.15106   15.69574   8.202139
## 8    84.19979   12.48020   12.88423   85.46665   46.27727   26.91106   8.220745
## 9   178.51466   12.18283   18.53681   57.71579   35.65898   19.66863  10.376200
## 10  166.38602   12.57062   25.73242   43.47873   43.33050   25.76022   8.425694
##    GSM4340526 GSM4340527 GSM4340528 GSM4340529 GSM4340530 GSM4340531 GSM4340532
## 1    22.60302   29.23281   39.67376   19.79227   28.35322   19.97120   29.13563
## 2    59.96040   54.04548   70.48516   27.35002   29.48876   22.49588   28.12865
## 3    28.85456   24.49016   46.70942   17.85099   28.61755   17.59975   22.97713
## 4    23.86327   23.47894   36.97766   10.79269   27.08516   17.47945   16.38441
## 5    21.17318   26.41670   30.78056   10.01888   28.98670   18.99342   17.95132
## 6    23.51028   21.90884   43.19002   12.25469   27.85714   15.79263   21.51168
## 7    22.43253   21.78898   31.11961   18.49046   38.36838   23.91300   25.05539
## 8    24.16935   26.67729   29.88674    9.71116   17.99804   17.79893   28.18503
## 9    20.44577   23.87293   33.83838   10.20384   34.59415   19.38511   21.62683
## 10   23.92411   26.24495   33.01483   13.64693   23.45511   18.83328   28.62235
##    GSM4340533 GSM4340534 GSM4340535 GSM4340536 GSM4340537 GSM4340538 GSM4340539
## 1    48.46683   47.87722   44.83273   24.58380   7.340767   16.65791   21.84250
## 2   170.41287   62.57221  104.44944   29.65918   8.879782   21.87477   24.37222
## 3    75.55876   40.62497   26.55102   21.36238   8.114731   21.58562   26.32074
## 4    44.59320   35.68056   17.18297   32.75699   7.968716   26.83404   18.13896
## 5    57.78958   41.21533   55.69708   32.46296   7.212277   20.96364   22.92530
## 6    44.72606   35.93334   23.39033   34.79392  12.570699   24.68950   21.51554
## 7    46.41627   25.42504   53.57219   31.88584  12.770074   26.94255   20.21744
## 8    81.30571   29.36271   39.73188   29.44770   9.649790   38.30607   27.40950
## 9    65.04796   40.96207   45.96660   24.30032  10.562778   26.66755   19.93804
## 10   90.73399   25.65884   26.65580   31.23851  11.438219   37.32249   18.51423
##    GSM4340540 GSM4340541 GSM4340542 GSM4340543 GSM4340544 GSM4340545 GSM4340546
## 1    25.00497   15.03689   19.28145   20.95660   14.38273   15.33420   52.18946
## 2    25.46586   50.56783    9.69547   30.34963    8.87291   15.22122   43.66884
## 3    27.02532   22.97513   22.22991   19.31855   18.90868   18.99640   37.71431
## 4    31.89456   20.29308   15.39031   14.48411   16.93831   17.38741   37.30122
## 5    25.01962   26.78544   14.92172   17.98702   18.96980   16.47932   36.68333
## 6    33.41618   20.63765   16.49177   13.57623   16.04933   19.34174   40.90143
## 7    21.50006   33.35868   17.41162   16.83438   12.36843   30.35652   56.52007
## 8    26.00508   24.77098   12.87700   17.15641   16.26777   19.43870   55.65522
## 9    29.71015   17.68258   17.47798   19.44973   15.00299   24.99209   40.53731
## 10   25.44372   19.01152   14.99254   16.34560   14.46104   20.23523   48.80388
##    GSM4340547 GSM4340548 GSM4340549 GSM4340550 GSM4340551 GSM4340552 GSM4340553
## 1    18.81358   38.60427   37.42814   27.37149   66.26418   51.51663   21.96050
## 2    47.41096   68.53473   20.50622   28.66727   56.47553   47.93870   20.03999
## 3    26.37484   38.96412   31.05025   31.46036   53.54524   41.13340   11.31445
## 4    16.29607   39.13807   44.24589   22.14372   58.27621  105.83491   23.61590
## 5    24.14021   40.69022   42.51203   25.16801   70.64305   39.93768   17.43137
## 6    18.66889   34.36792   39.87882   25.31076   80.85079   67.46208   18.08939
## 7    16.92980   56.43290   26.39818   20.57866   54.96826   37.83800   24.03091
## 8    24.79599   52.06556   46.59680   17.59646   56.80215   48.62395   20.96293
## 9    16.45601   41.51407   39.66486   18.47170   57.88106   61.19389   20.26675
## 10   16.73956   38.55786   30.68710   35.81168   51.63592   69.89111   17.75162
##    GSM4340554 GSM4340555 GSM4340556 GSM4340557 GSM4340558 GSM4340559 GSM4340560
## 1    41.78819  10.586208  287.06210   47.57328  31.757893   82.46059   29.10355
## 2    53.65111  12.411786   36.30340   26.66048   9.121163   74.04372   47.00593
## 3    46.77273   9.445118  130.80362   42.99657  38.655164   59.58087   35.61306
## 4    42.92348   9.920727   46.89542   35.04883  44.403146   63.57810   18.64624
## 5    38.99176  10.473884   38.30379   43.80303  15.412572   64.99677   46.99216
## 6    39.02220   8.805156   39.14815   37.14713  13.733739   59.01782   31.57138
## 7    39.71794  12.819311   41.25430   40.85336  13.973693   66.33082   52.14100
## 8    36.82081   9.239648   38.55965   25.90088  21.723133   60.59767   76.40293
## 9    29.32376   8.355369   36.74231   43.58830  19.974912   46.19317   35.85380
## 10   37.22891   8.347794   39.02161   34.13150  14.486557   64.78393   44.07917
##    GSM4340561 GSM4340562 GSM4340563 GSM4340564 GSM4340565 GSM4340566 GSM4340567
## 1    38.66926   37.37517   25.17885   83.46367   8.726264   31.65491   16.70587
## 2    75.20406   44.74178   10.12092   24.80718  10.590134   18.90590   17.64200
## 3    39.22082   52.77229   23.94595   69.73067   8.788421   28.99057   16.29304
## 4    38.23628   25.00343   19.98203   36.92691  14.310777   25.72812   17.41430
## 5    33.85058   33.45397   18.74339   27.11026  13.724748   31.77075   15.22076
## 6    32.24516   53.23378   20.26322   35.59332  11.159615   26.84060   16.01224
## 7    40.19344   37.84687   11.05134   42.07957  14.554143   25.16745   25.32619
## 8    34.91737   38.04827   18.09229   19.24648  15.708073   22.64868   17.97442
## 9    37.29519   50.04637   13.54135   26.08811  12.545321   32.51360   16.11407
## 10   33.84937   29.26158   12.57219   28.20026  12.755591   40.84400   14.21224
##    GSM4340568 GSM4340569 GSM4340570 GSM4340571 GSM4340572 GSM4340573 GSM4340574
## 1    16.53895   11.41146   32.06021   32.59497   36.59878  16.169422   17.04565
## 2    17.05701   25.26796   30.59706   20.17319   42.00942   9.332273   15.09083
## 3    20.66426   16.72934   30.24238   29.54068   33.91117  17.311605   19.91072
## 4    22.52651   16.66729   66.02685   24.71268   53.23373  17.635533   15.61322
## 5    19.54391   14.44662   75.22061   37.34162   33.14573  18.038448   20.27913
## 6    23.86987   15.57318   29.60989   45.44414   40.25601  32.071137   21.79697
## 7    33.62460   20.89906   30.40400   35.99257   36.50542  15.989030   18.88726
## 8    42.39900   17.03657   33.06796   30.11359   62.76273  14.772002   17.92761
## 9    23.08441   18.29423   28.97904   29.94572   37.63438  20.703257   19.84273
## 10   19.03183   16.48280   49.14358   33.88505   49.22103  18.794217   17.47806
##    GSM4340575 GSM4340576 GSM4340577
## 1    20.95848   23.88463   17.68828
## 2    17.84571   18.43729   34.83445
## 3    24.01601   15.54136   19.56592
## 4    22.17310   23.68683   45.07172
## 5    22.27939   27.88571   44.54518
## 6    22.80547   25.35448   28.95011
## 7    25.57395   25.61280   23.92105
## 8    22.66021   19.02194   38.51459
## 9    21.06163   29.09052   33.92133
## 10   23.45789   24.84377   34.82380

Actually, these column names aren’t going to do much justice to the sample identifiers in the charts, so we should align these column names up to their aliases or descriptive names. We named that table after creating it earlier as descriptors2.

head(descriptors2,10)

##                         Sample_Title Sample_GEO_Accession   classDisease
## 1   PBMC total RNA-Healthy control 1           GSM4340492 healthyControl
## 2   PBMC total RNA-Healthy control 2           GSM4340493 healthyControl
## 3   PBMC total RNA-Healthy control 3           GSM4340494 healthyControl
## 4   PBMC total RNA-Healthy control 4           GSM4340495 healthyControl
## 5   PBMC total RNA-Healthy control 5           GSM4340496 healthyControl
## 6   PBMC total RNA-Healthy control 6           GSM4340497 healthyControl
## 7   PBMC total RNA-Healthy control 7           GSM4340498 healthyControl
## 8   PBMC total RNA-Healthy control 8           GSM4340499 healthyControl
## 9   PBMC total RNA-Healthy control 9           GSM4340500 healthyControl
## 10 PBMC total RNA-Healthy control 10           GSM4340501 healthyControl

Lets test the colnames of our denormalized and normalized data frames arethe same order as our descriptor names so we can replace the names.

descriptors2$denormalized <- as.factor(paste(colnames(lymeMx)[2:87]))
descriptors2$normalized <- as.factor(paste(colnames(LymeDisease3)[2:87]))
descriptors2[,1:5]

##                                                 Sample_Title
## 1                           PBMC total RNA-Healthy control 1
## 2                           PBMC total RNA-Healthy control 2
## 3                           PBMC total RNA-Healthy control 3
## 4                           PBMC total RNA-Healthy control 4
## 5                           PBMC total RNA-Healthy control 5
## 6                           PBMC total RNA-Healthy control 6
## 7                           PBMC total RNA-Healthy control 7
## 8                           PBMC total RNA-Healthy control 8
## 9                           PBMC total RNA-Healthy control 9
## 10                         PBMC total RNA-Healthy control 10
## 11                         PBMC total RNA-Healthy control 11
## 12                         PBMC total RNA-Healthy control 12
## 13                         PBMC total RNA-Healthy control 13
## 14                         PBMC total RNA-Healthy control 14
## 15                         PBMC total RNA-Healthy control 15
## 16                         PBMC total RNA-Healthy control 16
## 17                         PBMC total RNA-Healthy control 17
## 18                         PBMC total RNA-Healthy control 18
## 19                         PBMC total RNA-Healthy control 19
## 20                         PBMC total RNA-Healthy control 20
## 21                         PBMC total RNA-Healthy control 21
## 22               PBMC total RNA-Acute Lyme disease subject 1
## 23               PBMC total RNA-Acute Lyme disease subject 2
## 24               PBMC total RNA-Acute Lyme disease subject 3
## 25               PBMC total RNA-Acute Lyme disease subject 4
## 26               PBMC total RNA-Acute Lyme disease subject 5
## 27               PBMC total RNA-Acute Lyme disease subject 6
## 28               PBMC total RNA-Acute Lyme disease subject 7
## 29               PBMC total RNA-Acute Lyme disease subject 8
## 30               PBMC total RNA-Acute Lyme disease subject 9
## 31              PBMC total RNA-Acute Lyme disease subject 10
## 32              PBMC total RNA-Acute Lyme disease subject 11
## 33              PBMC total RNA-Acute Lyme disease subject 12
## 34              PBMC total RNA-Acute Lyme disease subject 13
## 35              PBMC total RNA-Acute Lyme disease subject 14
## 36              PBMC total RNA-Acute Lyme disease subject 15
## 37              PBMC total RNA-Acute Lyme disease subject 16
## 38              PBMC total RNA-Acute Lyme disease subject 17
## 39              PBMC total RNA-Acute Lyme disease subject 18
## 40              PBMC total RNA-Acute Lyme disease subject 19
## 41              PBMC total RNA-Acute Lyme disease subject 20
## 42              PBMC total RNA-Acute Lyme disease subject 21
## 43              PBMC total RNA-Acute Lyme disease subject 22
## 44              PBMC total RNA-Acute Lyme disease subject 23
## 45              PBMC total RNA-Acute Lyme disease subject 24
## 46              PBMC total RNA-Acute Lyme disease subject 25
## 47              PBMC total RNA-Acute Lyme disease subject 26
## 48              PBMC total RNA-Acute Lyme disease subject 27
## 49              PBMC total RNA-Acute Lyme disease subject 28
## 50  PBMC total RNA-early convalescent Lyme disease subject 1
## 51  PBMC total RNA-early convalescent Lyme disease subject 2
## 52  PBMC total RNA-early convalescent Lyme disease subject 3
## 53  PBMC total RNA-early convalescent Lyme disease subject 4
## 54  PBMC total RNA-early convalescent Lyme disease subject 5
## 55  PBMC total RNA-early convalescent Lyme disease subject 6
## 56  PBMC total RNA-early convalescent Lyme disease subject 7
## 57  PBMC total RNA-early convalescent Lyme disease subject 8
## 58  PBMC total RNA-early convalescent Lyme disease subject 9
## 59 PBMC total RNA-early convalescent Lyme disease subject 10
## 60 PBMC total RNA-early convalescent Lyme disease subject 11
## 61 PBMC total RNA-early convalescent Lyme disease subject 12
## 62 PBMC total RNA-early convalescent Lyme disease subject 13
## 63 PBMC total RNA-early convalescent Lyme disease subject 14
## 64 PBMC total RNA-early convalescent Lyme disease subject 15
## 65 PBMC total RNA-early convalescent Lyme disease subject 16
## 66 PBMC total RNA-early convalescent Lyme disease subject 17
## 67 PBMC total RNA-early convalescent Lyme disease subject 18
## 68 PBMC total RNA-early convalescent Lyme disease subject 19
## 69 PBMC total RNA-early convalescent Lyme disease subject 20
## 70 PBMC total RNA-early convalescent Lyme disease subject 21
## 71 PBMC total RNA-early convalescent Lyme disease subject 22
## 72 PBMC total RNA-early convalescent Lyme disease subject 23
## 73 PBMC total RNA-early convalescent Lyme disease subject 24
## 74 PBMC total RNA-early convalescent Lyme disease subject 25
## 75 PBMC total RNA-early convalescent Lyme disease subject 26
## 76 PBMC total RNA-early convalescent Lyme disease subject 27
## 77   PBMC total RNA-late convalescent Lyme disease subject 1
## 78   PBMC total RNA-late convalescent Lyme disease subject 2
## 79   PBMC total RNA-late convalescent Lyme disease subject 3
## 80   PBMC total RNA-late convalescent Lyme disease subject 4
## 81   PBMC total RNA-late convalescent Lyme disease subject 5
## 82   PBMC total RNA-late convalescent Lyme disease subject 6
## 83   PBMC total RNA-late convalescent Lyme disease subject 7
## 84   PBMC total RNA-late convalescent Lyme disease subject 8
## 85   PBMC total RNA-late convalescent Lyme disease subject 9
## 86  PBMC total RNA-late convalescent Lyme disease subject 10
##    Sample_GEO_Accession       classDisease denormalized normalized
## 1            GSM4340492     healthyControl   GSM4340492 GSM4340492
## 2            GSM4340493     healthyControl   GSM4340493 GSM4340493
## 3            GSM4340494     healthyControl   GSM4340494 GSM4340494
## 4            GSM4340495     healthyControl   GSM4340495 GSM4340495
## 5            GSM4340496     healthyControl   GSM4340496 GSM4340496
## 6            GSM4340497     healthyControl   GSM4340497 GSM4340497
## 7            GSM4340498     healthyControl   GSM4340498 GSM4340498
## 8            GSM4340499     healthyControl   GSM4340499 GSM4340499
## 9            GSM4340500     healthyControl   GSM4340500 GSM4340500
## 10           GSM4340501     healthyControl   GSM4340501 GSM4340501
## 11           GSM4340502     healthyControl   GSM4340502 GSM4340502
## 12           GSM4340503     healthyControl   GSM4340503 GSM4340503
## 13           GSM4340504     healthyControl   GSM4340504 GSM4340504
## 14           GSM4340505     healthyControl   GSM4340505 GSM4340505
## 15           GSM4340506     healthyControl   GSM4340506 GSM4340506
## 16           GSM4340507     healthyControl   GSM4340507 GSM4340507
## 17           GSM4340508     healthyControl   GSM4340508 GSM4340508
## 18           GSM4340509     healthyControl   GSM4340509 GSM4340509
## 19           GSM4340510     healthyControl   GSM4340510 GSM4340510
## 20           GSM4340511     healthyControl   GSM4340511 GSM4340511
## 21           GSM4340512     healthyControl   GSM4340512 GSM4340512
## 22           GSM4340513   acuteLymeDisease   GSM4340513 GSM4340513
## 23           GSM4340514   acuteLymeDisease   GSM4340514 GSM4340514
## 24           GSM4340515   acuteLymeDisease   GSM4340515 GSM4340515
## 25           GSM4340516   acuteLymeDisease   GSM4340516 GSM4340516
## 26           GSM4340517   acuteLymeDisease   GSM4340517 GSM4340517
## 27           GSM4340518   acuteLymeDisease   GSM4340518 GSM4340518
## 28           GSM4340519   acuteLymeDisease   GSM4340519 GSM4340519
## 29           GSM4340520   acuteLymeDisease   GSM4340520 GSM4340520
## 30           GSM4340521   acuteLymeDisease   GSM4340521 GSM4340521
## 31           GSM4340522   acuteLymeDisease   GSM4340522 GSM4340522
## 32           GSM4340523   acuteLymeDisease   GSM4340523 GSM4340523
## 33           GSM4340524   acuteLymeDisease   GSM4340524 GSM4340524
## 34           GSM4340525   acuteLymeDisease   GSM4340525 GSM4340525
## 35           GSM4340526   acuteLymeDisease   GSM4340526 GSM4340526
## 36           GSM4340527   acuteLymeDisease   GSM4340527 GSM4340527
## 37           GSM4340528   acuteLymeDisease   GSM4340528 GSM4340528
## 38           GSM4340529   acuteLymeDisease   GSM4340529 GSM4340529
## 39           GSM4340530   acuteLymeDisease   GSM4340530 GSM4340530
## 40           GSM4340531   acuteLymeDisease   GSM4340531 GSM4340531
## 41           GSM4340532   acuteLymeDisease   GSM4340532 GSM4340532
## 42           GSM4340533   acuteLymeDisease   GSM4340533 GSM4340533
## 43           GSM4340534   acuteLymeDisease   GSM4340534 GSM4340534
## 44           GSM4340535   acuteLymeDisease   GSM4340535 GSM4340535
## 45           GSM4340536   acuteLymeDisease   GSM4340536 GSM4340536
## 46           GSM4340537   acuteLymeDisease   GSM4340537 GSM4340537
## 47           GSM4340538   acuteLymeDisease   GSM4340538 GSM4340538
## 48           GSM4340539   acuteLymeDisease   GSM4340539 GSM4340539
## 49           GSM4340540   acuteLymeDisease   GSM4340540 GSM4340540
## 50           GSM4340541  Antibodies_1month   GSM4340541 GSM4340541
## 51           GSM4340542  Antibodies_1month   GSM4340542 GSM4340542
## 52           GSM4340543  Antibodies_1month   GSM4340543 GSM4340543
## 53           GSM4340544  Antibodies_1month   GSM4340544 GSM4340544
## 54           GSM4340545  Antibodies_1month   GSM4340545 GSM4340545
## 55           GSM4340546  Antibodies_1month   GSM4340546 GSM4340546
## 56           GSM4340547  Antibodies_1month   GSM4340547 GSM4340547
## 57           GSM4340548  Antibodies_1month   GSM4340548 GSM4340548
## 58           GSM4340549  Antibodies_1month   GSM4340549 GSM4340549
## 59           GSM4340550  Antibodies_1month   GSM4340550 GSM4340550
## 60           GSM4340551  Antibodies_1month   GSM4340551 GSM4340551
## 61           GSM4340552  Antibodies_1month   GSM4340552 GSM4340552
## 62           GSM4340553  Antibodies_1month   GSM4340553 GSM4340553
## 63           GSM4340554  Antibodies_1month   GSM4340554 GSM4340554
## 64           GSM4340555  Antibodies_1month   GSM4340555 GSM4340555
## 65           GSM4340556  Antibodies_1month   GSM4340556 GSM4340556
## 66           GSM4340557  Antibodies_1month   GSM4340557 GSM4340557
## 67           GSM4340558  Antibodies_1month   GSM4340558 GSM4340558
## 68           GSM4340559  Antibodies_1month   GSM4340559 GSM4340559
## 69           GSM4340560  Antibodies_1month   GSM4340560 GSM4340560
## 70           GSM4340561  Antibodies_1month   GSM4340561 GSM4340561
## 71           GSM4340562  Antibodies_1month   GSM4340562 GSM4340562
## 72           GSM4340563  Antibodies_1month   GSM4340563 GSM4340563
## 73           GSM4340564  Antibodies_1month   GSM4340564 GSM4340564
## 74           GSM4340565  Antibodies_1month   GSM4340565 GSM4340565
## 75           GSM4340566  Antibodies_1month   GSM4340566 GSM4340566
## 76           GSM4340567  Antibodies_1month   GSM4340567 GSM4340567
## 77           GSM4340568 Antibodies_6months   GSM4340568 GSM4340568
## 78           GSM4340569 Antibodies_6months   GSM4340569 GSM4340569
## 79           GSM4340570 Antibodies_6months   GSM4340570 GSM4340570
## 80           GSM4340571 Antibodies_6months   GSM4340571 GSM4340571
## 81           GSM4340572 Antibodies_6months   GSM4340572 GSM4340572
## 82           GSM4340573 Antibodies_6months   GSM4340573 GSM4340573
## 83           GSM4340574 Antibodies_6months   GSM4340574 GSM4340574
## 84           GSM4340575 Antibodies_6months   GSM4340575 GSM4340575
## 85           GSM4340576 Antibodies_6months   GSM4340576 GSM4340576
## 86           GSM4340577 Antibodies_6months   GSM4340577 GSM4340577

descriptors2$Sample_GEO_Accession==descriptors2$denormalized

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

descriptors2$Sample_GEO_Accession==descriptors2$normalized

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The sample IDs are the same order as our aliases for the class they belong to. Here are our unique classes, there are four of them.

unique(descriptors2$classDisease)

## [1] "healthyControl"     "acuteLymeDisease"   "Antibodies_1month" 
## [4] "Antibodies_6months"

We can still use our shorter names or gsub() the extended names with the information we don’t need. But we have to add a number to the end that makes each column name different.

n21 <- as.character(c(1:21))
n28 <- as.character(c(1:28))
n27 <- as.character(c(1:27))
n10 <- as.character(c(1:10))

descriptors2$classDisease[1:21] <- paste(descriptors2$classDisease[1:21],n21,sep='_')
descriptors2$classDisease[22:49] <- paste(descriptors2$classDisease[22:49],n28,sep='_')
descriptors2$classDisease[50:76] <- paste(descriptors2$classDisease[50:76],n27,sep='_')
descriptors2$classDisease[77:86] <- paste(descriptors2$classDisease[77:86],n10,sep='_')

head(descriptors2)

##                       Sample_Title Sample_GEO_Accession     classDisease
## 1 PBMC total RNA-Healthy control 1           GSM4340492 healthyControl_1
## 2 PBMC total RNA-Healthy control 2           GSM4340493 healthyControl_2
## 3 PBMC total RNA-Healthy control 3           GSM4340494 healthyControl_3
## 4 PBMC total RNA-Healthy control 4           GSM4340495 healthyControl_4
## 5 PBMC total RNA-Healthy control 5           GSM4340496 healthyControl_5
## 6 PBMC total RNA-Healthy control 6           GSM4340497 healthyControl_6
##   denormalized normalized
## 1   GSM4340492 GSM4340492
## 2   GSM4340493 GSM4340493
## 3   GSM4340494 GSM4340494
## 4   GSM4340495 GSM4340495
## 5   GSM4340496 GSM4340496
## 6   GSM4340497 GSM4340497

descriptors2$classDisease

##  [1] "healthyControl_1"      "healthyControl_2"      "healthyControl_3"     
##  [4] "healthyControl_4"      "healthyControl_5"      "healthyControl_6"     
##  [7] "healthyControl_7"      "healthyControl_8"      "healthyControl_9"     
## [10] "healthyControl_10"     "healthyControl_11"     "healthyControl_12"    
## [13] "healthyControl_13"     "healthyControl_14"     "healthyControl_15"    
## [16] "healthyControl_16"     "healthyControl_17"     "healthyControl_18"    
## [19] "healthyControl_19"     "healthyControl_20"     "healthyControl_21"    
## [22] "acuteLymeDisease_1"    "acuteLymeDisease_2"    "acuteLymeDisease_3"   
## [25] "acuteLymeDisease_4"    "acuteLymeDisease_5"    "acuteLymeDisease_6"   
## [28] "acuteLymeDisease_7"    "acuteLymeDisease_8"    "acuteLymeDisease_9"   
## [31] "acuteLymeDisease_10"   "acuteLymeDisease_11"   "acuteLymeDisease_12"  
## [34] "acuteLymeDisease_13"   "acuteLymeDisease_14"   "acuteLymeDisease_15"  
## [37] "acuteLymeDisease_16"   "acuteLymeDisease_17"   "acuteLymeDisease_18"  
## [40] "acuteLymeDisease_19"   "acuteLymeDisease_20"   "acuteLymeDisease_21"  
## [43] "acuteLymeDisease_22"   "acuteLymeDisease_23"   "acuteLymeDisease_24"  
## [46] "acuteLymeDisease_25"   "acuteLymeDisease_26"   "acuteLymeDisease_27"  
## [49] "acuteLymeDisease_28"   "Antibodies_1month_1"   "Antibodies_1month_2"  
## [52] "Antibodies_1month_3"   "Antibodies_1month_4"   "Antibodies_1month_5"  
## [55] "Antibodies_1month_6"   "Antibodies_1month_7"   "Antibodies_1month_8"  
## [58] "Antibodies_1month_9"   "Antibodies_1month_10"  "Antibodies_1month_11" 
## [61] "Antibodies_1month_12"  "Antibodies_1month_13"  "Antibodies_1month_14" 
## [64] "Antibodies_1month_15"  "Antibodies_1month_16"  "Antibodies_1month_17" 
## [67] "Antibodies_1month_18"  "Antibodies_1month_19"  "Antibodies_1month_20" 
## [70] "Antibodies_1month_21"  "Antibodies_1month_22"  "Antibodies_1month_23" 
## [73] "Antibodies_1month_24"  "Antibodies_1month_25"  "Antibodies_1month_26" 
## [76] "Antibodies_1month_27"  "Antibodies_6months_1"  "Antibodies_6months_2" 
## [79] "Antibodies_6months_3"  "Antibodies_6months_4"  "Antibodies_6months_5" 
## [82] "Antibodies_6months_6"  "Antibodies_6months_7"  "Antibodies_6months_8" 
## [85] "Antibodies_6months_9"  "Antibodies_6months_10"

write.csv(descriptors2,'descriptors2.csv',row.names=F)

LymeDisease4 <- LymeDisease3
colnames(LymeDisease4)[2:87] <- descriptors2$classDisease
lymeMx2 <- lymeMx
colnames(lymeMx2)[2:87] <- descriptors2$classDisease

write.csv(LymeDisease3,'LymeDisease3.csv',row.names=FALSE)
write.csv(LymeDisease4,'LymeDisease4normalized-easynames.csv',row.names=FALSE)
write.csv(lymeMx2,'lymeMx2-denormalized-easynames.csv',row.names=FALSE)
write.csv(lymeMx,'lymeMx-denormalized-originalnames.csv',row.names=FALSE)

Now, we can use this data to find the mean values across samples and get the fold change values, then plot the data in Tableau.

LymeDisease5 <- LymeDisease4 %>% group_by(Gene) %>% summarise_at(vars('healthyControl_1':'Antibodies_6months_10'),mean)

lymeMx3 <- lymeMx2 %>% group_by(gene) %>% summarise_at(vars('healthyControl_1':'Antibodies_6months_10'),mean)

Lyme6 <- LymeDisease5 %>% group_by(Gene) %>% 
  mutate(
    healthy_Mean = mean(healthyControl_1:healthyControl_21,na.rm=T),
    acuteLymeDisease_Mean = mean(acuteLymeDisease_1:acuteLymeDisease_28,na.rm=T),
    antibodies_1month_Mean = mean(Antibodies_1month_1:Antibodies_1month_27,na.rm=T),
    antibodies_6month_Mean = mean(Antibodies_6months_1:Antibodies_6months_10,na.rm=T)
  )

tail(colnames(Lyme6),5)

## [1] "Antibodies_6months_10"  "healthy_Mean"           "acuteLymeDisease_Mean" 
## [4] "antibodies_1month_Mean" "antibodies_6month_Mean"

lymeMx4 <- lymeMx3 %>% group_by(gene) %>% 
  mutate(
    healthy_Mean = mean(healthyControl_1:healthyControl_21,na.rm=T),
    acuteLymeDisease_Mean = mean(acuteLymeDisease_1:acuteLymeDisease_28,na.rm=T),
    antibodies_1month_Mean = mean(Antibodies_1month_1:Antibodies_1month_27,na.rm=T),
    antibodies_6month_Mean = mean(Antibodies_6months_1:Antibodies_6months_10,na.rm=T)
  )

tail(colnames(lymeMx4),5)

## [1] "Antibodies_6months_10"  "healthy_Mean"           "acuteLymeDisease_Mean" 
## [4] "antibodies_1month_Mean" "antibodies_6month_Mean"

lymeMx5 <- lymeMx4 %>% group_by(gene) %>% 
  mutate(acuteHealthy_foldChange=acuteLymeDisease_Mean/healthy_Mean,
    antibodies_1month_healthy_foldChange=antibodies_1month_Mean/healthy_Mean,
    antibodies_6month_healthy_foldchange=antibodies_6month_Mean/healthy_Mean)

tail(colnames(lymeMx5),10)

##  [1] "Antibodies_6months_8"                
##  [2] "Antibodies_6months_9"                
##  [3] "Antibodies_6months_10"               
##  [4] "healthy_Mean"                        
##  [5] "acuteLymeDisease_Mean"               
##  [6] "antibodies_1month_Mean"              
##  [7] "antibodies_6month_Mean"              
##  [8] "acuteHealthy_foldChange"             
##  [9] "antibodies_1month_healthy_foldChange"
## [10] "antibodies_6month_healthy_foldchange"

Lyme7 <- Lyme6 %>% group_by(Gene) %>% 
  mutate(acuteHealthy_foldChange=acuteLymeDisease_Mean/healthy_Mean,
    antibodies_1month_healthy_foldChange=antibodies_1month_Mean/healthy_Mean,
    antibodies_6month_healthy_foldchange=antibodies_6month_Mean/healthy_Mean)

tail(colnames(Lyme7),10)

##  [1] "Antibodies_6months_8"                
##  [2] "Antibodies_6months_9"                
##  [3] "Antibodies_6months_10"               
##  [4] "healthy_Mean"                        
##  [5] "acuteLymeDisease_Mean"               
##  [6] "antibodies_1month_Mean"              
##  [7] "antibodies_6month_Mean"              
##  [8] "acuteHealthy_foldChange"             
##  [9] "antibodies_1month_healthy_foldChange"
## [10] "antibodies_6month_healthy_foldchange"

Our tables of unique genes grouped by genes to get their means of each gene within each sample for the duplicate genes, the added features of each class’s mean gene expression per gene, and the fold change ratio of the diseased or treated to the healthy gene expression values have been created. The normalized data or the original data is the Lyme7 data frame and the denormalized data is the lymeMx5 data frame. Now each shrunk from 48851 genes to 19526 genes when grouping by unique genes, but now that is still a lot of genes, so lets take the gene that have the top 10 most expressed and least expressed values in both data frames by acute/healthy fold change, and the top 10 and bottom 10 of the 1month of antibodies/healthy fold change values, and finally the top 10 and bottom 10 of the 6 month of antibodies/healthy fold change values. *** The denormalized group first:

Acute/healthy top 10 and bottom 10 genes by fold change data frame:

acuteHealthy20 <- lymeMx5[order(lymeMx5$acuteHealthy_foldChange,
                                decreasing=T)[c(1:10,19517:19526)],]

One month/healthy top 10 and bottom 10 genes by fold change data frame:

month1healthy20 <- lymeMx5[order(lymeMx5$antibodies_1month_healthy_foldChange,
                                 decreasing=T)[c(1:10,19517:19526)],]

Six month/healthy top 10 and bottom 10 genes by fold change data frame:

month6healthy20 <- lymeMx5[order(lymeMx5$antibodies_6month_healthy_foldchange,
                                 decreasing=T)[c(1:10,19517:19526)],]

lymeMx6 <- rbind(acuteHealthy20,month1healthy20,month6healthy20)
lymeMx7 <- lymeMx6[!duplicated(lymeMx6),]

There were 43 unique genes between all three fold change groups in the denormalized data out of 60 genes that were either the top 10 or bottom 10 of genes being expressed.

Now, for the normalized data:

Acute/healthy top 10 and bottom 10 genes by fold change data frame:

acuteHealthy20b <- Lyme7[order(Lyme7$acuteHealthy_foldChange,
                                decreasing=T)[c(1:10,19517:19526)],]

One month/healthy top 10 and bottom 10 genes by fold change data frame:

month1healthy20b <- Lyme7[order(Lyme7$antibodies_1month_healthy_foldChange,
                                 decreasing=T)[c(1:10,19517:19526)],]

Six month/healthy top 10 and bottom 10 genes by fold change data frame:

month6healthy20b <- Lyme7[order(Lyme7$antibodies_6month_healthy_foldchange,
                                 decreasing=T)[c(1:10,19517:19526)],]

Lyme8 <- rbind(acuteHealthy20b,month1healthy20b,month6healthy20b)
Lyme9 <- Lyme8[!duplicated(Lyme8),]

There are 33 genes unique to the normalized data, probably because this data had negative values. The scaling done to denormalize this data is probably not exactly what the true raw values are. But they should have the same number of genes, but this one has 10 less than the normalized data. We will see later which one can be split into training and testing sets with better prediction accuracy within each class and overall.

Lets also add the gene summaries to these data frames and create a field that will give the class of each sample. This file,genecards2.R, is an R file sourced for the functions made in previous scripts. We lose one of the genes in the original data frame because it isn’t in genecards.org and end up with 32 instead of 33 genes for that data frame.

source('geneCards2.R')

## Warning: package 'rvest' was built under R version 3.6.3

## Loading required package: xml2

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

LOC400657 (#23 in list) is a gene that genecards.org doesn’t recognize and it will throw an error, so we should skip it.

for (i in Lyme9$Gene[1:22]){
  getSummaries2(i,'protein')
}

for (i in Lyme9$Gene[24:33]){
  getSummaries2(i,'protein')
}

getGeneSummaries('protein')

summsLyme9 <- read.csv("proteinGeneSummaries_protein.csv")

for (i in lymeMx7$gene){
  getSummaries2(i,'immune')
}

getGeneSummaries('immune')

summsLymeMx7 <- read.csv("proteinGeneSummaries_immune.csv")

Lyme10 <- merge(summsLyme9,Lyme9,by.x='gene', by.y='Gene')
lymeMx8 <- merge(summsLymeMx7,lymeMx7, by.x='gene',by.y='gene')

Lets create those classes for each data frame. But first we have to tidy the data.

Lyme11 <- gather(Lyme10, key='classSample',value='classValue',8:93)
lymeMx9 <- gather(lymeMx8,key='classSample',value='classValue',8:93)

Lyme11$class <- Lyme11$classSample
Lyme11$class <- gsub('^hea.*$','healthy',Lyme11$class, perl=T)
Lyme11$class <- gsub('^acute.*$','acute Lyme Disease',Lyme11$class,perl=T)
Lyme11$class <- gsub('^.*1month.*$','1 month treatment', Lyme11$class, perl=T)
Lyme11$class <- gsub('^.*6month.*$','6 months treatment',Lyme11$class,perl=T)

lymeMx9$class <- lymeMx9$classSample
lymeMx9$class <- gsub('^hea.*$','healthy',lymeMx9$class, perl=T)
lymeMx9$class <- gsub('^acute.*$','acute Lyme Disease',lymeMx9$class,perl=T)
lymeMx9$class <- gsub('^.*1month.*$','1 month treatment', lymeMx9$class, perl=T)
lymeMx9$class <- gsub('^.*6month.*$','6 months treatment',lymeMx9$class,perl=T)

unique(Lyme11$gene)

## factor(0)
## 61 Levels: AANAT ANKH APOA1 APOB CACNA1B CALCA CALCR CALCRL CASR ... VDR

unique(lymeMx9$gene)

##  [1] AREG     BEST1    BPI      CAMP     CEACAM8  CHI3L1   CKMT1B   CTSG    
##  [9] CXCL2    DEFA1    DEFA4    DHX58    FCGR3B   FSIP1    GAPT     GZMH    
## [17] HBG1     HLA-DRB4 HTR3C    IL1B     KIAA1245 KIR2DL3  KIR2DS1  LCN2    
## [25] LIPN     LSM2     LTF      MS4A3    MUC12    MYOM2    OLR1     OR2B11  
## [33] POLR2I   RGS18    S100B    SERPINB2 THBD     THBS1    TNFSF10  TSIX    
## [41] TXNL4A   XIST    
## 43 Levels: AREG BEST1 BPI C7ORF55 CAMP CEACAM8 CHI3L1 CKMT1B CTSG ... XIST

It looks like the genes aren’t even the same genes.

unique(Lyme11$Gene) %in% unique(lymeMx9)

## logical(0)

Apparently, they are not the same genes. Its ok, maybe they still offer some information. The techniques and methods are the same to inverse what was assumed to be the normalization method, but for typical studies. In bioinformatics, with gene expression data, there is usually more to it, like trimming the bottom and top outliers, and taking the quantile normalization, then scaling. We used the standardization method of normalizing values between 0 and 1 as log2 normalized is to f(x)=log2[(x-min(x))/(max(x)-min(x))]=y and the inverse would be: f(y)=2¹=x So, there is some logic to this, and at some point rounded values could lose information in the numer of scientific placeholders of precision is used in calculating the inverse of the base 2 log, or the exact values for max and min of X need to be used. Reminder, when I demonstrated this earlier, the method worked using this procedure for 10 values that included a 0 where a small value was added to take the log2 of x=0 without an error, but the exact values were still decimals at the final step. To fix this they were turned to fractions, where the denominator was the max(x), and so each value multiplied by the denominator at that point returned the original x values in our list of 10. When addidng that step to the last step we used on this data to denormalize the data, the values were extremely large, approximately 10^3-104 larger. So we stopped before taking the fractional values. We will continue with these genes in our machine learning to see if either set makes good gene targets for pathogenesis of lyme disease by how accurately the classes of: healthy, acute disease, 1 month convalescing or developing antibodies after being given a regimen of antibiotics, and 6 months convalescing after being given antibiotics. This is temporal or time specific data, and there were some discrepencies in the study when being done, because it spanned 2 years, some patients didn’t know how long they had it but if they had symptoms they were assumed to be suffering from lyme disease, like the facial paralysis or the skin lesion type marks. Also, some patients dropped out and if the study spanned two years, and only the last 6 months recorded the convalescing at 6 months then the first batches of patients in the acute phase weren’t being recorded or they were actually being monitored after six months and up to two years after being given antibiotics. So we can imagine the data might be skewed for these differences or discrepencies.

Lets write these two tables out to csv files to analyze visually in Tableau.

write.csv(Lyme11,'LymeDisease_originalValues_foldchages32.csv',row.names=F)

write.csv(lymeMx9,'LymeDisease_denormalizedValues_foldchanges43.csv',row.names=F)

Tableau Images of Charts

Lets see what great charts were created in Tableau with this data using our de-normalized or de-standardized data.

Tableau Dashboard of Lime Disease De-Standardized

The link to this dashboard is at this site:https://public.tableau.com/profile/janis5126#!/vizhome/LymeDiseaseDashboardGSE145974/LymeDiseaseDashboardGSE145974?publish=yes

Dashboard De-standardized Lyme Disease Data GSE145974

In the above image and the dashboard if you click on the link above, You can see the genes to the right with the gene summaries if you hover over the text to the right of the dashboard in the ‘Gene Filtering’ box. It will select only the genes you select to show the median gene expression values within each class of healthy, acute lyme disease, one month after antibiotics treatment, and six months after antibiotics treatement, with varying class sizes due to changes in patient participation and methods during the study. The top chart of the warm colors is for the median gene expression values for each gene of 43 genes that were filtered from 19,000 genes as having the most or least fold change in disease or treatment to healthy ratios for all three classes with duplicates removed from the top 10 or bottom 10 genes in each class by fold change. The lower left chart with the greens is the fold change values for each gene within each class of acute lyme disease, one month of treatment, or six months of treatment compared to healthy samples by mean values of all samples in each class. The lower right chart of the purple colors is a tree map that is categorized by class and within each class each box is a gene with the average gene expression value within that class for that gene. The upper right box shows that the gene DEFA1 was selected and it is displayed in all three accompanying charts on the dashboard.

The following images are the charts that are in the dashboard above.

link to image of the bar chart of fold change values.

Bar chart of Fold Change Values

link to image of the bar chart of median gene expression values.

Bar chart of Median Gene Expression Values

link to image of the treechart of average gene expression values within each class.

Treemap chart of the average gene expression values for each gene within each of the four classes

We will perform machine learning on this data in the upcoming additions to this post. But first we will also look at those genes that are from the original data as the most or least expressed genes in our lyme disease data obtained from NCBI with accession ID, GSE145974.

The original data with log2 normalized values including negative values was used to make the same or similar dashboard of charts as was done above with the de-standardized lyme disease data of GSE145974. However, because of the negative values, the treemap chart was replaced with a highlight chart as the treemap removed 45 negative values. This data took a total of 32 genes (technically 33 but one didn’t have a gene summary so it was omitted), and compared the most or least expressed genes as up and down regulated in the disease (acute) or treatment (1 month or 6 months of antibiotics) to the healthy samples all as the fold change of class sample mean values per gene.

dashboard of the original GSE145974 top changing genes

image dashboard of the original GSE145974 top changing genes

Figure 5: The above image is a dashboard of the original log2 normalized data that has negative values as well as positive values. The filter at the top right will display only the gene or genes selected if you select one then use ctrl + click on each additional gene in that ‘Gene Filter’ box. The other three charts will show the respective gene as it relates in the top for median gene expression values within each class and the number of samples in each class, in the lower left the fold change values for that gene in the three classes of diseased or treatment to healthy as a ratio of means, and in the lower right the highlight table with a color gradient bar of oranges and grays for the gene or genes selected. The oranges are the lowest values, or down-regulated negative gene expression values from the median, and the grays are the genes with more positive gene expression values or up regulation in each class of acute, 1 month of treatment, 6 months of treatment, or healthy. The image above will link to this dashboard to try out the different genes associated with lyme disease and treatment. I just read an article about Gigi Hadid having lyme disease since age 14 and is now 21 years old with brain fog, joint pain and stiffness, light sensitivity, headaches, anxiety, and possibly other symptoms like face paralysis related to lyme disease.

Highlight chart of original GSE145974 data

image of highlight chart of original GSE145974 data

Figure 6: The above image is a highlight chart of the average gene expression values within each class of acute, one month of treatment, six months of treatment, and healthy classes. This chart was used instead of the treemap chart used in the de-standardized data because it allows negative values, and treemap charts do not and would have eliminated 45 gene values across these samples. The genes are gradient color coded, so that under expressed genes in samples having negative values are reddish-orange, and those genes with more gene expression values or up regulated are gray. Colors in between the reddish-orange and gray are for genes that didn’t change as much up or down in gene expression. After six months of treatments one gene is highly under expressed, ISG20, and it is in the middle of the chart in a reddish color that indicates it is the most under expressed gene or at the end of the lowest values for gene expression. The gene summary for this gene is in the dashboard that Figure 5 links to. The Entrez gene summary says Hepatitis C and Yellow fever are associated with abnormalities in this gene and its network involvements include the innate immune system. A gene with the highest up regulation is CENPF in the acute phase it is highly up regulated. The gene summary for this gene says it could possibly have some involvement with chromosome segregation during mitosis and also that it encodes a protein associated with the centromere-kinetochore complex. Also, autoantibodies in cancer patients have been found that target this gene, CENPF. And a quick online wikipedia search says autoantibodies are the antibodies your own body produces to attack your own body’s proteins.

Original Fold Change Values GSE145974

image of original GSE145974 fold change values

Figure 7: The image above is to the chart of original lyme disease data fold change values for each gene across all samples. It is bidirectional as is the other charts in direction or color gradient, because this data has negative values accomodating the log2 normalized data. Negative values indicate down regulation and positive values indicate up regulation or you can think suppression versus explosion if magnitude dramatic enough relative to the neighboring genes. All these genes are the most or least expressed of all genes in the data using the original values, so they should have some change visible. The gene CTXN3 is shown to be highly down regulated with -128,817 in fold change comparison of this gene in patients’ average gene expression after six months of treatment compared to the healthy samples. It was down regulated on average 10^-6 approximately less than healthy samples. That is a very large magnitude and could possibly be a target gene for having the disease or antibodies. It is important to get the treatment early to avoid symptoms, but some people still have symptoms and treatment might not work well or at all is what this could be indicating. Because the other classes of acute and one month of treatment as well as healthy don’t have this magnitude of down regulation at all. Keep in mind the patients in this class was nearly a third of the original sample size 10/28 of acute patients. The gene summary for this gene, CTXN3, in the dashboard says it is a protein encoding gene that Autotopagnosia and Clear Cell Adinoma are diseases associated to CTXN3. Autotopagnosia is the inability for one to identify his or her body parts or locate them on his or her body. And Clear Cell Adinoma is a vaginal/cervical cancer that is rare and usually diethylstilbestrol (DES) exposure in utero of a female’s mother. The daughters of moms exposed to DES are more likely to get Clear Cell Adinoma and a gene that is highly underexpressed in our 6 months of treatment group, CTXN3, is associated to that disease. Either by having lower risk by not producing as much, or increased risk due to not producing much of it as the healthy and acute disease phase are.

Median Gene Expression Values original GSE145974 data

image of median gene expression values of original GSE145974 data

Figure 8: The above image is to the bar chart that is bidirectional like Figure 7 of fold change values, but this chart is of the median gene expression values across all four samples for each gene. The number of samples in each class is also labeled on each bar. Scrolling through the genes in the chart you will see other genes like ENO1 which has a gene summary stating it encodes alpha-enolase, one of three enolase isoenzymes found in mammals. This gene is associated with an autoantigen in Hoshimoto encephalopothy, another autoimmune contributor it sounds like. We see it is dramatically under regulated in the healthy samples and also under regulated in the samples who received six months of antibiotic treatment. But in the acute phase it is up regulated almost 50% more than the healthy samples and in the acute samples it is also up regulated but by about 25% of the healthy samples. ISG20 is very highly under regulated in the 6 month class at about 10 fold the amount of the healthy class median values which is also under regulated. We saw this gene earlier in our highlight chart as being associated with yellow fever and hepatitis C as well as innate immunity network signaling. It is the most under regulated gene in all. A gene, RNF168, is also highly under regulated in the 6 month class, but the healthy class and 1 month class are up regulated in this gene by 3-4 fold more than the 6 month class by visual inspection. This gene, RNF168, has an Entrez gene summary that states it is involved in DNA Double-Strand Break (DSB) repair, and that it has mutations associated with Riddle syndrome. Wikipedia says this is a rare genetic disease that causes radiosensitivity, ImmunoDeficiency Dysmorphic features, and learning difficulties as an acronym meaning.

There are a lot of different genes with useful information and they are the top genes in changes in gene expression in either data, but we still need to test these genes to see how they compare using machine learning to see how well the classifications can be predicted by these genes. We will get to that later but soon.

Lets start the machine learning by first making the data frames with the class as the output or target feature and the samples as observations and the genes as predictors from both sets separately.

The 43 de-standardized genes will be created first then the 32 original genes that are completely different. Both are the filtered top or bottom 10 genes out of their respective 19526 unique gene sets of each class by fold change of acute/healthy, 1 month/healthy, or 6 months/healthy by means of their respective class samples.

The destandardized set. Lets just name our data sets something silly to keep track of them. Dance is the de-standardized set and Stand is the original log2 normalized set.

The Dance Machine Learning set, made from the lymeMx7, not-tidied, de-standardized data frame:

colnames(lymeMx7)

##  [1] "gene"                                
##  [2] "healthyControl_1"                    
##  [3] "healthyControl_2"                    
##  [4] "healthyControl_3"                    
##  [5] "healthyControl_4"                    
##  [6] "healthyControl_5"                    
##  [7] "healthyControl_6"                    
##  [8] "healthyControl_7"                    
##  [9] "healthyControl_8"                    
## [10] "healthyControl_9"                    
## [11] "healthyControl_10"                   
## [12] "healthyControl_11"                   
## [13] "healthyControl_12"                   
## [14] "healthyControl_13"                   
## [15] "healthyControl_14"                   
## [16] "healthyControl_15"                   
## [17] "healthyControl_16"                   
## [18] "healthyControl_17"                   
## [19] "healthyControl_18"                   
## [20] "healthyControl_19"                   
## [21] "healthyControl_20"                   
## [22] "healthyControl_21"                   
## [23] "acuteLymeDisease_1"                  
## [24] "acuteLymeDisease_2"                  
## [25] "acuteLymeDisease_3"                  
## [26] "acuteLymeDisease_4"                  
## [27] "acuteLymeDisease_5"                  
## [28] "acuteLymeDisease_6"                  
## [29] "acuteLymeDisease_7"                  
## [30] "acuteLymeDisease_8"                  
## [31] "acuteLymeDisease_9"                  
## [32] "acuteLymeDisease_10"                 
## [33] "acuteLymeDisease_11"                 
## [34] "acuteLymeDisease_12"                 
## [35] "acuteLymeDisease_13"                 
## [36] "acuteLymeDisease_14"                 
## [37] "acuteLymeDisease_15"                 
## [38] "acuteLymeDisease_16"                 
## [39] "acuteLymeDisease_17"                 
## [40] "acuteLymeDisease_18"                 
## [41] "acuteLymeDisease_19"                 
## [42] "acuteLymeDisease_20"                 
## [43] "acuteLymeDisease_21"                 
## [44] "acuteLymeDisease_22"                 
## [45] "acuteLymeDisease_23"                 
## [46] "acuteLymeDisease_24"                 
## [47] "acuteLymeDisease_25"                 
## [48] "acuteLymeDisease_26"                 
## [49] "acuteLymeDisease_27"                 
## [50] "acuteLymeDisease_28"                 
## [51] "Antibodies_1month_1"                 
## [52] "Antibodies_1month_2"                 
## [53] "Antibodies_1month_3"                 
## [54] "Antibodies_1month_4"                 
## [55] "Antibodies_1month_5"                 
## [56] "Antibodies_1month_6"                 
## [57] "Antibodies_1month_7"                 
## [58] "Antibodies_1month_8"                 
## [59] "Antibodies_1month_9"                 
## [60] "Antibodies_1month_10"                
## [61] "Antibodies_1month_11"                
## [62] "Antibodies_1month_12"                
## [63] "Antibodies_1month_13"                
## [64] "Antibodies_1month_14"                
## [65] "Antibodies_1month_15"                
## [66] "Antibodies_1month_16"                
## [67] "Antibodies_1month_17"                
## [68] "Antibodies_1month_18"                
## [69] "Antibodies_1month_19"                
## [70] "Antibodies_1month_20"                
## [71] "Antibodies_1month_21"                
## [72] "Antibodies_1month_22"                
## [73] "Antibodies_1month_23"                
## [74] "Antibodies_1month_24"                
## [75] "Antibodies_1month_25"                
## [76] "Antibodies_1month_26"                
## [77] "Antibodies_1month_27"                
## [78] "Antibodies_6months_1"                
## [79] "Antibodies_6months_2"                
## [80] "Antibodies_6months_3"                
## [81] "Antibodies_6months_4"                
## [82] "Antibodies_6months_5"                
## [83] "Antibodies_6months_6"                
## [84] "Antibodies_6months_7"                
## [85] "Antibodies_6months_8"                
## [86] "Antibodies_6months_9"                
## [87] "Antibodies_6months_10"               
## [88] "healthy_Mean"                        
## [89] "acuteLymeDisease_Mean"               
## [90] "antibodies_1month_Mean"              
## [91] "antibodies_6month_Mean"              
## [92] "acuteHealthy_foldChange"             
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"

Lets remove the fold change and mean value features from our lymeMx7 data frame and save it as ‘Dance’ after we transpose it to get the unique genes as predictors and the samples as observations.

dance <- lymeMx7[,-c(88:94)]
danceSampleNames <- colnames(dance)[2:87]

month1 <- grep('1month',danceSampleNames)
month6 <- grep('6month',danceSampleNames)
healthy <- grep('healthy',danceSampleNames)
acute <- grep('acute',danceSampleNames)

class <- danceSampleNames
class[month1] <- '1 month'
class[month6] <- '6 months'
class[healthy] <- 'healthy'
class[acute] <- 'acute'


danceGeneNames <- dance$gene
Dance <- as.data.frame(t(dance[,-1]))
colnames(Dance) <- danceGeneNames
Dance$class <- class
Dance2 <- Dance[,c(44,1:43)]
head(Dance2)

##                    class      LCN2      LTF  CEACAM8    DEFA4     CAMP      BPI
## healthyControl_1 healthy 18.345169 36.72210 40.13472 38.79050 25.56147 23.02797
## healthyControl_2 healthy 33.503142 75.86353 59.99411 68.67612 61.39548 44.71151
## healthyControl_3 healthy 10.400323 21.55983 29.53441 28.76057 15.62534 29.02589
## healthyControl_4 healthy 12.799352 13.72309 18.74716 11.98227 16.94088 18.36342
## healthyControl_5 healthy 20.690155 21.22504 26.42882 21.02267 27.18118 33.89164
## healthyControl_6 healthy  6.900668 18.82061 18.28476 19.84215 20.98911 33.41352
##                      MS4A3   TNFSF10    FCGR3B     DEFA1      IL1B    CKMT1B
## healthyControl_1 14.804557  3.757020 23.042961 48.895788 156.60191 219.10525
## healthyControl_2 45.414723 28.335535 48.768151 66.449316  46.72271  39.60314
## healthyControl_3  7.174274  8.998383 11.442811 23.156031  14.95010  83.29841
## healthyControl_4 39.871848 13.483297  8.406282  7.708771  25.60839  15.19072
## healthyControl_5 24.112126 24.915041 27.191459 13.763571  16.24523  24.02644
## healthyControl_6  6.043684 12.700514 35.130123 23.632009  26.42206  24.03571
##                       THBD     HTR3C    TXNL4A     DHX58     MUC12      LSM2
## healthyControl_1 347.32608 224.84244 275.64991 314.02427 257.42280 714.84725
## healthyControl_2  69.89785  41.26142  27.09973  40.93505  30.79249  46.27268
## healthyControl_3  30.60276  28.55769  17.62511  30.02908  97.04293  24.66836
## healthyControl_4  21.77138  18.31557  23.46840  13.98155  17.95108  28.95942
## healthyControl_5  30.47450  30.04598  27.85309  33.09064  24.11742  27.05694
## healthyControl_6  25.34718  16.63337  18.20711  16.78156  16.57356  10.13407
##                       MYOM2       HBG1  HLA-DRB4     CTSG     RGS18      GAPT
## healthyControl_1 238.652070 815.078715  14.21339 34.19820  9.651897  6.980782
## healthyControl_2 323.065372 167.289522 165.56168 53.10345 40.654359 26.524937
## healthyControl_3  65.086732  10.658862  11.97314 23.49343 24.912526 16.167267
## healthyControl_4   9.536578  29.211161 312.03920 11.58055 17.404908 27.702748
## healthyControl_5  46.234651  26.330291 245.62247 12.99521 24.331860 24.667356
## healthyControl_6  14.813989   8.593226 360.87281 74.44276 12.925465 34.413011
##                   SERPINB2     THBS1      AREG     CXCL2       XIST     OLR1
## healthyControl_1 468.25347 403.26591 261.23314 214.90236   5.636720 44.43886
## healthyControl_2 140.75193 138.67140 136.37357  90.08176 422.253222 81.50329
## healthyControl_3  24.34336  34.46742  10.75996  26.80778 206.491863 19.01474
## healthyControl_4  33.59483  20.47388  33.58836  12.26752   3.421391 22.83240
## healthyControl_5  23.84726  29.81402  44.00702  36.01149   5.291298 51.09338
## healthyControl_6  53.35440  33.16393  18.14765  29.94835 174.811375 20.10506
##                    OR2B11    FSIP1      TSIX  C7orf55   CHI3L1 KIAA1245
## healthyControl_1 41.91948 70.20709  17.90936 29.88277 29.77403 33.17650
## healthyControl_2 32.42068 26.85640 433.24591 29.82837 58.14152 91.34233
## healthyControl_3 19.85783 27.90551 134.56488 19.38080 16.31402 22.47583
## healthyControl_4 21.78934 16.33424  10.88805 28.87899 12.99754 21.60191
## healthyControl_5 24.37846 31.07355  12.66765 45.15163 26.19157 27.89315
## healthyControl_6 14.81774 16.89240  53.40405 18.44080 20.26103 20.81851
##                     BEST1     LIPN     GZMH    KIR2DL3   KIR2DS1   POLR2I
## healthyControl_1 29.67359 20.03313 49.52473 176.990491 100.08657 77.65711
## healthyControl_2 31.69743 84.80267 14.86283  44.288607  61.10841 22.72473
## healthyControl_3 25.92331 16.76715 24.14236  15.927277  21.24897 17.52461
## healthyControl_4 20.75736 29.99576 16.10039  39.862912  65.46716 20.43849
## healthyControl_5 45.23739 40.94288 34.47135  96.261702  56.76246 34.97384
## healthyControl_6 22.38874 27.26274 11.90818   9.957902  14.84648 19.87330
##                      S100B
## healthyControl_1 280.46369
## healthyControl_2  25.98400
## healthyControl_3  20.74111
## healthyControl_4  48.50091
## healthyControl_5 123.08727
## healthyControl_6  21.27883

We have our machine learning ready data frame of de-standardized genes, and will be using the target, class, for predictions. We could use all 43 genes or just take those genes in the visualizations that we saw had very peculiar fold change values like in the 6 months of treatment or acute stages. Or we could test both. Might as well test bost as we will see how likely these genes are in predicting an acute disease stage, treatment time, or healthy class by blood analysis.

Lets refresh our memories on what those genes were. We put them in our notes on the visualizations above for the de-standardized Tableau charts. We might miss some, as those were scanned visuals, so I am going to make a list of those genes that have noticeable shifts in gene expression or fold change values compared to the other classes and make that are peculiar set of genes. We could even divide those genes up into the ones up or down regulated in the 6 month or the acute stage only or even the healthy samples only. I will revisit that dashboard and select the genes from the filter and compare across all charts available and bring back the findings here. We’ll call that set Dance-odd6, Dance-oddAcute, or Dance-oddHealthy. Possibly Dance-odd1, but I didn’t notice anything the first quick scan through the genes. There are only 43, so it shouldn’t be a problem.

Using the fold change values:

acute up, decreasing order of up is 1 month, healthy, 6 months
up acute

monotonically decreasing from acute -> 1 month -> 6 months:

BPI
CAMP CEACAM8 CTSG DEFA1 DEFA4 DHX58 FCGR3B
GAPT GZMH HBG1 HLA-DRB4 KIR2DL3 KIR2DS1 LCN2 LSM2 LTF MS4A3 POLR21 RGS18 S100B TNFSF10

more up regulated in acute, drops in 1month then up regulated to approximately 1/2 acute in 6 months:

THBD CHI3L1 SERPINB2

-up in acute and down 1/2 in 1 month with slight increase in 6 months by approx 5%

TXNL4A

up reg in acute, drops in 1 month then almost up reg same amount in 6 months as in acute

HTR3C

%%%%% - up 6 months

starts up in acute, drops in 1 month then up in 6 months close to acute levels

AREG BEST1 KIAA1245 LIPN OR2B11

starts mid-level in acute, then stays about the same in 1 month, then noticeably up in 6 months about two fold as in acute

CKMT1B

starts low in acute, drops a little in 1 month, then up in 6 months much more noticeably then in the acute by 5-10 fold acute levels

IL1B MUC12 MYOM2 OLR1 THBS1 TSIX XIST

starts low, stays same approx in 1 month, then much more noticeably up in 6 months 5-10 fold

FSIP1

Few observations, many of the genes are monotonically decreasing from acute to 1 month to 6 months in gene expression levels, where they start high in the acute stage, then decrease gradually in 1 month: BPI, CAMP, CEACAM8, CTSG, DEFA1, DEFA4, DHX58, FCGR3B, GAPT, GZMH, HBG1, HLA-DRB4,KIR2DL3, KIR2DS1, LCN2, LSM2, LTF, MS4A3, POLR21, RGS18, S100B, TNFSF10

, and more decrease in 6 months. The we have some odd genes , THBD, CHI3L1, and SERPINB2 in the acute up regulated genes that don’t behave this way, but they indicate that maybe treatment is working, because they start high in acute, then drop in 1 month of treatment, and then increase almost to half the same levels as in the acute phase after 6 months of treatment. Also, TXNL4A, drops in 1 month and stays about the same after 6 months. And another gene, HTR3C, drops in 1 month, then increases to almost same acute levels in 6 months. Lets make those lists of the acute, then the lists for the 6 months genes.

#monotonically decreasing 
Acute_md <- c('BPI', 'CAMP', 'CEACAM8', 'CTSG',  'DEFA1', 'DEFA4', 'DHX58', 'FCGR3B', 'GAPT', 'GZMH', 'HBG1', 'HLA-DRB4','KIR2DL3', 'KIR2DS1', 'LCN2', 'LSM2', 'LTF', 'MS4A3', 'POLR21', 'RGS18', 'S100B', 'TNFSF10')

#high in acute, drop after 1 month, then half as high as acute after 6 months
Acute_mayWork <- c('THBD', 'CHI3L1','SERPINB2')

#odd ones in acute, starts high in acute, then drops, and increases slightly in 6 months
Acute_dropsThenUpslightly <- 'TXNL4A'
Acute_dropsReturnsSame <- 'HTR3C'

For the six months of treatment, genes that were noticeably increased after 6 months compared to the acute stage before treatment, none monotonically increased from acute levels, to 1 month of treatment levels, to six months of treatment levels. But some did drop in 1 month, then increase in 6 months to levels much higher by 5-10 fold than the acute levels, IL1B, MUC12, MYOM2, OLR1, THBS1, TSIX, XIST. A few genes start lower, stays lower, then increase 5-10 fold of the acute level in 6 months, FSIP1 and CKMT1B. And there are those genes that are more up regulated after 6 months of treatment, but only slightly more than the acute phase and after decreasing in the 1 month of treatment phase, AREG, BEST1, KIAA1245, LIPN, and OR2B11. Lets now make those lists to show genes that are more up regulated in the 6 month samples.

month6_5foldup <- c('IL1B', 'MUC12', 'MYOM2', 'OLR1', 'THBS1', 'TSIX', 'XIST')
month6_5foldupStartLow <- c('FSIP1','CKMT1B')
month6_upMoreThanAcute <- c('AREG', 'BEST1', 'KIAA1245', 'LIPN', 'OR2B11')

Now that we have our lists, lets see about those data frames for the seven different groups of gene anomolies or similarities. The following are our ML ready dataframes for our seven groups in our de-standardized Lyme disease data.

Acute_md_DF <- Dance2[,colnames(Dance2) %in% Acute_md]
Acute_md_DF$class <- Dance2$class

Acute_mayWork_DF <- Dance2[,colnames(Dance2) %in% Acute_mayWork]
Acute_mayWork_DF$class <- Dance2$class

Acute_dropsThenUpslightly_DF <- data.frame(TXNL4A=Dance2[,colnames(Dance2) %in% Acute_dropsThenUpslightly], row.names=row.names(Dance2))
Acute_dropsThenUpslightly_DF$class <- Dance2$class

Acute_dropsReturnsSame_DF <- data.frame(HTR3C=Dance2[,colnames(Dance2) %in%  Acute_dropsReturnsSame],row.names=row.names(Dance2))
Acute_dropsReturnsSame_DF$class <- Dance2$class

month6_5foldup_DF <- Dance2[,colnames(Dance2) %in% month6_5foldup]
month6_5foldup_DF$class <- Dance2$class

month6_5foldupStartLow_DF <- Dance2[,colnames(Dance2) %in% month6_5foldupStartLow]
month6_5foldupStartLow_DF$class <- Dance2$class

month6_upMoreThanAcute_DF <- Dance2[,colnames(Dance2) %in% month6_upMoreThanAcute]
month6_upMoreThanAcute_DF$class <- Dance2$class

Machine Learning

Great, now we need to run through each of these 7 data frames and split into separate training and testing sets, and test a machine learning algorithm on. I tend to always use random forest to start with, or caret’s rpart.

Lets make sure we keep the same samples in our testing set and training set for each group to test machine learning algorithm(s) on. Lets keep the standard 70% training set and 30% testing set using a random sampling of our classes.

set.seed(34567)
train <- sample(1:86,.7*86)
training <- class[train]
testing <- class[-train]
t <- data.frame(train = training)
ts <- data.frame(test= testing)

t %>% group_by(train) %>% count(train)

## # A tibble: 4 x 2
## # Groups:   train [4]
##   train        n
##   <fct>    <int>
## 1 1 month     21
## 2 6 months     8
## 3 acute       18
## 4 healthy     13

ts %>% group_by(test) %>% count(test)

## # A tibble: 4 x 2
## # Groups:   test [4]
##   test         n
##   <fct>    <int>
## 1 1 month      6
## 2 6 months     2
## 3 acute       10
## 4 healthy      8

We can see we have a fair share of samples in our training set and at least one of each class in our testing set to make predictions based on the model we train. Lets keep these same samples in each of our 8 groups to classify with. Lets make our 8 training and testing sets with our indices labeled ‘train’ and note the numeric labeling of each correspongs to their data frame:

Training/Testing split 1: Acute_md_DF Training/Testing split 2: Acute_mayWork_DF Training/Testing split 3: Acute_dropsThenUpslightly_DF Training/Testing split 4: Acute_dropsReturnsSame_DF Training/Testing split 5: month6_5foldup_DF Training/Testing split 6: month6_5foldupStartLow_DF Training/Testing split 7: month6_upMoreThanAcute_DF Training/Testing split 8: Dance2

training1 <- Acute_md_DF[train,]
testing1 <- Acute_md_DF[-train,]
training2 <- Acute_mayWork_DF[train,]
testing2 <- Acute_mayWork_DF[-train,]
training3 <- Acute_dropsThenUpslightly_DF[train,]
testing3 <- Acute_dropsThenUpslightly_DF[-train,]
training4 <- Acute_dropsReturnsSame_DF[train,]
testing4 <- Acute_dropsReturnsSame_DF[-train,]
training5 <- month6_5foldup_DF[train,]
testing5 <- month6_5foldup_DF[-train,]
training6 <- month6_5foldupStartLow_DF[train,]
testing6 <- month6_5foldupStartLow_DF[-train,]
training7 <- month6_upMoreThanAcute_DF[train,]
testing7 <- month6_upMoreThanAcute_DF[-train,]
training8 <- Dance2[train,]
testing8 <- Dance2[-train,]

Lets make a function specific to our data frames to return the precision, recall, and accuracy of these four classes. I actually made this in a previous script,monotonicGenes.Rmd, when testing the COVID-19 samples with GSE152418 that also had four classes to classify. But those classes were healthy, moderate, severe, or ICU grades of severity of Covid19. Actually, I found out later, that the convalescent class was its own class even though it was only one sample. So there should have been five classes. But no need to alter that function now. There is also some other packages or in the caret package, that I never use that can return the precision and recall, but i don’t think as a confusion matrix. I thought the convalescent class was mislabeled, so had it relabeled as healthy, since the models pedicted it as such. I didn’t find out until this study, when the summary of this study, GSE145974, used ‘convalesced’ blood after 1 and 6 months of antibiotics, that the sample in GSE152418 was likely its own class. I assumed it was identifying the source of its patient sample,because another previous study on Rheumatoid Arthritis (RA), GSE151161, did use convalescent patients, and it preceded the analysis on GSE152418. Typically in research, you need a client consent and informed consent from people who aren’t incarcerated or in the care of another person or facility,because it violates the human research subjects guidelines for ethical research and not victimizing vulnerable populations or culpabe and incoherant populations. This stems from research that was criminal in the Tuskegee hospital on injecting black populations with syphilis or polio vaccines on inmates in other studies for some small reward or break from their punishment or lowered/free cost clinic for medical treatment. Any researcher knows this, especially if they are funded by government agencies. Also, due to the Nazi research done on Jewish victims during World War 2, the Nuremberg Code, was created, as well as later the Belmont report. “The Nuremberg Code states that”the voluntary consent of the human subject is absolutely essential" and it further explains the details implied by this requirement: capacity to consent, freedom from coercion, no penalty for withdrawal, and comprehension of the risks and benefits involved."-The Nuremberg Code, taken from a resource for getting certified in understanding compliance with human research experiments as part of my graduate research project this had to be completed. The agency who provided this, similar to HIPPA compliance for healthcare providers, is CITI.

precisionRecallAccuracy <- function(df){
  
 colnames(df) <- c('pred','type')
  df$pred <- as.character(paste(df$pred))
  df$type <- as.character(paste(df$type))
  
 classes <- unique(df$type)
 
 class1a <- as.character(paste(classes[1]))
 class2a <- as.character(paste(classes[2]))
 class3a <- as.character(paste(classes[3]))
 class4a <- as.character(paste(classes[4]))
 
  #correct classes
  class1 <- subset(df, df$type==class1a)
  class2 <- subset(df, df$type==class2a)
  class3 <- subset(df, df$type==class3a)
  class4 <- subset(df, df$type==class4a)
  
  #incorrect classes
  notClass1 <- subset(df,df$type != class1a)
  notClass2 <- subset(df,df$type != class2a)
  notClass3 <- subset(df,df$type != class3a)
  notClass4 <- subset(df, df$type != class4a)
  
  #true positives (real positives predicted positive)
  tp_1 <- sum(class1$pred==class1$type)
  tp_2 <- sum(class2$pred==class2$type)
  tp_3 <- sum(class3$pred==class3$type)
  tp_4 <- sum(class4$pred==class4$type)
  
  #false positives (real negatives predicted positive)
  fp_1 <- sum(notClass1$pred==class1a)
  fp_2 <- sum(notClass2$pred==class2a)
  fp_3 <- sum(notClass3$pred==class3a)
  fp_4 <- sum(notClass4$pred==class4a)
  
  #false negatives (real positive predicted negative)
  fn_1 <- sum(class1$pred!=class1$type)
  fn_2 <- sum(class2$pred!=class2$type)
  fn_3 <- sum(class3$pred!=class3$type)
  fn_4 <- sum(class4$pred!=class4$type)
  
  #true negatives (real negatives predicted negative)
  tn_1 <- sum(notClass1$pred!=class1a)
  tn_2 <- sum(notClass2$pred!=class2a)
  tn_3 <- sum(notClass3$pred!=class3a)
  tn_4 <- sum(notClass4$pred!=class4a)
  
  
  #precision
  p1 <- tp_1/(tp_1+fp_1)
  p2 <- tp_2/(tp_2+fp_2)
  p3 <- tp_3/(tp_3+fp_3)
  p4 <- tp_4/(tp_4+fp_4)
  
  p1 <- ifelse(p1=='NaN',0,p1)
  p2 <- ifelse(p2=='NaN',0,p2)
  p3 <- ifelse(p3=='NaN',0,p3)
  p4 <- ifelse(p4=='NaN',0,p4)
  
  #recall
  r1 <- tp_1/(tp_1+fn_1)
  r2 <- tp_2/(tp_2+fn_2)
  r3 <- tp_3/(tp_3+fn_3)
  r4 <- tp_4/(tp_4+fn_4)
  
  r1 <- ifelse(r1=='NaN',0,r1)
  r2 <- ifelse(r2=='NaN',0,r2)
  r3 <- ifelse(r3=='NaN',0,r3)
  r4 <- ifelse(r4=='NaN',0,r4)
  
  #accuracy
  ac1 <- (tp_1+tn_1)/(tp_1+tn_1+fp_1+fn_1)
  ac2 <- (tp_2+tn_2)/(tp_2+tn_2+fp_2+fn_2)
  ac3 <- (tp_3+tn_3)/(tp_3+tn_3+fp_3+fn_3)
  ac4 <- (tp_4+tn_4)/(tp_4+tn_4+fp_4+fn_4)
  
  table <- as.data.frame(rbind(c(class1a,p1,r1,ac1),
                         c(class2a,p2,r2,ac2),
                         c(class3a,p3,r3,ac3),
                         c(class4a,p4,r4,ac4)))
  
  colnames(table) <- c('class','precision','recall','accuracy')
  acc <- (sum(df$pred==df$type)/length(df$type))*100
  cat('accuracy is: ',as.character(paste(acc)),'%')
  return(table)
  
  
}

Lets start with the first group of genes using Training/Testing 1:

set.seed(589647)
rfMod1 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training1),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF1 <- predict(rfMod1, testing1)

predDF1 <- data.frame(predRF1, type=testing1$class)
predDF1

##    predRF1     type
## 1  healthy  healthy
## 2  1 month  healthy
## 3    acute  healthy
## 4  healthy  healthy
## 5  1 month  healthy
## 6  1 month  healthy
## 7  healthy  healthy
## 8  healthy  healthy
## 9    acute    acute
## 10 1 month    acute
## 11   acute    acute
## 12   acute    acute
## 13 1 month    acute
## 14 1 month    acute
## 15   acute    acute
## 16   acute    acute
## 17   acute    acute
## 18   acute    acute
## 19 1 month  1 month
## 20   acute  1 month
## 21   acute  1 month
## 22   acute  1 month
## 23 1 month  1 month
## 24 1 month  1 month
## 25 1 month 6 months
## 26 healthy 6 months

pra1 <- precisionRecallAccuracy(predDF1)

## accuracy is:  53.8461538461538 %

pra1

##      class         precision recall          accuracy
## 1  healthy               0.8    0.5 0.807692307692308
## 2    acute 0.636363636363636    0.7 0.730769230769231
## 3  1 month               0.3    0.5 0.615384615384615
## 4 6 months                 0      0 0.923076923076923

That set wasn’t so great. Lets run through the other 7 sets using the same format and compare the results at the end.

Training/Testing 2:

rfMod2 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training2),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

predRF2 <- predict(rfMod2, testing2)

predDF2 <- data.frame(predRF2, type=testing2$class)
predDF2

##     predRF2     type
## 1   healthy  healthy
## 2     acute  healthy
## 3     acute  healthy
## 4   healthy  healthy
## 5   healthy  healthy
## 6     acute  healthy
## 7   healthy  healthy
## 8   1 month  healthy
## 9     acute    acute
## 10  1 month    acute
## 11 6 months    acute
## 12    acute    acute
## 13  1 month    acute
## 14  1 month    acute
## 15 6 months    acute
## 16  healthy    acute
## 17  1 month    acute
## 18  healthy    acute
## 19  1 month  1 month
## 20    acute  1 month
## 21    acute  1 month
## 22    acute  1 month
## 23  1 month  1 month
## 24  1 month  1 month
## 25  1 month 6 months
## 26 6 months 6 months

pra2 <- precisionRecallAccuracy(predDF2)

## accuracy is:  38.4615384615385 %

pra2

##      class         precision recall          accuracy
## 1  healthy 0.666666666666667    0.5 0.769230769230769
## 2    acute              0.25    0.2 0.461538461538462
## 3  1 month 0.333333333333333    0.5 0.653846153846154
## 4 6 months 0.333333333333333    0.5 0.884615384615385

Training/Testing 3:

rfMod3 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training3),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF3 <- predict(rfMod3, testing3)

predDF3 <- data.frame(predRF3, type=testing3$class)
predDF3

##     predRF3     type
## 1   1 month  healthy
## 2   1 month  healthy
## 3   1 month  healthy
## 4     acute  healthy
## 5     acute  healthy
## 6   healthy  healthy
## 7     acute  healthy
## 8   healthy  healthy
## 9     acute    acute
## 10  healthy    acute
## 11  healthy    acute
## 12    acute    acute
## 13  1 month    acute
## 14  1 month    acute
## 15    acute    acute
## 16    acute    acute
## 17  healthy    acute
## 18  1 month    acute
## 19  healthy  1 month
## 20  1 month  1 month
## 21  1 month  1 month
## 22  1 month  1 month
## 23    acute  1 month
## 24    acute  1 month
## 25 6 months 6 months
## 26    acute 6 months

pra3 <- precisionRecallAccuracy(predDF3)

## accuracy is:  38.4615384615385 %

pra3

##      class         precision recall          accuracy
## 1  healthy 0.333333333333333   0.25 0.615384615384615
## 2    acute               0.4    0.4 0.538461538461538
## 3  1 month 0.333333333333333    0.5 0.653846153846154
## 4 6 months                 1    0.5 0.961538461538462

Training/Testing 4:

rfMod4 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training4),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF4 <- predict(rfMod4, testing4)

predDF4 <- data.frame(predRF4, type=testing4$class)
predDF4

##     predRF4     type
## 1  6 months  healthy
## 2   healthy  healthy
## 3   1 month  healthy
## 4     acute  healthy
## 5   1 month  healthy
## 6   healthy  healthy
## 7   healthy  healthy
## 8   healthy  healthy
## 9   healthy    acute
## 10  1 month    acute
## 11  healthy    acute
## 12  healthy    acute
## 13  1 month    acute
## 14  1 month    acute
## 15 6 months    acute
## 16    acute    acute
## 17  1 month    acute
## 18  healthy    acute
## 19  1 month  1 month
## 20  1 month  1 month
## 21    acute  1 month
## 22    acute  1 month
## 23    acute  1 month
## 24  1 month  1 month
## 25 6 months 6 months
## 26  1 month 6 months

pra4 <- precisionRecallAccuracy(predDF4)

## accuracy is:  34.6153846153846 %

pra4

##      class         precision recall          accuracy
## 1  healthy               0.5    0.5 0.692307692307692
## 2    acute               0.2    0.1               0.5
## 3  1 month               0.3    0.5 0.615384615384615
## 4 6 months 0.333333333333333    0.5 0.884615384615385

Training/Testing 5:

rfMod5 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training5),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF5 <- predict(rfMod5, testing5)

predDF5 <- data.frame(predRF5, type=testing5$class)
predDF5

##     predRF5     type
## 1   healthy  healthy
## 2  6 months  healthy
## 3  6 months  healthy
## 4   healthy  healthy
## 5   healthy  healthy
## 6     acute  healthy
## 7   healthy  healthy
## 8     acute  healthy
## 9   1 month    acute
## 10  1 month    acute
## 11  healthy    acute
## 12 6 months    acute
## 13  1 month    acute
## 14  1 month    acute
## 15 6 months    acute
## 16  1 month    acute
## 17  1 month    acute
## 18  healthy    acute
## 19    acute  1 month
## 20  healthy  1 month
## 21  1 month  1 month
## 22    acute  1 month
## 23    acute  1 month
## 24  1 month  1 month
## 25 6 months 6 months
## 26 6 months 6 months

pra5 <- precisionRecallAccuracy(predDF5)

## accuracy is:  30.7692307692308 %

pra5

##      class         precision            recall          accuracy
## 1  healthy 0.571428571428571               0.5 0.730769230769231
## 2    acute                 0                 0 0.423076923076923
## 3  1 month              0.25 0.333333333333333 0.615384615384615
## 4 6 months 0.333333333333333                 1 0.846153846153846

Training/Testing 6:

rfMod6 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training6),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

predRF6 <- predict(rfMod6, testing6)

predDF6 <- data.frame(predRF6, type=testing6$class)
predDF6

##     predRF6     type
## 1   1 month  healthy
## 2  6 months  healthy
## 3  6 months  healthy
## 4     acute  healthy
## 5     acute  healthy
## 6  6 months  healthy
## 7  6 months  healthy
## 8   1 month  healthy
## 9   1 month    acute
## 10    acute    acute
## 11  1 month    acute
## 12 6 months    acute
## 13  1 month    acute
## 14  1 month    acute
## 15 6 months    acute
## 16  1 month    acute
## 17 6 months    acute
## 18  1 month    acute
## 19  1 month  1 month
## 20    acute  1 month
## 21  1 month  1 month
## 22 6 months  1 month
## 23  healthy  1 month
## 24    acute  1 month
## 25 6 months 6 months
## 26 6 months 6 months

pra6 <- precisionRecallAccuracy(predDF6)

## accuracy is:  19.2307692307692 %

pra6

##      class precision            recall          accuracy
## 1  healthy         0                 0 0.653846153846154
## 2    acute       0.2               0.1               0.5
## 3  1 month       0.2 0.333333333333333 0.538461538461538
## 4 6 months       0.2                 1 0.692307692307692

Training/Testing 7:

rfMod7 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training7),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF7 <- predict(rfMod7, testing7)

predDF7 <- data.frame(predRF7, type=testing7$class)
predDF7

##     predRF7     type
## 1   1 month  healthy
## 2     acute  healthy
## 3     acute  healthy
## 4     acute  healthy
## 5   1 month  healthy
## 6     acute  healthy
## 7     acute  healthy
## 8  6 months  healthy
## 9  6 months    acute
## 10    acute    acute
## 11  1 month    acute
## 12    acute    acute
## 13  1 month    acute
## 14  1 month    acute
## 15    acute    acute
## 16  1 month    acute
## 17  healthy    acute
## 18    acute    acute
## 19  1 month  1 month
## 20    acute  1 month
## 21    acute  1 month
## 22 6 months  1 month
## 23    acute  1 month
## 24 6 months  1 month
## 25    acute 6 months
## 26 6 months 6 months

pra7 <- precisionRecallAccuracy(predDF1)

## accuracy is:  53.8461538461538 %

pra7

##      class         precision recall          accuracy
## 1  healthy               0.8    0.5 0.807692307692308
## 2    acute 0.636363636363636    0.7 0.730769230769231
## 3  1 month               0.3    0.5 0.615384615384615
## 4 6 months                 0      0 0.923076923076923

Training/Testing 8:

rfMod8 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training8),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF8 <- predict(rfMod8, testing8)

predDF8 <- data.frame(predRF8, type=testing8$class)
predDF8

##     predRF8     type
## 1   healthy  healthy
## 2     acute  healthy
## 3     acute  healthy
## 4   healthy  healthy
## 5   healthy  healthy
## 6  6 months  healthy
## 7   healthy  healthy
## 8   healthy  healthy
## 9     acute    acute
## 10  1 month    acute
## 11    acute    acute
## 12    acute    acute
## 13  1 month    acute
## 14  1 month    acute
## 15    acute    acute
## 16    acute    acute
## 17  1 month    acute
## 18    acute    acute
## 19  1 month  1 month
## 20    acute  1 month
## 21    acute  1 month
## 22    acute  1 month
## 23  1 month  1 month
## 24  1 month  1 month
## 25 6 months 6 months
## 26 6 months 6 months

pra8 <- precisionRecallAccuracy(predDF1)

## accuracy is:  53.8461538461538 %

pra8

##      class         precision recall          accuracy
## 1  healthy               0.8    0.5 0.807692307692308
## 2    acute 0.636363636363636    0.7 0.730769230769231
## 3  1 month               0.3    0.5 0.615384615384615
## 4 6 months                 0      0 0.923076923076923

The seed for randomness within the computations of this operating system and R has to be set before the models, because running different times after setting the seed when generating the random indices of the train vector didn’t work for the model generation. I reran 3-5 times and got inconsistent results unless using the set.seed before the 8 models were ran. It is supposed to work only once, and generate the same results everytime. But in either case it represents how the random forest works by randomly selecting samples within the sampels to test an ensemble of models and trees. This current seed still kept the 1st, 7th, and 8th groups as the highest scoring in accuracy. Overall accuracy was not good for any of the groups where it ranged from 19-54% accuracy in predicted being the same as the actual type. But there was class accuracy differences that could best be compared by combining the precision and recall accuracy tables then adding in a feature to identify which model the result came from. Note that the worst group for prediction accuracy was group 6, and the three best groups with 54% accuracy were groups 1, 7, and 8. Worst set of genes to keep as target genes for Lyme Disease are… These are the groups by gene behaviors in fold change of diseased or treated mean values compared to healthy mean values: Training/Testing split 1: Acute_md_DF Training/Testing split 2: Acute_mayWork_DF Training/Testing split 3: Acute_dropsThenUpslightly_DF Training/Testing split 4: Acute_dropsReturnsSame_DF Training/Testing split 5: month6_5foldup_DF Training/Testing split 6: month6_5foldupStartLow_DF Training/Testing split 7: month6_upMoreThanAcute_DF Training/Testing split 8: Dance2

So, without tuning our models or testing other algorithms, we can assume from this point on, all the 43 genes are best, as well as the set of genes with more up regulation after 6 months than in the acute phase, and the set of genes with the monotoncially decreasing gene values from acute to one month of treatment to six months of treatment with the acute phase having the highest gene expression values. The other genes are possibly noisy or add noise to our data that prevents the model from classifying greatly. But lets see if any of the sets did have better recall or precision in a class by class prediction accuracy first, before attempting to tune our random forest models.Also note, that I omitted the preprocessing step in the model training to begin with and then added it in and improved the range from a best score of 34% to a best score in overall accuracy of 50%.

pra_all <- rbind(pra1,pra2,pra3,pra4,pra5,pra6,pra7,pra8)
pra_all$GroupMembership <- c(rep(1,4),
                             rep(2,4),
                             rep(3,4),
                             rep(4,4),
                             rep(5,4),
                             rep(6,4),
                             rep(7,4),
                             rep(8,4))
pra_all2 <- pra_all %>% group_by(class) %>% mutate(max=
                  ifelse(accuracy==max(as.numeric(paste(accuracy))),'max','not max'))
max <- subset(pra_all2, pra_all2$max=='max')
max

## # A tibble: 9 x 6
## # Groups:   class [4]
##   class    precision         recall accuracy          GroupMembership max  
##   <fct>    <fct>             <fct>  <fct>                       <dbl> <chr>
## 1 healthy  0.8               0.5    0.807692307692308               1 max  
## 2 acute    0.636363636363636 0.7    0.730769230769231               1 max  
## 3 1 month  0.333333333333333 0.5    0.653846153846154               2 max  
## 4 1 month  0.333333333333333 0.5    0.653846153846154               3 max  
## 5 6 months 1                 0.5    0.961538461538462               3 max  
## 6 healthy  0.8               0.5    0.807692307692308               7 max  
## 7 acute    0.636363636363636 0.7    0.730769230769231               7 max  
## 8 healthy  0.8               0.5    0.807692307692308               8 max  
## 9 acute    0.636363636363636 0.7    0.730769230769231               8 max

We can see from the above chart of class membership accuracies that some other groups also did make good gene targets for some classes. Group 2 and group 3 had the best accuracy in predicting 1 or 6 months for group 3 and only the 1 month class for group 2. The 1st, 7th, and 8th groups were better at predicting the healthy and acute class memberships. We had fewer of the 6 month class, but many 1 month samples, yet that class for 1 month didn’t have any noticeable changes in our 43 genes to distinguish with the random forest classification. We could try more trees or tuning the model to see if there is an improvement. These models were fast and that was likely due to the number of trees being small. Lets use the randomForest package and its randomForest() to tune our model and test our same 8 groups.

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training1) <- gsub('-','',colnames(training1))
colnames(testing1) <- gsub('-','',colnames(testing1))
testing1$class <- as.factor(paste(testing1$class))
training1$class <- as.factor(paste(training1$class))
RF1 <- randomForest(class ~ ., data=training1, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

predict1 <- predict(RF1,testing1)
predict1df <- data.frame(predict1, type=testing1$class)
predict1df

##                       predict1     type
## healthyControl_3       1 month  healthy
## healthyControl_11      1 month  healthy
## healthyControl_12        acute  healthy
## healthyControl_13      healthy  healthy
## healthyControl_18      1 month  healthy
## healthyControl_19      1 month  healthy
## healthyControl_20      healthy  healthy
## healthyControl_21      healthy  healthy
## acuteLymeDisease_1       acute    acute
## acuteLymeDisease_4     1 month    acute
## acuteLymeDisease_6       acute    acute
## acuteLymeDisease_7       acute    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22    1 month    acute
## acuteLymeDisease_23    1 month    acute
## acuteLymeDisease_24    healthy    acute
## acuteLymeDisease_27   6 months    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6      acute  1 month
## Antibodies_1month_11     acute  1 month
## Antibodies_1month_12     acute  1 month
## Antibodies_1month_13   1 month  1 month
## Antibodies_1month_26   1 month  1 month
## Antibodies_6months_1   1 month 6 months
## Antibodies_6months_10  healthy 6 months

PRA1 <- precisionRecallAccuracy(predict1df)

## accuracy is:  34.6153846153846 %

PRA1

##      class         precision recall          accuracy
## 1  healthy               0.6  0.375 0.730769230769231
## 2    acute 0.428571428571429    0.3 0.576923076923077
## 3  1 month 0.230769230769231    0.5               0.5
## 4 6 months                 0      0 0.884615384615385

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training2) <- gsub('-','',colnames(training2))
colnames(testing2) <- gsub('-','',colnames(testing2))
testing2$class <- as.factor(paste(testing2$class))
training2$class <- as.factor(paste(training2$class))
RF2 <- randomForest(class ~ ., data=training2, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

predict2 <- predict(RF2,testing2)
predict2df <- data.frame(predict2, type=testing2$class)
predict2df

##                       predict2     type
## healthyControl_3       healthy  healthy
## healthyControl_11        acute  healthy
## healthyControl_12        acute  healthy
## healthyControl_13      healthy  healthy
## healthyControl_18      healthy  healthy
## healthyControl_19        acute  healthy
## healthyControl_20      healthy  healthy
## healthyControl_21      1 month  healthy
## acuteLymeDisease_1       acute    acute
## acuteLymeDisease_4     1 month    acute
## acuteLymeDisease_6    6 months    acute
## acuteLymeDisease_7       acute    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22   6 months    acute
## acuteLymeDisease_23    healthy    acute
## acuteLymeDisease_24    1 month    acute
## acuteLymeDisease_27    healthy    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6      acute  1 month
## Antibodies_1month_11     acute  1 month
## Antibodies_1month_12     acute  1 month
## Antibodies_1month_13   1 month  1 month
## Antibodies_1month_26   1 month  1 month
## Antibodies_6months_1   1 month 6 months
## Antibodies_6months_10    acute 6 months

PRA2 <- precisionRecallAccuracy(predict2df)

## accuracy is:  34.6153846153846 %

PRA2

##      class         precision recall          accuracy
## 1  healthy 0.666666666666667    0.5 0.769230769230769
## 2    acute 0.222222222222222    0.2 0.423076923076923
## 3  1 month 0.333333333333333    0.5 0.653846153846154
## 4 6 months                 0      0 0.846153846153846

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training3) <- gsub('-','',colnames(training3))
colnames(testing3) <- gsub('-','',colnames(testing3))
testing3$class <- as.factor(paste(testing3$class))
training3$class <- as.factor(paste(training3$class))
RF3 <- randomForest(class ~ ., data=training3, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range

predict3 <- predict(RF3,testing3)
predict3df <- data.frame(predict3, type=testing3$class)
predict3df

##                       predict3     type
## healthyControl_3       1 month  healthy
## healthyControl_11      1 month  healthy
## healthyControl_12      1 month  healthy
## healthyControl_13        acute  healthy
## healthyControl_18        acute  healthy
## healthyControl_19      healthy  healthy
## healthyControl_20        acute  healthy
## healthyControl_21      healthy  healthy
## acuteLymeDisease_1       acute    acute
## acuteLymeDisease_4     healthy    acute
## acuteLymeDisease_6     healthy    acute
## acuteLymeDisease_7       acute    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22      acute    acute
## acuteLymeDisease_23    1 month    acute
## acuteLymeDisease_24    healthy    acute
## acuteLymeDisease_27    1 month    acute
## Antibodies_1month_4    healthy  1 month
## Antibodies_1month_6    1 month  1 month
## Antibodies_1month_11   1 month  1 month
## Antibodies_1month_12   1 month  1 month
## Antibodies_1month_13     acute  1 month
## Antibodies_1month_26     acute  1 month
## Antibodies_6months_1  6 months 6 months
## Antibodies_6months_10    acute 6 months

PRA3 <- precisionRecallAccuracy(predict3df)

## accuracy is:  34.6153846153846 %

PRA3

##      class         precision recall          accuracy
## 1  healthy 0.333333333333333   0.25 0.615384615384615
## 2    acute 0.333333333333333    0.3               0.5
## 3  1 month               0.3    0.5 0.615384615384615
## 4 6 months                 1    0.5 0.961538461538462

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training4) <- gsub('-','',colnames(training4))
colnames(testing4) <- gsub('-','',colnames(testing4))
testing4$class <- as.factor(paste(testing4$class))
training4$class <- as.factor(paste(training4$class))
RF4 <- randomForest(class ~ ., data=training4, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range

predict4 <- predict(RF4,testing4)
predict4df <- data.frame(predict4, type=testing4$class)
predict4df

##                       predict4     type
## healthyControl_3      6 months  healthy
## healthyControl_11      healthy  healthy
## healthyControl_12      1 month  healthy
## healthyControl_13        acute  healthy
## healthyControl_18      1 month  healthy
## healthyControl_19      healthy  healthy
## healthyControl_20      healthy  healthy
## healthyControl_21      healthy  healthy
## acuteLymeDisease_1     healthy    acute
## acuteLymeDisease_4     1 month    acute
## acuteLymeDisease_6     healthy    acute
## acuteLymeDisease_7     healthy    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22   6 months    acute
## acuteLymeDisease_23      acute    acute
## acuteLymeDisease_24    1 month    acute
## acuteLymeDisease_27    healthy    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6    1 month  1 month
## Antibodies_1month_11     acute  1 month
## Antibodies_1month_12     acute  1 month
## Antibodies_1month_13     acute  1 month
## Antibodies_1month_26   1 month  1 month
## Antibodies_6months_1  6 months 6 months
## Antibodies_6months_10  1 month 6 months

PRA4 <- precisionRecallAccuracy(predict4df)

## accuracy is:  34.6153846153846 %

PRA4

##      class         precision recall          accuracy
## 1  healthy               0.5    0.5 0.692307692307692
## 2    acute               0.2    0.1               0.5
## 3  1 month               0.3    0.5 0.615384615384615
## 4 6 months 0.333333333333333    0.5 0.884615384615385

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training5) <- gsub('-','',colnames(training5))
colnames(testing5) <- gsub('-','',colnames(testing5))
testing5$class <- as.factor(paste(testing5$class))
training5$class <- as.factor(paste(training5$class))
RF5 <- randomForest(class ~ ., data=training5, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

predict5 <- predict(RF5,testing5)
predict5df <- data.frame(predict5, type=testing5$class)
predict5df

##                       predict5     type
## healthyControl_3       healthy  healthy
## healthyControl_11     6 months  healthy
## healthyControl_12     6 months  healthy
## healthyControl_13      healthy  healthy
## healthyControl_18      healthy  healthy
## healthyControl_19        acute  healthy
## healthyControl_20      healthy  healthy
## healthyControl_21      1 month  healthy
## acuteLymeDisease_1    6 months    acute
## acuteLymeDisease_4     1 month    acute
## acuteLymeDisease_6     healthy    acute
## acuteLymeDisease_7    6 months    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22   6 months    acute
## acuteLymeDisease_23    1 month    acute
## acuteLymeDisease_24    1 month    acute
## acuteLymeDisease_27    healthy    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6    healthy  1 month
## Antibodies_1month_11   1 month  1 month
## Antibodies_1month_12     acute  1 month
## Antibodies_1month_13   1 month  1 month
## Antibodies_1month_26   1 month  1 month
## Antibodies_6months_1  6 months 6 months
## Antibodies_6months_10 6 months 6 months

PRA5 <- precisionRecallAccuracy(predict5df)

## accuracy is:  38.4615384615385 %

PRA5

##      class         precision            recall          accuracy
## 1  healthy 0.571428571428571               0.5 0.730769230769231
## 2    acute                 0                 0 0.538461538461538
## 3  1 month               0.4 0.666666666666667 0.692307692307692
## 4 6 months 0.285714285714286                 1 0.807692307692308

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training6) <- gsub('-','',colnames(training6))
colnames(testing6) <- gsub('-','',colnames(testing6))
testing6$class <- as.factor(paste(testing6$class))
training6$class <- as.factor(paste(training6$class))
RF6 <- randomForest(class ~ ., data=training6, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range

predict6 <- predict(RF6,testing6)
predict6df <- data.frame(predict6, type=testing6$class)
predict6df

##                       predict6     type
## healthyControl_3       1 month  healthy
## healthyControl_11     6 months  healthy
## healthyControl_12     6 months  healthy
## healthyControl_13        acute  healthy
## healthyControl_18        acute  healthy
## healthyControl_19     6 months  healthy
## healthyControl_20     6 months  healthy
## healthyControl_21      1 month  healthy
## acuteLymeDisease_1     1 month    acute
## acuteLymeDisease_4       acute    acute
## acuteLymeDisease_6     1 month    acute
## acuteLymeDisease_7    6 months    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22   6 months    acute
## acuteLymeDisease_23    1 month    acute
## acuteLymeDisease_24   6 months    acute
## acuteLymeDisease_27    1 month    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6      acute  1 month
## Antibodies_1month_11   1 month  1 month
## Antibodies_1month_12  6 months  1 month
## Antibodies_1month_13   healthy  1 month
## Antibodies_1month_26     acute  1 month
## Antibodies_6months_1  6 months 6 months
## Antibodies_6months_10 6 months 6 months

PRA6 <- precisionRecallAccuracy(predict6df)

## accuracy is:  19.2307692307692 %

PRA6

##      class precision            recall          accuracy
## 1  healthy         0                 0 0.653846153846154
## 2    acute       0.2               0.1               0.5
## 3  1 month       0.2 0.333333333333333 0.538461538461538
## 4 6 months       0.2                 1 0.692307692307692

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training7) <- gsub('-','',colnames(training7))
colnames(testing7) <- gsub('-','',colnames(testing7))
testing7$class <- as.factor(paste(testing7$class))
training7$class <- as.factor(paste(training7$class))
RF7 <- randomForest(class ~ ., data=training7, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

predict7 <- predict(RF7,testing7)
predict7df <- data.frame(predict7, type=testing7$class)
predict7df

##                       predict7     type
## healthyControl_3       1 month  healthy
## healthyControl_11        acute  healthy
## healthyControl_12        acute  healthy
## healthyControl_13      1 month  healthy
## healthyControl_18      1 month  healthy
## healthyControl_19        acute  healthy
## healthyControl_20        acute  healthy
## healthyControl_21     6 months  healthy
## acuteLymeDisease_1    6 months    acute
## acuteLymeDisease_4       acute    acute
## acuteLymeDisease_6     1 month    acute
## acuteLymeDisease_7       acute    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22      acute    acute
## acuteLymeDisease_23    1 month    acute
## acuteLymeDisease_24    healthy    acute
## acuteLymeDisease_27      acute    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6      acute  1 month
## Antibodies_1month_11     acute  1 month
## Antibodies_1month_12  6 months  1 month
## Antibodies_1month_13     acute  1 month
## Antibodies_1month_26  6 months  1 month
## Antibodies_6months_1     acute 6 months
## Antibodies_6months_10 6 months 6 months

PRA7 <- precisionRecallAccuracy(predict7df)

## accuracy is:  23.0769230769231 %

PRA7

##      class         precision            recall          accuracy
## 1  healthy                 0                 0 0.653846153846154
## 2    acute 0.333333333333333               0.4 0.461538461538462
## 3  1 month             0.125 0.166666666666667 0.538461538461538
## 4 6 months               0.2               0.5 0.807692307692308

#an error with hyphen in HLA-DRB4, so we will omit it in the testing and training set
set.seed(4567)
colnames(training8) <- gsub('-','',colnames(training8))
colnames(testing8) <- gsub('-','',colnames(testing8))
testing8$class <- as.factor(paste(testing8$class))
training8$class <- as.factor(paste(training8$class))
RF8 <- randomForest(class ~ ., data=training8, 
                    importance=TRUE, nodesize=2, ntree=400,mtry=3)

predict8 <- predict(RF8,testing8)
predict8df <- data.frame(predict8, type=testing8$class)
predict8df

##                       predict8     type
## healthyControl_3       healthy  healthy
## healthyControl_11        acute  healthy
## healthyControl_12        acute  healthy
## healthyControl_13      healthy  healthy
## healthyControl_18      healthy  healthy
## healthyControl_19        acute  healthy
## healthyControl_20      healthy  healthy
## healthyControl_21      healthy  healthy
## acuteLymeDisease_1       acute    acute
## acuteLymeDisease_4     1 month    acute
## acuteLymeDisease_6     1 month    acute
## acuteLymeDisease_7       acute    acute
## acuteLymeDisease_9     1 month    acute
## acuteLymeDisease_13    1 month    acute
## acuteLymeDisease_22      acute    acute
## acuteLymeDisease_23    1 month    acute
## acuteLymeDisease_24    healthy    acute
## acuteLymeDisease_27    healthy    acute
## Antibodies_1month_4    1 month  1 month
## Antibodies_1month_6      acute  1 month
## Antibodies_1month_11     acute  1 month
## Antibodies_1month_12     acute  1 month
## Antibodies_1month_13   1 month  1 month
## Antibodies_1month_26   1 month  1 month
## Antibodies_6months_1  6 months 6 months
## Antibodies_6months_10 6 months 6 months

PRA8 <- precisionRecallAccuracy(predict8df)

## accuracy is:  50 %

PRA8

##      class         precision recall          accuracy
## 1  healthy 0.714285714285714  0.625 0.807692307692308
## 2    acute 0.333333333333333    0.3               0.5
## 3  1 month             0.375    0.5 0.692307692307692
## 4 6 months                 1      1                 1

PRA_all <- rbind(PRA1,PRA2,PRA3,PRA4,PRA5,PRA6,PRA7,PRA8)
PRA_all$groupMembership <- c(rep(1,4),
                             rep(2,4),
                             rep(3,4),
                             rep(4,4),
                             rep(5,4),
                             rep(6,4),
                             rep(7,4),
                             rep(8,4))
PRA_all2 <- PRA_all %>% group_by(class) %>% mutate(max=
    ifelse(accuracy==max(as.numeric(paste(accuracy))),'max','not max'))
max2 <- subset(PRA_all2, PRA_all2$max=='max')
max2

## # A tibble: 5 x 6
## # Groups:   class [4]
##   class    precision        recall         accuracy        groupMembership max  
##   <fct>    <fct>            <fct>          <fct>                     <dbl> <chr>
## 1 acute    0.4285714285714~ 0.3            0.576923076923~               1 max  
## 2 1 month  0.4              0.66666666666~ 0.692307692307~               5 max  
## 3 healthy  0.7142857142857~ 0.625          0.807692307692~               8 max  
## 4 1 month  0.375            0.5            0.692307692307~               8 max  
## 5 6 months 1                1              1                             8 max

The accuracy wasn’t as great using the randomForest() instead of caret’s built in random forest function. And we did still see group 8 or all 43 genes score the best or with the best. Group 5 made a class prediction best score that it didn’t in the other model. Group 5 scored the best on class ‘1 month’, and group 8 scored the best on the same class but only in accuracy, becasue group 5 had better recall and precision than Group 8 on that class. Group 8 did score 100% accuracy in the ‘6 months’ class in recall, precision, and total accuracy. Recall that our sets are split with the same share of training and testing samples and that there were 8 samples of the 6 months class to train our model and 2 to predict in the testing set with our model. Group 8 got all relevant 6 month class samples in the testing set (precision) and no other samples were misclassified as the 6 months class (recall). The ‘healthy’ class was also correctly predicted by Group 8 with 81% accuracy where the precision is 71% and recall was 63%. The acute class was predicted best by group 5 with 58% accuracy, 30% recall (misclassified 70%), and 43% precision (didn’t find 57%).

We could test out more algorithms or we could test out the original data of 32 completely different genes. and go through the same process of grouping the genes by those gene fold change ratios that we saw in our 7 groups above. We could also test out a data set of those genes in our top performing groups, 1,8, and 2,3,5,or 7 with groups 4 and 6 not being a better performer at any class prediction or overall accuracy. Those groups again are:

where md is monotonically decreasing from acute -> month 1 -> month 6 and Dance2 is all genes. There weren’t any monotonically increasing genes, all the remaining genes started higher than the 1 month class then increased to a level close to the acute levels, just under the acute levels, slightly more than the acute levels, or much higher than the acute levels. All 43 genes (Dance2), the monotonically decreasing genes, and all groups except for group 4 and 5 can be used. But really we are just picking the ones that aren’t useful from group 8. Groups 4 and 6, with group 6 seeming to always score the minimum accuracy will be in our data set to test our models on. This means neither of the genes that return to similar levels from acute to the 6th month levels or the genes from the group that starts low in the acute phase but end up increasing to about five fold the acute levels by month 6 will be used. And I would have thought those genes would be indicative of the class. We should just make two data sets, where one is Groups 4 and 6, and the other is groups 1,2,3,5, and 7. Because group 8 is all the genes in the set. Either set could have some noise. These are fold change values of the mean values across all samples. It is possible to go back to the dashboard and find some outlier samples that skew the gene values from group 4 and 6. Lets see what those genes are again. Acute_dropsReturnsSame, and month6_5foldupStartLow are those gene lists made earlier.

poorPathogenesisTargets <- c(month6_5foldupStartLow, Acute_dropsReturnsSame)
poorPathogenesisTargets

## [1] "FSIP1"  "CKMT1B" "HTR3C"

We need to go back to the dashboard and see if FSIP1, CKMT1B, or HTR3C have some samples that are skewing their gene expression values greatly.

I actually didn’t post the individual gene expression values up in a chart on Tableau, so I just loaded one that shows there are some samples skewing the data with those three genes, FSIP1, CKMT1B, and HTR3C. I have decided to backtrack and see if I take the median fold change values instead of the mean across all samples if the results will be better.

individual samples’ gene expression values across all four classes.

FSIP1, CKMT1B, and HTR3C spread across all samples

Figure 9: I added the sample chart to see the groups of individual samples within each class of healthy, acute lyme disease, one month of antibiotic treatment, and six months of antibiotics treatments, after realizing, some of the genes’ fold change values are skewing the data greatly. And we can see in the above image of this chart (linked to through the image) that these samples in this set of genes that skewed our data when running some machine learning algorithms were samples: sample 7 of the acute class samples, sample 12 of the 1 month class samples, sample 10 of the 6 month class samples, and samples 1, 11, and 12 of the healthy class samples. I want to remove these samples and run some machine learning on the set, or just take the median sample values instead when deriving the fold change values.

I want to backtrack at this point and use the median values by switching to a new document to test the median, and referencing back to it in this document, with the machine learning results.

I did that work on the median sample values and dropped those six classes that seemed to skew the data, but the results weren’t better and had a best score of 42% accuracy, where here the best score was 54% accuracy so far. We still need to test the machine learning results on the original data that destandardization wasn’t made to. But to access the median sample values as fold changes and the results with the six outlier samples in the mean sample derived fold change data before getting the median derived fold changes, it is on rpubs as part 2 of the Lyme Disease Ticks document.

I took out six samples that were skewed in this set, but never tested if taking those samples out of this data would improve the classification accuracy in this data. We can do that fast with our testing and training sets. Lets use all the data of training and testing set 8.

row.names(training8)

##  [1] "Antibodies_6months_5" "Antibodies_1month_2"  "Antibodies_6months_9"
##  [4] "Antibodies_1month_10" "Antibodies_1month_1"  "healthyControl_4"    
##  [7] "Antibodies_6months_4" "acuteLymeDisease_19"  "Antibodies_1month_25"
## [10] "Antibodies_1month_3"  "Antibodies_1month_16" "acuteLymeDisease_21" 
## [13] "healthyControl_8"     "Antibodies_1month_8"  "acuteLymeDisease_15" 
## [16] "acuteLymeDisease_11"  "healthyControl_5"     "Antibodies_1month_5" 
## [19] "Antibodies_1month_20" "acuteLymeDisease_12"  "Antibodies_1month_9" 
## [22] "acuteLymeDisease_16"  "acuteLymeDisease_28"  "acuteLymeDisease_25" 
## [25] "Antibodies_6months_7" "healthyControl_2"     "Antibodies_1month_14"
## [28] "healthyControl_1"     "acuteLymeDisease_26"  "Antibodies_1month_27"
## [31] "healthyControl_6"     "acuteLymeDisease_18"  "Antibodies_1month_19"
## [34] "healthyControl_16"    "Antibodies_1month_21" "healthyControl_15"   
## [37] "healthyControl_9"     "acuteLymeDisease_17"  "healthyControl_14"   
## [40] "Antibodies_1month_24" "Antibodies_1month_15" "healthyControl_17"   
## [43] "Antibodies_6months_6" "acuteLymeDisease_14"  "acuteLymeDisease_20" 
## [46] "Antibodies_1month_7"  "healthyControl_10"    "Antibodies_6months_8"
## [49] "acuteLymeDisease_8"   "Antibodies_6months_2" "Antibodies_1month_17"
## [52] "acuteLymeDisease_3"   "Antibodies_6months_3" "healthyControl_7"    
## [55] "Antibodies_1month_23" "Antibodies_1month_18" "acuteLymeDisease_2"  
## [58] "Antibodies_1month_22" "acuteLymeDisease_5"   "acuteLymeDisease_10"

sample 7 of the acute class samples, sample 12 of the 1 month class samples, sample 10 of the 6 month class samples, and samples 1, 11, and 12 of the healthy class samples.

Check back for machine learning on the original data.

skewSamples <- c('Antibodies_6months_10','Antibodies_1month_12','acuteLymeDisease_7',
                 'healthyControl_1','healthyControl_11','healthyControl_12')

sort(row.names(training8))

##  [1] "acuteLymeDisease_10"  "acuteLymeDisease_11"  "acuteLymeDisease_12" 
##  [4] "acuteLymeDisease_14"  "acuteLymeDisease_15"  "acuteLymeDisease_16" 
##  [7] "acuteLymeDisease_17"  "acuteLymeDisease_18"  "acuteLymeDisease_19" 
## [10] "acuteLymeDisease_2"   "acuteLymeDisease_20"  "acuteLymeDisease_21" 
## [13] "acuteLymeDisease_25"  "acuteLymeDisease_26"  "acuteLymeDisease_28" 
## [16] "acuteLymeDisease_3"   "acuteLymeDisease_5"   "acuteLymeDisease_8"  
## [19] "Antibodies_1month_1"  "Antibodies_1month_10" "Antibodies_1month_14"
## [22] "Antibodies_1month_15" "Antibodies_1month_16" "Antibodies_1month_17"
## [25] "Antibodies_1month_18" "Antibodies_1month_19" "Antibodies_1month_2" 
## [28] "Antibodies_1month_20" "Antibodies_1month_21" "Antibodies_1month_22"
## [31] "Antibodies_1month_23" "Antibodies_1month_24" "Antibodies_1month_25"
## [34] "Antibodies_1month_27" "Antibodies_1month_3"  "Antibodies_1month_5" 
## [37] "Antibodies_1month_7"  "Antibodies_1month_8"  "Antibodies_1month_9" 
## [40] "Antibodies_6months_2" "Antibodies_6months_3" "Antibodies_6months_4"
## [43] "Antibodies_6months_5" "Antibodies_6months_6" "Antibodies_6months_7"
## [46] "Antibodies_6months_8" "Antibodies_6months_9" "healthyControl_1"    
## [49] "healthyControl_10"    "healthyControl_14"    "healthyControl_15"   
## [52] "healthyControl_16"    "healthyControl_17"    "healthyControl_2"    
## [55] "healthyControl_4"     "healthyControl_5"     "healthyControl_6"    
## [58] "healthyControl_7"     "healthyControl_8"     "healthyControl_9"

sort(skewSamples)

## [1] "acuteLymeDisease_7"    "Antibodies_1month_12"  "Antibodies_6months_10"
## [4] "healthyControl_1"      "healthyControl_11"     "healthyControl_12"

skewSamples %in% row.names(training8)

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

skewSamples %in% row.names(testing8)

## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

dim(training8);dim(testing8)

## [1] 60 44

## [1] 26 44

training8b <- subset(training8,!(row.names(training8) %in% skewSamples))
testing8b <- subset(testing8, !(row.names(testing8) %in% skewSamples))

skewSamples %in% row.names(training8b)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

skewSamples %in% row.names(testing8b)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

dim(training8b);dim(testing8b)

## [1] 59 44

## [1] 21 44

Now, we can see if there is an improvement in accuracy in machine learning prediction. Training/Testing 1:

set.seed(589647)
rfMod8b <- train(class~., method='rf', 
               na.action=na.pass,
               data=(training8b),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRF8b <- predict(rfMod8b, testing8b)

predDF8b <- data.frame(predRF8b, type=testing8b$class)
predDF8b

##    predRF8b     type
## 1   healthy  healthy
## 2   healthy  healthy
## 3   healthy  healthy
## 4     acute  healthy
## 5   healthy  healthy
## 6   1 month  healthy
## 7     acute    acute
## 8   1 month    acute
## 9   1 month    acute
## 10  1 month    acute
## 11  1 month    acute
## 12    acute    acute
## 13  1 month    acute
## 14 6 months    acute
## 15  healthy    acute
## 16  1 month  1 month
## 17    acute  1 month
## 18    acute  1 month
## 19  1 month  1 month
## 20  1 month  1 month
## 21 6 months 6 months

pra8b <- precisionRecallAccuracy(predDF8b)

## accuracy is:  47.6190476190476 %

pra8b

##      class         precision            recall          accuracy
## 1  healthy               0.8 0.666666666666667 0.857142857142857
## 2    acute               0.4 0.222222222222222 0.523809523809524
## 3  1 month 0.333333333333333               0.6 0.619047619047619
## 4 6 months               0.5                 1 0.952380952380952

The accuracy is 47.6% for all genes used on all samples except the six skewed ones. The highest Group 8 scored was 54% earlier. Maybe we could try training more samples and having a smaller test set to predict the classifications? Or making the classes more balanced. Lets see what are class counts are in each set. The healthy class isn’t oo bad at 80% precision, it only missed 20% of the healthy samples, but the recall is 67% on the healthy class, meaning it misclassified some samples as healthy. And the recall was 100% on the 6 month class, when there is only one 6 month class in the testing set, it correctly predicted the only class as 6 months out of the total number of classes there are (recall) is 100% but also incorrectly predicted one of the acute classes as a 6 month class so that the total predicted correctly/total predicted as correct or incorrect (precision) is 50%. When looking at recall and precision, both are the number of predicted for a class but the precision is a ratio of the number of classes it predicted as a class as the denominator while the recall is the true number of classes there are. People have tried shortening it and it leaves out those facts or putting prime over the P as P’ to condense the interpretation, but it just adds confusion, and really, needs to be fully written out as such, instead of assuming the readers know the shorthand abbreviations that could have been explained many pages or chapters prior to the current page. I will always bring this up. Because the shorthand differences are the reason for the inconsistencies and misinterpretation or confusion by people who don’t use these measures day in and out like their normal cup of coffee. You’ll see this in type I (count of false positives) and type II (count of false negatives) errors too for hypothesis testing. And people wired to paraphrase will also get confused there, because they want to condense that to false positives as negatives and false negatives as positives, but really its true negatives labeled positive, and true positives labeled negative. It’s not a simple rule as the derivative of a constant is always 0, or the derivative of a x^2 with respect to y is 0. Its a mnemonic of sorts that goes through many versions of shorthand text. I believe you can honestly pull the top 90th percentile of calculus III students aside at random and ask them to calculate the precision and recall and not get consistent results. Because its not used as much in calculus but statistics, and also not seen as really relevant until predictive analytics or machine learning. To them it seems trivial, because class imbalance is irrelevant until trying to improve the accuracy and test out ways to make the class balance produce higher prediction accuracy overall, or separate the classess, and make the model predict within a subset of the classess accurately, but not within the total set of classes. So, lets get to class balancing the best we can.

train8b <- training8b %>% group_by(class) %>% count(class)
test8b <- testing8b %>% group_by(class) %>% count(class)

train8b

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <fct>    <int>
## 1 1 month     21
## 2 6 months     8
## 3 acute       18
## 4 healthy     12

test8b

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <fct>    <int>
## 1 1 month      5
## 2 6 months     1
## 3 acute        9
## 4 healthy      6

Lets first try training the data on all 86 of the samples except for 1 in each class, then use that 1 sample from each class in the testing set. To select which one to include from each class, lets only keep the one closest to the mean of their class. But we’ll use the mean of all the samples in each class.

all <- rbind(testing8,training8)
dim(all)

## [1] 86 44

healthy <- subset(all, all$class=='healthy')
m1 <- subset(all, all$class=='1 month')
m6 <- subset(all, all$class=='6 months')
acute <- subset(all, all$class=='acute')

healthyMean <- order(apply(healthy[,-1],1,mean))
m1Mean <- order(apply(m1[,-1],1,mean))
m6Mean <- order(apply(m6[,-1],1,mean))
acuteMean <- order(apply(acute[,-1],1,mean))

H1 <- healthyMean[floor(length(healthyMean)/2)]
M1 <- m1Mean[floor(length(m1Mean)/2)]
M6 <- m6Mean[floor(length(m6Mean)/2)]
A1 <- acuteMean[floor(length(acuteMean)/2)]

testUno <- rbind(healthy[H1,],m1[M1,],m6[M6,],acute[A1,])
testUno

##                         class     LCN2      LTF  CEACAM8    DEFA4     CAMP
## healthyControl_15     healthy 17.13554 38.03209 28.97459 54.96061 18.54468
## Antibodies_1month_22  1 month 74.79538 30.14569 31.15147 23.96823 34.69399
## Antibodies_6months_6 6 months 14.03154 10.51058 24.86117 15.04481 15.14718
## acuteLymeDisease_28     acute 18.31185 24.21692 23.66863 24.37679 20.76201
##                           BPI     MS4A3   TNFSF10    FCGR3B    DEFA1     IL1B
## healthyControl_15    26.27302 30.679988  4.737728  10.43546 38.64095 15.05342
## Antibodies_1month_22 26.42632 69.686460 45.287117 137.63276 25.53752 20.64861
## Antibodies_6months_6 18.03675  3.824587 10.129959  15.14794 10.99052 80.27107
## acuteLymeDisease_28  23.55601  9.509528 47.924129  59.37641 18.46380 80.03062
##                        CKMT1B     THBD    HTR3C   TXNL4A    DHX58    MUC12
## healthyControl_15    21.77310 16.95726 22.03265 17.98004 17.52017 18.00291
## Antibodies_1month_22 37.74576 34.85232 36.38374 43.34992 34.07528 26.58173
## Antibodies_6months_6 47.40098 41.04443 26.94393 15.92046 13.26633 52.08020
## acuteLymeDisease_28  26.54026 66.61738 24.00392 23.84010 25.46579 23.86389
##                          LSM2     MYOM2      HBG1   HLADRB4      CTSG    RGS18
## healthyControl_15    14.55579  18.88724 74.250569 134.11953 30.868166 14.23276
## Antibodies_1month_22 44.82963 130.37108 33.606807  29.88086 23.700120 64.04231
## Antibodies_6months_6 14.52782  10.68510  9.518176 158.95025  8.712228  8.22711
## acuteLymeDisease_28  26.99585  10.82321 27.438872  19.20605 28.341905 22.74309
##                          GAPT  SERPINB2    THBS1      AREG     CXCL2       XIST
## healthyControl_15    12.78011  21.86668 115.1202  44.29163  40.54945 264.995632
## Antibodies_1month_22 51.84625  15.36698  14.5359  18.25920  40.07748   5.577409
## Antibodies_6months_6 14.47655  56.79544 123.0045  49.87920  48.29439 245.154850
## acuteLymeDisease_28  27.25833 251.52046 115.4409 107.67192 238.19923   4.213683
##                           OLR1   OR2B11     FSIP1       TSIX  C7orf55   CHI3L1
## healthyControl_15     29.20186 15.40234  17.00508 144.156847 19.09080 26.03393
## Antibodies_1month_22  29.22141 22.13845  43.96003   7.827089 37.14990 35.64124
## Antibodies_6months_6 118.89554 49.42415 207.96209 122.374346 22.68011 17.43028
## acuteLymeDisease_28   67.36953 87.49594  46.01348  15.058749 30.79871 29.23335
##                      KIAA1245    BEST1     LIPN     GZMH  KIR2DL3   KIR2DS1
## healthyControl_15    31.57280 18.19139 12.71581  5.92233 10.14529  9.736479
## Antibodies_1month_22 27.71482 29.81649 67.69835 49.63028 42.19264 48.858579
## Antibodies_6months_6 75.20643 62.78859 38.05924 12.85319 25.17363 23.375599
## acuteLymeDisease_28  54.28491 36.54691 51.24253 48.10395 32.97931 10.734790
##                        POLR2I      S100B
## healthyControl_15    17.48189  13.097410
## Antibodies_1month_22 44.75784 223.073644
## Antibodies_6months_6 14.71374   7.099816
## acuteLymeDisease_28  25.42501  35.586478

t1Names <- row.names(testUno)
trainUno <- subset(all, !(row.names(all) %in% t1Names))
dim(trainUno);dim(testUno)

## [1] 82 44

## [1]  4 44

Now, lets see how well our model does.

set.seed(589647)
rfModUno <- train(class~., method='rf', 
               na.action=na.pass,
               data=(trainUno),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRFUno <- predict(rfModUno, testUno)

predDFUno <- data.frame(predRFUno, type=testUno$class)
predDFUno

##   predRFUno     type
## 1   healthy  healthy
## 2   1 month  1 month
## 3  6 months 6 months
## 4   healthy    acute

pra_uno <- precisionRecallAccuracy(predDFUno)

## accuracy is:  75 %

pra_uno

##      class precision recall accuracy
## 1  healthy       0.5      1     0.75
## 2  1 month         1      1        1
## 3 6 months         1      1        1
## 4    acute         0      0     0.75

The overall accuracy was improved by taking the samples closer to the mean of each class to use in the test set and train all samples on. But there was one class that was not identified correctly, and another class that was identified correctly but misclassified another class as its own. Maybe we can improve the accuracy even more by removing those class samples that are outside of the standard deviation by the most and then reselecting our one sample per class testing set. Lets see if we can.

healthyStd <- order(apply(healthy[,-1],1,sd))
m1Std <- order(apply(m1[,-1],1,sd))
m6Std <- order(apply(m6[,-1],1,sd))
acuteStd <- order(apply(acute[,-1],1,sd))

H1b <- healthyStd[length(healthyStd)]
m1b <- m1Std[length(m1Std)]
m6b <- m6Std[length(m6Std)]
acuteb <- acuteStd[length(acuteStd)]

allb <- all[-c(H1b,m1b,m6b,acuteb),]

row.names(testUno)==row.names(allb)

## Warning in row.names(testUno) == row.names(allb): longer object length is not a
## multiple of shorter object length

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

testDos <- testUno
trainDos <- allb

Lets see if removing the outliers as far as the sample in each class with the most deviation from the mean of all gene values, will improve accuracy in prediction.

set.seed(589647)
rfModDos <- train(class~., method='rf', 
               na.action=na.pass,
               data=(trainDos),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRFDos <- predict(rfModDos, testDos)

predDFDos <- data.frame(predRFDos, type=testDos$class)
predDFDos

##   predRFDos     type
## 1   healthy  healthy
## 2   1 month  1 month
## 3  6 months 6 months
## 4     acute    acute

It worked, we see all classes were predicted correctly by removing the largest standard error sample across all genes, then using the sample closest to the mean of the genes as the samples to predict. We scored 100% accuracy. Lets see this for precision and recall too.

pra_dos <- precisionRecallAccuracy(predDFDos)

## accuracy is:  100 %

pra_dos

##      class precision recall accuracy
## 1  healthy         1      1        1
## 2  1 month         1      1        1
## 3 6 months         1      1        1
## 4    acute         1      1        1

Great! all 100% as it should be since the accuracy was 100%

We could also see what the accuracy is with more samples, like the next closest to the mean, and iteratively after running our model on the growing test sets of samples closest to their class mean, until we can get a measure thats indicative of the population. There must be some discrepencies in the samples as far as what is in their systems, how long they had lyme disease before they began treatment, what vitamins and other medications their on, how old are they, condition of healthy, gender, recent injuries, etc. But otherwise, with all other variables being constant, these genes can identify separate classes of lyme disease or healthy in PBMC or blood.

Lets look at the original data that wasn’t de-standardized and see if we can get the same results, better, or filter out the most deviated samples from the average gene expressions. I have been saying this for a while, but more approaches occur in thought before leaving this de-standardized set of gene expression data.

stand <- Lyme9[,-c(88:94)]
standSampleNames <- colnames(stand)[2:87]

month1 <- grep('1month',standSampleNames)
month6 <- grep('6month',standSampleNames)
healthy <- grep('healthy',standSampleNames)
acute <- grep('acute',standSampleNames)

class <- standSampleNames
class[month1] <- '1 month'
class[month6] <- '6 months'
class[healthy] <- 'healthy'
class[acute] <- 'acute'


standGeneNames <- stand$Gene
stand <- as.data.frame(t(stand[,-1]))
colnames(stand) <- standGeneNames
stand$class <- class
stand2 <- stand[,c(34,1:33)]
head(stand2)

##                    class       CABP1      POU3F2         CTXN3      CYP7B1
## healthyControl_1 healthy  0.00003390 -0.00003100 -2.068333e-06  0.00007770
## healthyControl_2 healthy  0.20398974  0.23925924 -4.171753e-02 -0.13945055
## healthyControl_3 healthy  0.53017545  0.12777781  2.462832e-01  0.48965120
## healthyControl_4 healthy  0.03769469 -0.07923627 -1.347235e-01  0.11624050
## healthyControl_5 healthy -0.22982526 -0.10329986 -2.256233e-01 -0.07544518
## healthyControl_6 healthy  0.14261413 -0.05624676  2.141688e-01  0.17962694
##                         CENPF         PEX26       ISG20      CLEC2L
## healthyControl_1  0.000020504  8.380425e-05 -0.00000525  0.00003390
## healthyControl_2 -0.061241150 -2.901954e-02 -0.81921244 -0.05779099
## healthyControl_3 -0.331103565  1.364853e-01 -1.26617620  0.23713994
## healthyControl_4  0.225760225 -1.274686e-02  0.22721529 -0.29611158
## healthyControl_5 -0.158033967  1.454494e-01  0.11260080 -0.34715940
## healthyControl_6 -0.449312565  2.032977e-02 -0.07338428  0.17190409
##                       TMEM194A        PDZRN3       NUDT18         DLG3
## healthyControl_1 -0.0002138617  5.150233e-05  0.001091003 -0.000759415
## healthyControl_2  0.0008786520  9.941888e-02 -0.235054730 -0.133482646
## healthyControl_3 -0.0146088577  3.188891e-01  0.881111150 -0.208074767
## healthyControl_4  0.0448924700 -1.342481e-01 -0.086553570 -0.164650345
## healthyControl_5 -0.2484165067 -1.581577e-01 -0.254269360  0.047922324
## healthyControl_6 -0.0732025320  8.387852e-02  1.138130700  0.024404717
##                        IGFALS       SLC1A1            F2       OTOS
## healthyControl_1 -0.000343000  0.000299095  0.0001342283  0.0002070
## healthyControl_2  0.113092660 -0.032329679 -0.0762577053 -0.2211900
## healthyControl_3  0.423058270  0.131755352  0.4079195013  0.8309946
## healthyControl_4  0.101755140 -0.076371432 -0.2981442627 -0.1290300
## healthyControl_5  0.496858360 -0.239856365 -0.2190894300 -0.1959085
## healthyControl_6  0.007001638  0.177237512  0.3654349633  0.2405429
##                           ENO1         GATC       FAM162A         PSMF1
## healthyControl_1 -0.0001921633 -0.000256000  0.0000765325 -8.066483e-05
## healthyControl_2 -1.3440033333 -0.127287860 -0.6245107500  7.158557e-02
## healthyControl_3 -0.6280503333 -0.152866360 -0.7001379900 -4.501788e-01
## healthyControl_4  0.1182982100 -0.359013560  0.0864962350  5.658539e-02
## healthyControl_5 -0.3270891500 -0.038307190  0.8462015400 -5.105789e-02
## healthyControl_6 -0.4178791000 -0.002932549 -0.2148986000  4.468243e-02
##                         HECW1        MAP2K7   LOC400657      PRR24      OR52A4
## healthyControl_1 -0.000307797 -0.0002498642 -0.00065100 -0.0005050 -0.00027200
## healthyControl_2 -0.085825205  0.1400634320 -0.42919350 -0.5969741  0.06674075
## healthyControl_3  0.706030240  0.6052526480 -0.12039113 -0.5586877  0.27201056
## healthyControl_4 -0.165934685  0.0417925380 -0.45719910 -0.1085486 -0.26390958
## healthyControl_5 -0.276568890 -0.0204313716 -0.13633752 -0.3364162  0.01595378
## healthyControl_6  0.296823980  0.4235142192 -0.06637859  0.3638177  0.08342671
##                        RGPD3       FRS3          HPGD      RNF168      KCNJ16
## healthyControl_1 -0.00024600 -0.0001400  0.0000453905 -0.00027800  0.00022185
## healthyControl_2  0.26228500  0.5222826 -0.0960063920 -0.08893681 -0.14524805
## healthyControl_3 -0.04079795  0.0001400 -0.1628251363 -0.10046721  0.11858976
## healthyControl_4 -0.37593746  0.1984134  0.3786586206 -0.23633862 -0.29451060
## healthyControl_5  0.19060660 -0.3893075  0.2273819169 -0.01087332 -0.22437679
## healthyControl_6  0.26550126 -0.6040711 -0.0666730125  0.18056202  0.63788391
##                       ESYT1       POU4F2       KHDRBS3
## healthyControl_1 -0.0005580 -0.000424625 -0.0002194645
## healthyControl_2 -0.2626066  0.046180010  0.3207392700
## healthyControl_3  0.1145935  0.909273735  0.1385450350
## healthyControl_4  0.2984104 -0.464481365 -0.0196878910
## healthyControl_5  0.2822208 -0.246123550 -0.0518192050
## healthyControl_6  0.5622082  0.152537220  0.0963140750

We could look through all the genes and note the subcategories of behaviors, but there wasn’t really an improvement in accuracy when doing this earlier for the de-standardized data. So we will just use all the genes. One gene doesn’t have a genes summary, LOC400657 gene, and it won’t be in the Tableau charts on this data.But we can still compare how it is in predicting accuracy of classification with our other genes.

Lets split the data into testing and training sets.

set.seed(1234)
train2 <- sample(1:86,.7*86)
trainingNorm <- stand2[train2,]
testingNorm <- stand2[-train2,]
dim(trainingNorm);dim(testingNorm)

## [1] 60 34

## [1] 26 34

Training/Testing 1:

set.seed(589647)
rfModNorm <- train(class~., method='rf', 
               na.action=na.pass,
               data=(trainingNorm),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRFNorm <- predict(rfModNorm, testingNorm)

predDFNorm <- data.frame(predRFNorm, type=testingNorm$class)
predDFNorm

##    predRFNorm     type
## 1     1 month  healthy
## 2     1 month  healthy
## 3       acute  healthy
## 4     healthy  healthy
## 5     1 month    acute
## 6       acute    acute
## 7       acute    acute
## 8     healthy    acute
## 9       acute    acute
## 10      acute    acute
## 11      acute    acute
## 12      acute    acute
## 13    1 month  1 month
## 14      acute  1 month
## 15      acute  1 month
## 16    1 month  1 month
## 17      acute  1 month
## 18    1 month  1 month
## 19      acute  1 month
## 20      acute  1 month
## 21    1 month  1 month
## 22      acute  1 month
## 23    1 month  1 month
## 24   6 months 6 months
## 25    1 month 6 months
## 26    healthy 6 months

praNorm <- precisionRecallAccuracy(predDFNorm)

## accuracy is:  50 %

praNorm

##      class         precision            recall          accuracy
## 1  healthy 0.333333333333333              0.25 0.807692307692308
## 2    acute 0.461538461538462              0.75 0.653846153846154
## 3  1 month 0.555555555555556 0.454545454545455 0.615384615384615
## 4 6 months                 1 0.333333333333333 0.923076923076923

The accuracy was 50% on this log2 normalized data, which is in the same range of accuracy the de-standardized data scored. Lets try removing the samples with the highest standard deviation from the mean of the samples. But first lets see what the class balance or number of samples in each class for each of the training or testing set.The best precision was on the 6 month class, then the 1 month, acute, and healthy classes. The recall was best on the acute class, then the 1 month, 6 month, and healthy class.

train2Bal <- trainingNorm %>% group_by(class) %>% count(class)
test2Bal <- testingNorm %>% group_by(class) %>% count(class)
train2Bal

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month     16
## 2 6 months     7
## 3 acute       20
## 4 healthy     17

test2Bal

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month     11
## 2 6 months     3
## 3 acute        8
## 4 healthy      4

The balance seems to be good, and the 6 to 3 split of the 6 month class seemed to help it score 100% precision accuracy in classification. There must be a lot of variance in those other classes, especially the acute and 1 month, since they had a lot of samples to train and didn’t score well when predicting many of those samples.

Lets remove the sample from each class that has the highest deviation and see if it helps in predictions accuracy.

H1nb <- subset(stand2, stand2$class=='healthy')
m1nb <- subset(stand2,stand2$class=='1 month')
m6nb <- subset(stand2, stand2$class=='6 months')
acutenb <- subset(stand2, stand2$class=='acute')
dim(H1nb);dim(m1nb);dim(m6nb);dim(acutenb)

## [1] 21 34

## [1] 27 34

## [1] 10 34

## [1] 28 34

The dimensions are as they should be, 21 samples as healthy, 27 samples as acute, 10 samples as 6 months, and 28 samples as 1 month.

Hnc <- order(apply(H1nb[,-1],1,sd))
m1nc <- order(apply(m1nb[,-1],1,sd))
m6nc <- order(apply(m6nb[,-1],1,sd))
acnc <- order(apply(acutenb[,-1],1,sd))

ac_sd1 <- acnc[length(acnc)]
H_sd1 <- Hnc[length(Hnc)]
m1_sd1 <- m1nc[length(m1nc)]
m6_sd1 <- m6nc[length(m6nc)]

stdNormUno <- c(H_sd1,ac_sd1,m1_sd1,m6_sd1)

stand2_std <- stand2[-stdNormUno,]
dim(stand2);dim(stand2_std)

## [1] 86 34

## [1] 82 34

We removed the most deviated samples from each class, now lets split the data and test our classification model we train on it.

set.seed(1234)
s <-sample(1:82,.7*82)
trainingNormUno <- stand2_std[s,]
testingNormUno <- stand2_std[-s,]
dim(trainingNormUno);dim(testingNormUno)

## [1] 57 34

## [1] 25 34

Lets see the class samples of each set.

trainCounts <- trainingNormUno %>% group_by(class) %>% count(class)
trainCounts

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month     18
## 2 6 months     7
## 3 acute       19
## 4 healthy     13

testCounts <- testingNormUno %>% group_by(class) %>% count(class)
testCounts

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month      9
## 2 6 months     3
## 3 acute        8
## 4 healthy      5

There seems to be a fair distribution of samples in each set. Lets see how our model will classify after removing the most deviated samples of each class.

set.seed(589647)
rfModNormUno <- train(class~., method='rf', 
               na.action=na.pass,
               data=(trainingNormUno),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRFNormUno <- predict(rfModNormUno, testingNormUno)

predDFNormUno <- data.frame(predRFNormUno, type=testingNormUno$class)
predDFNormUno

##    predRFNormUno     type
## 1        1 month  healthy
## 2        1 month  healthy
## 3        1 month  healthy
## 4          acute  healthy
## 5        healthy  healthy
## 6          acute    acute
## 7        1 month    acute
## 8          acute    acute
## 9          acute    acute
## 10       1 month    acute
## 11         acute    acute
## 12         acute    acute
## 13         acute    acute
## 14       1 month  1 month
## 15       1 month  1 month
## 16       1 month  1 month
## 17       1 month  1 month
## 18         acute  1 month
## 19       healthy  1 month
## 20       1 month  1 month
## 21         acute  1 month
## 22       healthy  1 month
## 23      6 months 6 months
## 24      6 months 6 months
## 25      6 months 6 months

pra_NormUno <- precisionRecallAccuracy(predDFNormUno)

## accuracy is:  60 %

pra_NormUno

##      class         precision            recall accuracy
## 1  healthy 0.333333333333333               0.2     0.76
## 2    acute 0.666666666666667              0.75      0.8
## 3  1 month               0.5 0.555555555555556     0.64
## 4 6 months                 1                 1        1

The accuracy jumped up to 60% from the previous overall accuracy of 50%, so we increased the accuracy 10% better overall by removing a single sample from each class that was the most deviated.The 6 months class scored 100% for accuracy, precision, and recall. So based on these genes we could detect up to 100% accuracy in whether or not a blood sample has had 6 months of antibiotic treatment for lyme disease or not.

Lets not remove any more of the samples from the 6 month class, since we already scored 100% accurate readings. But we should definitely remove some from the other three classes. Lets remove the to 3 classes from those other classes and test the prediction accuracy.

ac_sd1b <- acnc[(length(acnc)-2):length(acnc)]
H_sd1b <- Hnc[(length(Hnc)-2):length(Hnc)]
m1_sd1b <- m1nc[(length(m1nc)-2):length(m1nc)]

m6_sd1b <- m6nc[length(m6nc)]

stdNormUnob <- c(H_sd1b,ac_sd1b,m1_sd1b,m6_sd1b)

stand2_stdb <- stand2[-stdNormUnob,]
dim(stand2);dim(stand2_stdb)

## [1] 86 34

## [1] 76 34

We removed the three most deviated samples from each class except the 6 months class, and now lets split the data and test our classification model we train on it.

set.seed(1234)
sb <-sample(1:76,.7*76)
trainingNormUnob <- stand2_stdb[sb,]
testingNormUnob <- stand2_stdb[-sb,]
dim(trainingNormUnob);dim(testingNormUnob)

## [1] 53 34

## [1] 23 34

Lets see the class samples of each set.

trainCountsb <- trainingNormUnob %>% group_by(class) %>% count(class)
trainCountsb

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month     20
## 2 6 months     8
## 3 acute       17
## 4 healthy      8

testCountsb <- testingNormUnob %>% group_by(class) %>% count(class)
testCountsb

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month      7
## 2 6 months     2
## 3 acute        9
## 4 healthy      5

There seems to be a fair distribution of samples in each set. Lets see how our model will classify after removing the most deviated samples of each class.

set.seed(589647)
rfModNormUnob <- train(class~., method='rf', 
               na.action=na.pass,
               data=(trainingNormUnob),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRFNormUnob <- predict(rfModNormUnob, testingNormUnob)

predDFNormUnob <- data.frame(predRFNormUnob, type=testingNormUnob$class)
predDFNormUnob

##    predRFNormUnob     type
## 1         1 month  healthy
## 2         1 month  healthy
## 3           acute  healthy
## 4           acute  healthy
## 5         1 month  healthy
## 6        6 months    acute
## 7         1 month    acute
## 8         1 month    acute
## 9           acute    acute
## 10        1 month    acute
## 11        1 month    acute
## 12        1 month    acute
## 13          acute    acute
## 14        1 month    acute
## 15          acute  1 month
## 16        1 month  1 month
## 17        1 month  1 month
## 18          acute  1 month
## 19        1 month  1 month
## 20        1 month  1 month
## 21        1 month  1 month
## 22        1 month 6 months
## 23        1 month 6 months

pra_NormUnob <- precisionRecallAccuracy(predDFNormUnob)

## accuracy is:  30.4347826086957 %

pra_NormUnob

##      class         precision            recall          accuracy
## 1  healthy                 0                 0 0.782608695652174
## 2    acute 0.333333333333333 0.222222222222222 0.521739130434783
## 3  1 month            0.3125 0.714285714285714 0.434782608695652
## 4 6 months                 0                 0 0.869565217391304

The accuracy of 30% was worse than keeping all samples (50%) and worse than removing the most deviated sample (60%). With one less sample in the 6 months class for the testing set it is not longer 100% but 0% in precision and recall, but still scored 87% accuracy overall for not misclassifying any samples as such. The healthy class also received 0% precision and recall. There were now much less data for the model to train on, so lets change the 70-30 split for training and testing to 95% approximately and 5% testing. Lets see how it does.

set.seed(1234)
sc <-sample(1:76,.95*76)
trainingNormUnoc <- stand2_stdb[sc,]
testingNormUnoc <- stand2_stdb[-sc,]
dim(trainingNormUnoc);dim(testingNormUnoc)

## [1] 72 34

## [1]  4 34

Lets see the class samples of each set.

trainCountsc <- trainingNormUnoc %>% group_by(class) %>% count(class)
trainCountsc

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class        n
##   <chr>    <int>
## 1 1 month     26
## 2 6 months    10
## 3 acute       24
## 4 healthy     12

testCountsc <- testingNormUnoc %>% group_by(class) %>% count(class)
testCountsc

## # A tibble: 3 x 2
## # Groups:   class [3]
##   class       n
##   <chr>   <int>
## 1 1 month     1
## 2 acute       2
## 3 healthy     1

One of the 6 months class is missing, so lets take one from the training set and give the training set our extra acute class.

dim(trainingNormUnoc)

## [1] 72 34

dim(testingNormUnoc)

## [1]  4 34

s6 <- grep('6month',row.names(trainingNormUnoc))[1]
a6 <- grep('acute',row.names(testingNormUnoc))[2]

S6 <- trainingNormUnoc[s6,]
A6 <- testingNormUnoc[a6,]

testingNormUnoc2 <- testingNormUnoc[-a6,]
testingNormUnoc3 <- rbind(testingNormUnoc2,S6)

trainingNormUnoc2 <- trainingNormUnoc[-s6,]
trainingNormUnoc3 <- rbind(trainingNormUnoc2,A6)

dim(trainingNormUnoc3)

## [1] 72 34

dim(testingNormUnoc3)

## [1]  4 34

testingNormUnoc3

##                         class       CABP1      POU3F2        CTXN3      CYP7B1
## healthyControl_15     healthy -0.07589889  0.23626685 -0.055404983  0.14204478
## acuteLymeDisease_13     acute -0.34953380 -0.14276838 -0.010607242 -0.19537401
## Antibodies_1month_4   1 month -0.00003390 -0.03967643  0.004391431  0.01038003
## Antibodies_6months_4 6 months -0.26161885  0.14507556  0.370101364  0.02928352
##                            CENPF       PEX26       ISG20      CLEC2L
## healthyControl_15    -0.19094265  0.20897907 -0.51682330  0.01077294
## acuteLymeDisease_13   0.23698175 -0.18220122 -0.08197308 -0.47302198
## Antibodies_1month_4  -0.07768702  0.39442974 -0.14839697 -0.00003390
## Antibodies_6months_4 -0.05396521  0.06561916 -0.90647270  1.08304500
##                         TMEM194A      PDZRN3     NUDT18         DLG3     IGFALS
## healthyControl_15    -0.10291028 -0.01527524  0.3245788 -0.195066205 -0.1421464
## acuteLymeDisease_13   0.14031196 -0.07497390  0.2230201  0.240245728 -0.3925624
## Antibodies_1month_4  -0.03302749  0.08099906 -0.6770477 -0.002440358 -0.2245874
## Antibodies_6months_4  0.60594010  0.09496323 -0.2761650 -0.269157226  0.2567589
##                           SLC1A1         F2        OTOS        ENO1       GATC
## healthyControl_15     0.10206497 -0.0289009  0.30036592 -0.51248899  0.7165318
## acuteLymeDisease_13   0.08817935  0.2687805 -0.14195108  0.46049946  0.2304306
## Antibodies_1month_4  -0.15825224  0.6607375  0.79377030  0.07921139 -0.4782224
## Antibodies_6months_4  0.22317195  0.4966492  0.08883715 -0.94455562 -0.8335578
##                         FAM162A       PSMF1       HECW1     MAP2K7   LOC400657
## healthyControl_15    -0.0756954 -0.41451812  0.59047568 -0.1142301  0.31817222
## acuteLymeDisease_13   0.6112867  0.14354845 -0.20705736  0.0973073 -0.03576565
## Antibodies_1month_4   0.7709609  0.09603031  0.26225078 -0.1698379  0.21037936
## Antibodies_6months_4 -0.6535579 -0.19522234  0.02570832 -0.5512561  0.03413296
##                           PRR24     OR52A4       RGPD3       FRS3        HPGD
## healthyControl_15     0.1925106  0.3949838 -0.07778883 -0.2036598 -0.31984014
## acuteLymeDisease_13   0.5176373 -0.2754617 -0.01048160  0.1514721  0.26282352
## Antibodies_1month_4  -0.2431202  0.4273884 -0.32242823  0.1591983  0.14059359
## Antibodies_6months_4  0.3079028  0.2469349  0.33410000  0.2525678 -0.04983094
##                           RNF168      KCNJ16      ESYT1      POU4F2    KHDRBS3
## healthyControl_15     0.84589290  0.00178826  0.3702583  0.08467209  0.4332647
## acuteLymeDisease_13  -0.09541321 -0.09380269 -0.5263796 -0.27526760 -0.2171967
## Antibodies_1month_4  -0.72275350 -0.03493690 -0.1543922  0.10672879  0.1636499
## Antibodies_6months_4 -0.78001500  0.70712567 -0.8132048  0.35219288  0.3668466

tail(trainingNormUnoc3)

##                         class        CABP1     POU3F2         CTXN3      CYP7B1
## acuteLymeDisease_22     acute  0.198859930  0.5636461  2.113796e-01 -0.01237798
## Antibodies_6months_3 6 months  0.005481005  0.2960713  1.457065e-01  0.03596401
## Antibodies_1month_19  1 month  0.077300310  0.1283374  8.339159e-02 -0.07159638
## healthyControl_1      healthy  0.000033900 -0.0000310 -2.068333e-06  0.00007770
## Antibodies_1month_25  1 month -0.061007260 -0.3139806 -1.648021e-01  0.03461719
## acuteLymeDisease_16     acute  0.146200180 -0.2785492 -8.703899e-02 -0.03114438
##                             CENPF         PEX26       ISG20     CLEC2L
## acuteLymeDisease_22   0.190623643  2.077085e-01  0.52346134  0.2109024
## Antibodies_6months_3 -0.189381601  1.805775e-01 -0.08145428  0.3529954
## Antibodies_1month_19 -0.021549701 -1.338840e-03  1.34653470  0.6721482
## healthyControl_1      0.000020504  8.380425e-05 -0.00000525  0.0000339
## Antibodies_1month_25 -0.038229700  1.857385e-01 -0.11857653 -0.2541506
## acuteLymeDisease_16   0.122709752 -1.675950e-01  0.25008965 -0.3953524
##                           TMEM194A        PDZRN3       NUDT18         DLG3
## acuteLymeDisease_22  -0.2123386123  2.818368e-01  0.122541904 -0.613058094
## Antibodies_6months_3 -0.0632990217  2.783562e-01 -0.234975340 -0.181450794
## Antibodies_1month_19  0.2203699767 -7.045595e-02  0.775892260 -0.292146015
## healthyControl_1     -0.0002138617  5.150233e-05  0.001091003 -0.000759415
## Antibodies_1month_25  0.1204493800 -2.187032e-01  0.396648880  0.619860268
## acuteLymeDisease_16   0.3153448920 -3.467973e-02 -0.036186695  0.211573980
##                           IGFALS       SLC1A1            F2         OTOS
## acuteLymeDisease_22  -0.08303142 -0.129606130  0.0308383307  0.055172443
## Antibodies_6months_3  0.87783310  0.247302785  0.2235975280 -0.392873050
## Antibodies_1month_19 -0.13888740  0.222857237 -0.0168207747 -0.006484032
## healthyControl_1     -0.00034300  0.000299095  0.0001342283  0.000207000
## Antibodies_1month_25 -0.51494503  0.089521049 -0.3120245933 -0.032780170
## acuteLymeDisease_16  -0.12205100 -0.127218849 -0.1225695383 -0.419476750
##                               ENO1        GATC       FAM162A         PSMF1
## acuteLymeDisease_22   0.1172231033 -0.42436218  0.2262662630 -1.502608e-01
## Antibodies_6months_3 -0.0127711307  0.05392551 -0.2357237500 -3.833545e-01
## Antibodies_1month_19  0.8786134567  0.69354630 -0.4382941600  3.992631e-01
## healthyControl_1     -0.0001921633 -0.00025600  0.0000765325 -8.066483e-05
## Antibodies_1month_25  0.1702928577  0.05868816  0.1851600450  1.426514e-02
## acuteLymeDisease_16   0.7285270767  0.06770325 -0.1586052250  5.073412e-02
##                             HECW1        MAP2K7  LOC400657       PRR24
## acuteLymeDisease_22   0.211710450 -0.3484804180 -0.1357880 -0.08162641
## Antibodies_6months_3  0.305571195  0.6139164520 -0.1460991  0.87414217
## Antibodies_1month_19  0.059431908 -0.0480062960 -0.1668129  0.06050730
## healthyControl_1     -0.000307797 -0.0002498642 -0.0006510 -0.00050500
## Antibodies_1month_25 -0.065520765  0.1364680820 -0.2043467 -0.13330030
## acuteLymeDisease_16  -0.396578185  0.2168077000  0.4075024  0.04976654
##                           OR52A4        RGPD3         FRS3          HPGD
## acuteLymeDisease_22  -0.19158888  0.025556326  0.324386120  0.0438663946
## Antibodies_6months_3 -0.07754993 -0.176086900  0.321854600  0.1335166987
## Antibodies_1month_19  0.12671065  0.006590605 -0.211962220 -0.1516500700
## healthyControl_1     -0.00027200 -0.000246000 -0.000140000  0.0000453905
## Antibodies_1month_25  0.28631163  0.009201050 -0.007893086 -0.1850745963
## acuteLymeDisease_16   0.16221523 -0.239856960 -0.139697070  0.0678496650
##                          RNF168      KCNJ16       ESYT1       POU4F2
## acuteLymeDisease_22   0.1008706  0.04425120 -0.33963203  0.058635474
## Antibodies_6months_3 -0.2761946  0.58044040 -0.17030620 -0.219005940
## Antibodies_1month_19 -0.2180100  0.41718853 -0.07835484  0.028754355
## healthyControl_1     -0.0002780  0.00022185 -0.00055800 -0.000424625
## Antibodies_1month_25  0.3044357 -0.07687879  0.07615852 -0.252467275
## acuteLymeDisease_16   0.2950745  0.08341658  0.02792740 -0.271346800
##                            KHDRBS3
## acuteLymeDisease_22   0.2132141600
## Antibodies_6months_3  0.3479696500
## Antibodies_1month_19  0.1818720115
## healthyControl_1     -0.0002194645
## Antibodies_1month_25 -0.0755381600
## acuteLymeDisease_16  -0.0995574015

Now we have at least one of each class in our testing set. This model will train on the data with the three most deviated from the mean samples removed in all classes except the 6 months class, which only has one most deviated sample removed.

set.seed(589647)
rfModNormUnoc3 <- train(class~., method='rf', 
               na.action=na.pass,
               data=(trainingNormUnoc3),  preProc = c("center", "scale","medianImpute"),
               trControl=trainControl(method='oob'), number=5)

predRFNormUnoc3 <- predict(rfModNormUnoc3, testingNormUnoc3)

predDFNormUnoc3 <- data.frame(predRFNormUnoc3, type=testingNormUnoc3$class)
predDFNormUnoc3

##   predRFNormUnoc3     type
## 1         healthy  healthy
## 2           acute    acute
## 3        6 months  1 month
## 4        6 months 6 months

pra_NormUnoc3 <- precisionRecallAccuracy(predDFNormUnoc3)

## accuracy is:  75 %

pra_NormUnoc3

##      class precision recall accuracy
## 1  healthy         1      1        1
## 2    acute         1      1        1
## 3  1 month         0      0     0.75
## 4 6 months       0.5      1     0.75

The overall accuracy is 75%, because one class was misclassified as 6 months when it wasn’t. That led to a precision of 50% on the 6 month class even though all 6 month classes were predicted accurately (recall of 100%). So this means that originally when keeping all the 6 month samples other than the most deviated we removed, because it scored 100% in prediction for all categories, we now know that removing those additional 6 samples by taking another 2 samples from each of the other three classes affected our model on predicting the 6 months class accurately. But, it did improve the recall and precision for the healthy and acute classes to 100% for precision, recall, and accuracy. but the 1 month class was not identified and was misclassified as the 6 month class. We could be more selective and take only those samples within three standard deviations of the mean and running our algorithms and see how accurate our model is. This is what a course on Linked in learning for recommender systems in python says to do when preprocessing and training your models for classification in that case for sentiment analysis. That course is recommended by Frank Kane, and he has some youtube channel under SunDog he referred to in that online course, but never visited by me. Lets use a little imagination here and say how this effects the model. When someone gives blood at a site, and the values are out of range for the model, then its values in all genes that aren’t within the certain selected range of values would throw an error, and the patient would be told his or her sample came out flawed and needs to be done again, with some questionaire on medications taking or not taking a certain vitamin or not drinking water, or drinking more water, or not eating 12-24 hours before, etc. Then if the next time their sample is taken and it is within the range of values, then it can be used to run the model on to predict whether or not the person has lyme disease in the acute phase or if they are healthy. Because this model scored 100% in precision, recall, and accuracy for the healthy and acute classes, but errored on the classes where the patient was taking antibiotics for 1 month and separately 6 months later. So, these genes are also a good set of genes to use as lyme disease pathogenesis, just like our other sample when excluding the most deviated samples from the training model.

Body System Genes and Added Genes

This next section looks at the denormalized data set of 40,000+ genes with out the duplicates removed, and with the sample alias names instead of GSM IDs to look at our body system genes that we explored with the COVID-19 study GSE152418. We will use the lymeMX2-denormalized-easynames.csv file we created ealier.

systemsDF <- read.csv('lymeMX2-denormalized-easynames.csv',sep=',',
                      header=T, na.strings=c('',' ','NA'))

Lets go ahead and source our scripted file, because I am starting this section of this script after clearing out my objects and closing out the previous sections data. That script of our functions is geneCards2.R

source('geneCards2.R')

For our body systems. Lets look at lymphatic, integumentary, nervous, circulatory, musculature, endocrine, bone structures, and reproductive systems. We will first get the genes that are the top 3 ranked genes of some select genes in those systems. I also want to look up tetanis because the tetanis booster is something we are all supposed to have every 10 years. And alcohol and dopamine genes, as well as OTC drugs like Ibuprofen and aspirin, tylenol, NSAIDs, and cannabidiol genes for the toxic and non-toxic genes related to marijuana. For lymphatic, lets just enter lymphatic, for integumentary we should use some epithelial genes. Lets just see what it pulls up on the systems by what we type for the systems.

find25genes('integumentary')
find25genes('nervous')

getProteinGenes('integumentary')
getProteinGenes('nervous')

integumentary <- read.csv('Top25integumentarys.csv')
nervous <- read.csv('Top25nervouss.csv')

for (i in integumentary$proteinType){
  getSummaries2(i,'integumentary')
}
for (i in nervous$proteinType){
  getSummaries2(i,'nervous')
}

getGeneSummaries('integumentary')
getGeneSummaries('nervous')

integumentarySumms <- read.csv('proteinGeneSummaries_integumentary.csv')
nervousSumms <- read.csv('proteinGeneSummaries_nervous.csv')

Lets now look at the epithelial system which includes the skin and the lining of the organs and membranes.

find25genes('epithelial')

getProteinGenes('epithelial')

epithelial <- read.csv("Top25epithelials.csv")

for (i in epithelial$proteinType){
  getSummaries2(i,'epithelial')
}

getGeneSummaries('epithelial')

epithelialSumms <- read.csv("proteinGeneSummaries_epithelial.csv")

find25genes('lymphatic')

getProteinGenes('lymphatic')

lymphatic <- read.csv("Top25lymphatics.csv")

for (i in lymphatic$proteinType){
  getSummaries2(i,'lymphatic')
}

getGeneSummaries('lymphatic')

lymphaticSumms <- read.csv("proteinGeneSummaries_lymphatic.csv")

find25genes('circulatory')

getProteinGenes('circulatory')

circulatory <- read.csv("Top25circulatorys.csv")

for (i in circulatory$proteinType){
  getSummaries2(i, 'circulatory')
}

getGeneSummaries('circulatory')

circulatorySumms <- read.csv("proteinGeneSummaries_circulatory.csv")

find25genes('musculature')

getProteinGenes('musculature')

musculature <- read.csv("Top25musculatures.csv")

for (i in musculature$proteinType){
  getSummaries2(i,'musculature')
}

getGeneSummaries('musculature')

musculatureSumms <- read.csv("proteinGeneSummaries_musculature.csv")

find25genes('endocrine')

getProteinGenes('endocrine')

endocrine <- read.csv("Top25endocrines.csv")

for (i in endocrine$proteinType){
  getSummaries2(i,'endocrine')
}

getGeneSummaries('endocrine')

endocrineSumms <- read.csv("proteinGeneSummaries_endocrine.csv")

find25genes('bone structure')

getProteinGenes('bone structure')

boneStructure <- read.csv("Top25bone-structures.csv")

for (i in boneStructure$proteinType){
  getSummaries2(i,'bone structure')
}

getGeneSummaries('bone structure')

boneStructureSumms <- read.csv("proteinGeneSummaries_bone-structure.csv")

find25genes('reproductive')

getProteinGenes('reproductive')

reproductive <- read.csv("Top25reproductives.csv")

for (i in reproductive$proteinType){
  getSummaries2(i,'reproductive')
}

getGeneSummaries('reproductive')

reproductiveSumms <- read.csv("proteinGeneSummaries_reproductive.csv")

find25genes('tetanis')

getProteinGenes('tetanis')

tetanis <- read.csv("Top25tetaniss.csv")

for (i in tetanis$proteinType){
  getSummaries2(i,'tetanis')
}

getGeneSummaries('tetanis')

tetanisSumms <- read.csv("proteinGeneSummaries_tetanis.csv")

find25genes('alcohol')

getProteinGenes('alcohol')

alcohol <- read.csv("Top25alcohols.csv")

for (i in alcohol$proteinType){
  getSummaries2(i,'alcohol')
}

getGeneSummaries('alcohol')

alcoholSumms <- read.csv("proteinGeneSummaries_alcohol.csv")

find25genes('dopamine')

getProteinGenes('dopamine')

dopamine <- read.csv("Top25dopamines.csv")

for (i in dopamine$proteinType){
  getSummaries2(i, 'dopamine')
}

getGeneSummaries('dopamine')

dopamineSumms <- read.csv("proteinGeneSummaries_dopamine.csv")

find25genes('ibuprofen')

getProteinGenes('ibuprofen')

ibuprofen <- read.csv("Top25ibuprofens.csv")

for (i in ibuprofen$proteinType){
  getSummaries2(i,'ibuprofen')
}

getGeneSummaries('ibuprofen')

ibuprofenSumms <- read.csv("proteinGeneSummaries_ibuprofen.csv")

find25genes('aspirin')

getProteinGenes('aspirin')

aspirin <- read.csv("Top25aspirins.csv")

for (i in aspirin$proteinType){
  getSummaries2(i,'aspirin')
}

getGeneSummaries('aspirin')

aspirinSumms <- read.csv("proteinGeneSummaries_aspirin.csv")

find25genes('tylenol')

getProteinGenes('tylenol')

tylenol <- read.csv("Top25tylenols.csv")

for (i in tylenol$proteinType){
  getSummaries2(i,'tylenol')
}

getGeneSummaries('tylenol')

tylenolSumms <- read.csv("proteinGeneSummaries_tylenol.csv")

find25genes('NSAIDs')

getProteinGenes('NSAIDs')

nsaid <- read.csv("Top25nsaidss.csv")

for (i in nsaid$proteinType){
  getSummaries2(i,'NSAIDs')
}

getGeneSummaries('NSAIDs')

NSAID_summs <- read.csv("proteinGeneSummaries_nsaids.csv")

find25genes('cannabidiol')

getProteinGenes('cannabidiol')

cannabidiol <- read.csv("Top25cannabidiols.csv")

for (i in cannabidiol$proteinType){
  getSummaries2(i,'cannabidiol')
}

getGeneSummaries('cannabidiol')

cannabidiolSumms <- read.csv("proteinGeneSummaries_cannabidiol.csv")

Lets combine all the genes from these data sets of 25 genes together for their data sets of gene summaries.

allSystemSumms <- rbind(lymphaticSumms, integumentarySumms,
                        circulatorySumms, musculatureSumms,
                        endocrineSumms,boneStructureSumms,
                        reproductiveSumms,tetanisSumms,
                        alcoholSumms,ibuprofenSumms,
                        aspirinSumms,tylenolSumms,
                        NSAID_summs,cannabidiolSumms)

Lets also just combine the top 3 of each body system gene into a separate data set.

allSystemSummsFirst3 <- rbind(lymphaticSumms[1:3,], integumentarySumms[1:3,],
                              nervousSumms[1:3,],
                        circulatorySumms[1:3,], musculatureSumms[1:3,],
                        endocrineSumms[1:3,],boneStructureSumms[1:3,],
                        reproductiveSumms[1:3,],tetanisSumms[1:3,],
                        alcoholSumms[1:3,],ibuprofenSumms[1:3,],
                        aspirinSumms[1:3,],tylenolSumms[1:3,],
                        NSAID_summs[1:3,],cannabidiolSumms[1:3,])

Lets also use the vitamin, mineral, and hormonal genes used in our analysis of COVID-19 of study GSE152418.

Lets not stop at the sun genes, as a massage therapist of more than 14 years of experience and having recently studied for and taken and passed my MBLEx or Massage and Bodywork Licensing Examination, I can tell you there are many fascinating items of the body systems and mineral as well as vitamin dependencies that lead to disease in some people. But when relearning the endocrine system and the hormones related to the pineal, hypothalamus, pituitary, adrenals, thyroid, and pancreas many other vitamins, steroids, and hormones should be looked at in studying these different cases of COVID-19.

We will look at the Vitamin C which helps the body absorb Vitamin D and make calcium in the bone blood, the glucagon that turn glucose into sugar and insulin that lowers glucose in the blood having to do with the pancreas hormones, dopamine that relates to parkinsons disease when the hypothalamus doesn’t produce enough, melatonin that regulates sleep and produced by the pineal gland near the pituitary and hypothalamus in the brain that regulates sleep, estrogen, prolactin, and progesterone regulated by the pituitary gland in the brain in females, testosterone regulated by the males in their testes, and corticosteroids and adrenaline regulated by the adrenals when in sympathetic response of danger in the body. Also, the vitamins that people are commonly told to take in addition to Vitamin C and Vitamin D, such as fish oil or omega 3s, vitamin B12 or zinc, and magnesium mineral.Also, calcitonin, a thyroid hormone that breaks down calcium so that the kidneys don’t get kidney stones nor other healthy problems.

find25genes('vitamin D')
find25genes('melanin')
find25genes('vitamin C')
find25genes('glucose')
find25genes('insulin')
find25genes('glucagon')
find25genes('dopamine')
find25genes('estrogen')
find25genes('progesterone')
find25genes('prolactin')
find25genes('testosterone')
find25genes('calcium')
find25genes('melatonin')
find25genes('vitamin B12')
find25genes('zinc')
find25genes('magnesium')
find25genes('fish oil')
find25genes('omega 3s')
find25genes('adrenaline')
find25genes('corticosteroids')
find25genes('calcitonine')
find25genes('iron')

getProteinGenes('vitamin D')
getProteinGenes('melanin')
getProteinGenes('vitamin C')
getProteinGenes('glucose')
getProteinGenes('insulin')
getProteinGenes('glucagon')
getProteinGenes('dopamine')
getProteinGenes('estrogen')
getProteinGenes('progesterone')
getProteinGenes('prolactin')
getProteinGenes('testosterone')
getProteinGenes('calcium')
getProteinGenes('melatonin')
getProteinGenes('vitamin B12')
getProteinGenes('zinc')
getProteinGenes('magnesium')
getProteinGenes('fish oil')
getProteinGenes('omega 3s')
getProteinGenes('adrenaline')
getProteinGenes('corticosteroids')
getProteinGenes('calcitonine')
getProteinGenes('iron')

vitD <- read.csv('Top25vitamin-ds.csv')
melanin <- read.csv('Top25melanins.csv')
vitC <- read.csv('Top25vitamin-cs.csv')
glucose <- read.csv('Top25glucoses.csv')
insulin <- read.csv('Top25insulins.csv')
glucagon <- read.csv('Top25glucagons.csv')
dopamine <- read.csv('Top25dopamines.csv')
estrogen <- read.csv('Top25estrogens.csv')
progesterone <- read.csv('Top25progesterones.csv')
prolactin <- read.csv('Top25prolactins.csv')
testosterone <- read.csv('Top25testosterones.csv')
calcium <- read.csv('Top25calciums.csv')
melatonin <- read.csv('Top25melatonins.csv')
vitB12 <- read.csv('Top25vitamin-b12s.csv')
zinc <- read.csv('Top25zincs.csv')
magnesium <- read.csv('Top25magnesiums.csv')
fishOil <- read.csv('Top25fish-oils.csv')
omega3s <- read.csv('Top25omega-3ss.csv')
adrenaline <- read.csv('Top25adrenalines.csv')
corticosteroid <- read.csv('Top25corticosteroidss.csv')
calcitonine <- read.csv('Top25calcitonines.csv')
iron <- read.csv('Top25irons.csv')

Lets only take the top 3 from each data frame of mineral, vitamin, or steroid.

vitMinSter <- rbind(vitD[1:3,1:2],melanin[1:3,1:2],
                    vitC[1:3,1:2],glucose[1:3,1:2],
                    insulin[1:3,1:2],glucagon[1:3,1:2],
                    dopamine[1:3,1:2],estrogen[1:3,1:2],
                    progesterone[1:3,1:2],prolactin[1:3,1:2],
                    testosterone[1:3,1:2],calcium[1:3,1:2],
                    calcitonine[1:3,1:2],melatonin[1:3,1:2],
                    vitB12[1:3,1:2],zinc[1:3,1:2],magnesium[1:3,1:2],
                    fishOil[1:3,1:2],omega3s[1:3,1:2],
                    adrenaline[1:3,1:2],iron[1:3,1:2],
                    corticosteroid[1:3,1:2])
head(vitMinSter)

##   proteinType proteinSearched
## 1         VDR       vitamin-d
## 2     CYP27B1       vitamin-d
## 3        PHEX       vitamin-d
## 4         TYR         melanin
## 5       TYRP1         melanin
## 6        OCA2         melanin

Some of the genes associated with one vitamin also associate with another. We will keep them this way for the visualizations or charting.We could make a link analysis with these genes that are associated with other vitamins and minerals, but if not then you should.

Lets now get the gene summaries of these genes.

for (i in vitMinSter$proteinType){
  getSummaries2(i,'protein')
}

getGeneSummaries('protein')

vitMinSterSumms <- read.csv("proteinGeneSummaries_protein.csv"
)

vitMinSterSumms2 <- vitMinSterSumms[,c(2:7)]
head(vitMinSterSumms2)

##      gene       EnsemblID
## 1     VDR ENSG00000111424
## 2 CYP27B1 ENSG00000111012
## 3    PHEX ENSG00000102174
## 4     TYR ENSG00000077498
## 5   TYRP1 ENSG00000107165
## 6    OCA2 ENSG00000104044
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                                                        VDR (Vitamin D Receptor) is a Protein Coding gene.                                            Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets.                                            Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity.                                            An important paralog of this gene is NR1I2.
## 2                                             CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets.                                            Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen.                                            An important paralog of this gene is CYP27A1.
## 3                                                                                                                                                                                                                 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene.                                            Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive.                                                                                        Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity.                                            An important paralog of this gene is MMEL1.
## 4                                                                                                                                                                                          TYR (Tyrosinase) is a Protein Coding gene.                                            Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib.                                            Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is TYRP1.
## 5                                                                                                                                               TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene.                                            Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11.                                            Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is DCT.
## 6                                                                                                                                                     OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene.                                            Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1.                                            Among its related pathways are Viral mRNA Translation and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity.                                            An important paralog of this gene is SLC13A2.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n                         VDR_HUMAN,P11473\n                         
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n                         CP27B_HUMAN,O15528\n                         
## 3                                                                                                                                                                                                                                                                                                                                                           Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n                         PHEX_HUMAN,P78562\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n                         TYRO_HUMAN,P14679\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n                         TYRP1_HUMAN,P17643\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n                         P_HUMAN,Q04671\n                         
##                 todaysDate
## 1 Thu Sep 03 14:11:07 2020
## 2 Thu Sep 03 14:11:09 2020
## 3 Thu Sep 03 14:11:12 2020
## 4 Thu Sep 03 14:11:13 2020
## 5 Thu Sep 03 14:11:15 2020
## 6 Thu Sep 03 14:11:16 2020

Combine the vitamin searched with the gene from the last two data frames.

vitamins <- merge(vitMinSter,vitMinSterSumms2,
                  by.x='proteinType',
                  by.y='gene')
vitamins2 <- vitamins[!duplicated(vitamins),]
colnames(vitamins2)[1] <- 'gene'
head(vitamins2)

##      gene proteinSearched       EnsemblID
## 1   AANAT       melatonin ENSG00000129673
## 2    ANKH         calcium ENSG00000154122
## 3   APOA1        fish-oil ENSG00000118137
## 4    APOB        fish-oil ENSG00000084674
## 5 CACNA1B        omega-3s ENSG00000148408
## 6   CALCA     calcitonine ENSG00000110680
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   EntrezSummary
## 1                                                                                                                                                                                   The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes a multipass transmembrane protein that is expressed in joints and other tissues and controls pyrophosphate levels in cultured cells. Progressive ankylosis-mediated control of pyrophosphate levels has been suggested as a possible mechanism regulating tissue calcification and susceptibility to arthritis in higher animals. Mutations in this gene have been associated with autosomal dominant craniometaphyseal dysplasia. [provided by RefSeq, Jul 2008]
## 3                                                                                                                     This gene encodes apolipoprotein A-I, which is the major protein component of high density lipoprotein (HDL) in plasma. The encoded preproprotein is proteolytically processed to generate the mature protein, which promotes cholesterol efflux from tissues to the liver for excretion, and is a cofactor for lecithin cholesterolacyltransferase (LCAT), an enzyme responsible for the formation of most plasma cholesteryl esters. This gene is closely linked with two other apolipoprotein genes on chromosome 11. Defects in this gene are associated with HDL deficiencies, including Tangier disease, and with systemic non-neuropathic amyloidosis. Alternative splicing results in multiple transcript variants, at least one of which encodes a preproprotein. [provided by RefSeq, Dec 2015]
## 4 This gene product is the main apolipoprotein of chylomicrons and low density lipoproteins (LDL), and is the ligand for the LDL receptor. It occurs in plasma as two main isoforms, apoB-48 and apoB-100: the former is synthesized exclusively in the gut and the latter in the liver. The intestinal and the hepatic forms of apoB are encoded by a single gene from a single, very long mRNA. The two isoforms share a common N-terminal sequence. The shorter apoB-48 protein is produced after RNA editing of the apoB-100 transcript at residue 2180 (CAA->UAA), resulting in the creation of a stop codon, and early translation termination. Mutations in this gene or its regulatory region cause hypobetalipoproteinemia, normotriglyceridemic hypobetalipoproteinemia, and hypercholesterolemia due to ligand-defective apoB, diseases affecting plasma cholesterol and apoB levels. [provided by RefSeq, Dec 2019]
## 5                                                                                                                                                                                                                                                                                                                                                                                                    The protein encoded by this gene is the pore-forming subunit of an N-type voltage-dependent calcium channel, which controls neurotransmitter release from neurons. The encoded protein forms a complex with alpha-2, beta, and delta subunits to form the high-voltage activated channel. This channel is sensitive to omega-conotoxin-GVIA and omega-agatoxin-IIIA but insensitive to dihydropyridines. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Aug 2011]
## 6                                                                                                                                                                                                                                                                                                                                                  This gene encodes the peptide hormones calcitonin, calcitonin gene-related peptide and katacalcin by tissue-specific alternative RNA splicing of the gene transcripts and cleavage of inactive precursor proteins. Calcitonin is involved in calcium regulation and acts to regulate phosphorus metabolism. Calcitonin gene-related peptide functions as a vasodilator and as an antimicrobial peptide while katacalcin is a calcium-lowering peptide. Multiple transcript variants encoding different isoforms have been found for this gene.[provided by RefSeq, Aug 2014]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                GeneCardsSummary
## 1                                                                                                                                                                                                                    AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ANKH (ANKH Inorganic Pyrophosphate Transport Regulator) is a Protein Coding gene.                                            Diseases associated with ANKH include Craniometaphyseal Dysplasia, Autosomal Dominant and Chondrocalcinosis 2.                                            Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Miscellaneous transport and binding events.                                            Gene Ontology (GO) annotations related to this gene include inorganic phosphate transmembrane transporter activity and inorganic diphosphate transmembrane transporter activity.                                            
## 3                                                                                                                                                                                                             APOA1 (Apolipoprotein A1) is a Protein Coding gene.                                            Diseases associated with APOA1 include Hypoalphalipoproteinemia, Primary, 2 and Amyloidosis, Familial Visceral.                                            Among its related pathways are Lipoprotein metabolism and Folate Metabolism.                                            Gene Ontology (GO) annotations related to this gene include identical protein binding and lipid binding.                                            An important paralog of this gene is APOA4.
## 4                                                                                                                                                                                                                                                                APOB (Apolipoprotein B) is a Protein Coding gene.                                            Diseases associated with APOB include Hypobetalipoproteinemia, Familial, 1 and Hypercholesterolemia, Familial, 2.                                            Among its related pathways are Activated TLR4 signalling and Lipoprotein metabolism.                                            Gene Ontology (GO) annotations related to this gene include binding and heparin binding.                                            
## 5                                                                                    CACNA1B (Calcium Voltage-Gated Channel Subunit Alpha1 B) is a Protein Coding gene.                                            Diseases associated with CACNA1B include Neurodevelopmental Disorder With Seizures And Nonepileptic Hyperkinetic Movements and Undetermined Early-Onset Epileptic Encephalopathy.                                            Among its related pathways are Nicotine addiction and ADP signalling through P2Y purinoceptor 12.                                            Gene Ontology (GO) annotations related to this gene include calcium ion binding and ion channel activity.                                            An important paralog of this gene is CACNA1A.
## 6                                                                                                                                                                                                                                             CALCA (Calcitonin Related Polypeptide Alpha) is a Protein Coding gene.                                            Diseases associated with CALCA include Reflex Sympathetic Dystrophy and Spinal Stenosis.                                            Among its related pathways are Neuroscience and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include identical protein binding.                                            An important paralog of this gene is CALCB.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Regulates intra- and extracellular levels of inorganic pyrophosphate (PPi), probably functioning as PPi transporter.\n                         ANKH_HUMAN,Q9HCJ1\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                   Participates in the reverse transport of cholesterol from tissues to the liver for excretion by promoting cholesterol efflux from tissues and by acting as a cofactor for the lecithin cholesterol acyltransferase (LCAT). As part of the SPAP complex, activates spermatozoa motility.\n                         APOA1_HUMAN,P02647\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apolipoprotein B is a major protein constituent of chylomicrons (apo B-48), LDL (apo B-100) and VLDL (apo B-100). Apo B-100 functions as a recognition signal for the cellular binding and internalization of LDL particles by the apoB/E receptor.\n                         APOB_HUMAN,P04114\n                         
## 5 Voltage-sensitive calcium channels (VSCC) mediate the entry of calcium ions into excitable cells and are also involved in a variety of calcium-dependent processes, including muscle contraction, hormone or neurotransmitter release, gene expression, cell motility, cell division and cell death. The isoform alpha-1B gives rise to N-type calcium currents. N-type calcium channels belong to the 'high-voltage activated' (HVA) group and are specifically blocked by omega-conotoxin-GVIA (AC P01522) (AC P01522) (By similarity). They are however insensitive to dihydropyridines (DHP). Calcium channels containing alpha-1B subunit may play a role in directed migration of immature neurons.\n                         CAC1B_HUMAN,Q00975\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                              CGRP induces vasodilation. It dilates a variety of vessels including the coronary, cerebral and systemic vasculature. Its abundance in the CNS also points toward a neurotransmitter or neuromodulator role. It also elevates platelet cAMP.\n                         CALCA_HUMAN,P06881\n                         
##                 todaysDate
## 1 Thu Sep 03 14:12:01 2020
## 2 Thu Sep 03 14:11:52 2020
## 3 Thu Sep 03 14:12:14 2020
## 4 Thu Sep 03 14:12:16 2020
## 5 Thu Sep 03 14:12:20 2020
## 6 Thu Sep 03 14:11:54 2020

allwithVits <- rbind(allSystemSumms,vitamins2)
head(allwithVits)

##   proteinSearched   gene       EnsemblID
## 1       lymphatic   FLT4 ENSG00000037280
## 2       lymphatic  VEGFC ENSG00000150630
## 3       lymphatic  LYVE1 ENSG00000133800
## 4       lymphatic  SOX18 ENSG00000203883
## 5       lymphatic PIK3CA ENSG00000121879
## 6       lymphatic  CCBE1 ENSG00000183287
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            EntrezSummary
## 1                                                                                                                                                                                                                                                      This gene encodes a tyrosine kinase receptor for vascular endothelial growth factors C and D. The protein is thought to be involved in lymphangiogenesis and maintenance of the lymphatic endothelium. Mutations in this gene cause hereditary lymphedema type IA. [provided by RefSeq, Jul 2008]
## 2                                                                                                                         The protein encoded by this gene is a member of the platelet-derived growth factor/vascular endothelial growth factor (PDGF/VEGF) family. The encoded protein promotes angiogenesis and endothelial cell growth, and can also affect the permeability of blood vessels. The proprotein is further cleaved into a fully processed form that can bind and activate VEGFR-2 and VEGFR-3 receptors. [provided by RefSeq, Apr 2014]
## 3                                                                                                                                                                                                                                                                This gene encodes a type I integral membrane glycoprotein. The encoded protein acts as a receptor and binds to both soluble and immobilized hyaluronan. This protein may function in lymphatic hyaluronan transport and have a role in tumor metastasis. [provided by RefSeq, Jul 2008]
## 4 This gene encodes a member of the SOX (SRY-related HMG-box) family of transcription factors involved in the regulation of embryonic development and in the determination of the cell fate. The encoded protein may act as a transcriptional regulator after forming a protein complex with other proteins. This protein plays a role in hair, blood vessel, and lymphatic vessel development. Mutations in this gene have been associated with recessive and dominant forms of hypotrichosis-lymphedema-telangiectasia. [provided by RefSeq, Jul 2008]
## 5                                                                                                                    Phosphatidylinositol 3-kinase is composed of an 85 kDa regulatory subunit and a 110 kDa catalytic subunit. The protein encoded by this gene represents the catalytic subunit, which uses ATP to phosphorylate PtdIns, PtdIns4P and PtdIns(4,5)P2. This gene has been found to be oncogenic and has been implicated in cervical cancers. A pseudogene of this gene has been defined on chromosome 22. [provided by RefSeq, Apr 2016]
## 6                                                                                                                         This gene is thought to function in extracellular matrix remodeling and migration. It is predominantly expressed in the ovary, but down regulated in ovarian cancer cell lines and primary carcinomas, suggesting its role as a tumour suppressor. Mutations in this gene have been associated with Hennekam lymphangiectasia-lymphedema syndrome, a generalized lymphatic dysplasia in humans. [provided by RefSeq, Mar 2010]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               GeneCardsSummary
## 1                                                                                                                               FLT4 (Fms Related Receptor Tyrosine Kinase 4) is a Protein Coding gene.                                            Diseases associated with FLT4 include Lymphatic Malformation 1 and Congenital Heart Defects, Multiple Types, 7.                                            Among its related pathways are Signaling by GPCR and NF-KappaB Family Pathway.                                            Gene Ontology (GO) annotations related to this gene include transferase activity, transferring phosphorus-containing groups and protein tyrosine kinase activity.                                            An important paralog of this gene is KDR.
## 2                                                                                                                                                                           VEGFC (Vascular Endothelial Growth Factor C) is a Protein Coding gene.                                            Diseases associated with VEGFC include Lymphatic Malformation 4 and Hereditary Lymphedema Id.                                            Among its related pathways are HIF1Alpha Pathway and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include growth factor activity and vascular endothelial growth factor receptor 3 binding.                                            An important paralog of this gene is VEGFD.
## 3                                                                                                                                                                           LYVE1 (Lymphatic Vessel Endothelial Hyaluronan Receptor 1) is a Protein Coding gene.                                            Diseases associated with LYVE1 include Intramuscular Hemangioma and Middle Cerebral Artery Infarction.                                            Among its related pathways are Cell adhesion_Cell-matrix glycoconjugates and Glycosaminoglycan metabolism.                                            Gene Ontology (GO) annotations related to this gene include hyaluronic acid binding.                                            An important paralog of this gene is CD44.
## 4                                                                                                                                            SOX18 (SRY-Box Transcription Factor 18) is a Protein Coding gene.                                            Diseases associated with SOX18 include Hypotrichosis-Lymphedema-Telangiectasia-Renal Defect Syndrome and Hypotrichosis-Lymphedema-Telangiectasia Syndrome.                                            Among its related pathways are ERK Signaling.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and protein heterodimerization activity.                                            An important paralog of this gene is SOX17.
## 5                                             PIK3CA (Phosphatidylinositol-4,5-Bisphosphate 3-Kinase Catalytic Subunit Alpha) is a Protein Coding gene.                                            Diseases associated with PIK3CA include Hepatocellular Carcinoma and Megalencephaly-Capillary Malformation-Polymicrogyria Syndrome.                                            Among its related pathways are GDNF-Family Ligands and Receptor Interactions and RET signaling.                                            Gene Ontology (GO) annotations related to this gene include transferase activity, transferring phosphorus-containing groups and protein serine/threonine kinase activity.                                            An important paralog of this gene is PIK3CB.
## 6                                                                                                                                                                                                                                                                                                               CCBE1 (Collagen And Calcium Binding EGF Domains 1) is a Protein Coding gene.                                            Diseases associated with CCBE1 include Hennekam Lymphangiectasia-Lymphedema Syndrome 1 and Hennekam Syndrome.                                                                                        Gene Ontology (GO) annotations related to this gene include calcium ion binding and collagen binding.                                            
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         UniProtKB_Summary
## 1                                                                              Tyrosine-protein kinase that acts as a cell-surface receptor for VEGFC and VEGFD, and plays an essential role in adult lymphangiogenesis and in the development of the vascular network and the cardiovascular system during embryonic development. Promotes proliferation, survival and migration of endothelial cells, and regulates angiogenic sprouting. Signaling by activated FLT4 leads to enhanced production of VEGFC, and to a lesser degree VEGFA, thereby creating a positive feedback loop that enhances FLT4 signaling. Modulates KDR signaling by forming heterodimers. The secreted isoform 3 may function as a decoy receptor for VEGFC and/or VEGFD and play an important role as a negative regulator of VEGFC-mediated lymphangiogenesis and angiogenesis. Binding of vascular growth factors to isoform 1 or isoform 2 leads to the activation of several signaling cascades; isoform 2 seems to be less efficient in signal transduction, because it has a truncated C-terminus and therefore lacks several phosphorylation sites. Mediates activation of the MAPK1/ERK2, MAPK3/ERK1 signaling pathway, of MAPK8 and the JUN signaling pathway, and of the AKT1 signaling pathway. Phosphorylates SHC1. Mediates phosphorylation of PIK3R1, the regulatory subunit of phosphatidylinositol 3-kinase. Promotes phosphorylation of MAPK8 at 'Thr-183' and 'Tyr-185', and of AKT1 at 'Ser-473'.\n                         VGFR3_HUMAN,P35916\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Growth factor active in angiogenesis, and endothelial cell growth, stimulating their proliferation and migration and also has effects on the permeability of blood vessels. May function in angiogenesis of the venous and lymphatic vascular systems during embryogenesis, and also in the maintenance of differentiated lymphatic endothelium in adults. Binds and activates KDR/VEGFR2 and FLT4/VEGFR3 receptors.\n                         VEGFC_HUMAN,P49767\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Ligand-specific transporter trafficking between intracellular organelles (TGN) and the plasma membrane. Plays a role in autocrine regulation of cell growth mediated by growth regulators containing cell surface retention sequence binding (CRS). May act as a hyaluronan (HA) transporter, either mediating its uptake for catabolism within lymphatic endothelial cells themselves, or its transport into the lumen of afferent lymphatic vessels for subsequent re-uptake and degradation in lymph nodes.\n                         LYVE1_HUMAN,Q9Y5Y7\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Transcriptional activator that binds to the consensus sequence 5'-AACAAAG-3' in the promoter of target genes and plays an essential role in embryonic cardiovascular development and lymphangiogenesis. Activates transcription of PROX1 and other genes coding for lymphatic endothelial markers. Plays an essential role in triggering the differentiation of lymph vessels, but is not required for the maintenance of differentiated lymphatic endothelial cells. Plays an important role in postnatal angiogenesis, where it is functionally redundant with SOX17. Interaction with MEF2C enhances transcriptional activation. Besides, required for normal hair development.\n                         SOX18_HUMAN,P35713\n                         
## 5 Phosphoinositide-3-kinase (PI3K) that phosphorylates PtdIns (Phosphatidylinositol), PtdIns4P (Phosphatidylinositol 4-phosphate) and PtdIns(4,5)P2 (Phosphatidylinositol 4,5-bisphosphate) to generate phosphatidylinositol 3,4,5-trisphosphate (PIP3). PIP3 plays a key role by recruiting PH domain-containing proteins to the membrane, including AKT1 and PDPK1, activating signaling cascades involved in cell growth, survival, proliferation, motility and morphology. Participates in cellular signaling in response to various growth factors. Involved in the activation of AKT1 upon stimulation by receptor tyrosine kinases ligands such as EGF, insulin, IGF1, VEGFA and PDGF. Involved in signaling via insulin-receptor substrate (IRS) proteins. Essential in endothelial cell migration during vascular development through VEGFA signaling, possibly by regulating RhoA activity. Required for lymphatic vasculature development, possibly by binding to RAS and by activation by EGF and FGF2, but not by PDGF. Regulates invadopodia formation through the PDPK1-AKT1 pathway. Participates in cardiomyogenesis in embryonic stem cells through a AKT1 pathway. Participates in vasculogenesis in embryonic stem cells through PDK1 and protein kinase C pathway. Also has serine-protein kinase activity: phosphorylates PIK3R1 (p85alpha regulatory subunit), EIF4EBP1 and HRAS. Plays a role in the positive regulation of phagocytosis and pinocytosis (By similarity).\n                         PK3CA_HUMAN,P42336\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Required for lymphangioblast budding and angiogenic sprouting from venous endothelium during embryogenesis.\n                         CCBE1_HUMAN,Q6UXH8\n                         
##                 todaysDate
## 1 Thu Sep 03 13:49:53 2020
## 2 Thu Sep 03 13:49:56 2020
## 3 Thu Sep 03 13:49:57 2020
## 4 Thu Sep 03 13:49:58 2020
## 5 Thu Sep 03 13:49:59 2020
## 6 Thu Sep 03 13:50:00 2020

Lets merge the genes of 40k+ with both sets of genes.

all375plus <- merge(allwithVits,systemsDF, by.x='gene', by.y='gene')
head(all375plus)

##    gene proteinSearched       EnsemblID
## 1 AANAT       melatonin ENSG00000129673
## 2 ABCB1         tylenol ENSG00000085563
## 3 ABCB1         tylenol ENSG00000085563
## 4 ABCB1         tylenol ENSG00000085563
## 5 ABCB1         tylenol ENSG00000085563
## 6 ABCC1     cannabidiol ENSG00000103222
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              EntrezSummary
## 1                                                                                                                                                                                                                                                                                                              The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 3 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 4 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 5 The membrane-associated protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug resistance. The protein encoded by this gene is an ATP-dependent drug efflux pump for xenobiotic compounds with broad substrate specificity. It is responsible for decreased drug accumulation in multidrug-resistant cells and often mediates the development of resistance to anticancer drugs. This protein also functions as a transporter in the blood-brain barrier. Mutations in this gene are associated with colchicine resistance and Inflammatory bowel disease 13. Alternative splicing and the use of alternative promoters results in multiple transcript variants. [provided by RefSeq, Feb 2017]
## 6                                                                                                                                                                                                                                    The protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra-and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This full transporter is a member of the MRP subfamily which is involved in multi-drug resistance. This protein functions as a multispecific organic anion transporter, with oxidized glutatione, cysteinyl leukotrienes, and activated aflatoxin B1 as substrates. This protein also transports glucuronides and sulfate conjugates of steroid hormones and bile salts. Alternatively spliced variants of this gene have been described but their full-length nature is unknown. [provided by RefSeq, Apr 2012]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GeneCardsSummary
## 1                                                                                                                                                                            AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13.                                            Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances.                                            An important paralog of this gene is ABCB4.
## 3                                             ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13.                                            Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances.                                            An important paralog of this gene is ABCB4.
## 4                                             ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13.                                            Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances.                                            An important paralog of this gene is ABCB4.
## 5                                             ABCB1 (ATP Binding Cassette Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with ABCB1 include Colchicine Resistance and Inflammatory Bowel Disease 13.                                            Among its related pathways are Zidovudine Pathway, Pharmacokinetics/Pharmacodynamics and Ponatinib Pathway, Pharmacokinetics/Pharmacodynamics.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances.                                            An important paralog of this gene is ABCB4.
## 6                                                                                                 ABCC1 (ATP Binding Cassette Subfamily C Member 1) is a Protein Coding gene.                                            Diseases associated with ABCC1 include Dubin-Johnson Syndrome and Pseudoxanthoma Elasticum.                                            Among its related pathways are Arachidonic acid metabolism and Sphingolipid signaling pathway.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and ATPase activity, coupled to transmembrane movement of substances.                                            An important paralog of this gene is ABCC3.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                      Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n                         MDR1_HUMAN,P08183\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                      Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n                         MDR1_HUMAN,P08183\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                      Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n                         MDR1_HUMAN,P08183\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                      Translocates drugs and phospholipids across the membrane (PubMed:8898203, PubMed:2897240, PubMed:9038218). Catalyzes the flop of phospholipids from the cytoplasmic to the exoplasmic leaflet of the apical membrane. Participates mainly to the flop of phosphatidylcholine, phosphatidylethanolamine, beta-D-glucosylceramides and sphingomyelins (PubMed:8898203). Energy-dependent efflux pump responsible for decreased drug accumulation in multidrug-resistant cells (PubMed:2897240, PubMed:9038218).\n                         MDR1_HUMAN,P08183\n                         
## 6 Mediates export of organic anions and drugs from the cytoplasm (PubMed:7961706, PubMed:16230346, PubMed:9281595, PubMed:10064732, PubMed:11114332). Mediates ATP-dependent transport of glutathione and glutathione conjugates, leukotriene C4, estradiol-17-beta-o-glucuronide, methotrexate, antiviral drugs and other xenobiotics (PubMed:7961706, PubMed:16230346, PubMed:9281595, PubMed:10064732, PubMed:11114332). Confers resistance to anticancer drugs by decreasing accumulation of drug in cells, and by mediating ATP- and GSH-dependent drug export (PubMed:9281595). Hydrolyzes ATP with low efficiency (PubMed:16230346). Catalyzes the export of sphingosine 1-phosphate from mast cells independently of their degranulation (PubMed:17050692). Participates in inflammatory response by allowing export of leukotriene C4 from leukotriene C4-synthezing cells (By similarity).\n                         MRP1_HUMAN,P33527\n                         
##                 todaysDate healthyControl_1 healthyControl_2 healthyControl_3
## 1 Thu Sep 03 14:12:01 2020         38.65447         43.30859         34.63334
## 2 Thu Sep 03 14:06:27 2020         41.22896         47.16127         21.53185
## 3 Thu Sep 03 14:06:27 2020         32.25172         31.97326         56.89866
## 4 Thu Sep 03 14:06:27 2020         38.24506         41.06669         28.37736
## 5 Thu Sep 03 14:06:27 2020         32.99834         39.03790         27.31621
## 6 Thu Sep 03 14:08:53 2020         28.91772         40.26726         16.87763
##   healthyControl_4 healthyControl_5 healthyControl_6 healthyControl_7
## 1         19.88367         31.57302         17.33214         69.51032
## 2         21.29092         30.00952         20.07555         33.02105
## 3         18.42561         22.80082         30.40254         41.85742
## 4         19.97834         26.84866         21.20648         49.07782
## 5         30.76515         29.44986         48.65851         44.60380
## 6         24.20228         35.25948         19.15219         26.95437
##   healthyControl_8 healthyControl_9 healthyControl_10 healthyControl_11
## 1        11.902382         29.29495          22.12536          84.22110
## 2        10.757634         18.82793          21.56646          79.84726
## 3        15.256492         16.62954          23.08130          92.62902
## 4         8.592051         19.40342          15.56581          64.05169
## 5        15.099013         21.15442          14.42682          74.99046
## 6        10.947289         25.50794          17.22340          46.21760
##   healthyControl_12 healthyControl_13 healthyControl_14 healthyControl_15
## 1          50.67037          10.37422          31.11561          23.81043
## 2          36.34932          11.90963          49.19399          17.42491
## 3          38.90237          12.18897          28.95807          16.91220
## 4          49.74689          12.66472          35.52134          20.32528
## 5          69.92664          14.20984          22.59296          15.04298
## 6          42.64672          13.63434          33.65687          18.83468
##   healthyControl_16 healthyControl_17 healthyControl_18 healthyControl_19
## 1          20.46186          41.48475          18.62474          46.30124
## 2          18.40882          49.98046          18.95894          49.55159
## 3          26.11347          32.81002          20.61510         106.03831
## 4          19.75054          34.48769          17.11111         108.70370
## 5          32.37899          33.79629          18.68831          43.80003
## 6          19.55225          40.03501          21.74411          40.66432
##   healthyControl_20 healthyControl_21 acuteLymeDisease_1 acuteLymeDisease_2
## 1          45.55125          15.28551           71.80305           21.75414
## 2          36.03069          31.41368           38.05233           21.02175
## 3          38.48303          21.02176           45.82799           20.52045
## 4          45.94058          22.14821           48.30754           25.54595
## 5          35.39077          53.24209           77.74697           43.00508
## 6          30.89479          19.71880           46.64285           21.31349
##   acuteLymeDisease_3 acuteLymeDisease_4 acuteLymeDisease_5 acuteLymeDisease_6
## 1           49.93791           20.91372           60.35461           29.56083
## 2           39.44675           21.38210           45.26092           41.05704
## 3           44.08240           18.79489           44.00047           34.33294
## 4           51.66513           26.43767           40.24261           34.65819
## 5           30.09120           22.49047           22.04645           35.70981
## 6           35.81335           17.20735           37.86800           40.76434
##   acuteLymeDisease_7 acuteLymeDisease_8 acuteLymeDisease_9 acuteLymeDisease_10
## 1          115.26505           12.74919           14.58672            58.68188
## 2          817.18666           13.66956           14.09339            48.10711
## 3          505.81829           21.29795           13.70197            49.31229
## 4          132.25512           13.35872           11.50697            41.51591
## 5          106.28833           34.43221           15.08469            45.77373
## 6           60.77827           17.23546           18.66173            71.35843
##   acuteLymeDisease_11 acuteLymeDisease_12 acuteLymeDisease_13
## 1            33.32225            23.25981            9.588795
## 2            40.30350            19.15355            7.888114
## 3            41.56268            21.22284            8.447833
## 4            40.41479            20.12256            7.350215
## 5            39.50821            29.78886           15.063060
## 6            53.69678            27.22513            8.894936
##   acuteLymeDisease_14 acuteLymeDisease_15 acuteLymeDisease_16
## 1            20.86345            29.27038            43.14832
## 2            25.21845            19.74167            29.28886
## 3            25.32556            21.71775            41.87143
## 4            25.56478            24.96515            47.03945
## 5            69.40036            17.89554            69.75110
## 6            24.18120            24.62062            30.00687
##   acuteLymeDisease_17 acuteLymeDisease_18 acuteLymeDisease_19
## 1            14.57474            28.45002            15.69350
## 2            11.87161            24.10970            16.67370
## 3            14.17475            22.54048            15.10654
## 4            10.21484            31.30573            20.25107
## 5            13.02415            48.07647            14.64475
## 6            13.35633            30.66082            17.37290
##   acuteLymeDisease_20 acuteLymeDisease_21 acuteLymeDisease_22
## 1            21.95486            45.46997            28.11229
## 2            15.69664            57.35794            51.84340
## 3            15.44018            60.99973            34.66314
## 4            20.28179            44.40500            38.41623
## 5            14.21193            71.96156            42.69591
## 6            25.51457            44.92045            34.12562
##   acuteLymeDisease_23 acuteLymeDisease_24 acuteLymeDisease_25
## 1            34.01966            36.43714           11.098142
## 2            25.09195            38.82158           10.132455
## 3            32.02725            28.75331            9.661386
## 4            22.30251            24.74467           11.006607
## 5            15.44757            17.43424           19.974906
## 6            23.61546            21.99793           15.807990
##   acuteLymeDisease_26 acuteLymeDisease_27 acuteLymeDisease_28
## 1            20.95721            18.77368            22.93936
## 2            30.10835            20.40635            26.60660
## 3            23.14832            27.20347            29.15688
## 4            21.33121            26.27650            20.61601
## 5            26.04407            37.98274            43.11017
## 6            35.08638            18.89033            27.50780
##   Antibodies_1month_1 Antibodies_1month_2 Antibodies_1month_3
## 1            15.51901            14.49904            17.05954
## 2            23.35448            12.29871            17.04148
## 3            21.76531            20.58139            13.23527
## 4            18.64363            31.02243            18.31642
## 5            17.44667            15.86079            19.19898
## 6            18.17025            16.46885            15.18037
##   Antibodies_1month_4 Antibodies_1month_5 Antibodies_1month_6
## 1            15.50139            22.02124            59.06156
## 2            25.62144            16.42297            29.43557
## 3            18.78380            18.75276            33.07018
## 4            15.65744            22.42357            38.36952
## 5            12.03262            19.64487            51.37644
## 6            13.20924            26.51673            55.85404
##   Antibodies_1month_7 Antibodies_1month_8 Antibodies_1month_9
## 1            20.16316            36.54447            24.36854
## 2            20.60197            39.56771            27.53398
## 3            15.76464            60.34087            29.26406
## 4            22.36988            29.06728            23.79171
## 5            34.21045            28.25430            26.92548
## 6            14.08537            46.30949            32.91737
##   Antibodies_1month_10 Antibodies_1month_11 Antibodies_1month_12
## 1             16.58916             71.64483             43.23948
## 2             17.04288             80.96806             72.89021
## 3             18.46279             66.27261             56.96983
## 4             21.88265             60.36095             54.82444
## 5             20.85923            102.61530             55.42687
## 6             25.50856             51.15237             35.64776
##   Antibodies_1month_13 Antibodies_1month_14 Antibodies_1month_15
## 1             20.09754             33.01413             8.609319
## 2             25.08455             39.58836             7.249021
## 3             19.66783             38.49537             9.142852
## 4             21.90548             38.52790            10.809951
## 5             20.54839             28.82423            24.285132
## 6             19.48506             33.63112            10.733931
##   Antibodies_1month_16 Antibodies_1month_17 Antibodies_1month_18
## 1             31.82035             27.06322            22.189114
## 2             43.15264             27.76661            13.287931
## 3             43.18348             33.37388            15.686958
## 4             40.81558             60.06926            13.834432
## 5             50.33261             32.45963             9.912929
## 6             45.11738             29.03249            20.524764
##   Antibodies_1month_19 Antibodies_1month_20 Antibodies_1month_21
## 1             43.05822             35.09371             34.37424
## 2             47.68777             37.90771             50.50895
## 3             52.03077             30.40575             30.13461
## 4             47.55413             28.13396             30.64936
## 5             39.33369             22.83271             43.71672
## 6             40.20511             33.37773             33.46759
##   Antibodies_1month_22 Antibodies_1month_23 Antibodies_1month_24
## 1             32.27040             14.20441             23.45584
## 2             29.96935             23.32879             14.40262
## 3             38.58699             14.45487             51.44689
## 4             61.40487             14.97484             30.95148
## 5             55.44185             18.42424             30.75718
## 6             48.85811             15.79943             43.76389
##   Antibodies_1month_25 Antibodies_1month_26 Antibodies_1month_27
## 1             12.04484             33.61438             25.23156
## 2             19.42912             26.26096             16.56160
## 3             13.82278             24.13657             13.58578
## 4             14.95064             33.09392             21.99314
## 5             13.06079             21.28961             25.59947
## 6             13.23880             34.70037             19.66687
##   Antibodies_6months_1 Antibodies_6months_2 Antibodies_6months_3
## 1             23.81457             17.47219             51.39457
## 2             46.38081             18.70302             55.60095
## 3             28.62397             16.77057             65.63707
## 4             16.89601             12.10923             42.76211
## 5             17.95221             14.40513             31.01147
## 6             21.63674             13.62361             32.41461
##   Antibodies_6months_4 Antibodies_6months_5 Antibodies_6months_6
## 1             32.73614             29.56239             15.20396
## 2             78.31794             48.37211             26.58536
## 3             43.12171             39.48976             21.41928
## 4             29.90365             40.81556             15.23238
## 5             24.59404             68.37386             11.99703
## 6             36.70282             50.10856             30.27561
##   Antibodies_6months_7 Antibodies_6months_8 Antibodies_6months_9
## 1             16.99264             28.08311             39.58991
## 2             33.42081             37.71527             31.17144
## 3             32.91563             23.94722             30.45324
## 4             20.52140             24.96915             28.02090
## 5             13.33228             24.88406             23.71542
## 6             27.45445             22.96609             30.74962
##   Antibodies_6months_10
## 1              39.65745
## 2              32.94535
## 3              29.12041
## 4              27.83572
## 5              20.28640
## 6              31.10116

all3withVits <- rbind(allSystemSummsFirst3,vitamins2)
head(all3withVits)

##   proteinSearched   gene       EnsemblID
## 1       lymphatic   FLT4 ENSG00000037280
## 2       lymphatic  VEGFC ENSG00000150630
## 3       lymphatic  LYVE1 ENSG00000133800
## 4   integumentary    FLG ENSG00000143631
## 5   integumentary    KIT ENSG00000157404
## 6   integumentary COL7A1 ENSG00000114270
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes a tyrosine kinase receptor for vascular endothelial growth factors C and D. The protein is thought to be involved in lymphangiogenesis and maintenance of the lymphatic endothelium. Mutations in this gene cause hereditary lymphedema type IA. [provided by RefSeq, Jul 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The protein encoded by this gene is a member of the platelet-derived growth factor/vascular endothelial growth factor (PDGF/VEGF) family. The encoded protein promotes angiogenesis and endothelial cell growth, and can also affect the permeability of blood vessels. The proprotein is further cleaved into a fully processed form that can bind and activate VEGFR-2 and VEGFR-3 receptors. [provided by RefSeq, Apr 2014]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     This gene encodes a type I integral membrane glycoprotein. The encoded protein acts as a receptor and binds to both soluble and immobilized hyaluronan. This protein may function in lymphatic hyaluronan transport and have a role in tumor metastasis. [provided by RefSeq, Jul 2008]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is an intermediate filament-associated protein that aggregates keratin intermediate filaments in mammalian epidermis. It is initially synthesized as a polyprotein precursor, profilaggrin (consisting of multiple filaggrin units of 324 aa each), which is localized in keratohyalin granules, and is subsequently proteolytically processed into individual functional filaggrin molecules. Mutations in this gene are associated with ichthyosis vulgaris.[provided by RefSeq, Dec 2009]
## 5 This gene encodes a receptor tyrosine kinase. This gene was initially identified as a homolog of the feline sarcoma viral oncogene v-kit and is often referred to as proto-oncogene c-Kit. The canonical form of this glycosylated transmembrane protein has an N-terminal extracellular region with five immunoglobulin-like domains, a transmembrane region, and an intracellular tyrosine kinase domain at the C-terminus. Upon activation by its cytokine ligand, stem cell factor (SCF), this protein phosphorylates multiple intracellular proteins that play a role in in the proliferation, differentiation, migration and apoptosis of many cell types and thereby plays an important role in hematopoiesis, stem cell maintenance, gametogenesis, melanogenesis, and in mast cell development, migration and function. This protein can be a membrane-bound or soluble protein. Mutations in this gene are associated with gastrointestinal stromal tumors, mast cell disease, acute myelogenous leukemia, and piebaldism. Multiple transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, May 2020]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          This gene encodes the alpha chain of type VII collagen. The type VII collagen fibril, composed of three identical alpha collagen chains, is restricted to the basement zone beneath stratified squamous epithelia. It functions as an anchoring fibril between the external epithelia and the underlying stroma. Mutations in this gene are associated with all forms of dystrophic epidermolysis bullosa. In the absence of mutations, however, an acquired form of this disease can result from an autoimmune response made to type VII collagen. [provided by RefSeq, Jul 2008]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GeneCardsSummary
## 1                                             FLT4 (Fms Related Receptor Tyrosine Kinase 4) is a Protein Coding gene.                                            Diseases associated with FLT4 include Lymphatic Malformation 1 and Congenital Heart Defects, Multiple Types, 7.                                            Among its related pathways are Signaling by GPCR and NF-KappaB Family Pathway.                                            Gene Ontology (GO) annotations related to this gene include transferase activity, transferring phosphorus-containing groups and protein tyrosine kinase activity.                                            An important paralog of this gene is KDR.
## 2                                                                                         VEGFC (Vascular Endothelial Growth Factor C) is a Protein Coding gene.                                            Diseases associated with VEGFC include Lymphatic Malformation 4 and Hereditary Lymphedema Id.                                            Among its related pathways are HIF1Alpha Pathway and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include growth factor activity and vascular endothelial growth factor receptor 3 binding.                                            An important paralog of this gene is VEGFD.
## 3                                                                                         LYVE1 (Lymphatic Vessel Endothelial Hyaluronan Receptor 1) is a Protein Coding gene.                                            Diseases associated with LYVE1 include Intramuscular Hemangioma and Middle Cerebral Artery Infarction.                                            Among its related pathways are Cell adhesion_Cell-matrix glycoconjugates and Glycosaminoglycan metabolism.                                            Gene Ontology (GO) annotations related to this gene include hyaluronic acid binding.                                            An important paralog of this gene is CD44.
## 4                                                                                                                                                            FLG (Filaggrin) is a Protein Coding gene.                                            Diseases associated with FLG include Dermatitis, Atopic, 2 and Ichthyosis Vulgaris.                                            Among its related pathways are Keratinization and Developmental Biology.                                            Gene Ontology (GO) annotations related to this gene include calcium ion binding and structural molecule activity.                                            An important paralog of this gene is HRNR.
## 5                                                                                                                 KIT (KIT Proto-Oncogene, Receptor Tyrosine Kinase) is a Protein Coding gene.                                            Diseases associated with KIT include Gastrointestinal Stromal Tumor and Piebald Trait.                                            Among its related pathways are RET signaling and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and protein kinase activity.                                            An important paralog of this gene is CSF1R.
## 6                                                           COL7A1 (Collagen Type VII Alpha 1 Chain) is a Protein Coding gene.                                            Diseases associated with COL7A1 include Epidermolysis Bullosa Pruriginosa and Transient Bullous Dermolysis Of The Newborn.                                            Among its related pathways are Integrin Pathway and Collagen chain trimerization.                                            Gene Ontology (GO) annotations related to this gene include identical protein binding and serine-type endopeptidase inhibitor activity.                                            An important paralog of this gene is COL2A1.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            UniProtKB_Summary
## 1 Tyrosine-protein kinase that acts as a cell-surface receptor for VEGFC and VEGFD, and plays an essential role in adult lymphangiogenesis and in the development of the vascular network and the cardiovascular system during embryonic development. Promotes proliferation, survival and migration of endothelial cells, and regulates angiogenic sprouting. Signaling by activated FLT4 leads to enhanced production of VEGFC, and to a lesser degree VEGFA, thereby creating a positive feedback loop that enhances FLT4 signaling. Modulates KDR signaling by forming heterodimers. The secreted isoform 3 may function as a decoy receptor for VEGFC and/or VEGFD and play an important role as a negative regulator of VEGFC-mediated lymphangiogenesis and angiogenesis. Binding of vascular growth factors to isoform 1 or isoform 2 leads to the activation of several signaling cascades; isoform 2 seems to be less efficient in signal transduction, because it has a truncated C-terminus and therefore lacks several phosphorylation sites. Mediates activation of the MAPK1/ERK2, MAPK3/ERK1 signaling pathway, of MAPK8 and the JUN signaling pathway, and of the AKT1 signaling pathway. Phosphorylates SHC1. Mediates phosphorylation of PIK3R1, the regulatory subunit of phosphatidylinositol 3-kinase. Promotes phosphorylation of MAPK8 at 'Thr-183' and 'Tyr-185', and of AKT1 at 'Ser-473'.\n                         VGFR3_HUMAN,P35916\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Growth factor active in angiogenesis, and endothelial cell growth, stimulating their proliferation and migration and also has effects on the permeability of blood vessels. May function in angiogenesis of the venous and lymphatic vascular systems during embryogenesis, and also in the maintenance of differentiated lymphatic endothelium in adults. Binds and activates KDR/VEGFR2 and FLT4/VEGFR3 receptors.\n                         VEGFC_HUMAN,P49767\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Ligand-specific transporter trafficking between intracellular organelles (TGN) and the plasma membrane. Plays a role in autocrine regulation of cell growth mediated by growth regulators containing cell surface retention sequence binding (CRS). May act as a hyaluronan (HA) transporter, either mediating its uptake for catabolism within lymphatic endothelial cells themselves, or its transport into the lumen of afferent lymphatic vessels for subsequent re-uptake and degradation in lymph nodes.\n                         LYVE1_HUMAN,Q9Y5Y7\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Aggregates keratin intermediate filaments and promotes disulfide-bond formation among the intermediate filaments during terminal differentiation of mammalian epidermis.\n                         FILA_HUMAN,P20930\n                         
## 5                                                                                                          Tyrosine-protein kinase that acts as cell-surface receptor for the cytokine KITLG/SCF and plays an essential role in the regulation of cell survival and proliferation, hematopoiesis, stem cell maintenance, gametogenesis, mast cell development, migration and function, and in melanogenesis. In response to KITLG/SCF binding, KIT can activate several signaling pathways. Phosphorylates PIK3R1, PLCG1, SH2B2/APS and CBL. Activates the AKT1 signaling pathway by phosphorylation of PIK3R1, the regulatory subunit of phosphatidylinositol 3-kinase. Activated KIT also transmits signals via GRB2 and activation of RAS, RAF1 and the MAP kinases MAPK1/ERK2 and/or MAPK3/ERK1. Promotes activation of STAT family members STAT1, STAT3, STAT5A and STAT5B. Activation of PLCG1 leads to the production of the cellular signaling molecules diacylglycerol and inositol 1,4,5-trisphosphate. KIT signaling is modulated by protein phosphatases, and by rapid internalization and degradation of the receptor. Activated KIT promotes phosphorylation of the protein phosphatases PTPN6/SHP-1 and PTPRU, and of the transcription factors STAT1, STAT3, STAT5A and STAT5B. Promotes phosphorylation of PIK3R1, CBL, CRK (isoform Crk-II), LYN, MAPK1/ERK2 and/or MAPK3/ERK1, PLCG1, SRC and SHC1.\n                         KIT_HUMAN,P10721\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Stratified squamous epithelial basement membrane protein that forms anchoring fibrils which may contribute to epithelial basement membrane organization and adherence by interacting with extracellular matrix (ECM) proteins such as type IV collagen.\n                         CO7A1_HUMAN,Q02388\n                         
##                 todaysDate
## 1 Thu Sep 03 13:49:53 2020
## 2 Thu Sep 03 13:49:56 2020
## 3 Thu Sep 03 13:49:57 2020
## 4 Thu Sep 03 13:45:05 2020
## 5 Thu Sep 03 13:45:06 2020
## 6 Thu Sep 03 13:45:07 2020

allTop3systems <- merge(all3withVits,systemsDF,by.x='gene',
                        by.y='gene')

There are more observations in the merged data, because there are more than one entry per gene in the original data to merge with the unique genes related to our body systems and OTC drugs, cannabidiol, alcohol, and dopamine.

dim(all375plus)

## [1] 1216   93

dim(allTop3systems)

## [1] 350  93

head(allTop3systems)

##    gene proteinSearched       EnsemblID
## 1 AANAT       melatonin ENSG00000129673
## 2 ADH1B         alcohol ENSG00000196616
## 3 ADH1B         alcohol ENSG00000196616
## 4 ADH1B         alcohol ENSG00000196616
## 5 ADH1B         alcohol ENSG00000196616
## 6 ADH1B         alcohol ENSG00000196616
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 3                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 4                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 6                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                               AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 3                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 4                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 5                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 6                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                          Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 3 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 4 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 5 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 6 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
##                 todaysDate healthyControl_1 healthyControl_2 healthyControl_3
## 1 Thu Sep 03 14:12:01 2020         38.65447         43.30859         34.63334
## 2 Thu Sep 03 14:02:01 2020         37.89308         37.26308         23.94914
## 3 Thu Sep 03 14:02:01 2020         33.10215         39.16493         25.82138
## 4 Thu Sep 03 14:02:01 2020         36.52194         57.39502         35.30746
## 5 Thu Sep 03 14:02:01 2020         32.60534         36.08616         25.93888
## 6 Thu Sep 03 14:02:01 2020         29.32294         42.79331         21.29711
##   healthyControl_4 healthyControl_5 healthyControl_6 healthyControl_7
## 1         19.88367         31.57302         17.33214         69.51032
## 2         17.88074         28.76127         24.04333         36.20929
## 3         18.49482         27.73537         18.57254         41.24310
## 4         20.50253         24.70425         21.05146         45.07237
## 5         20.11668         31.67117         19.19682         34.59911
## 6         18.51708         25.70072         19.08679         32.79410
##   healthyControl_8 healthyControl_9 healthyControl_10 healthyControl_11
## 1         11.90238         29.29495          22.12536          84.22110
## 2         10.89609         23.64742          21.44924          70.99529
## 3         12.05248         15.14195          20.64977          66.46961
## 4         14.53013         17.09744          22.74018          96.51198
## 5         12.27169         18.57713          21.48818          85.75042
## 6         11.39472         18.52626          26.56640          83.33878
##   healthyControl_12 healthyControl_13 healthyControl_14 healthyControl_15
## 1          50.67037         10.374223          31.11561          23.81043
## 2          51.22769         11.265804          41.29142          19.52040
## 3          44.22388         11.719167          26.24507          23.72086
## 4          50.89903          9.894555          37.57538          15.86959
## 5          55.35348         11.183972          38.27067          18.14704
## 6          49.70900         12.427741          34.60306          22.33815
##   healthyControl_16 healthyControl_17 healthyControl_18 healthyControl_19
## 1          20.46186          41.48475          18.62474          46.30124
## 2          21.43320          41.84901          18.35966         111.66591
## 3          18.35441          43.20239          17.93534          55.34340
## 4          21.01759          37.74332          19.34599          65.39802
## 5          18.26951          45.02007          24.47236          96.89240
## 6          23.96481          42.46289          42.95787          75.40298
##   healthyControl_20 healthyControl_21 acuteLymeDisease_1 acuteLymeDisease_2
## 1          45.55125          15.28551           71.80305           21.75414
## 2          46.01727          19.27535           60.70073           21.36817
## 3          43.43277          21.97501           69.80451           24.61685
## 4          27.45052          20.00639           67.25343           25.10416
## 5          31.17226          20.38369           43.01904           18.60533
## 6          28.97394          17.00365           47.91530           22.10635
##   acuteLymeDisease_3 acuteLymeDisease_4 acuteLymeDisease_5 acuteLymeDisease_6
## 1           49.93791           20.91372           60.35461           29.56083
## 2           41.31194           22.03221           45.32658           35.77288
## 3           39.38423           22.70426           45.83045           33.39604
## 4           42.68764           24.91723           41.56295           26.09728
## 5           41.48576           17.26287           42.07435           35.45385
## 6           43.90134           20.33773           42.12905           29.83464
##   acuteLymeDisease_7 acuteLymeDisease_8 acuteLymeDisease_9 acuteLymeDisease_10
## 1          115.26505           12.74919           14.58672            58.68188
## 2           81.45244           15.14020           17.59256            48.55106
## 3           95.05012           14.23976           15.65921            61.44041
## 4          225.20762           14.68151           13.63993            47.56154
## 5           85.27062           16.34465           17.65256            55.49571
## 6          120.56462           15.63079           15.93060            53.69469
##   acuteLymeDisease_11 acuteLymeDisease_12 acuteLymeDisease_13
## 1            33.32225            23.25981            9.588795
## 2            42.21454            22.07492            7.248266
## 3            41.33231            21.94181            9.866293
## 4            40.75957            18.89717            7.216334
## 5            45.82252            19.04510            7.861977
## 6            44.52237            18.59789            7.733616
##   acuteLymeDisease_14 acuteLymeDisease_15 acuteLymeDisease_16
## 1            20.86345            29.27038            43.14832
## 2            24.54406            25.49322            32.11470
## 3            19.76155            23.89585            59.11455
## 4            19.51075            23.89222            38.61305
## 5            29.57920            21.97918            37.74002
## 6            25.89079            22.59387            45.24864
##   acuteLymeDisease_17 acuteLymeDisease_18 acuteLymeDisease_19
## 1            14.57474            28.45002            15.69350
## 2            11.75390            23.04194            15.93794
## 3            11.33043            22.03662            17.25698
## 4            13.01954            35.32621            21.17568
## 5            16.39203            23.02250            17.74235
## 6            10.52267            24.30442            17.48153
##   acuteLymeDisease_20 acuteLymeDisease_21 acuteLymeDisease_22
## 1            21.95486            45.46997            28.11229
## 2            17.41908            55.92618            54.60819
## 3            19.55800            70.16983            44.13887
## 4            15.41068            50.07806            37.08963
## 5            18.09133            46.35208            47.28785
## 6            18.50505            52.38181            43.69408
##   acuteLymeDisease_23 acuteLymeDisease_24 acuteLymeDisease_25
## 1            34.01966            36.43714           11.098142
## 2            21.31802            48.58598           11.585635
## 3            22.18329            31.00082           11.101384
## 4            24.19148            36.22127            7.693474
## 5            34.25241            25.66271           10.716630
## 6            35.25920            32.90474           11.448932
##   acuteLymeDisease_26 acuteLymeDisease_27 acuteLymeDisease_28
## 1            20.95721            18.77368            22.93936
## 2            28.65290            21.22896            30.24262
## 3            19.88627            22.42390            31.73242
## 4            34.17640            23.67241            29.95222
## 5            30.62333            19.26661            28.75988
## 6            27.67493            23.11083            36.57107
##   Antibodies_1month_1 Antibodies_1month_2 Antibodies_1month_3
## 1            15.51901            14.49904            17.05954
## 2            27.99790            17.53823            17.45409
## 3            24.99516            16.98122            17.99978
## 4            25.08477            18.85538            15.41842
## 5            21.39371            18.72753            18.61650
## 6            18.73800            18.67212            15.78598
##   Antibodies_1month_4 Antibodies_1month_5 Antibodies_1month_6
## 1            15.50139            22.02124            59.06156
## 2            17.15096            20.74663            41.57386
## 3            26.85810            33.94878            43.68206
## 4            31.43632            17.65364            37.33166
## 5            20.21483            23.68388            45.04598
## 6            18.25358            22.51697            42.97516
##   Antibodies_1month_7 Antibodies_1month_8 Antibodies_1month_9
## 1            20.16316            36.54447            24.36854
## 2            17.66582            46.34294            26.26307
## 3            18.38415            38.46949            34.83685
## 4            15.72403            39.26543            27.46579
## 5            14.83868            37.63227            32.73706
## 6            15.53457            35.80330            33.23423
##   Antibodies_1month_10 Antibodies_1month_11 Antibodies_1month_12
## 1             16.58916             71.64483             43.23948
## 2             20.88941             69.77408             50.10176
## 3             21.46458             49.33115             49.48506
## 4             25.84191             66.44381             44.60148
## 5             20.71777             58.78334             49.35734
## 6             18.91623             51.90412             56.34813
##   Antibodies_1month_13 Antibodies_1month_14 Antibodies_1month_15
## 1             20.09754             33.01413             8.609319
## 2             22.28110             42.87377             7.016328
## 3             28.64540             47.57697             8.424463
## 4             25.71589             33.50081             6.436601
## 5             22.76641             37.61831             9.280659
## 6             19.49066             36.03596             7.416995
##   Antibodies_1month_16 Antibodies_1month_17 Antibodies_1month_18
## 1             31.82035             27.06322             22.18911
## 2             39.48848             33.98432             15.84832
## 3             34.23690             32.57032             13.98179
## 4             40.12788             35.27267             14.66785
## 5             54.02090             38.60195             32.63592
## 6             39.58097             34.13032             26.50713
##   Antibodies_1month_19 Antibodies_1month_20 Antibodies_1month_21
## 1             43.05822             35.09371             34.37424
## 2             36.98065             28.37421             42.29672
## 3             47.55496             45.95646             37.90742
## 4             44.62578             35.60238             36.38586
## 5             44.48999             37.91425             37.01849
## 6             43.65493             42.85820             34.90846
##   Antibodies_1month_22 Antibodies_1month_23 Antibodies_1month_24
## 1             32.27040             14.20441             23.45584
## 2             31.52575             21.26285             32.94306
## 3             39.12872             14.04135             28.71038
## 4             34.95027             19.32330             21.73639
## 5             38.18108             15.24316             36.73830
## 6             39.83798             13.71510             50.11779
##   Antibodies_1month_25 Antibodies_1month_26 Antibodies_1month_27
## 1             12.04484             33.61438             25.23156
## 2             15.10542             30.28620             27.50727
## 3             13.34428             28.29141             20.68341
## 4             13.17195             34.61330             16.93934
## 5             13.71118             27.07499             14.21209
## 6             13.10811             27.00124             18.82190
##   Antibodies_6months_1 Antibodies_6months_2 Antibodies_6months_3
## 1             23.81457             17.47219             51.39457
## 2             29.23864             18.47114             28.10985
## 3             27.43140             15.91317             45.41752
## 4             60.77292             14.68160             47.29560
## 5             26.86661             15.36001             47.90822
## 6             27.11552             16.07950             38.76281
##   Antibodies_6months_4 Antibodies_6months_5 Antibodies_6months_6
## 1             32.73614             29.56239             15.20396
## 2             35.55087             43.75245             13.49950
## 3             35.23614             38.91768             27.97887
## 4             50.26657             53.04640             23.69328
## 5             31.21551             47.96653             23.93062
## 6             34.74876             40.62482             21.30499
##   Antibodies_6months_7 Antibodies_6months_8 Antibodies_6months_9
## 1             16.99264             28.08311             39.58991
## 2             17.75465             21.52223             31.17582
## 3             24.47172             21.86029             28.76703
## 4             19.18802             19.77882             30.05590
## 5             21.74967             23.60711             24.50661
## 6             22.19385             25.53545             32.77911
##   Antibodies_6months_10
## 1              39.65745
## 2              30.49779
## 3              39.56165
## 4              24.02602
## 5              26.39521
## 6              27.63658

bodySystems3_geneCounts <- allTop3systems %>% group_by(gene) %>% 
  count(gene)
bodySystems3_geneCounts <- bodySystems3_geneCounts[order(bodySystems3_geneCounts$n,decreasing=T),]
colnames(bodySystems3_geneCounts)[2] <- 'geneCounts'
bodySystems3_geneCounts

## # A tibble: 96 x 2
## # Groups:   gene [96]
##    gene    geneCounts
##    <fct>        <int>
##  1 CYP19A1         25
##  2 ESR1            20
##  3 VDR             16
##  4 PTGS1           12
##  5 HFE              9
##  6 PTGS2            9
##  7 GFAP             8
##  8 ESR2             8
##  9 IGF1R            7
## 10 FLT4             6
## # ... with 86 more rows

bodySystemsTotal <- merge(bodySystems3_geneCounts,allTop3systems,by.x='gene',
                          by.y='gene')
head(bodySystemsTotal)

##    gene geneCounts proteinSearched       EnsemblID
## 1 AANAT          1       melatonin ENSG00000129673
## 2 ADH1B          5         alcohol ENSG00000196616
## 3 ADH1B          5         alcohol ENSG00000196616
## 4 ADH1B          5         alcohol ENSG00000196616
## 5 ADH1B          5         alcohol ENSG00000196616
## 6 ADH1B          5         alcohol ENSG00000196616
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 3                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 4                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 6                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                               AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 3                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 4                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 5                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 6                                             ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                          Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 3 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 4 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 5 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 6 Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
##                 todaysDate healthyControl_1 healthyControl_2 healthyControl_3
## 1 Thu Sep 03 14:12:01 2020         38.65447         43.30859         34.63334
## 2 Thu Sep 03 14:02:01 2020         33.10215         39.16493         25.82138
## 3 Thu Sep 03 14:02:01 2020         29.32294         42.79331         21.29711
## 4 Thu Sep 03 14:02:01 2020         37.89308         37.26308         23.94914
## 5 Thu Sep 03 14:02:01 2020         32.60534         36.08616         25.93888
## 6 Thu Sep 03 14:02:01 2020         36.52194         57.39502         35.30746
##   healthyControl_4 healthyControl_5 healthyControl_6 healthyControl_7
## 1         19.88367         31.57302         17.33214         69.51032
## 2         18.49482         27.73537         18.57254         41.24310
## 3         18.51708         25.70072         19.08679         32.79410
## 4         17.88074         28.76127         24.04333         36.20929
## 5         20.11668         31.67117         19.19682         34.59911
## 6         20.50253         24.70425         21.05146         45.07237
##   healthyControl_8 healthyControl_9 healthyControl_10 healthyControl_11
## 1         11.90238         29.29495          22.12536          84.22110
## 2         12.05248         15.14195          20.64977          66.46961
## 3         11.39472         18.52626          26.56640          83.33878
## 4         10.89609         23.64742          21.44924          70.99529
## 5         12.27169         18.57713          21.48818          85.75042
## 6         14.53013         17.09744          22.74018          96.51198
##   healthyControl_12 healthyControl_13 healthyControl_14 healthyControl_15
## 1          50.67037         10.374223          31.11561          23.81043
## 2          44.22388         11.719167          26.24507          23.72086
## 3          49.70900         12.427741          34.60306          22.33815
## 4          51.22769         11.265804          41.29142          19.52040
## 5          55.35348         11.183972          38.27067          18.14704
## 6          50.89903          9.894555          37.57538          15.86959
##   healthyControl_16 healthyControl_17 healthyControl_18 healthyControl_19
## 1          20.46186          41.48475          18.62474          46.30124
## 2          18.35441          43.20239          17.93534          55.34340
## 3          23.96481          42.46289          42.95787          75.40298
## 4          21.43320          41.84901          18.35966         111.66591
## 5          18.26951          45.02007          24.47236          96.89240
## 6          21.01759          37.74332          19.34599          65.39802
##   healthyControl_20 healthyControl_21 acuteLymeDisease_1 acuteLymeDisease_2
## 1          45.55125          15.28551           71.80305           21.75414
## 2          43.43277          21.97501           69.80451           24.61685
## 3          28.97394          17.00365           47.91530           22.10635
## 4          46.01727          19.27535           60.70073           21.36817
## 5          31.17226          20.38369           43.01904           18.60533
## 6          27.45052          20.00639           67.25343           25.10416
##   acuteLymeDisease_3 acuteLymeDisease_4 acuteLymeDisease_5 acuteLymeDisease_6
## 1           49.93791           20.91372           60.35461           29.56083
## 2           39.38423           22.70426           45.83045           33.39604
## 3           43.90134           20.33773           42.12905           29.83464
## 4           41.31194           22.03221           45.32658           35.77288
## 5           41.48576           17.26287           42.07435           35.45385
## 6           42.68764           24.91723           41.56295           26.09728
##   acuteLymeDisease_7 acuteLymeDisease_8 acuteLymeDisease_9 acuteLymeDisease_10
## 1          115.26505           12.74919           14.58672            58.68188
## 2           95.05012           14.23976           15.65921            61.44041
## 3          120.56462           15.63079           15.93060            53.69469
## 4           81.45244           15.14020           17.59256            48.55106
## 5           85.27062           16.34465           17.65256            55.49571
## 6          225.20762           14.68151           13.63993            47.56154
##   acuteLymeDisease_11 acuteLymeDisease_12 acuteLymeDisease_13
## 1            33.32225            23.25981            9.588795
## 2            41.33231            21.94181            9.866293
## 3            44.52237            18.59789            7.733616
## 4            42.21454            22.07492            7.248266
## 5            45.82252            19.04510            7.861977
## 6            40.75957            18.89717            7.216334
##   acuteLymeDisease_14 acuteLymeDisease_15 acuteLymeDisease_16
## 1            20.86345            29.27038            43.14832
## 2            19.76155            23.89585            59.11455
## 3            25.89079            22.59387            45.24864
## 4            24.54406            25.49322            32.11470
## 5            29.57920            21.97918            37.74002
## 6            19.51075            23.89222            38.61305
##   acuteLymeDisease_17 acuteLymeDisease_18 acuteLymeDisease_19
## 1            14.57474            28.45002            15.69350
## 2            11.33043            22.03662            17.25698
## 3            10.52267            24.30442            17.48153
## 4            11.75390            23.04194            15.93794
## 5            16.39203            23.02250            17.74235
## 6            13.01954            35.32621            21.17568
##   acuteLymeDisease_20 acuteLymeDisease_21 acuteLymeDisease_22
## 1            21.95486            45.46997            28.11229
## 2            19.55800            70.16983            44.13887
## 3            18.50505            52.38181            43.69408
## 4            17.41908            55.92618            54.60819
## 5            18.09133            46.35208            47.28785
## 6            15.41068            50.07806            37.08963
##   acuteLymeDisease_23 acuteLymeDisease_24 acuteLymeDisease_25
## 1            34.01966            36.43714           11.098142
## 2            22.18329            31.00082           11.101384
## 3            35.25920            32.90474           11.448932
## 4            21.31802            48.58598           11.585635
## 5            34.25241            25.66271           10.716630
## 6            24.19148            36.22127            7.693474
##   acuteLymeDisease_26 acuteLymeDisease_27 acuteLymeDisease_28
## 1            20.95721            18.77368            22.93936
## 2            19.88627            22.42390            31.73242
## 3            27.67493            23.11083            36.57107
## 4            28.65290            21.22896            30.24262
## 5            30.62333            19.26661            28.75988
## 6            34.17640            23.67241            29.95222
##   Antibodies_1month_1 Antibodies_1month_2 Antibodies_1month_3
## 1            15.51901            14.49904            17.05954
## 2            24.99516            16.98122            17.99978
## 3            18.73800            18.67212            15.78598
## 4            27.99790            17.53823            17.45409
## 5            21.39371            18.72753            18.61650
## 6            25.08477            18.85538            15.41842
##   Antibodies_1month_4 Antibodies_1month_5 Antibodies_1month_6
## 1            15.50139            22.02124            59.06156
## 2            26.85810            33.94878            43.68206
## 3            18.25358            22.51697            42.97516
## 4            17.15096            20.74663            41.57386
## 5            20.21483            23.68388            45.04598
## 6            31.43632            17.65364            37.33166
##   Antibodies_1month_7 Antibodies_1month_8 Antibodies_1month_9
## 1            20.16316            36.54447            24.36854
## 2            18.38415            38.46949            34.83685
## 3            15.53457            35.80330            33.23423
## 4            17.66582            46.34294            26.26307
## 5            14.83868            37.63227            32.73706
## 6            15.72403            39.26543            27.46579
##   Antibodies_1month_10 Antibodies_1month_11 Antibodies_1month_12
## 1             16.58916             71.64483             43.23948
## 2             21.46458             49.33115             49.48506
## 3             18.91623             51.90412             56.34813
## 4             20.88941             69.77408             50.10176
## 5             20.71777             58.78334             49.35734
## 6             25.84191             66.44381             44.60148
##   Antibodies_1month_13 Antibodies_1month_14 Antibodies_1month_15
## 1             20.09754             33.01413             8.609319
## 2             28.64540             47.57697             8.424463
## 3             19.49066             36.03596             7.416995
## 4             22.28110             42.87377             7.016328
## 5             22.76641             37.61831             9.280659
## 6             25.71589             33.50081             6.436601
##   Antibodies_1month_16 Antibodies_1month_17 Antibodies_1month_18
## 1             31.82035             27.06322             22.18911
## 2             34.23690             32.57032             13.98179
## 3             39.58097             34.13032             26.50713
## 4             39.48848             33.98432             15.84832
## 5             54.02090             38.60195             32.63592
## 6             40.12788             35.27267             14.66785
##   Antibodies_1month_19 Antibodies_1month_20 Antibodies_1month_21
## 1             43.05822             35.09371             34.37424
## 2             47.55496             45.95646             37.90742
## 3             43.65493             42.85820             34.90846
## 4             36.98065             28.37421             42.29672
## 5             44.48999             37.91425             37.01849
## 6             44.62578             35.60238             36.38586
##   Antibodies_1month_22 Antibodies_1month_23 Antibodies_1month_24
## 1             32.27040             14.20441             23.45584
## 2             39.12872             14.04135             28.71038
## 3             39.83798             13.71510             50.11779
## 4             31.52575             21.26285             32.94306
## 5             38.18108             15.24316             36.73830
## 6             34.95027             19.32330             21.73639
##   Antibodies_1month_25 Antibodies_1month_26 Antibodies_1month_27
## 1             12.04484             33.61438             25.23156
## 2             13.34428             28.29141             20.68341
## 3             13.10811             27.00124             18.82190
## 4             15.10542             30.28620             27.50727
## 5             13.71118             27.07499             14.21209
## 6             13.17195             34.61330             16.93934
##   Antibodies_6months_1 Antibodies_6months_2 Antibodies_6months_3
## 1             23.81457             17.47219             51.39457
## 2             27.43140             15.91317             45.41752
## 3             27.11552             16.07950             38.76281
## 4             29.23864             18.47114             28.10985
## 5             26.86661             15.36001             47.90822
## 6             60.77292             14.68160             47.29560
##   Antibodies_6months_4 Antibodies_6months_5 Antibodies_6months_6
## 1             32.73614             29.56239             15.20396
## 2             35.23614             38.91768             27.97887
## 3             34.74876             40.62482             21.30499
## 4             35.55087             43.75245             13.49950
## 5             31.21551             47.96653             23.93062
## 6             50.26657             53.04640             23.69328
##   Antibodies_6months_7 Antibodies_6months_8 Antibodies_6months_9
## 1             16.99264             28.08311             39.58991
## 2             24.47172             21.86029             28.76703
## 3             22.19385             25.53545             32.77911
## 4             17.75465             21.52223             31.17582
## 5             21.74967             23.60711             24.50661
## 6             19.18802             19.77882             30.05590
##   Antibodies_6months_10
## 1              39.65745
## 2              39.56165
## 3              27.63658
## 4              30.49779
## 5              26.39521
## 6              24.02602

colnames(bodySystemsTotal)

##  [1] "gene"                  "geneCounts"            "proteinSearched"      
##  [4] "EnsemblID"             "EntrezSummary"         "GeneCardsSummary"     
##  [7] "UniProtKB_Summary"     "todaysDate"            "healthyControl_1"     
## [10] "healthyControl_2"      "healthyControl_3"      "healthyControl_4"     
## [13] "healthyControl_5"      "healthyControl_6"      "healthyControl_7"     
## [16] "healthyControl_8"      "healthyControl_9"      "healthyControl_10"    
## [19] "healthyControl_11"     "healthyControl_12"     "healthyControl_13"    
## [22] "healthyControl_14"     "healthyControl_15"     "healthyControl_16"    
## [25] "healthyControl_17"     "healthyControl_18"     "healthyControl_19"    
## [28] "healthyControl_20"     "healthyControl_21"     "acuteLymeDisease_1"   
## [31] "acuteLymeDisease_2"    "acuteLymeDisease_3"    "acuteLymeDisease_4"   
## [34] "acuteLymeDisease_5"    "acuteLymeDisease_6"    "acuteLymeDisease_7"   
## [37] "acuteLymeDisease_8"    "acuteLymeDisease_9"    "acuteLymeDisease_10"  
## [40] "acuteLymeDisease_11"   "acuteLymeDisease_12"   "acuteLymeDisease_13"  
## [43] "acuteLymeDisease_14"   "acuteLymeDisease_15"   "acuteLymeDisease_16"  
## [46] "acuteLymeDisease_17"   "acuteLymeDisease_18"   "acuteLymeDisease_19"  
## [49] "acuteLymeDisease_20"   "acuteLymeDisease_21"   "acuteLymeDisease_22"  
## [52] "acuteLymeDisease_23"   "acuteLymeDisease_24"   "acuteLymeDisease_25"  
## [55] "acuteLymeDisease_26"   "acuteLymeDisease_27"   "acuteLymeDisease_28"  
## [58] "Antibodies_1month_1"   "Antibodies_1month_2"   "Antibodies_1month_3"  
## [61] "Antibodies_1month_4"   "Antibodies_1month_5"   "Antibodies_1month_6"  
## [64] "Antibodies_1month_7"   "Antibodies_1month_8"   "Antibodies_1month_9"  
## [67] "Antibodies_1month_10"  "Antibodies_1month_11"  "Antibodies_1month_12" 
## [70] "Antibodies_1month_13"  "Antibodies_1month_14"  "Antibodies_1month_15" 
## [73] "Antibodies_1month_16"  "Antibodies_1month_17"  "Antibodies_1month_18" 
## [76] "Antibodies_1month_19"  "Antibodies_1month_20"  "Antibodies_1month_21" 
## [79] "Antibodies_1month_22"  "Antibodies_1month_23"  "Antibodies_1month_24" 
## [82] "Antibodies_1month_25"  "Antibodies_1month_26"  "Antibodies_1month_27" 
## [85] "Antibodies_6months_1"  "Antibodies_6months_2"  "Antibodies_6months_3" 
## [88] "Antibodies_6months_4"  "Antibodies_6months_5"  "Antibodies_6months_6" 
## [91] "Antibodies_6months_7"  "Antibodies_6months_8"  "Antibodies_6months_9" 
## [94] "Antibodies_6months_10"

bodySystems3_geneMeans <- bodySystemsTotal %>% group_by(gene) %>% 
  summarise_at(vars('healthyControl_1':'Antibodies_6months_10'),mean)

BodySystems_countsAndMeans <- merge(bodySystems3_geneCounts,
                                    bodySystems3_geneMeans,
                                    by.x='gene',by.y='gene')

We just added the means of each gene per sample and the counts of each gene in the total data. Now we will get the group means for each of healthy, acute, 1 month of treatment, and 6 months of treatment.

BS1 <- BodySystems_countsAndMeans %>% group_by(gene) %>% 
  mutate(
  healthyMean = mean(healthyControl_1:healthyControl_21),
  acuteMean=mean(acuteLymeDisease_1:acuteLymeDisease_28),
  month1 = mean(Antibodies_1month_1:Antibodies_1month_27),
  month6=mean(Antibodies_6months_1:Antibodies_6months_10)
  )

Lets get the fold change values of these genes per group.

BS1$acuteHealthy_foldChange <- BS1$acuteMean/BS1$healthyMean
BS1$month1Healthy_foldChange <- BS1$month1/BS1$healthyMean
BS1$month6Healthy_foldChange <- BS1$month6/BS1$healthyMean

library(tidyr)

BS1_tidy <- gather(BS1,key='sample',value='sampleValue',3:88)

BS1_tidy$group <- 'group'
healthy <- grep('healthy',BS1_tidy$sample)
acute <- grep('acute',BS1_tidy$sample)
month_1 <- grep('1month',BS1_tidy$sample)
month_6 <- grep('6month',BS1_tidy$sample)

BS1_tidy[healthy,12] <- 'healthy'
BS1_tidy[acute,12] <- 'acute'
BS1_tidy[month_1,12] <- 'month 1'
BS1_tidy[month_6,12] <- 'month 6'

summs3 <- all3withVits[,c(1,2,4)]

BS1_tidy2_summs <- merge(summs3,BS1_tidy,by.x='gene',by.y='gene')

colnames(BS1_tidy2_summs)

##  [1] "gene"                     "proteinSearched"         
##  [3] "EntrezSummary"            "geneCounts"              
##  [5] "healthyMean"              "acuteMean"               
##  [7] "month1"                   "month6"                  
##  [9] "acuteHealthy_foldChange"  "month1Healthy_foldChange"
## [11] "month6Healthy_foldChange" "sample"                  
## [13] "sampleValue"              "group"

write.csv(BS1_tidy2_summs,'bodySystemLymeDiseaseGenes.csv',row.names=F)

Lets also write out theses body system and vitamins/minerals/hormone genes to use in future gene expression analysis.

write.csv(all3withVits,'vitaminAndBodySystemSums.csv',row.names=F)

These genes were then analyzed into a fold change, mean value, sample value, and filters for selcting by gene, body system, or group in a Tableau dashboard.

Tableau Dashboard on Lyme Disease Body System Genes.

Tableau Dashboard of Body System Genes

Figure 11: The body system genes as they relate to Lyme disease after 1-6 months of treatment, in the acute phase or a person who is healthy and doesn’t have Lyme disease. The filters at the upper left can be used to select specific body systems, genes, or groups (acute, healthy, 1 month, or 6 months). The upper right corner is the Entrez gene summary of the genes. The middle left is the mean values of each gene in each group with the acute, 1 month, and 6 month mean values compared to the healthy mean values. The middle right is a bar chart of the fold change values of the acute/healthy, 1 month/healthy, and the 6 months/healthy mean value ratios for each gene. The bottom is the gene expression value per gene in each sample colored by group membership in healthy, acute, 1 month, or 6 months of treatment. To select multiple genes, use ctrl+click, to deselect click each gene again.

y*(max(y)-min(y))+min(y)↩

Lyme Disease Ticks

Janis Corona

8/26/2020-9/3/2020

Tableau Images of Charts

Machine Learning

Body System Genes and Added Genes